initial modifications to use the conda env builder repo.

This commit is contained in:
Kerem Kayabay 2024-03-25 14:52:40 +01:00
parent 5a8bf27936
commit 6c4b028131
7 changed files with 12 additions and 250 deletions

View file

@ -5,60 +5,28 @@ This guide shows you how to launch a Ray cluster on HLRS' Hawk system.
## Table of Contents ## Table of Contents
- [Ray: How to launch a Ray Cluster on Hawk?](#ray-how-to-launch-a-ray-cluster-on-hawk) - [Ray: How to launch a Ray Cluster on Hawk?](#ray-how-to-launch-a-ray-cluster-on-hawk)
- [Table of Contents](#table-of-contents) - [Table of Contents](#table-of-contents)
- [Prerequisites](#prerequisites)
- [Getting Started](#getting-started) - [Getting Started](#getting-started)
- [Launch a local Ray Cluster in Interactive Mode](#launch-a-local-ray-cluster-in-interactive-mode) - [Launch a local Ray Cluster in Interactive Mode](#launch-a-local-ray-cluster-in-interactive-mode)
- [Launch a Ray Cluster in Batch Mode](#launch-a-ray-cluster-in-batch-mode) - [Launch a Ray Cluster in Batch Mode](#launch-a-ray-cluster-in-batch-mode)
## Prerequisites
Before building the environment, make sure you have the following prerequisites:
- [Conda Installation](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html): Ensure that Conda is installed on your local system.
- [Conda-Pack](https://conda.github.io/conda-pack/) installed in the base environment: Conda pack is used to package the Conda environment into a single tarball. This is used to transfer the environment to the target system.
- `linux-64` platform for installing the Conda packages because Conda/pip downloads and installs precompiled binaries suitable to the architecture and OS of the local environment.
For more information, look at the documentation for [Conda on HLRS HPC systems](https://kb.hlrs.de/platforms/index.php/How_to_move_local_conda_environments_to_the_clusters)
## Getting Started ## Getting Started
Only the main and r channels are available using the conda module on the clusters. To use custom packages, we need to move the local conda environment to Hawk. **Step 1.** Build and transfer the Conda environment to Hawk:
**Step 1.** Clone this repository to your local machine: Only the main and r channels are available using the Conda module on the clusters. To use custom packages, we need to move the local Conda environment to Hawk.
```bash Follow the instructions in the Conda environment builder repository, which includes a YAML file for building a test environment to run Ray workflows.
git clone <repository_url>
```
**Step 2.** Go into the directory and create an environment using Conda and environment.yaml. **Step 2.** Allocate workspace on Hawk:
Note: Be sure to add the necessary packages in `deployment_scripts/environment.yaml`: Proceed to the next step if you have already configured your workspace. Use the following command to create a workspace on the high-performance filesystem, which will expire in 10 days. For more information, such as how to enable reminder emails, refer to the [workspace mechanism](https://kb.hlrs.de/platforms/index.php/Workspace_mechanism) guide.
```bash
cd deployment_scripts
./create-env.sh <your-env>
```
**Step 3.** Package the environment and transfer the archive to the target system:
```bash
(base) $ conda pack -n <your-env> -o ray_env.tar.gz # conda-pack must be installed in the base environment
```
A workspace is suitable to store the compressed Conda environment archive on Hawk. Proceed to the next step if you have already configured your workspace. Use the following command to create a workspace on the high-performance filesystem, which will expire in 10 days. For more information, such as how to enable reminder emails, refer to the [workspace mechanism](https://kb.hlrs.de/platforms/index.php/Workspace_mechanism) guide.
```bash ```bash
ws_allocate hpda_project 10 ws_allocate hpda_project 10
ws_find hpda_project # find the path to workspace, which is the destination directory in the next step ws_find hpda_project # find the path to workspace, which is the destination directory in the next step
``` ```
You can send your data to an existing workspace using: **Step 2.** Clone the repository on Hawk to use the deployment scripts and project structure:
```bash
scp ray_env.tar.gz <username>@hawk.hww.hlrs.de:<workspace_directory>
rm ray_env.tar.gz # We don't need the archive locally anymore.
```
**Step 4.** Clone the repository on Hawk to use the deployment scripts and project structure:
```bash ```bash
cd <workspace_directory> cd <workspace_directory>

View file

@ -1,23 +0,0 @@
#!/bin/bash
# Display usage
if [ "$#" -ne 1 ]; then
echo "Usage: $0 <conda_environment_name>"
exit 1
fi
# Name of the Conda environment
CONDA_ENV_NAME=$1
# Check if the Conda environment already exists
if conda env list | grep -q "$CONDA_ENV_NAME"; then
echo "Environment '$CONDA_ENV_NAME' already exists."
else
echo "Environment '$CONDA_ENV_NAME' does not exist, creating it."
# Create Conda environment
CONDA_SUBDIR=linux-64 conda env create --name $CONDA_ENV_NAME -f environment.yaml
fi

View file

@ -1,43 +0,0 @@
#!/bin/bash
export WS_DIR=<workspace_dir>
# Get the first character of the hostname
first_char=$(hostname | cut -c1)
# Check if the first character is not "r"
if [[ $first_char != "r" ]]; then
# it's not a cpu node.
echo "Hostname does not start with 'r'."
# Get the first seven characters of the hostname
first_seven_chars=$(hostname | cut -c1,2,3,4,5,6,7)
# Check if it is an ai node
if [[ $first_seven_chars != "hawk-ai" ]]; then
echo "Hostname does not start with 'hawk-ai' too. Exiting."
return 1
else
echo "GPU node detected."
export OBJ_STR_MEMORY=350000000000
export TEMP_CHECKPOINT_DIR=/localscratch/$PBS_JOBID/model_checkpoints/
mkdir -p $TEMP_CHECKPOINT_DIR
fi
else
echo "CPU node detected."
fi
module load bigdata/conda
export RAY_DEDUP_LOGS=0
export ENV_ARCHIVE=ray_env.tar.gz
export CONDA_ENVS=/run/user/$PBS_JOBID/envs
export ENV_NAME=ray_env
export ENV_PATH=$CONDA_ENVS/$ENV_NAME
mkdir -p $ENV_PATH
tar -xzf $WS_DIR/$ENV_ARCHIVE -C $ENV_PATH
source $ENV_PATH/bin/activate
export CONDA_ENVS_PATH=CONDA_ENVS

View file

@ -1,104 +0,0 @@
# Reference: Cluster Deployment Scripts
Wiki link:
Motivation: This document aims to show users how to use additional Dask deployment scripts to streamline the deployment and management of a Dask cluster on a high-performance computing (HPC) environment.
Structure:
- [ ] [Tutorial](https://diataxis.fr/tutorials/)
- [ ] [How-to guide](https://diataxis.fr/how-to-guides/)
- [x] [Reference](https://diataxis.fr/reference/)
- [ ] [Explanation](https://diataxis.fr/explanation/)
To do:
---
## Overview
This repository contains a set of bash scripts designed to streamline the deployment and management of a Dask cluster on a high-performance computing (HPC) environment. These scripts facilitate the creation of Conda environments, deployment of the environment to a remote server, and initiation of Dask clusters on distributed systems. Below is a comprehensive guide on how to use and understand each script:
### Note: Permissions
Ensure that execution permissions (`chmod +x`) are granted to these scripts before attempting to run them. This can be done using the following command:
```bash
chmod +x script_name.sh
```
## Prerequisites
Before using these scripts, ensure that the following prerequisites are met:
1. **Conda Installation**: Ensure that Conda is installed on your local system. Follow the [official Conda installation guide](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) if not already installed.
2. **PBS Job Scheduler**: The deployment scripts (`deploy-dask.sh` and `dask-worker.sh`) are designed for use with the PBS job scheduler. Modify accordingly if using a different job scheduler.
3. **SSH Setup**: Ensure that SSH is set up and configured on your system for remote server communication.
## 1. create-env.sh
### Overview
`create-env.sh` is designed to create a Conda environment. It checks for the existence of the specified environment and either creates it or notifies the user if it already exists.
Note: Define your Conda environment in `environment.yaml` before running this script.
### Usage
```bash
./create-env.sh <conda_environment_name>
```
### Note
- This script is intended to run on a local system where Conda is installed.
## 2. deploy-env.sh
### Overview
`deploy-env.sh` is responsible for deploying the Conda environment to a remote server. If the tar.gz file already exists, it is copied; otherwise, it is created before being transferred.
### Usage
```bash
./deploy-env.sh <environment_name> <destination_directory>
```
### Note
- This script is intended to run on a local system.
## 3. deploy-dask.sh
### Overview
`deploy-dask.sh` initiates the Dask cluster on an HPC environment using the PBS job scheduler. It extracts the Conda environment, activates it, and starts the Dask scheduler and workers on allocated nodes.
### Usage
```bash
./deploy-dask.sh <current_workspace_directory>
```
### Notes
- This script is designed for an HPC environment with PBS job scheduling.
- Modifications may be necessary for different job schedulers.
## 4. dask-worker.sh
### Overview
`dask-worker.sh` is a worker script designed to be executed on each allocated node. It sets up the Dask environment, extracts the Conda environment, activates it, and starts the Dask worker to connect to the scheduler. This script is not directly executed by the user.
### Notes
- Execute this script on each allocated node to connect them to the Dask scheduler.
- Designed for use with PBS job scheduling.
## Workflow
1. **Create Conda Environment**: Execute `create-env.sh` to create a Conda environment locally.
2. **Deploy Conda Environment**: Execute `deploy-env.sh` to deploy the Conda environment to a remote server.
3. **Deploy Dask Cluster**: Execute `deploy-dask.sh` to start the Dask cluster on an HPC environment.

View file

@ -1,23 +0,0 @@
name: ray
channels:
- defaults
dependencies:
- python=3.10
- pip
- pip:
- ray==2.8.0
- "ray[default]==2.8.0"
- dask==2022.10.1
- torch
- pydantic<2
- six
- torch
- tqdm
- pandas<2
- scikit-learn
- matplotlib
- optuna
- seaborn
- tabulate
- jupyterlab
- autopep8

View file

@ -1,26 +1,19 @@
#!/bin/bash #!/bin/bash
if [ $# -ne 5 ]; then if [ $# -ne 5 ]; then
echo "Usage: $0 <ws_dir> <env_archive> <ray_address> <redis_password> <obj_store_memory>" echo "Usage: $0 <ws_dir> <env_path> <ray_address> <redis_password> <obj_store_memory>"
exit 1 exit 1
fi fi
export WS_DIR=$1 export WS_DIR=$1
export ENV_ARCHIVE=$2 export ENV_PATH=$2
export RAY_ADDRESS=$3 export RAY_ADDRESS=$3
export REDIS_PASSWORD=$4 export REDIS_PASSWORD=$4
export OBJECT_STORE_MEMORY=$5 export OBJECT_STORE_MEMORY=$5
export ENV_PATH=/run/user/$PBS_JOBID/ray_env # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
mkdir -p $ENV_PATH
tar -xzf $WS_DIR/$ENV_ARCHIVE -C $ENV_PATH
source $ENV_PATH/bin/activate source $ENV_PATH/bin/activate
conda-unpack
ray start --address=$RAY_ADDRESS \ ray start --address=$RAY_ADDRESS \
--redis-password=$REDIS_PASSWORD \ --redis-password=$REDIS_PASSWORD \
--object-store-memory=$OBJECT_STORE_MEMORY \ --object-store-memory=$OBJECT_STORE_MEMORY \
--block --block
rm -rf $ENV_PATH # It's nice to clean up before you terminate the job

View file

@ -5,10 +5,9 @@
export WS_DIR=<workspace_dir> export WS_DIR=<workspace_dir>
export PROJECT_DIR=$WS_DIR/<project_name> export PROJECT_DIR=$WS_DIR/<project_name>
export ENV_PATH=<env_path>
export JOB_SCRIPT=monte-carlo-pi.py export JOB_SCRIPT=monte-carlo-pi.py
export ENV_ARCHIVE=ray_env.tar.gz
export OBJECT_STORE_MEMORY=128000000000 export OBJECT_STORE_MEMORY=128000000000
# Environment variables after this line should not change # Environment variables after this line should not change
@ -16,10 +15,7 @@ export OBJECT_STORE_MEMORY=128000000000
export SRC_DIR=$PROJECT_DIR/src export SRC_DIR=$PROJECT_DIR/src
export PYTHON_FILE=$SRC_DIR/$JOB_SCRIPT export PYTHON_FILE=$SRC_DIR/$JOB_SCRIPT
export DEPLOYMENT_SCRIPTS=$PROJECT_DIR/deployment_scripts export DEPLOYMENT_SCRIPTS=$PROJECT_DIR/deployment_scripts
export ENV_PATH=/run/user/$PBS_JOBID/ray_env # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
mkdir -p $ENV_PATH
tar -xzf $WS_DIR/$ENV_ARCHIVE -C $ENV_PATH # This line extracts the packages to ram disk.
source $ENV_PATH/bin/activate source $ENV_PATH/bin/activate
export IP_ADDRESS=`ip addr show ib0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}' | awk '{print $1}'` export IP_ADDRESS=`ip addr show ib0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}' | awk '{print $1}'`
@ -40,11 +36,9 @@ ray start --disable-usage-stats \
export NUM_NODES=$(sort $PBS_NODEFILE |uniq | wc -l) export NUM_NODES=$(sort $PBS_NODEFILE |uniq | wc -l)
for ((i=1;i<$NUM_NODES;i++)); do for ((i=1;i<$NUM_NODES;i++)); do
pbsdsh -n $i -- bash -l -c "'$DEPLOYMENT_SCRIPTS/start-ray-worker.sh' '$WS_DIR' '$ENV_ARCHIVE' '$RAY_ADDRESS' '$REDIS_PASSWORD' '$OBJECT_STORE_MEMORY'" & pbsdsh -n $i -- bash -l -c "'$DEPLOYMENT_SCRIPTS/start-ray-worker.sh' '$WS_DIR' '$ENV_PATH' '$RAY_ADDRESS' '$REDIS_PASSWORD' '$OBJECT_STORE_MEMORY'" &
done done
python3 $PYTHON_FILE python3 $PYTHON_FILE
ray stop --grace-period 30 ray stop --grace-period 30
rm -rf $ENV_PATH # It's nice to clean up before you terminate the job.