Merge pull request 'use-conda-env-builder' (#2) from use-conda-env-builder into main
Reviewed-on: hpckkaya/ray_template#2
This commit is contained in:
commit
ac99d53c62
7 changed files with 12 additions and 250 deletions
44
README.md
44
README.md
|
@ -5,60 +5,28 @@ This guide shows you how to launch a Ray cluster on HLRS' Hawk system.
|
||||||
## Table of Contents
|
## Table of Contents
|
||||||
- [Ray: How to launch a Ray Cluster on Hawk?](#ray-how-to-launch-a-ray-cluster-on-hawk)
|
- [Ray: How to launch a Ray Cluster on Hawk?](#ray-how-to-launch-a-ray-cluster-on-hawk)
|
||||||
- [Table of Contents](#table-of-contents)
|
- [Table of Contents](#table-of-contents)
|
||||||
- [Prerequisites](#prerequisites)
|
|
||||||
- [Getting Started](#getting-started)
|
- [Getting Started](#getting-started)
|
||||||
- [Launch a local Ray Cluster in Interactive Mode](#launch-a-local-ray-cluster-in-interactive-mode)
|
- [Launch a local Ray Cluster in Interactive Mode](#launch-a-local-ray-cluster-in-interactive-mode)
|
||||||
- [Launch a Ray Cluster in Batch Mode](#launch-a-ray-cluster-in-batch-mode)
|
- [Launch a Ray Cluster in Batch Mode](#launch-a-ray-cluster-in-batch-mode)
|
||||||
|
|
||||||
## Prerequisites
|
|
||||||
|
|
||||||
Before building the environment, make sure you have the following prerequisites:
|
|
||||||
- [Conda Installation](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html): Ensure that Conda is installed on your local system.
|
|
||||||
- [Conda-Pack](https://conda.github.io/conda-pack/) installed in the base environment: Conda pack is used to package the Conda environment into a single tarball. This is used to transfer the environment to the target system.
|
|
||||||
- `linux-64` platform for installing the Conda packages because Conda/pip downloads and installs precompiled binaries suitable to the architecture and OS of the local environment.
|
|
||||||
|
|
||||||
For more information, look at the documentation for [Conda on HLRS HPC systems](https://kb.hlrs.de/platforms/index.php/How_to_move_local_conda_environments_to_the_clusters)
|
|
||||||
|
|
||||||
## Getting Started
|
## Getting Started
|
||||||
|
|
||||||
Only the main and r channels are available using the conda module on the clusters. To use custom packages, we need to move the local conda environment to Hawk.
|
**Step 1.** Build and transfer the Conda environment to Hawk:
|
||||||
|
|
||||||
**Step 1.** Clone this repository to your local machine:
|
Only the main and r channels are available using the Conda module on the clusters. To use custom packages, we need to move the local Conda environment to Hawk.
|
||||||
|
|
||||||
```bash
|
Follow the instructions in [the Conda environment builder repository](https://code.hlrs.de/SiVeGCS/conda-env-builder), which includes a YAML file for building a test environment to run Ray workflows.
|
||||||
git clone <repository_url>
|
|
||||||
```
|
|
||||||
|
|
||||||
**Step 2.** Go into the directory and create an environment using Conda and environment.yaml.
|
**Step 2.** Allocate workspace on Hawk:
|
||||||
|
|
||||||
Note: Be sure to add the necessary packages in `deployment_scripts/environment.yaml`:
|
Proceed to the next step if you have already configured your workspace. Use the following command to create a workspace on the high-performance filesystem, which will expire in 10 days. For more information, such as how to enable reminder emails, refer to the [workspace mechanism](https://kb.hlrs.de/platforms/index.php/Workspace_mechanism) guide.
|
||||||
|
|
||||||
```bash
|
|
||||||
cd deployment_scripts
|
|
||||||
./create-env.sh <your-env>
|
|
||||||
```
|
|
||||||
|
|
||||||
**Step 3.** Package the environment and transfer the archive to the target system:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
(base) $ conda pack -n <your-env> -o ray_env.tar.gz # conda-pack must be installed in the base environment
|
|
||||||
```
|
|
||||||
|
|
||||||
A workspace is suitable to store the compressed Conda environment archive on Hawk. Proceed to the next step if you have already configured your workspace. Use the following command to create a workspace on the high-performance filesystem, which will expire in 10 days. For more information, such as how to enable reminder emails, refer to the [workspace mechanism](https://kb.hlrs.de/platforms/index.php/Workspace_mechanism) guide.
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ws_allocate hpda_project 10
|
ws_allocate hpda_project 10
|
||||||
ws_find hpda_project # find the path to workspace, which is the destination directory in the next step
|
ws_find hpda_project # find the path to workspace, which is the destination directory in the next step
|
||||||
```
|
```
|
||||||
|
|
||||||
You can send your data to an existing workspace using:
|
**Step 2.** Clone the repository on Hawk to use the deployment scripts and project structure:
|
||||||
|
|
||||||
```bash
|
|
||||||
scp ray_env.tar.gz <username>@hawk.hww.hlrs.de:<workspace_directory>
|
|
||||||
rm ray_env.tar.gz # We don't need the archive locally anymore.
|
|
||||||
```
|
|
||||||
|
|
||||||
**Step 4.** Clone the repository on Hawk to use the deployment scripts and project structure:
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd <workspace_directory>
|
cd <workspace_directory>
|
||||||
|
|
|
@ -1,23 +0,0 @@
|
||||||
#!/bin/bash
|
|
||||||
|
|
||||||
# Display usage
|
|
||||||
if [ "$#" -ne 1 ]; then
|
|
||||||
echo "Usage: $0 <conda_environment_name>"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Name of the Conda environment
|
|
||||||
CONDA_ENV_NAME=$1
|
|
||||||
|
|
||||||
# Check if the Conda environment already exists
|
|
||||||
if conda env list | grep -q "$CONDA_ENV_NAME"; then
|
|
||||||
|
|
||||||
echo "Environment '$CONDA_ENV_NAME' already exists."
|
|
||||||
|
|
||||||
else
|
|
||||||
|
|
||||||
echo "Environment '$CONDA_ENV_NAME' does not exist, creating it."
|
|
||||||
|
|
||||||
# Create Conda environment
|
|
||||||
CONDA_SUBDIR=linux-64 conda env create --name $CONDA_ENV_NAME -f environment.yaml
|
|
||||||
fi
|
|
|
@ -1,43 +0,0 @@
|
||||||
#!/bin/bash
|
|
||||||
|
|
||||||
export WS_DIR=<workspace_dir>
|
|
||||||
|
|
||||||
# Get the first character of the hostname
|
|
||||||
first_char=$(hostname | cut -c1)
|
|
||||||
|
|
||||||
# Check if the first character is not "r"
|
|
||||||
if [[ $first_char != "r" ]]; then
|
|
||||||
# it's not a cpu node.
|
|
||||||
echo "Hostname does not start with 'r'."
|
|
||||||
# Get the first seven characters of the hostname
|
|
||||||
first_seven_chars=$(hostname | cut -c1,2,3,4,5,6,7)
|
|
||||||
# Check if it is an ai node
|
|
||||||
if [[ $first_seven_chars != "hawk-ai" ]]; then
|
|
||||||
echo "Hostname does not start with 'hawk-ai' too. Exiting."
|
|
||||||
return 1
|
|
||||||
else
|
|
||||||
echo "GPU node detected."
|
|
||||||
export OBJ_STR_MEMORY=350000000000
|
|
||||||
export TEMP_CHECKPOINT_DIR=/localscratch/$PBS_JOBID/model_checkpoints/
|
|
||||||
mkdir -p $TEMP_CHECKPOINT_DIR
|
|
||||||
fi
|
|
||||||
else
|
|
||||||
echo "CPU node detected."
|
|
||||||
fi
|
|
||||||
|
|
||||||
module load bigdata/conda
|
|
||||||
|
|
||||||
export RAY_DEDUP_LOGS=0
|
|
||||||
|
|
||||||
export ENV_ARCHIVE=ray_env.tar.gz
|
|
||||||
export CONDA_ENVS=/run/user/$PBS_JOBID/envs
|
|
||||||
export ENV_NAME=ray_env
|
|
||||||
export ENV_PATH=$CONDA_ENVS/$ENV_NAME
|
|
||||||
|
|
||||||
mkdir -p $ENV_PATH
|
|
||||||
|
|
||||||
tar -xzf $WS_DIR/$ENV_ARCHIVE -C $ENV_PATH
|
|
||||||
|
|
||||||
source $ENV_PATH/bin/activate
|
|
||||||
|
|
||||||
export CONDA_ENVS_PATH=CONDA_ENVS
|
|
|
@ -1,104 +0,0 @@
|
||||||
# Reference: Cluster Deployment Scripts
|
|
||||||
|
|
||||||
Wiki link:
|
|
||||||
|
|
||||||
Motivation: This document aims to show users how to use additional Dask deployment scripts to streamline the deployment and management of a Dask cluster on a high-performance computing (HPC) environment.
|
|
||||||
|
|
||||||
Structure:
|
|
||||||
- [ ] [Tutorial](https://diataxis.fr/tutorials/)
|
|
||||||
- [ ] [How-to guide](https://diataxis.fr/how-to-guides/)
|
|
||||||
- [x] [Reference](https://diataxis.fr/reference/)
|
|
||||||
- [ ] [Explanation](https://diataxis.fr/explanation/)
|
|
||||||
|
|
||||||
To do:
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
This repository contains a set of bash scripts designed to streamline the deployment and management of a Dask cluster on a high-performance computing (HPC) environment. These scripts facilitate the creation of Conda environments, deployment of the environment to a remote server, and initiation of Dask clusters on distributed systems. Below is a comprehensive guide on how to use and understand each script:
|
|
||||||
|
|
||||||
### Note: Permissions
|
|
||||||
|
|
||||||
Ensure that execution permissions (`chmod +x`) are granted to these scripts before attempting to run them. This can be done using the following command:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
chmod +x script_name.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
## Prerequisites
|
|
||||||
|
|
||||||
Before using these scripts, ensure that the following prerequisites are met:
|
|
||||||
|
|
||||||
1. **Conda Installation**: Ensure that Conda is installed on your local system. Follow the [official Conda installation guide](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) if not already installed.
|
|
||||||
|
|
||||||
2. **PBS Job Scheduler**: The deployment scripts (`deploy-dask.sh` and `dask-worker.sh`) are designed for use with the PBS job scheduler. Modify accordingly if using a different job scheduler.
|
|
||||||
|
|
||||||
3. **SSH Setup**: Ensure that SSH is set up and configured on your system for remote server communication.
|
|
||||||
|
|
||||||
## 1. create-env.sh
|
|
||||||
|
|
||||||
### Overview
|
|
||||||
|
|
||||||
`create-env.sh` is designed to create a Conda environment. It checks for the existence of the specified environment and either creates it or notifies the user if it already exists.
|
|
||||||
Note: Define your Conda environment in `environment.yaml` before running this script.
|
|
||||||
|
|
||||||
### Usage
|
|
||||||
|
|
||||||
```bash
|
|
||||||
./create-env.sh <conda_environment_name>
|
|
||||||
```
|
|
||||||
|
|
||||||
### Note
|
|
||||||
|
|
||||||
- This script is intended to run on a local system where Conda is installed.
|
|
||||||
|
|
||||||
## 2. deploy-env.sh
|
|
||||||
|
|
||||||
### Overview
|
|
||||||
|
|
||||||
`deploy-env.sh` is responsible for deploying the Conda environment to a remote server. If the tar.gz file already exists, it is copied; otherwise, it is created before being transferred.
|
|
||||||
|
|
||||||
### Usage
|
|
||||||
|
|
||||||
```bash
|
|
||||||
./deploy-env.sh <environment_name> <destination_directory>
|
|
||||||
```
|
|
||||||
|
|
||||||
### Note
|
|
||||||
|
|
||||||
- This script is intended to run on a local system.
|
|
||||||
|
|
||||||
## 3. deploy-dask.sh
|
|
||||||
|
|
||||||
### Overview
|
|
||||||
|
|
||||||
`deploy-dask.sh` initiates the Dask cluster on an HPC environment using the PBS job scheduler. It extracts the Conda environment, activates it, and starts the Dask scheduler and workers on allocated nodes.
|
|
||||||
|
|
||||||
### Usage
|
|
||||||
|
|
||||||
```bash
|
|
||||||
./deploy-dask.sh <current_workspace_directory>
|
|
||||||
```
|
|
||||||
|
|
||||||
### Notes
|
|
||||||
|
|
||||||
- This script is designed for an HPC environment with PBS job scheduling.
|
|
||||||
- Modifications may be necessary for different job schedulers.
|
|
||||||
|
|
||||||
## 4. dask-worker.sh
|
|
||||||
|
|
||||||
### Overview
|
|
||||||
|
|
||||||
`dask-worker.sh` is a worker script designed to be executed on each allocated node. It sets up the Dask environment, extracts the Conda environment, activates it, and starts the Dask worker to connect to the scheduler. This script is not directly executed by the user.
|
|
||||||
|
|
||||||
### Notes
|
|
||||||
|
|
||||||
- Execute this script on each allocated node to connect them to the Dask scheduler.
|
|
||||||
- Designed for use with PBS job scheduling.
|
|
||||||
|
|
||||||
## Workflow
|
|
||||||
|
|
||||||
1. **Create Conda Environment**: Execute `create-env.sh` to create a Conda environment locally.
|
|
||||||
2. **Deploy Conda Environment**: Execute `deploy-env.sh` to deploy the Conda environment to a remote server.
|
|
||||||
3. **Deploy Dask Cluster**: Execute `deploy-dask.sh` to start the Dask cluster on an HPC environment.
|
|
|
@ -1,23 +0,0 @@
|
||||||
name: ray
|
|
||||||
channels:
|
|
||||||
- defaults
|
|
||||||
dependencies:
|
|
||||||
- python=3.10
|
|
||||||
- pip
|
|
||||||
- pip:
|
|
||||||
- ray==2.8.0
|
|
||||||
- "ray[default]==2.8.0"
|
|
||||||
- dask==2022.10.1
|
|
||||||
- torch
|
|
||||||
- pydantic<2
|
|
||||||
- six
|
|
||||||
- torch
|
|
||||||
- tqdm
|
|
||||||
- pandas<2
|
|
||||||
- scikit-learn
|
|
||||||
- matplotlib
|
|
||||||
- optuna
|
|
||||||
- seaborn
|
|
||||||
- tabulate
|
|
||||||
- jupyterlab
|
|
||||||
- autopep8
|
|
|
@ -1,26 +1,19 @@
|
||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
|
|
||||||
if [ $# -ne 5 ]; then
|
if [ $# -ne 5 ]; then
|
||||||
echo "Usage: $0 <ws_dir> <env_archive> <ray_address> <redis_password> <obj_store_memory>"
|
echo "Usage: $0 <ws_dir> <env_path> <ray_address> <redis_password> <obj_store_memory>"
|
||||||
exit 1
|
exit 1
|
||||||
fi
|
fi
|
||||||
|
|
||||||
export WS_DIR=$1
|
export WS_DIR=$1
|
||||||
export ENV_ARCHIVE=$2
|
export ENV_PATH=$2
|
||||||
export RAY_ADDRESS=$3
|
export RAY_ADDRESS=$3
|
||||||
export REDIS_PASSWORD=$4
|
export REDIS_PASSWORD=$4
|
||||||
export OBJECT_STORE_MEMORY=$5
|
export OBJECT_STORE_MEMORY=$5
|
||||||
|
|
||||||
export ENV_PATH=/run/user/$PBS_JOBID/ray_env # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
|
|
||||||
|
|
||||||
mkdir -p $ENV_PATH
|
|
||||||
tar -xzf $WS_DIR/$ENV_ARCHIVE -C $ENV_PATH
|
|
||||||
source $ENV_PATH/bin/activate
|
source $ENV_PATH/bin/activate
|
||||||
conda-unpack
|
|
||||||
|
|
||||||
ray start --address=$RAY_ADDRESS \
|
ray start --address=$RAY_ADDRESS \
|
||||||
--redis-password=$REDIS_PASSWORD \
|
--redis-password=$REDIS_PASSWORD \
|
||||||
--object-store-memory=$OBJECT_STORE_MEMORY \
|
--object-store-memory=$OBJECT_STORE_MEMORY \
|
||||||
--block
|
--block
|
||||||
|
|
||||||
rm -rf $ENV_PATH # It's nice to clean up before you terminate the job
|
|
|
@ -5,10 +5,9 @@
|
||||||
|
|
||||||
export WS_DIR=<workspace_dir>
|
export WS_DIR=<workspace_dir>
|
||||||
export PROJECT_DIR=$WS_DIR/<project_name>
|
export PROJECT_DIR=$WS_DIR/<project_name>
|
||||||
|
export ENV_PATH=<env_path>
|
||||||
export JOB_SCRIPT=monte-carlo-pi.py
|
export JOB_SCRIPT=monte-carlo-pi.py
|
||||||
|
|
||||||
export ENV_ARCHIVE=ray_env.tar.gz
|
|
||||||
|
|
||||||
export OBJECT_STORE_MEMORY=128000000000
|
export OBJECT_STORE_MEMORY=128000000000
|
||||||
|
|
||||||
# Environment variables after this line should not change
|
# Environment variables after this line should not change
|
||||||
|
@ -16,10 +15,7 @@ export OBJECT_STORE_MEMORY=128000000000
|
||||||
export SRC_DIR=$PROJECT_DIR/src
|
export SRC_DIR=$PROJECT_DIR/src
|
||||||
export PYTHON_FILE=$SRC_DIR/$JOB_SCRIPT
|
export PYTHON_FILE=$SRC_DIR/$JOB_SCRIPT
|
||||||
export DEPLOYMENT_SCRIPTS=$PROJECT_DIR/deployment_scripts
|
export DEPLOYMENT_SCRIPTS=$PROJECT_DIR/deployment_scripts
|
||||||
export ENV_PATH=/run/user/$PBS_JOBID/ray_env # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
|
|
||||||
|
|
||||||
mkdir -p $ENV_PATH
|
|
||||||
tar -xzf $WS_DIR/$ENV_ARCHIVE -C $ENV_PATH # This line extracts the packages to ram disk.
|
|
||||||
source $ENV_PATH/bin/activate
|
source $ENV_PATH/bin/activate
|
||||||
|
|
||||||
export IP_ADDRESS=`ip addr show ib0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}' | awk '{print $1}'`
|
export IP_ADDRESS=`ip addr show ib0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}' | awk '{print $1}'`
|
||||||
|
@ -40,11 +36,9 @@ ray start --disable-usage-stats \
|
||||||
export NUM_NODES=$(sort $PBS_NODEFILE |uniq | wc -l)
|
export NUM_NODES=$(sort $PBS_NODEFILE |uniq | wc -l)
|
||||||
|
|
||||||
for ((i=1;i<$NUM_NODES;i++)); do
|
for ((i=1;i<$NUM_NODES;i++)); do
|
||||||
pbsdsh -n $i -- bash -l -c "'$DEPLOYMENT_SCRIPTS/start-ray-worker.sh' '$WS_DIR' '$ENV_ARCHIVE' '$RAY_ADDRESS' '$REDIS_PASSWORD' '$OBJECT_STORE_MEMORY'" &
|
pbsdsh -n $i -- bash -l -c "'$DEPLOYMENT_SCRIPTS/start-ray-worker.sh' '$WS_DIR' '$ENV_PATH' '$RAY_ADDRESS' '$REDIS_PASSWORD' '$OBJECT_STORE_MEMORY'" &
|
||||||
done
|
done
|
||||||
|
|
||||||
python3 $PYTHON_FILE
|
python3 $PYTHON_FILE
|
||||||
|
|
||||||
ray stop --grace-period 30
|
ray stop --grace-period 30
|
||||||
|
|
||||||
rm -rf $ENV_PATH # It's nice to clean up before you terminate the job.
|
|
Loading…
Reference in a new issue