prepare for multi node cluster
This commit is contained in:
parent
5e25f899fa
commit
ad953fd96a
5 changed files with 152 additions and 110 deletions
74
README.md
74
README.md
|
@ -7,7 +7,7 @@ This guide shows you how to launch a Ray cluster on HLRS' Hawk system.
|
||||||
- [Table of Contents](#table-of-contents)
|
- [Table of Contents](#table-of-contents)
|
||||||
- [Prerequisites](#prerequisites)
|
- [Prerequisites](#prerequisites)
|
||||||
- [Getting Started](#getting-started)
|
- [Getting Started](#getting-started)
|
||||||
- [Launch a Ray Cluster in Interactive Mode](#launch-a-ray-cluster-in-interactive-mode)
|
- [Launch a local Ray Cluster in Interactive Mode](#launch-a-local-ray-cluster-in-interactive-mode)
|
||||||
- [Launch a Ray Cluster in Batch Mode](#launch-a-ray-cluster-in-batch-mode)
|
- [Launch a Ray Cluster in Batch Mode](#launch-a-ray-cluster-in-batch-mode)
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
@ -23,13 +23,13 @@ For more information, look at the documentation for [Conda on HLRS HPC systems](
|
||||||
|
|
||||||
Only the main and r channels are available using the conda module on the clusters. To use custom packages, we need to move the local conda environment to Hawk.
|
Only the main and r channels are available using the conda module on the clusters. To use custom packages, we need to move the local conda environment to Hawk.
|
||||||
|
|
||||||
1. Clone this repository to your local machine:
|
**Step 1.** Clone this repository to your local machine:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git clone <repository_url>
|
git clone <repository_url>
|
||||||
```
|
```
|
||||||
|
|
||||||
2. Go into the directory and create an environment using Conda and environment.yaml.
|
**Step 2.** Go into the directory and create an environment using Conda and environment.yaml.
|
||||||
|
|
||||||
Note: Be sure to add the necessary packages in `deployment_scripts/environment.yaml`:
|
Note: Be sure to add the necessary packages in `deployment_scripts/environment.yaml`:
|
||||||
|
|
||||||
|
@ -38,7 +38,7 @@ cd deployment_scripts
|
||||||
./create-env.sh <your-env>
|
./create-env.sh <your-env>
|
||||||
```
|
```
|
||||||
|
|
||||||
3. Package the environment and transfer the archive to the target system:
|
**Step 3.** Package the environment and transfer the archive to the target system:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
(base) $ conda pack -n <your-env> -o ray_env.tar.gz # conda-pack must be installed in the base environment
|
(base) $ conda pack -n <your-env> -o ray_env.tar.gz # conda-pack must be installed in the base environment
|
||||||
|
@ -58,49 +58,76 @@ scp ray_env.tar.gz <username>@hawk.hww.hlrs.de:<workspace_directory>
|
||||||
rm ray_env.tar.gz # We don't need the archive locally anymore.
|
rm ray_env.tar.gz # We don't need the archive locally anymore.
|
||||||
```
|
```
|
||||||
|
|
||||||
4. Clone the repository on Hawk to use the deployment scripts and project structure:
|
**Step 4.** Clone the repository on Hawk to use the deployment scripts and project structure:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd <workspace_directory>
|
cd <workspace_directory>
|
||||||
git clone <repository_url>
|
git clone <repository_url>
|
||||||
```
|
```
|
||||||
|
|
||||||
## Launch a Ray Cluster in Interactive Mode
|
## Launch a local Ray Cluster in Interactive Mode
|
||||||
|
|
||||||
Using a single node interactively provides opportunities for faster code debugging.
|
Using a single node interactively provides opportunities for faster code debugging.
|
||||||
|
|
||||||
1. On the Hawk login node, start an interactive job using:
|
**Step 1.** On the Hawk login node, start an interactive job using:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
qsub -I -l select=1:node_type=rome -l walltime=01:00:00
|
qsub -I -l select=1:node_type=rome -l walltime=01:00:00
|
||||||
```
|
```
|
||||||
|
|
||||||
2. Go into the directory with all code:
|
**Step 2.** Go into the project directory:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd <project_directory>/deployment_scripts
|
cd <project_directory>/deployment_scripts
|
||||||
```
|
```
|
||||||
|
|
||||||
3. Deploy the conda environment to the ram disk:
|
**Step 3.** Deploy the conda environment to the ram disk:
|
||||||
|
|
||||||
|
Change the following line by editing `deploy-env.sh`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export WS_DIR=<workspace_dir>
|
||||||
|
```
|
||||||
|
|
||||||
|
Then, use the following command to deploy and activate the environment:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
source deploy-env.sh
|
source deploy-env.sh
|
||||||
```
|
```
|
||||||
Note: Make sure all permissions are set using `chmod +x`.
|
Note: Make sure all permissions are set using `chmod +x`.
|
||||||
|
|
||||||
4. Initialize the Ray cluster.
|
**Step 4.** Initialize the Ray cluster.
|
||||||
|
|
||||||
You can use a Python interpreter to start a Ray cluster:
|
You can use a Python interpreter to start a local Ray cluster:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
import ray
|
import ray
|
||||||
|
|
||||||
ray.init(dashboard_host='127.0.0.1')
|
ray.init()
|
||||||
```
|
```
|
||||||
|
|
||||||
1. Connect to the dashboard.
|
**Step 5.** Connect to the dashboard.
|
||||||
|
|
||||||
Warning: Always use `127.0.0.1` as the dashboard host to make the Ray cluster reachable by only you.
|
Warning: Do not change the default dashboard host `127.0.0.1` to keep Ray cluster reachable by only you.
|
||||||
|
|
||||||
|
Note: We recommend using a dedicated Firefox profile for accessing web-based services on HLRS Compute Platforms. If you haven't created a profile, check out our [guide](https://kb.hlrs.de/platforms/index.php/How_to_use_Web_Based_Services_on_HLRS_Compute_Platforms).
|
||||||
|
|
||||||
|
You need the job id and the hostname for your current job. You can obtain this information on the login node using:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
qstat -anw # get the job id and the hostname
|
||||||
|
```
|
||||||
|
|
||||||
|
Then, on your local computer,
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export PBS_JOBID=<job-id> # e.g., 2316419.hawk-pbs5
|
||||||
|
ssh <compute-host> # e.g., r38c3t8n3
|
||||||
|
```
|
||||||
|
|
||||||
|
Check your SSH config in the first step if this doesn't work.
|
||||||
|
|
||||||
|
Then, launch Firefox web browser using the configured profile. Open `localhost:8265` to access the Ray dashboard.
|
||||||
|
|
||||||
## Launch a Ray Cluster in Batch Mode
|
## Launch a Ray Cluster in Batch Mode
|
||||||
|
|
||||||
|
@ -108,11 +135,24 @@ Warning: Always use `127.0.0.1` as the dashboard host to make the Ray cluster re
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd deployment_scripts
|
cd deployment_scripts
|
||||||
chmod +x ray-start-worker.sh
|
chmod +x start-ray-worker.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
2. Submit a job to launch the head and worker nodes.
|
2. Submit a job to launch the head and worker nodes.
|
||||||
|
|
||||||
You must modify the following variables in `submit-ray-job.sh`:
|
You must modify the following lines in `submit-ray-job.sh`:
|
||||||
- Line 3 changes the cluster size. The default configuration launches a 3 node cluster.
|
- Line 3 changes the cluster size. The default configuration launches a 3 node cluster.
|
||||||
- `$PROJECT_DIR`
|
- `export WS_DIR=<workspace_dir>` - set the correct workspace directory.
|
||||||
|
- `export PROJECT_DIR=$WS_DIR/<project_name>` - set the correct project directory.
|
||||||
|
|
||||||
|
Note: The job script `src/monte-carlo-pi.py` waits for all nodes in the Ray cluster to become available. Preserve this pattern in your Python code while using a multiple node Ray cluster.
|
||||||
|
|
||||||
|
Launch the job and monitor the progress. As the job starts, its status (S) shifts from Q (Queued) to R (Running). Upon completion, the job will no longer appear in the `qstat -a` display.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
qsub submit-ray-job.pbs
|
||||||
|
qstat -anw # Q: Queued, R: Running, E: Ending
|
||||||
|
ls -l # list files after the job finishes
|
||||||
|
cat ray-job.o... # inspect the output file
|
||||||
|
cat ray-job.e... # inspect the error file
|
||||||
|
```
|
|
@ -11,11 +11,7 @@ export RAY_ADDRESS=$3
|
||||||
export REDIS_PASSWORD=$4
|
export REDIS_PASSWORD=$4
|
||||||
export OBJECT_STORE_MEMORY=$5
|
export OBJECT_STORE_MEMORY=$5
|
||||||
|
|
||||||
# printenv | grep 'RAY_ADDRESS\|REDIS_PASSWORD'
|
export ENV_PATH=/run/user/$PBS_JOBID/ray_env # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
|
||||||
|
|
||||||
# module load system/nvidia/ALL.ALL.525.125.06
|
|
||||||
|
|
||||||
export ENV_PATH=/run/user/$PBS_JOBID/ray_env
|
|
||||||
|
|
||||||
mkdir -p $ENV_PATH
|
mkdir -p $ENV_PATH
|
||||||
tar -xzf $WS_DIR/$ENV_ARCHIVE -C $ENV_PATH
|
tar -xzf $WS_DIR/$ENV_ARCHIVE -C $ENV_PATH
|
||||||
|
@ -27,4 +23,4 @@ ray start --address=$RAY_ADDRESS \
|
||||||
--object-store-memory=$OBJECT_STORE_MEMORY \
|
--object-store-memory=$OBJECT_STORE_MEMORY \
|
||||||
--block
|
--block
|
||||||
|
|
||||||
rm -rf $ENV_PATH
|
rm -rf $ENV_PATH # It's nice to clean up before you terminate the job
|
50
deployment_scripts/submit-ray-job.pbs
Normal file
50
deployment_scripts/submit-ray-job.pbs
Normal file
|
@ -0,0 +1,50 @@
|
||||||
|
#!/bin/bash
|
||||||
|
#PBS -N ray-job
|
||||||
|
#PBS -l select=2:node_type=rome-ai
|
||||||
|
#PBS -l walltime=1:00:00
|
||||||
|
|
||||||
|
export WS_DIR=<workspace_dir>
|
||||||
|
export PROJECT_DIR=$WS_DIR/<project_name>
|
||||||
|
export JOB_SCRIPT=monte-carlo-pi.py
|
||||||
|
|
||||||
|
export ENV_ARCHIVE=ray_env.tar.gz
|
||||||
|
|
||||||
|
export OBJECT_STORE_MEMORY=128000000000
|
||||||
|
|
||||||
|
# Environment variables after this line should not change
|
||||||
|
|
||||||
|
export SRC_DIR=$PROJECT_DIR/src
|
||||||
|
export PYTHON_FILE=$SRC_DIR/$JOB_SCRIPT
|
||||||
|
export DEPLOYMENT_SCRIPTS=$PROJECT_DIR/deployment_scripts
|
||||||
|
export ENV_PATH=/run/user/$PBS_JOBID/ray_env # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
|
||||||
|
|
||||||
|
mkdir -p $ENV_PATH
|
||||||
|
tar -xzf $WS_DIR/$ENV_ARCHIVE -C $ENV_PATH # This line extracts the packages to ram disk.
|
||||||
|
source $ENV_PATH/bin/activate
|
||||||
|
|
||||||
|
export IP_ADDRESS=`ip addr show ib0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}' | awk '{print $1}'`
|
||||||
|
|
||||||
|
export RAY_ADDRESS=$IP_ADDRESS:6379
|
||||||
|
export REDIS_PASSWORD=$(openssl rand -base64 32)
|
||||||
|
|
||||||
|
export NCCL_DEBUG=INFO
|
||||||
|
|
||||||
|
ray start --disable-usage-stats \
|
||||||
|
--head \
|
||||||
|
--node-ip-address=$IP_ADDRESS \
|
||||||
|
--port=6379 \
|
||||||
|
--dashboard-host=127.0.0.1 \
|
||||||
|
--redis-password=$REDIS_PASSWORD \
|
||||||
|
--object-store-memory=$OBJECT_STORE_MEMORY
|
||||||
|
|
||||||
|
export NUM_NODES=$(sort $PBS_NODEFILE |uniq | wc -l)
|
||||||
|
|
||||||
|
for ((i=1;i<$NUM_NODES;i++)); do
|
||||||
|
pbsdsh -n $i -- bash -l -c "'$DEPLOYMENT_SCRIPTS/ray-start-worker.sh' '$WS_DIR' '$ENV_ARCHIVE' '$RAY_ADDRESS' '$REDIS_PASSWORD' '$OBJECT_STORE_MEMORY'" &
|
||||||
|
done
|
||||||
|
|
||||||
|
python3 $PYTHON_FILE
|
||||||
|
|
||||||
|
ray stop --grace-period 30
|
||||||
|
|
||||||
|
rm -rf $ENV_PATH # It's nice to clean up before you terminate the job.
|
|
@ -1,59 +0,0 @@
|
||||||
#!/bin/bash
|
|
||||||
#PBS -N output-ray-job
|
|
||||||
#PBS -l select=2:node_type=rome-ai
|
|
||||||
#PBS -l walltime=1:00:00
|
|
||||||
|
|
||||||
export JOB_SCRIPT=modeling_evaluation.py
|
|
||||||
|
|
||||||
export WS_DIR=/lustre/hpe/ws10/ws10.3/ws/hpckkaya-ifu
|
|
||||||
export ENV_ARCHIVE=ray-environment-v0.3.tar.gz
|
|
||||||
|
|
||||||
export SRC_DIR=$WS_DIR/ifu/src/ray-workflow
|
|
||||||
export DATA_DIR=/lustre/hpe/ws10/ws10.3/ws/hpckkaya-ifu-data/hpclzhon-ifu_data-1668830707
|
|
||||||
export RESULTS_DIR=$WS_DIR/ray_results
|
|
||||||
|
|
||||||
export NCCL_DEBUG=INFO
|
|
||||||
|
|
||||||
# Environment variables after this line should not change
|
|
||||||
|
|
||||||
export PYTHON_FILE=$SRC_DIR/$JOB_SCRIPT
|
|
||||||
export ENV_PATH=/run/user/$PBS_JOBID/ray_env
|
|
||||||
|
|
||||||
mkdir -p $ENV_PATH
|
|
||||||
tar -xzf $WS_DIR/$ENV_ARCHIVE -C $ENV_PATH
|
|
||||||
source $ENV_PATH/bin/activate
|
|
||||||
conda-unpack
|
|
||||||
|
|
||||||
export IP_ADDRESS=`ip addr show ib0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}' | awk '{print $1}'`
|
|
||||||
|
|
||||||
export RAY_ADDRESS=$IP_ADDRESS:6379
|
|
||||||
export REDIS_PASSWORD=$(uuidgen)
|
|
||||||
|
|
||||||
# export RAY_scheduler_spread_threshold=0.0
|
|
||||||
export OBJECT_STORE_MEMORY=128000000000
|
|
||||||
|
|
||||||
ray start --disable-usage-stats \
|
|
||||||
--head \
|
|
||||||
--node-ip-address=$IP_ADDRESS \
|
|
||||||
--port=6379 \
|
|
||||||
--dashboard-host=127.0.0.1 \
|
|
||||||
--redis-password=$REDIS_PASSWORD \
|
|
||||||
--object-store-memory=$OBJECT_STORE_MEMORY
|
|
||||||
|
|
||||||
export NUM_NODES=$(sort $PBS_NODEFILE |uniq | wc -l)
|
|
||||||
|
|
||||||
for ((i=1;i<$NUM_NODES;i++)); do
|
|
||||||
pbsdsh -n $i -- bash -l -c "'$SRC_DIR/ray-start-worker.sh' '$WS_DIR' '$ENV_ARCHIVE' '$RAY_ADDRESS' '$REDIS_PASSWORD' '$OBJECT_STORE_MEMORY'" &
|
|
||||||
done
|
|
||||||
|
|
||||||
# uncomment if you don't already control inside the code
|
|
||||||
# if [[ $NUM_NODES -gt 1 ]]
|
|
||||||
# then
|
|
||||||
# sleep 90
|
|
||||||
#fi
|
|
||||||
|
|
||||||
python3 $PYTHON_FILE
|
|
||||||
|
|
||||||
ray stop
|
|
||||||
|
|
||||||
rm -rf $ENV_PATH
|
|
|
@ -6,11 +6,10 @@ import time
|
||||||
import random
|
import random
|
||||||
import os
|
import os
|
||||||
|
|
||||||
ray.init(address="auto", _node_ip_address=os.environ["IP_ADDRESS"], _redis_password=os.environ["REDIS_PASSWORD"])
|
# Change this to match your cluster scale.
|
||||||
|
NUM_SAMPLING_TASKS = 100
|
||||||
cluster_resources = ray.available_resources()
|
NUM_SAMPLES_PER_TASK = 10_000_000
|
||||||
available_cpu_cores = cluster_resources.get('CPU', 0)
|
TOTAL_NUM_SAMPLES = NUM_SAMPLING_TASKS * NUM_SAMPLES_PER_TASK
|
||||||
print(cluster_resources)
|
|
||||||
|
|
||||||
@ray.remote
|
@ray.remote
|
||||||
class ProgressActor:
|
class ProgressActor:
|
||||||
|
@ -44,31 +43,47 @@ def sampling_task(num_samples: int, task_id: int,
|
||||||
progress_actor.report_progress.remote(task_id, num_samples)
|
progress_actor.report_progress.remote(task_id, num_samples)
|
||||||
return num_inside
|
return num_inside
|
||||||
|
|
||||||
# Change this to match your cluster scale.
|
def wait_for_nodes(expected_num_nodes: int):
|
||||||
NUM_SAMPLING_TASKS = 100
|
while True:
|
||||||
NUM_SAMPLES_PER_TASK = 10_000_000
|
num_nodes = len(ray.nodes())
|
||||||
TOTAL_NUM_SAMPLES = NUM_SAMPLING_TASKS * NUM_SAMPLES_PER_TASK
|
if num_nodes >= expected_num_nodes:
|
||||||
|
break
|
||||||
|
print(f'Currently {num_nodes} nodes connected. Waiting for more...')
|
||||||
|
time.sleep(5) # wait for 5 seconds before checking again
|
||||||
|
|
||||||
# Create the progress actor.
|
if __name__ == "__main__":
|
||||||
progress_actor = ProgressActor.remote(TOTAL_NUM_SAMPLES)
|
|
||||||
|
|
||||||
# Create and execute all sampling tasks in parallel.
|
num_nodes = int(os.environ["NUM_NODES"])
|
||||||
results = [
|
assert num_nodes > 1, "If the environment variable NUM_NODES is set, it should be greater than 1."
|
||||||
sampling_task.remote(NUM_SAMPLES_PER_TASK, i, progress_actor)
|
|
||||||
for i in range(NUM_SAMPLING_TASKS)
|
|
||||||
]
|
|
||||||
|
|
||||||
# Query progress periodically.
|
redis_password = os.environ["REDIS_PASSWORD"]
|
||||||
while True:
|
ray.init(address="auto", _redis_password=redis_password)
|
||||||
progress = ray.get(progress_actor.get_progress.remote())
|
|
||||||
print(f"Progress: {int(progress * 100)}%")
|
|
||||||
|
|
||||||
if progress == 1:
|
wait_for_nodes(num_nodes)
|
||||||
break
|
|
||||||
|
|
||||||
time.sleep(1)
|
cluster_resources = ray.available_resources()
|
||||||
|
print(cluster_resources)
|
||||||
|
|
||||||
# Get all the sampling tasks results.
|
# Create the progress actor.
|
||||||
total_num_inside = sum(ray.get(results))
|
progress_actor = ProgressActor.remote(TOTAL_NUM_SAMPLES)
|
||||||
pi = (total_num_inside * 4) / TOTAL_NUM_SAMPLES
|
|
||||||
print(f"Estimated value of π is: {pi}")
|
# Create and execute all sampling tasks in parallel.
|
||||||
|
results = [
|
||||||
|
sampling_task.remote(NUM_SAMPLES_PER_TASK, i, progress_actor)
|
||||||
|
for i in range(NUM_SAMPLING_TASKS)
|
||||||
|
]
|
||||||
|
|
||||||
|
# Query progress periodically.
|
||||||
|
while True:
|
||||||
|
progress = ray.get(progress_actor.get_progress.remote())
|
||||||
|
print(f"Progress: {int(progress * 100)}%")
|
||||||
|
|
||||||
|
if progress == 1:
|
||||||
|
break
|
||||||
|
|
||||||
|
time.sleep(1)
|
||||||
|
|
||||||
|
# Get all the sampling tasks results.
|
||||||
|
total_num_inside = sum(ray.get(results))
|
||||||
|
pi = (total_num_inside * 4) / TOTAL_NUM_SAMPLES
|
||||||
|
print(f"Estimated value of π is: {pi}")
|
||||||
|
|
Loading…
Reference in a new issue