Template repository for Dask workflows on HLRS HPC

Find a file

Kerem Kayabay 2e5b17ab5e fix the second bash block		2024-02-07 17:19:05 +01:00
deployment_scripts	add dockerfile	2024-02-07 14:42:39 +00:00
notebooks	ready to test the workflow on Hawk	2024-01-05 13:22:52 +01:00
src	prepare for multi node cluster	2024-01-05 16:08:04 +01:00
.gitignore	ready to test the workflow on Hawk	2024-01-05 13:22:52 +01:00
README.md	multiple node test successful	2024-01-05 16:25:02 +01:00
reproduce_container_bug.md	fix the second bash block	2024-02-07 17:19:05 +01:00

README.md

Ray: How to launch a Ray Cluster on Hawk?

This guide shows you how to launch a Ray cluster on HLRS' Hawk system.

Ray: How to launch a Ray Cluster on Hawk?

Prerequisites

Before building the environment, make sure you have the following prerequisites:

Conda Installation: Ensure that Conda is installed on your local system.
Conda-Pack installed in the base environment: Conda pack is used to package the Conda environment into a single tarball. This is used to transfer the environment to the target system.
linux-64 platform for installing the Conda packages because Conda/pip downloads and installs precompiled binaries suitable to the architecture and OS of the local environment.

For more information, look at the documentation for Conda on HLRS HPC systems

Getting Started

Only the main and r channels are available using the conda module on the clusters. To use custom packages, we need to move the local conda environment to Hawk.

Step 1. Clone this repository to your local machine:

git clone <repository_url>

Step 2. Go into the directory and create an environment using Conda and environment.yaml.

Note: Be sure to add the necessary packages in deployment_scripts/environment.yaml:

cd deployment_scripts
./create-env.sh <your-env>

Step 3. Package the environment and transfer the archive to the target system:

(base) $ conda pack -n <your-env> -o ray_env.tar.gz # conda-pack must be installed in the base environment

A workspace is suitable to store the compressed Conda environment archive on Hawk. Proceed to the next step if you have already configured your workspace. Use the following command to create a workspace on the high-performance filesystem, which will expire in 10 days. For more information, such as how to enable reminder emails, refer to the workspace mechanism guide.

ws_allocate hpda_project 10
ws_find hpda_project # find the path to workspace, which is the destination directory in the next step

You can send your data to an existing workspace using:

scp ray_env.tar.gz <username>@hawk.hww.hlrs.de:<workspace_directory>
rm ray_env.tar.gz # We don't need the archive locally anymore.

Step 4. Clone the repository on Hawk to use the deployment scripts and project structure:

cd <workspace_directory>
git clone <repository_url>

Launch a local Ray Cluster in Interactive Mode

Using a single node interactively provides opportunities for faster code debugging.

Step 1. On the Hawk login node, start an interactive job using:

qsub -I -l select=1:node_type=rome -l walltime=01:00:00

Step 2. Go into the project directory:

cd <project_directory>/deployment_scripts

Step 3. Deploy the conda environment to the ram disk:

Change the following line by editing deploy-env.sh:

export WS_DIR=<workspace_dir>

Then, use the following command to deploy and activate the environment:

source deploy-env.sh

Note: Make sure all permissions are set using chmod +x.

Step 4. Initialize the Ray cluster.

You can use a Python interpreter to start a local Ray cluster:

import ray

ray.init()

Step 5. Connect to the dashboard.

Warning: Do not change the default dashboard host 127.0.0.1 to keep Ray cluster reachable by only you.

Note: We recommend using a dedicated Firefox profile for accessing web-based services on HLRS Compute Platforms. If you haven't created a profile, check out our guide.

You need the job id and the hostname for your current job. You can obtain this information on the login node using:

qstat -anw # get the job id and the hostname

Then, on your local computer,

export PBS_JOBID=<job-id> # e.g., 2316419.hawk-pbs5
ssh <compute-host> # e.g., r38c3t8n3

Check your SSH config in the first step if this doesn't work.

Then, launch Firefox web browser using the configured profile. Open localhost:8265 to access the Ray dashboard.

Launch a Ray Cluster in Batch Mode

Let us estimate the value of π as an example application.

Step 1. Add execution permissions to start-ray-worker.sh

cd deployment_scripts
chmod +x start-ray-worker.sh

Step 2. Submit a job to launch the head and worker nodes.

You must modify the following lines in submit-ray-job.sh:

Line 3 changes the cluster size. The default configuration launches a 3 node cluster.
export WS_DIR=<workspace_dir> - set the correct workspace directory.
export PROJECT_DIR=$WS_DIR/<project_name> - set the correct project directory.

Note: The job script src/monte-carlo-pi.py waits for all nodes in the Ray cluster to become available. Preserve this pattern in your Python code while using a multiple node Ray cluster.

Launch the job and monitor the progress. As the job starts, its status (S) shifts from Q (Queued) to R (Running). Upon completion, the job will no longer appear in the qstat -a display.

qsub submit-ray-job.pbs
qstat -anw # Q: Queued, R: Running, E: Ending
ls -l # list files after the job finishes
cat ray-job.o... # inspect the output file
cat ray-job.e... # inspect the error file

If you need to delete the job, use qdel <job-id>. If this doesn't work, use the -W force option: qdel -W force <job-id>