# Ray: How to launch a Ray Cluster on Hawk? This guide shows you how to launch a Ray cluster on HLRS' Hawk system. ## Table of Contents - [Ray: How to launch a Ray Cluster on Hawk?](#ray-how-to-launch-a-ray-cluster-on-hawk) - [Table of Contents](#table-of-contents) - [Prerequisites](#prerequisites) - [Getting Started](#getting-started) - [Launch a local Ray Cluster in Interactive Mode](#launch-a-local-ray-cluster-in-interactive-mode) - [Launch a Ray Cluster in Batch Mode](#launch-a-ray-cluster-in-batch-mode) ## Prerequisites Before building the environment, make sure you have the following prerequisites: - [Conda Installation](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html): Ensure that Conda is installed on your local system. - [Conda-Pack](https://conda.github.io/conda-pack/) installed in the base environment: Conda pack is used to package the Conda environment into a single tarball. This is used to transfer the environment to the target system. - `linux-64` platform for installing the Conda packages because Conda/pip downloads and installs precompiled binaries suitable to the architecture and OS of the local environment. For more information, look at the documentation for [Conda on HLRS HPC systems](https://kb.hlrs.de/platforms/index.php/How_to_move_local_conda_environments_to_the_clusters) ## Getting Started Only the main and r channels are available using the conda module on the clusters. To use custom packages, we need to move the local conda environment to Hawk. **Step 1.** Clone this repository to your local machine: ```bash git clone ``` **Step 2.** Go into the directory and create an environment using Conda and environment.yaml. Note: Be sure to add the necessary packages in `deployment_scripts/environment.yaml`: ```bash cd deployment_scripts ./create-env.sh ``` **Step 3.** Package the environment and transfer the archive to the target system: ```bash (base) $ conda pack -n -o ray_env.tar.gz # conda-pack must be installed in the base environment ``` A workspace is suitable to store the compressed Conda environment archive on Hawk. Proceed to the next step if you have already configured your workspace. Use the following command to create a workspace on the high-performance filesystem, which will expire in 10 days. For more information, such as how to enable reminder emails, refer to the [workspace mechanism](https://kb.hlrs.de/platforms/index.php/Workspace_mechanism) guide. ```bash ws_allocate hpda_project 10 ws_find hpda_project # find the path to workspace, which is the destination directory in the next step ``` You can send your data to an existing workspace using: ```bash scp ray_env.tar.gz @hawk.hww.hlrs.de: rm ray_env.tar.gz # We don't need the archive locally anymore. ``` **Step 4.** Clone the repository on Hawk to use the deployment scripts and project structure: ```bash cd git clone ``` ## Launch a local Ray Cluster in Interactive Mode Using a single node interactively provides opportunities for faster code debugging. **Step 1.** On the Hawk login node, start an interactive job using: ```bash qsub -I -l select=1:node_type=rome -l walltime=01:00:00 ``` **Step 2.** Go into the project directory: ```bash cd /deployment_scripts ``` **Step 3.** Deploy the conda environment to the ram disk: Change the following line by editing `deploy-env.sh`: ```bash export WS_DIR= ``` Then, use the following command to deploy and activate the environment: ```bash source deploy-env.sh ``` Note: Make sure all permissions are set using `chmod +x`. **Step 4.** Initialize the Ray cluster. You can use a Python interpreter to start a local Ray cluster: ```python import ray ray.init() ``` **Step 5.** Connect to the dashboard. Warning: Do not change the default dashboard host `127.0.0.1` to keep Ray cluster reachable by only you. Note: We recommend using a dedicated Firefox profile for accessing web-based services on HLRS Compute Platforms. If you haven't created a profile, check out our [guide](https://kb.hlrs.de/platforms/index.php/How_to_use_Web_Based_Services_on_HLRS_Compute_Platforms). You need the job id and the hostname for your current job. You can obtain this information on the login node using: ```bash qstat -anw # get the job id and the hostname ``` Then, on your local computer, ```bash export PBS_JOBID= # e.g., 2316419.hawk-pbs5 ssh # e.g., r38c3t8n3 ``` Check your SSH config in the first step if this doesn't work. Then, launch Firefox web browser using the configured profile. Open `localhost:8265` to access the Ray dashboard. ## Launch a Ray Cluster in Batch Mode Let us [estimate the value of π](https://docs.ray.io/en/releases-2.8.0/ray-core/examples/monte_carlo_pi.html) as an example application. **Step 1.** Add execution permissions to `start-ray-worker.sh` ```bash cd deployment_scripts chmod +x start-ray-worker.sh ``` **Step 2.** Submit a job to launch the head and worker nodes. You must modify the following lines in `submit-ray-job.sh`: - Line 3 changes the cluster size. The default configuration launches a 3 node cluster. - `export WS_DIR=` - set the correct workspace directory. - `export PROJECT_DIR=$WS_DIR/` - set the correct project directory. Note: The job script `src/monte-carlo-pi.py` waits for all nodes in the Ray cluster to become available. Preserve this pattern in your Python code while using a multiple node Ray cluster. Launch the job and monitor the progress. As the job starts, its status (S) shifts from Q (Queued) to R (Running). Upon completion, the job will no longer appear in the `qstat -a` display. ```bash qsub submit-ray-job.pbs qstat -anw # Q: Queued, R: Running, E: Ending ls -l # list files after the job finishes cat ray-job.o... # inspect the output file cat ray-job.e... # inspect the error file ``` If you need to delete the job, use `qdel `. If this doesn't work, use the `-W force` option: `qdel -W force `