ray_template/README.md

# Ray: How to launch a Ray Cluster on Hawk?

This guide shows you how to launch a Ray cluster on HLRS' Hawk system.

## Table of Contents
- [Ray: How to launch a Ray Cluster on Hawk?](#ray-how-to-launch-a-ray-cluster-on-hawk)
  - [Table of Contents](#table-of-contents)
  - [Getting Started](#getting-started)
  - [Launch a local Ray Cluster in Interactive Mode](#launch-a-local-ray-cluster-in-interactive-mode)
  - [Launch a Ray Cluster in Batch Mode](#launch-a-ray-cluster-in-batch-mode)

## Getting Started

**Step 1.** Build and transfer the Conda environment to Hawk:

Only the main and r channels are available using the Conda module on the clusters. To use custom packages, we need to move the local Conda environment to Hawk.

Follow the instructions in the Conda environment builder repository, which includes a YAML file for building a test environment to run Ray workflows.

**Step 2.** Allocate workspace on Hawk:

Proceed to the next step if you have already configured your workspace. Use the following command to create a workspace on the high-performance filesystem, which will expire in 10 days. For more information, such as how to enable reminder emails, refer to the [workspace mechanism](https://kb.hlrs.de/platforms/index.php/Workspace_mechanism) guide.

```bash
ws_allocate hpda_project 10
ws_find hpda_project # find the path to workspace, which is the destination directory in the next step
```

**Step 2.** Clone the repository on Hawk to use the deployment scripts and project structure:

```bash
cd <workspace_directory>
git clone <repository_url>
```

## Launch a local Ray Cluster in Interactive Mode

Using a single node interactively provides opportunities for faster code debugging.

**Step 1.** On the Hawk login node, start an interactive job using:

```bash
qsub -I -l select=1:node_type=rome -l walltime=01:00:00
```

**Step 2.** Go into the project directory:

```bash
cd <project_directory>/deployment_scripts
```

**Step 3.** Deploy the conda environment to the ram disk:

Change the following line by editing `deploy-env.sh`:

```bash
export WS_DIR=<workspace_dir>
```

Then, use the following command to deploy and activate the environment:

```bash
source deploy-env.sh
```
Note: Make sure all permissions are set using `chmod +x`.

**Step 4.** Initialize the Ray cluster.

You can use a Python interpreter to start a local Ray cluster:

```python
import ray

ray.init()
```

**Step 5.** Connect to the dashboard.

Warning: Do not change the default dashboard host `127.0.0.1` to keep Ray cluster reachable by only you.

Note: We recommend using a dedicated Firefox profile for accessing web-based services on HLRS Compute Platforms. If you haven't created a profile, check out our [guide](https://kb.hlrs.de/platforms/index.php/How_to_use_Web_Based_Services_on_HLRS_Compute_Platforms).

You need the job id and the hostname for your current job. You can obtain this information on the login node using:

```bash
qstat -anw # get the job id and the hostname
```

Then, on your local computer,

```bash
export PBS_JOBID=<job-id> # e.g., 2316419.hawk-pbs5
ssh <compute-host> # e.g., r38c3t8n3
```

Check your SSH config in the first step if this doesn't work.

Then, launch Firefox web browser using the configured profile. Open `localhost:8265` to access the Ray dashboard.

## Launch a Ray Cluster in Batch Mode

Let us [estimate the value of π](https://docs.ray.io/en/releases-2.8.0/ray-core/examples/monte_carlo_pi.html) as an example application.

**Step 1.** Add execution permissions to `start-ray-worker.sh`

```bash
cd deployment_scripts
chmod +x start-ray-worker.sh
```

**Step 2.** Submit a job to launch the head and worker nodes.

You must modify the following lines in `submit-ray-job.sh`:
- Line 3 changes the cluster size. The default configuration launches a 3 node cluster.
- `export WS_DIR=<workspace_dir>` - set the correct workspace directory.
- `export PROJECT_DIR=$WS_DIR/<project_name>` - set the correct project directory.

Note: The job script `src/monte-carlo-pi.py` waits for all nodes in the Ray cluster to become available. Preserve this pattern in your Python code while using a multiple node Ray cluster.

Launch the job and monitor the progress. As the job starts, its status (S) shifts from Q (Queued) to R (Running). Upon completion, the job will no longer appear in the `qstat -a` display.

```bash
qsub submit-ray-job.pbs
qstat -anw # Q: Queued, R: Running, E: Ending
ls -l # list files after the job finishes
cat ray-job.o... # inspect the output file
cat ray-job.e... # inspect the error file
```

If you need to delete the job, use `qdel <job-id>`. If this doesn't work, use the `-W force` option: `qdel -W force <job-id>`