2024-01-03 14:53:42 +00:00
# Ray: How to launch a Ray Cluster on Hawk?
2023-12-07 09:26:25 +00:00
2024-01-03 14:53:42 +00:00
This guide shows you how to launch a Ray cluster on HLRS' Hawk system.
2023-12-07 09:26:25 +00:00
## Table of Contents
2024-01-03 14:53:42 +00:00
- [Ray: How to launch a Ray Cluster on Hawk? ](#ray-how-to-launch-a-ray-cluster-on-hawk )
- [Table of Contents ](#table-of-contents )
- [Getting Started ](#getting-started )
2024-01-05 15:08:04 +00:00
- [Launch a local Ray Cluster in Interactive Mode ](#launch-a-local-ray-cluster-in-interactive-mode )
2024-01-05 12:22:52 +00:00
- [Launch a Ray Cluster in Batch Mode ](#launch-a-ray-cluster-in-batch-mode )
2023-12-07 09:26:25 +00:00
## Getting Started
2024-03-25 13:52:40 +00:00
**Step 1.** Build and transfer the Conda environment to Hawk:
2024-01-05 12:22:52 +00:00
2024-03-25 13:52:40 +00:00
Only the main and r channels are available using the Conda module on the clusters. To use custom packages, we need to move the local Conda environment to Hawk.
2023-12-07 09:26:25 +00:00
2024-03-25 13:52:40 +00:00
Follow the instructions in the Conda environment builder repository, which includes a YAML file for building a test environment to run Ray workflows.
2023-12-07 09:26:25 +00:00
2024-03-25 13:52:40 +00:00
**Step 2.** Allocate workspace on Hawk:
2024-01-03 08:23:41 +00:00
2024-03-25 13:52:40 +00:00
Proceed to the next step if you have already configured your workspace. Use the following command to create a workspace on the high-performance filesystem, which will expire in 10 days. For more information, such as how to enable reminder emails, refer to the [workspace mechanism ](https://kb.hlrs.de/platforms/index.php/Workspace_mechanism ) guide.
2024-01-03 08:23:41 +00:00
2024-01-05 12:22:52 +00:00
```bash
ws_allocate hpda_project 10
ws_find hpda_project # find the path to workspace, which is the destination directory in the next step
```
2023-12-07 09:26:25 +00:00
2024-03-25 13:52:40 +00:00
**Step 2.** Clone the repository on Hawk to use the deployment scripts and project structure:
2023-12-07 09:26:25 +00:00
2024-01-05 12:22:52 +00:00
```bash
cd < workspace_directory >
git clone < repository_url >
```
2023-12-07 09:26:25 +00:00
2024-01-05 15:08:04 +00:00
## Launch a local Ray Cluster in Interactive Mode
2023-12-07 09:26:25 +00:00
2024-01-05 12:22:52 +00:00
Using a single node interactively provides opportunities for faster code debugging.
2023-12-07 09:26:25 +00:00
2024-01-05 15:08:04 +00:00
**Step 1.** On the Hawk login node, start an interactive job using:
2023-12-07 09:26:25 +00:00
```bash
2024-01-05 12:22:52 +00:00
qsub -I -l select=1:node_type=rome -l walltime=01:00:00
2023-12-07 09:26:25 +00:00
```
2024-01-05 15:08:04 +00:00
**Step 2.** Go into the project directory:
2024-01-05 12:22:52 +00:00
2023-12-07 09:26:25 +00:00
```bash
2024-01-05 12:44:48 +00:00
cd < project_directory > /deployment_scripts
2023-12-07 09:26:25 +00:00
```
2024-01-05 15:08:04 +00:00
**Step 3.** Deploy the conda environment to the ram disk:
Change the following line by editing `deploy-env.sh` :
```bash
export WS_DIR=< workspace_dir >
```
Then, use the following command to deploy and activate the environment:
2024-01-05 12:22:52 +00:00
2023-12-07 09:26:25 +00:00
```bash
2024-01-05 12:22:52 +00:00
source deploy-env.sh
2023-12-07 09:26:25 +00:00
```
2024-01-05 12:22:52 +00:00
Note: Make sure all permissions are set using `chmod +x` .
2023-12-07 09:26:25 +00:00
2024-01-05 15:08:04 +00:00
**Step 4.** Initialize the Ray cluster.
2024-01-05 12:22:52 +00:00
2024-01-05 15:08:04 +00:00
You can use a Python interpreter to start a local Ray cluster:
2023-12-07 09:26:25 +00:00
```python
2024-01-05 12:22:52 +00:00
import ray
2024-01-05 15:08:04 +00:00
ray.init()
```
**Step 5.** Connect to the dashboard.
Warning: Do not change the default dashboard host `127.0.0.1` to keep Ray cluster reachable by only you.
Note: We recommend using a dedicated Firefox profile for accessing web-based services on HLRS Compute Platforms. If you haven't created a profile, check out our [guide ](https://kb.hlrs.de/platforms/index.php/How_to_use_Web_Based_Services_on_HLRS_Compute_Platforms ).
You need the job id and the hostname for your current job. You can obtain this information on the login node using:
```bash
qstat -anw # get the job id and the hostname
```
Then, on your local computer,
```bash
export PBS_JOBID=< job-id > # e.g., 2316419.hawk-pbs5
ssh < compute-host > # e.g., r38c3t8n3
2023-12-07 09:26:25 +00:00
```
2024-01-05 15:08:04 +00:00
Check your SSH config in the first step if this doesn't work.
2024-01-05 12:22:52 +00:00
2024-01-05 15:08:04 +00:00
Then, launch Firefox web browser using the configured profile. Open `localhost:8265` to access the Ray dashboard.
2024-01-05 12:22:52 +00:00
## Launch a Ray Cluster in Batch Mode
2024-01-05 15:25:02 +00:00
Let us [estimate the value of π ](https://docs.ray.io/en/releases-2.8.0/ray-core/examples/monte_carlo_pi.html ) as an example application.
**Step 1.** Add execution permissions to `start-ray-worker.sh`
2024-01-05 12:22:52 +00:00
```bash
cd deployment_scripts
2024-01-05 15:08:04 +00:00
chmod +x start-ray-worker.sh
2024-01-05 12:22:52 +00:00
```
2024-01-05 15:25:02 +00:00
**Step 2.** Submit a job to launch the head and worker nodes.
2024-01-05 12:22:52 +00:00
2024-01-05 15:08:04 +00:00
You must modify the following lines in `submit-ray-job.sh` :
2024-01-05 12:22:52 +00:00
- Line 3 changes the cluster size. The default configuration launches a 3 node cluster.
2024-01-05 15:08:04 +00:00
- `export WS_DIR=<workspace_dir>` - set the correct workspace directory.
- `export PROJECT_DIR=$WS_DIR/<project_name>` - set the correct project directory.
Note: The job script `src/monte-carlo-pi.py` waits for all nodes in the Ray cluster to become available. Preserve this pattern in your Python code while using a multiple node Ray cluster.
Launch the job and monitor the progress. As the job starts, its status (S) shifts from Q (Queued) to R (Running). Upon completion, the job will no longer appear in the `qstat -a` display.
```bash
qsub submit-ray-job.pbs
qstat -anw # Q: Queued, R: Running, E: Ending
ls -l # list files after the job finishes
cat ray-job.o... # inspect the output file
cat ray-job.e... # inspect the error file
2024-01-05 15:25:02 +00:00
```
If you need to delete the job, use `qdel <job-id>` . If this doesn't work, use the `-W force` option: `qdel -W force <job-id>`