dask_template/README.md

# Ray: How to launch a Ray Cluster on Hawk?

This guide shows you how to launch a Ray cluster on HLRS' Hawk system.

## Table of Contents
- [Ray: How to launch a Ray Cluster on Hawk?](#ray-how-to-launch-a-ray-cluster-on-hawk)
  - [Table of Contents](#table-of-contents)
  - [Prerequisites](#prerequisites)
  - [Getting Started](#getting-started)
  - [Launch a local Ray Cluster in Interactive Mode](#launch-a-local-ray-cluster-in-interactive-mode)
  - [Launch a Ray Cluster in Batch Mode](#launch-a-ray-cluster-in-batch-mode)

## Prerequisites

Before building the environment, make sure you have the following prerequisites:
- [Conda Installation](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html): Ensure that Conda is installed on your local system.
- [Conda-Pack](https://conda.github.io/conda-pack/) installed in the base environment: Conda pack is used to package the Conda environment into a single tarball. This is used to transfer the environment to the target system.
- `linux-64` platform for installing the Conda packages because Conda/pip downloads and installs precompiled binaries suitable to the architecture and OS of the local environment.

For more information, look at the documentation for [Conda on HLRS HPC systems](https://kb.hlrs.de/platforms/index.php/How_to_move_local_conda_environments_to_the_clusters)

## Getting Started

Only the main and r channels are available using the conda module on the clusters. To use custom packages, we need to move the local conda environment to Hawk. 

**Step 1.** Clone this repository to your local machine:

```bash
git clone <repository_url>
```

**Step 2.** Go into the directory and create an environment using Conda and environment.yaml.

Note: Be sure to add the necessary packages in `deployment_scripts/environment.yaml`:

```bash
cd deployment_scripts
./create-env.sh <your-env>
```

**Step 3.** Package the environment and transfer the archive to the target system:

```bash
(base) $ conda pack -n <your-env> -o ray_env.tar.gz # conda-pack must be installed in the base environment
```

A workspace is suitable to store the compressed Conda environment archive on Hawk. Proceed to the next step if you have already configured your workspace. Use the following command to create a workspace on the high-performance filesystem, which will expire in 10 days. For more information, such as how to enable reminder emails, refer to the [workspace mechanism](https://kb.hlrs.de/platforms/index.php/Workspace_mechanism) guide.

```bash
ws_allocate hpda_project 10
ws_find hpda_project # find the path to workspace, which is the destination directory in the next step
```

You can send your data to an existing workspace using: 

```bash
scp ray_env.tar.gz <username>@hawk.hww.hlrs.de:<workspace_directory>
rm ray_env.tar.gz # We don't need the archive locally anymore.
```

**Step 4.** Clone the repository on Hawk to use the deployment scripts and project structure:

```bash
cd <workspace_directory>
git clone <repository_url>
```

## Launch a local Ray Cluster in Interactive Mode

Using a single node interactively provides opportunities for faster code debugging.

**Step 1.** On the Hawk login node, start an interactive job using:

```bash
qsub -I -l select=1:node_type=rome -l walltime=01:00:00
```

**Step 2.** Go into the project directory:

```bash
cd <project_directory>/deployment_scripts
```

**Step 3.** Deploy the conda environment to the ram disk:

Change the following line by editing `deploy-env.sh`:

```bash
export WS_DIR=<workspace_dir>
```

Then, use the following command to deploy and activate the environment:

```bash
source deploy-env.sh
```
Note: Make sure all permissions are set using `chmod +x`.

**Step 4.** Initialize the Ray cluster.

You can use a Python interpreter to start a local Ray cluster:

```python
import ray

ray.init()
```

**Step 5.** Connect to the dashboard.

Warning: Do not change the default dashboard host `127.0.0.1` to keep Ray cluster reachable by only you.

Note: We recommend using a dedicated Firefox profile for accessing web-based services on HLRS Compute Platforms. If you haven't created a profile, check out our [guide](https://kb.hlrs.de/platforms/index.php/How_to_use_Web_Based_Services_on_HLRS_Compute_Platforms).

You need the job id and the hostname for your current job. You can obtain this information on the login node using:

```bash
qstat -anw # get the job id and the hostname
```

Then, on your local computer, 

```bash
export PBS_JOBID=<job-id> # e.g., 2316419.hawk-pbs5
ssh <compute-host> # e.g., r38c3t8n3
```

Check your SSH config in the first step if this doesn't work.

Then, launch Firefox web browser using the configured profile. Open `localhost:8265` to access the Ray dashboard.

## Launch a Ray Cluster in Batch Mode

1. Add execution permissions to `start-ray-worker.sh`

```bash
cd deployment_scripts
chmod +x start-ray-worker.sh
```

2. Submit a job to launch the head and worker nodes.

You must modify the following lines in `submit-ray-job.sh`:
- Line 3 changes the cluster size. The default configuration launches a 3 node cluster.
- `export WS_DIR=<workspace_dir>` - set the correct workspace directory.
- `export PROJECT_DIR=$WS_DIR/<project_name>` - set the correct project directory.

Note: The job script `src/monte-carlo-pi.py` waits for all nodes in the Ray cluster to become available. Preserve this pattern in your Python code while using a multiple node Ray cluster.

Launch the job and monitor the progress. As the job starts, its status (S) shifts from Q (Queued) to R (Running). Upon completion, the job will no longer appear in the `qstat -a` display.

```bash
qsub submit-ray-job.pbs
qstat -anw # Q: Queued, R: Running, E: Ending
ls -l # list files after the job finishes
cat ray-job.o... # inspect the output file
cat ray-job.e... # inspect the error file
```
change environment.yaml to install Ray 2024-01-03 14:53:42 +00:00			`# Ray: How to launch a Ray Cluster on Hawk?`
first commit 2023-12-07 09:26:25 +00:00
change environment.yaml to install Ray 2024-01-03 14:53:42 +00:00			`This guide shows you how to launch a Ray cluster on HLRS' Hawk system.`
first commit 2023-12-07 09:26:25 +00:00
			`## Table of Contents`
change environment.yaml to install Ray 2024-01-03 14:53:42 +00:00			`- [Ray: How to launch a Ray Cluster on Hawk?](#ray-how-to-launch-a-ray-cluster-on-hawk)`
			`- [Table of Contents](#table-of-contents)`
			`- [Prerequisites](#prerequisites)`
			`- [Getting Started](#getting-started)`
prepare for multi node cluster 2024-01-05 15:08:04 +00:00			`- [Launch a local Ray Cluster in Interactive Mode](#launch-a-local-ray-cluster-in-interactive-mode)`
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			`- [Launch a Ray Cluster in Batch Mode](#launch-a-ray-cluster-in-batch-mode)`
first commit 2023-12-07 09:26:25 +00:00
			`## Prerequisites`

ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			`Before building the environment, make sure you have the following prerequisites:`
			`- [Conda Installation](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html): Ensure that Conda is installed on your local system.`
			`- [Conda-Pack](https://conda.github.io/conda-pack/) installed in the base environment: Conda pack is used to package the Conda environment into a single tarball. This is used to transfer the environment to the target system.`
			- `linux-64` platform for installing the Conda packages because Conda/pip downloads and installs precompiled binaries suitable to the architecture and OS of the local environment.

			`For more information, look at the documentation for [Conda on HLRS HPC systems](https://kb.hlrs.de/platforms/index.php/How_to_move_local_conda_environments_to_the_clusters)`
first commit 2023-12-07 09:26:25 +00:00
			`## Getting Started`

ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			`Only the main and r channels are available using the conda module on the clusters. To use custom packages, we need to move the local conda environment to Hawk.`

prepare for multi node cluster 2024-01-05 15:08:04 +00:00			`Step 1. Clone this repository to your local machine:`
first commit 2023-12-07 09:26:25 +00:00
modify scripts for creating the environment 2024-01-03 15:37:34 +00:00			```bash
			`git clone <repository_url>`
			```
first commit 2023-12-07 09:26:25 +00:00
prepare for multi node cluster 2024-01-05 15:08:04 +00:00			`Step 2. Go into the directory and create an environment using Conda and environment.yaml.`
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00
			Note: Be sure to add the necessary packages in `deployment_scripts/environment.yaml`:
first commit 2023-12-07 09:26:25 +00:00
modify scripts for creating the environment 2024-01-03 15:37:34 +00:00			```bash
			`cd deployment_scripts`
			`./create-env.sh <your-env>`
			```
first commit 2023-12-07 09:26:25 +00:00
prepare for multi node cluster 2024-01-05 15:08:04 +00:00			`Step 3. Package the environment and transfer the archive to the target system:`
first commit 2023-12-07 09:26:25 +00:00
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			```bash
changes regarding environment creation and deployment 2024-01-05 12:44:48 +00:00			`(base) $ conda pack -n <your-env> -o ray_env.tar.gz # conda-pack must be installed in the base environment`
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			```
finalized for documentation upload 2024-01-03 08:23:41 +00:00
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			`A workspace is suitable to store the compressed Conda environment archive on Hawk. Proceed to the next step if you have already configured your workspace. Use the following command to create a workspace on the high-performance filesystem, which will expire in 10 days. For more information, such as how to enable reminder emails, refer to the [workspace mechanism](https://kb.hlrs.de/platforms/index.php/Workspace_mechanism) guide.`
finalized for documentation upload 2024-01-03 08:23:41 +00:00
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			```bash
			`ws_allocate hpda_project 10`
			`ws_find hpda_project # find the path to workspace, which is the destination directory in the next step`
			```
first commit 2023-12-07 09:26:25 +00:00
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			`You can send your data to an existing workspace using:`
first commit 2023-12-07 09:26:25 +00:00
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			```bash
changes regarding environment creation and deployment 2024-01-05 12:44:48 +00:00			`scp ray_env.tar.gz <username>@hawk.hww.hlrs.de:<workspace_directory>`
			`rm ray_env.tar.gz # We don't need the archive locally anymore.`
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			```
first commit 2023-12-07 09:26:25 +00:00
prepare for multi node cluster 2024-01-05 15:08:04 +00:00			`Step 4. Clone the repository on Hawk to use the deployment scripts and project structure:`
first commit 2023-12-07 09:26:25 +00:00
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			```bash
			`cd <workspace_directory>`
			`git clone <repository_url>`
			```
first commit 2023-12-07 09:26:25 +00:00
prepare for multi node cluster 2024-01-05 15:08:04 +00:00			`## Launch a local Ray Cluster in Interactive Mode`
first commit 2023-12-07 09:26:25 +00:00
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			`Using a single node interactively provides opportunities for faster code debugging.`
first commit 2023-12-07 09:26:25 +00:00
prepare for multi node cluster 2024-01-05 15:08:04 +00:00			`Step 1. On the Hawk login node, start an interactive job using:`
first commit 2023-12-07 09:26:25 +00:00
			```bash
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			`qsub -I -l select=1:node_type=rome -l walltime=01:00:00`
first commit 2023-12-07 09:26:25 +00:00			```

prepare for multi node cluster 2024-01-05 15:08:04 +00:00			`Step 2. Go into the project directory:`
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00
first commit 2023-12-07 09:26:25 +00:00			```bash
changes regarding environment creation and deployment 2024-01-05 12:44:48 +00:00			`cd <project_directory>/deployment_scripts`
first commit 2023-12-07 09:26:25 +00:00			```

prepare for multi node cluster 2024-01-05 15:08:04 +00:00			`Step 3. Deploy the conda environment to the ram disk:`

			Change the following line by editing `deploy-env.sh`:

			```bash
			`export WS_DIR=<workspace_dir>`
			```

			`Then, use the following command to deploy and activate the environment:`
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00
first commit 2023-12-07 09:26:25 +00:00			```bash
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			`source deploy-env.sh`
first commit 2023-12-07 09:26:25 +00:00			```
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			Note: Make sure all permissions are set using `chmod +x`.
first commit 2023-12-07 09:26:25 +00:00
prepare for multi node cluster 2024-01-05 15:08:04 +00:00			`Step 4. Initialize the Ray cluster.`
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00
prepare for multi node cluster 2024-01-05 15:08:04 +00:00			`You can use a Python interpreter to start a local Ray cluster:`
first commit 2023-12-07 09:26:25 +00:00
			```python
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			`import ray`

prepare for multi node cluster 2024-01-05 15:08:04 +00:00			`ray.init()`
			```

			`Step 5. Connect to the dashboard.`

			Warning: Do not change the default dashboard host `127.0.0.1` to keep Ray cluster reachable by only you.

			`Note: We recommend using a dedicated Firefox profile for accessing web-based services on HLRS Compute Platforms. If you haven't created a profile, check out our [guide](https://kb.hlrs.de/platforms/index.php/How_to_use_Web_Based_Services_on_HLRS_Compute_Platforms).`

			`You need the job id and the hostname for your current job. You can obtain this information on the login node using:`

			```bash
			`qstat -anw # get the job id and the hostname`
			```

			`Then, on your local computer,`

			```bash
			`export PBS_JOBID=<job-id> # e.g., 2316419.hawk-pbs5`
			`ssh <compute-host> # e.g., r38c3t8n3`
first commit 2023-12-07 09:26:25 +00:00			```

prepare for multi node cluster 2024-01-05 15:08:04 +00:00			`Check your SSH config in the first step if this doesn't work.`
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00
prepare for multi node cluster 2024-01-05 15:08:04 +00:00			Then, launch Firefox web browser using the configured profile. Open `localhost:8265` to access the Ray dashboard.
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00
			`## Launch a Ray Cluster in Batch Mode`

			1. Add execution permissions to `start-ray-worker.sh`

			```bash
			`cd deployment_scripts`
prepare for multi node cluster 2024-01-05 15:08:04 +00:00			`chmod +x start-ray-worker.sh`
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			```

			`2. Submit a job to launch the head and worker nodes.`

prepare for multi node cluster 2024-01-05 15:08:04 +00:00			You must modify the following lines in `submit-ray-job.sh`:
ready to test the workflow on Hawk 2024-01-05 12:22:52 +00:00			`- Line 3 changes the cluster size. The default configuration launches a 3 node cluster.`
prepare for multi node cluster 2024-01-05 15:08:04 +00:00			- `export WS_DIR=<workspace_dir>` - set the correct workspace directory.
			- `export PROJECT_DIR=$WS_DIR/<project_name>` - set the correct project directory.

			Note: The job script `src/monte-carlo-pi.py` waits for all nodes in the Ray cluster to become available. Preserve this pattern in your Python code while using a multiple node Ray cluster.

			Launch the job and monitor the progress. As the job starts, its status (S) shifts from Q (Queued) to R (Running). Upon completion, the job will no longer appear in the `qstat -a` display.

			```bash
			`qsub submit-ray-job.pbs`
			`qstat -anw # Q: Queued, R: Running, E: Ending`
			`ls -l # list files after the job finishes`
			`cat ray-job.o... # inspect the output file`
			`cat ray-job.e... # inspect the error file`
			```