Template repository for Ray workflows on HLRS HPC Systems
Find a file
2024-08-26 05:49:36 +00:00
deployment_scripts changes made accroding to issue #1 2024-05-22 11:43:24 +02:00
src Changes according to issue #1 2024-04-23 15:55:58 +02:00
.gitignore changes made accroding to issue #1 2024-05-22 11:43:24 +02:00
README.md Fix the readme file regarding the steps for Hawk 2024-08-26 05:49:36 +00:00

Dask: How to execute python workloads using a Dask cluster on Vulcan

This repository looks at a deployment of a Dask cluster on Vulcan, and executing your programs using this cluster.

Table of Contents

Getting Started

1. Build and transfer the Conda environment to Vulcan:

Only the main and r channels are available using the Conda module on the clusters. To use custom packages, we need to move the local Conda environment to Vulcan.

Follow the instructions in the Conda environment builder repository, which includes a YAML file for building a test environment.

2. Allocate workspace on Vulcan:

Proceed to the next step if you have already configured your workspace. Use the following command to create a workspace on the high-performance filesystem, which will expire in 10 days. For more information, such as how to enable reminder emails, refer to the workspace mechanism guide.

ws_allocate dask_workspace 10
ws_find dask_workspace # find the path to workspace, which is the destination directory in the next step

3. Clone the repository on Vulcan to use the deployment scripts and project structure:

cd <workspace_directory>
git clone <repository_url>

4. Send all the code to the appropriate directory on Vulcan using scp:

scp <your_script>.py <destination_host>:<destination_directory>

5. SSH into Vulcan and start a job interactively using:

qsub -I -N DaskJob -l select=1:node_type=clx-21 -l walltime=02:00:00

Note: For multiple nodes, it is recommended to write a .pbs script and submit it using qsub. Follow section Multiple Nodes for more information.

6. Go into the directory with all code:

cd <destination_directory>

7. Initialize the Dask cluster:

source deploy-dask.sh "$(pwd)"

Note: At the moment, the deployment is verbose, and there is no implementation to silence the logs. Note: Make sure all permissions are set using chmod +x for all scripts.

Usage

Single Node

To run the application interactively on a single node, follow points 4, 5, 6 and, 7 from Getting Started, and execute the following command after all the job has started:

# Load the Conda module
module load bigdata/conda
source activate # activates the base environment

# List available Conda environments for verification purposes
conda env list

# Activate a specific Conda environment.
conda activate dask_environment # you need to execute `source activate` first, or use `source [ENV_PATH]/bin/activate`

After the environment is activated, you can run the python interpretor:

python

Or to run a full script:

python <your-script>.py

Multiple Nodes

To run the application on multiple nodes, you need to write a .pbs script and submit it using qsub. Follow lines 1-4 from the Getting Started section. Write a submit-dask-job.pbs script:

#!/bin/bash
#PBS -N dask-job
#PBS -l select=3:node_type=rome
#PBS -l walltime=1:00:00

#Go to the directory where the code is
cd <destination_directory>

#Deploy the Dask cluster
source deploy-dask.sh "$(pwd)"

#Run the python script
python <your-script>.py

A more thorough example is available in the deployment_scripts directory under submit-dask-job.pbs.

And then execute the following commands to submit the job:

qsub submit-dask-job.pbs
qstat -anw # Q: Queued, R: Running, E: Ending
ls -l # list files after the job finishes
cat dask-job.o... # inspect the output file
cat dask-job.e... # inspect the error file

Notes

Note: Dask Cluster is set to verbose, add the following to your code while connecting to the Dask cluster:

client = Client(..., silence_logs='error')

Note: Replace all filenames within <> with the actual values applicable to your project.