Merge pull request 'use-conda-env-builder' (#2) from use-conda-env-builder into main

Reviewed-on: hpckkaya/ray_template#2
2024-04-03 06:43:10 +00:00 · 2024-04-03 06:43:10 +00:00 · ac99d53c62
commit ac99d53c62
parent 5a8bf27936 4485c24dd7
7 changed files with 12 additions and 250 deletions
--- a/README.md
+++ b/README.md
@ -5,60 +5,28 @@ This guide shows you how to launch a Ray cluster on HLRS' Hawk system.
 ## Table of Contents
 - [Ray: How to launch a Ray Cluster on Hawk?](#ray-how-to-launch-a-ray-cluster-on-hawk)
  - [Table of Contents](#table-of-contents)
  - [Prerequisites](#prerequisites)
  - [Getting Started](#getting-started)
  - [Launch a local Ray Cluster in Interactive Mode](#launch-a-local-ray-cluster-in-interactive-mode)
  - [Launch a Ray Cluster in Batch Mode](#launch-a-ray-cluster-in-batch-mode)
 ## Prerequisites
 Before building the environment, make sure you have the following prerequisites:
 - [Conda Installation](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html): Ensure that Conda is installed on your local system.
 - [Conda-Pack](https://conda.github.io/conda-pack/) installed in the base environment: Conda pack is used to package the Conda environment into a single tarball. This is used to transfer the environment to the target system.
 - `linux-64` platform for installing the Conda packages because Conda/pip downloads and installs precompiled binaries suitable to the architecture and OS of the local environment.
 For more information, look at the documentation for [Conda on HLRS HPC systems](https://kb.hlrs.de/platforms/index.php/How_to_move_local_conda_environments_to_the_clusters)
 ## Getting Started
-Only the main and r channels are available using the conda module on the clusters. To use custom packages, we need to move the local conda environment to Hawk. 
+**Step 1.** Build and transfer the Conda environment to Hawk:
-**Step 1.** Clone this repository to your local machine:
+Only the main and r channels are available using the Conda module on the clusters. To use custom packages, we need to move the local Conda environment to Hawk. 
-```bash
+Follow the instructions in [the Conda environment builder repository](https://code.hlrs.de/SiVeGCS/conda-env-builder), which includes a YAML file for building a test environment to run Ray workflows.
 git clone <repository_url>
 ```
-**Step 2.** Go into the directory and create an environment using Conda and environment.yaml.
+**Step 2.** Allocate workspace on Hawk:
-Note: Be sure to add the necessary packages in `deployment_scripts/environment.yaml`:
+Proceed to the next step if you have already configured your workspace. Use the following command to create a workspace on the high-performance filesystem, which will expire in 10 days. For more information, such as how to enable reminder emails, refer to the [workspace mechanism](https://kb.hlrs.de/platforms/index.php/Workspace_mechanism) guide.
 ```bash
 cd deployment_scripts
 ./create-env.sh <your-env>
 ```
 **Step 3.** Package the environment and transfer the archive to the target system:
 ```bash
 (base) $ conda pack -n <your-env> -o ray_env.tar.gz # conda-pack must be installed in the base environment
 ```
 A workspace is suitable to store the compressed Conda environment archive on Hawk. Proceed to the next step if you have already configured your workspace. Use the following command to create a workspace on the high-performance filesystem, which will expire in 10 days. For more information, such as how to enable reminder emails, refer to the [workspace mechanism](https://kb.hlrs.de/platforms/index.php/Workspace_mechanism) guide.
 ```bash
 ws_allocate hpda_project 10
 ws_find hpda_project # find the path to workspace, which is the destination directory in the next step
 ```
-You can send your data to an existing workspace using: 
+**Step 2.** Clone the repository on Hawk to use the deployment scripts and project structure:
 ```bash
 scp ray_env.tar.gz <username>@hawk.hww.hlrs.de:<workspace_directory>
 rm ray_env.tar.gz # We don't need the archive locally anymore.
 ```
 **Step 4.** Clone the repository on Hawk to use the deployment scripts and project structure:
 ```bash
 cd <workspace_directory>
--- a/deployment_scripts/create-env.sh
+++ b/deployment_scripts/create-env.sh
@ -1,23 +0,0 @@
 #!/bin/bash
 # Display usage
 if [ "$#" -ne 1 ]; then
    echo "Usage: $0 <conda_environment_name>"
    exit 1
 fi
 # Name of the Conda environment
 CONDA_ENV_NAME=$1
 # Check if the Conda environment already exists
 if conda env list | grep -q "$CONDA_ENV_NAME"; then
  echo "Environment '$CONDA_ENV_NAME' already exists."
 else
  echo "Environment '$CONDA_ENV_NAME' does not exist, creating it."
  # Create Conda environment
  CONDA_SUBDIR=linux-64 conda env create --name $CONDA_ENV_NAME -f environment.yaml
 fi
--- a/deployment_scripts/deploy-env.sh
+++ b/deployment_scripts/deploy-env.sh
@ -1,43 +0,0 @@
 #!/bin/bash
 export WS_DIR=<workspace_dir>
 # Get the first character of the hostname
 first_char=$(hostname | cut -c1)
 # Check if the first character is not "r"
 if [[ $first_char != "r" ]]; then
        # it's not a cpu node.
    echo "Hostname does not start with 'r'."
    # Get the first seven characters of the hostname
    first_seven_chars=$(hostname | cut -c1,2,3,4,5,6,7)
    # Check if it is an ai node
    if [[ $first_seven_chars != "hawk-ai" ]]; then
                echo "Hostname does not start with 'hawk-ai' too. Exiting."
        return 1
    else
        echo "GPU node detected."
        export OBJ_STR_MEMORY=350000000000
        export TEMP_CHECKPOINT_DIR=/localscratch/$PBS_JOBID/model_checkpoints/
        mkdir -p $TEMP_CHECKPOINT_DIR
    fi
 else
        echo "CPU node detected."
 fi
 module load bigdata/conda
 export RAY_DEDUP_LOGS=0
 export ENV_ARCHIVE=ray_env.tar.gz
 export CONDA_ENVS=/run/user/$PBS_JOBID/envs
 export ENV_NAME=ray_env
 export ENV_PATH=$CONDA_ENVS/$ENV_NAME
 mkdir -p $ENV_PATH
 tar -xzf $WS_DIR/$ENV_ARCHIVE -C $ENV_PATH
 source $ENV_PATH/bin/activate
 export CONDA_ENVS_PATH=CONDA_ENVS
--- a/deployment_scripts/deployment_scripts_reference.md
+++ b/deployment_scripts/deployment_scripts_reference.md
@ -1,104 +0,0 @@
 # Reference: Cluster Deployment Scripts
 Wiki link: 
 Motivation: This document aims to show users how to use additional Dask deployment scripts to streamline the deployment and management of a Dask cluster on a high-performance computing (HPC) environment.
 Structure:
 - [ ] [Tutorial](https://diataxis.fr/tutorials/)
 - [ ] [How-to guide](https://diataxis.fr/how-to-guides/)
 - [x] [Reference](https://diataxis.fr/reference/)
 - [ ] [Explanation](https://diataxis.fr/explanation/)
 To do:
 ---
 ## Overview
 This repository contains a set of bash scripts designed to streamline the deployment and management of a Dask cluster on a high-performance computing (HPC) environment. These scripts facilitate the creation of Conda environments, deployment of the environment to a remote server, and initiation of Dask clusters on distributed systems. Below is a comprehensive guide on how to use and understand each script:
 ### Note: Permissions
 Ensure that execution permissions (`chmod +x`) are granted to these scripts before attempting to run them. This can be done using the following command:
 ```bash
 chmod +x script_name.sh
 ```
 ## Prerequisites
 Before using these scripts, ensure that the following prerequisites are met:
 1. **Conda Installation**: Ensure that Conda is installed on your local system. Follow the [official Conda installation guide](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) if not already installed.
 2. **PBS Job Scheduler**: The deployment scripts (`deploy-dask.sh` and `dask-worker.sh`) are designed for use with the PBS job scheduler. Modify accordingly if using a different job scheduler.
 3. **SSH Setup**: Ensure that SSH is set up and configured on your system for remote server communication.
 ## 1. create-env.sh
 ### Overview
 `create-env.sh` is designed to create a Conda environment. It checks for the existence of the specified environment and either creates it or notifies the user if it already exists.
 Note: Define your Conda environment in `environment.yaml` before running this script.
 ### Usage
 ```bash
 ./create-env.sh <conda_environment_name>
 ```
 ### Note
 - This script is intended to run on a local system where Conda is installed.
 ## 2. deploy-env.sh
 ### Overview
 `deploy-env.sh` is responsible for deploying the Conda environment to a remote server. If the tar.gz file already exists, it is copied; otherwise, it is created before being transferred.
 ### Usage
 ```bash
 ./deploy-env.sh <environment_name> <destination_directory>
 ```
 ### Note
 - This script is intended to run on a local system.
 ## 3. deploy-dask.sh
 ### Overview
 `deploy-dask.sh` initiates the Dask cluster on an HPC environment using the PBS job scheduler. It extracts the Conda environment, activates it, and starts the Dask scheduler and workers on allocated nodes.
 ### Usage
 ```bash
 ./deploy-dask.sh <current_workspace_directory>
 ```
 ### Notes
 - This script is designed for an HPC environment with PBS job scheduling.
 - Modifications may be necessary for different job schedulers.
 ## 4. dask-worker.sh
 ### Overview
 `dask-worker.sh` is a worker script designed to be executed on each allocated node. It sets up the Dask environment, extracts the Conda environment, activates it, and starts the Dask worker to connect to the scheduler. This script is not directly executed by the user.
 ### Notes
 - Execute this script on each allocated node to connect them to the Dask scheduler.
 - Designed for use with PBS job scheduling.
 ## Workflow
 1. **Create Conda Environment**: Execute `create-env.sh` to create a Conda environment locally.
 2. **Deploy Conda Environment**: Execute `deploy-env.sh` to deploy the Conda environment to a remote server.
 3. **Deploy Dask Cluster**: Execute `deploy-dask.sh` to start the Dask cluster on an HPC environment.
--- a/deployment_scripts/environment.yaml
+++ b/deployment_scripts/environment.yaml
@ -1,23 +0,0 @@
 name: ray
 channels:
  - defaults
 dependencies:
  - python=3.10
  - pip
  - pip:
    - ray==2.8.0
    - "ray[default]==2.8.0"
    - dask==2022.10.1
    - torch
    - pydantic<2
    - six
    - torch
    - tqdm
    - pandas<2
    - scikit-learn
    - matplotlib
    - optuna
    - seaborn
    - tabulate
    - jupyterlab
    - autopep8
--- a/deployment_scripts/start-ray-worker.sh
+++ b/deployment_scripts/start-ray-worker.sh
@ -1,26 +1,19 @@
 #!/bin/bash
 if [ $# -ne 5 ]; then
-    echo "Usage: $0 <ws_dir> <env_archive> <ray_address> <redis_password> <obj_store_memory>"
+    echo "Usage: $0 <ws_dir> <env_path> <ray_address> <redis_password> <obj_store_memory>"
    exit 1
 fi
 export WS_DIR=$1
-export ENV_ARCHIVE=$2
+export ENV_PATH=$2
 export RAY_ADDRESS=$3
 export REDIS_PASSWORD=$4
 export OBJECT_STORE_MEMORY=$5
 export ENV_PATH=/run/user/$PBS_JOBID/ray_env # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
 mkdir -p $ENV_PATH
 tar -xzf $WS_DIR/$ENV_ARCHIVE -C $ENV_PATH
 source $ENV_PATH/bin/activate
 conda-unpack
 ray start --address=$RAY_ADDRESS \
        --redis-password=$REDIS_PASSWORD \
        --object-store-memory=$OBJECT_STORE_MEMORY \
        --block
 rm -rf $ENV_PATH # It's nice to clean up before you terminate the job
--- a/deployment_scripts/submit-ray-job.pbs
+++ b/deployment_scripts/submit-ray-job.pbs
@ -5,10 +5,9 @@
 export WS_DIR=<workspace_dir>
 export PROJECT_DIR=$WS_DIR/<project_name>
 export ENV_PATH=<env_path>
 export JOB_SCRIPT=monte-carlo-pi.py
 export ENV_ARCHIVE=ray_env.tar.gz
 export OBJECT_STORE_MEMORY=128000000000
 # Environment variables after this line should not change
@ -16,10 +15,7 @@ export OBJECT_STORE_MEMORY=128000000000
 export SRC_DIR=$PROJECT_DIR/src
 export PYTHON_FILE=$SRC_DIR/$JOB_SCRIPT
 export DEPLOYMENT_SCRIPTS=$PROJECT_DIR/deployment_scripts
 export ENV_PATH=/run/user/$PBS_JOBID/ray_env # We use the ram disk to extract the environment packages since a large number of files decreases the performance of the parallel file system.
 mkdir -p $ENV_PATH
 tar -xzf $WS_DIR/$ENV_ARCHIVE -C $ENV_PATH # This line extracts the packages to ram disk.
 source $ENV_PATH/bin/activate
 export IP_ADDRESS=`ip addr show ib0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}' | awk '{print $1}'`
@ -40,11 +36,9 @@ ray start --disable-usage-stats \
 export NUM_NODES=$(sort $PBS_NODEFILE |uniq | wc -l)
 for ((i=1;i<$NUM_NODES;i++)); do
-        pbsdsh -n $i -- bash -l -c "'$DEPLOYMENT_SCRIPTS/start-ray-worker.sh' '$WS_DIR' '$ENV_ARCHIVE' '$RAY_ADDRESS' '$REDIS_PASSWORD' '$OBJECT_STORE_MEMORY'" &
+        pbsdsh -n $i -- bash -l -c "'$DEPLOYMENT_SCRIPTS/start-ray-worker.sh' '$WS_DIR' '$ENV_PATH' '$RAY_ADDRESS' '$REDIS_PASSWORD' '$OBJECT_STORE_MEMORY'" &
 done
 python3 $PYTHON_FILE
 ray stop --grace-period 30
 rm -rf $ENV_PATH # It's nice to clean up before you terminate the job.