Testing ray tune with slurm batch submission and optuna and wandb
MIT License
This repository demonstrates/tests hyperparameter optimization with the following frameworks:
Note If you want to see this tech stack in an actual use case, see the GNN tracking Hyperparameter Optimization repository.
Use the conda environment, THEN pip
install the package.
src/rtstest/dothetune.py
(no batch submission) to also download the data fileFor a single batch jobs that uses multiple nodes to start both the head node and the works, see
slurm/all-in-one
. While this is the example used in the ray documentation, it might not be
the best for most use cases, as it relies on having enough available nodes directly available
for enough time to complete all requested trials.
Because the compute nodes usually do not have internet, we need a separate tool for this. See the documentation of wandb-osh for how to start the syncer on the head node.
Here, we start the ray head on the head (login) node and then use batch submission to start worker nodes asynchronously. Follow the following steps
slurm/head_workers/start-on-headnode.sh
and note down the IP and redis password that are printed outsbatch slurm/head_workers/start-on-worker.slurm <IP> <REDIS PWD>
slurm/head_workers/start-program.sh <IP> <REDIS PWD>
Note In my HPO scripts at my main ML project I instead write out the IP and password to files in my home directory and have dependent scripts read from there rather than passing them around on the command line.
Once the batch jobs for the workers start running, you should see activity in the tuning script output.