gpu tester detects broken and slow gpus in a cluster
MIT License
Gpu tester finds all your bad gpus.
Works on slurm.
Features:
Roadmap:
Create a venv:
python3 -m venv .env
source .env/bin/activate
pip install -U pip
Then:
pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu116
pip install gpu_tester
Checkout these examples to call this as a lib:
Output looks like this:
job succeeded
0 have incorrect results, 1 have gpu errors and 319 succeeded
incorrect results:
[]
gpu errors:
[['gpu_error', 'compute-od-gpu-st-p4d-24xlarge-156', '3']]
The easiest way to quickly spot broken node is to do the pair-based strategy. It will run many jobs in parallel and find which node can talk together Here is one example
gpu_tester --nodes 2 --parallel-tests 50 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 45 --exclude 'gpu-st-p4d-24xlarge-[66]'
Once you validated this works, you may want to try the DDP strategy over all nodes, eg:
gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "ddp" --job_timeout 300 --exclude 'gpu-st-p4d-24xlarge-[66]'
If you want to only validate the forward functionality of gpus and not the communication, you may use:
gpu_tester --nodes 100 --parallel-tests 1 --job_comment laion --partition "gpu" --test_kind "simple_forward" --job_timeout 50 --exclude 'gpu-st-p4d-24xlarge-[66]'
This module exposes a single function gpu_tester
which takes the same arguments as the command line tool:
Either locally, or in gitpod (do export PIP_USER=false
there)
Setup a virtualenv:
python3 -m venv .env
source .env/bin/activate
pip install -e .
to run tests:
pip install -r requirements-test.txt
then
make lint
make test
You can use make black
to reformat the code
python -m pytest -x -s -v tests -k "dummy"
to run a specific test