Simple tests for JAX, PyTorch, and TensorFlow to test if the installed NVIDIA drivers are being properly picked up
MIT License
Simple tests for JAX, PyTorch, and TensorFlow to test if the installed NVIDIA drivers are being properly picked up.
These instructions assume working on Ubuntu 20.04 LTS.
Example:
This setup has been tested on the following systems:
Create a Python virtual environment and install the base libraries from the relevant requirements.txt
files.
Examples:
python -m pip install -r requirements-jax.txt
python -m pip install -r requirements-jax-metal.txt
The easiest way to determine the correct NVIDIA driver for your system is to have it determine it automatically through Ubuntu's Software & Updates utility and selecting the Drivers tab.
The "Drivers" tab should begin with a listbox containing a progress bar and the text "Searching for available drivers…" until the search is complete. Once the search is complete, the listbox should list each device for which proprietary drivers could be installed. Each item in the list should have an indicator light: green if a driver tested with that Ubuntu release is being used, yellow if any other driver is being used, or red if no driver is being used.
Select the recommended NVIDIA driver from the list (proprietary, tested) and then select "Apply Changes" to install the driver. After the driver has finished installing, restart the computer to verify the driver has been installed successfully. If you run
nvidia-smi
from the command line the displayed driver version should match the one you installed.
N.B.: To check all the GPUs that are currently visible to NVIDIA you can use
nvidia-smi --list-gpus
See the output of nvidia-smi --help
for more details.
Example:
$ nvidia-smi --list-gpus
GPU 0: NVIDIA GeForce RTX 3050 Ti Laptop GPU (UUID: GPU-9b3a1382-1fb8-43c7-67b1-c28af22b6767)
Alternatively, if you are running headless or over a remote connection you can determine and install the correct driver from the command line. From the command line run
ubuntu-drivers devices
to get a list of all devices on the machine that need drivers and the recommended drivers.
Example:
$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd000025A0sv00001028sd00000A61bc03sc02i00
vendor : NVIDIA Corporation
model : GA107M [GeForce RTX 3050 Ti Mobile]
driver : nvidia-driver-535-server-open - distro non-free
driver : nvidia-driver-525-open - distro non-free
driver : nvidia-driver-535-server - distro non-free
driver : nvidia-driver-535 - distro non-free recommended
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-535-open - distro non-free
driver : nvidia-driver-470 - distro non-free
driver : nvidia-driver-525-server - distro non-free
driver : nvidia-driver-525 - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
You can now either install the supported driver you want directly through apt
Example:
sudo apt-get install nvidia-driver-535
or you can let ubnutu-driver
install the recommended driver for you automatically
sudo ubuntu-drivers autoinstall
After installing the NVIDIA driver, the NVIDIA CUDA Toolkit also needs to be installed.
This needs to be done every time you update the NVIDIA driver.
This can be done manually by following the instructions on the NVIDIA website, but it can also be done automatically through apt
installing the Ubuntu package nvidia-cuda-toolkit
.
sudo apt-get update -y
sudo apt-get install -y nvidia-cuda-toolkit
Example:
$ apt show nvidia-cuda-toolkit | head -n 5
Package: nvidia-cuda-toolkit
Version: 11.5.1-1ubuntu1
Priority: extra
Section: multiverse/devel
Origin: Ubuntu
After the NVIDIA CUDA Toolkit is installed restart the computer.
N.B.: If the NVIDIA drivers are ever changed the NVIDIA CUDA Toolkit will need to be reinstalled.
Now that the system NVIDIA drivers are installed the necessary requirements can be stepped through or the different machine learning backends in order (from easiest to hardest).
PyTorch makes things very easy by packaging all of the necessary CUDA libraries with its binary distributions (which is why they are so huge).
So by pip
installing the torch
wheel all necessary libraries are installed.
The CUDA and CUDNN release wheels can be installed from PyPI and Google with pip
python -m pip install --upgrade "jax[cuda12_pip]" --find-links https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
To instead install the jax
and jaxlib
but use locally installed CUDA and CUDNN versions follow the instructions in the JAX README.
In these circumstances to test the location of the installed CUDA release you can set the following environment variable before importing JAX
XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda
Example:
XLA_FLAGS=--xla_gpu_cuda_data_dir=/usr/lib/cuda/ python jax_MNIST.py
WARNING: This section will be out of date fast, so you'll have to adopt it for your particular circumstances.
TensorFlow requires the NVIDIA cuDNN closed source libraries, which are a pain to get and have quite bad documentation.
To download the libraries you will need to make an account with NVIDIA and register as a developer, which is also a bad experience.
Once you've done that go to the cuDNN download page, agree to the Software License Agreement, and the select the version of cuDNN that matches the version of CUDA your operating system has (the version from nvidia-smi
which is not necessarily the same as the version from nvcc --version
)
Example:
For the choices of
$ nvidia-smi | grep "CUDA Version"
| NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 |
would indicate that cuDNN v8.2.2 for CUDA 11.4 is the recommended version. (This is verified by noting that when clicked on the entry for cuDNN v8.2.2 for CUDA 11.4 lists support for Ubuntu 20.04, but the entry for cuDNN v8.2.2 for CUDA 10.2 lists support only for Ubuntu 18.04.)
Click on the cuDNN release you want to download to see the available libraries for supports system architectures. As these instructions are using Ubuntu, download the tarballs and Debian binaries for cuDNN library and the cuDNN runtime library, developer library, and code samples.
Example:
Once all the libraries are downloaded locally refer to the directions for installing on Linux in the cuDNN installation guide.
The documentation refers to a CUDA directory path (which they generically call /usr/local/cuda
) and a download path for all of the cuDNN libraries (referred to as <cudnnpath>
).
For the CUDA directory path we could use our existing symlink of /usr/local/cuda-10.1
, but the cuDNN examples all assume the path is /usr/local/cuda
so it is easier to make a new symlink of /usr/local/cuda
pointing to /usr/lib/cuda
.
sudo ln -s /usr/lib/cuda /usr/local/cuda
The examples are also going to assume that nvcc
is at /usr/local/cuda/bin/nvcc
and cuda.h
is at /usr/local/cuda/include/cuda.h
, so make additional symlinks of those paths pointing to /usr/bin/nvcc
and /usr/include/cuda.h
sudo ln -s /usr/bin/nvcc /usr/local/cuda/bin/nvcc
sudo ln -s /usr/include/cuda.h /usr/local/cuda/include/cuda.h
<cudnnpath>
directory containing the cuDNN tar file (example: cudnn-11.4-linux-x64-v8.2.2.26.tgz
)cuda
)tar -xzvf cudnn-*-linux-x64-v*.tgz
sudo cp cuda/include/cudnn*.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
To use in your applications the cuDNN runtime library, developer library, and code samples should be installed too.
This can be done with apt install
from your <cudnnpath>
.
Example:
sudo apt install ./libcudnn8_8.2.2.26-1+cuda11.4_amd64.deb
sudo apt install ./libcudnn8-dev_8.2.2.26-1+cuda11.4_amd64.deb
sudo apt install ./libcudnn8-samples_8.2.2.26-1+cuda11.4_amd64.deb
Copy the cuDNN samples to a writable path
cp -r /usr/src/cudnn_samples_v8/ $PWD
then navigate to the mnistCUDNN
sample directory and compile and run the sample
cd cudnn_samples_v8/mnistCUDNN
make clean && make
./mnistCUDNN
If everything is setup correctly then the resulting output should conclude with
Test passed!
The installed libraries should also be known added to PATH
and LD_LIBARY_PATH
, so add the following to your ~/.profile
to be loaded at system login
# Add CUDA Toolkit 10.1 to PATH
# /usr/local/cuda-10.1 should be a symlink to /usr/lib/cuda
if [ -d "/usr/local/cuda-10.1/bin" ]; then
PATH="/usr/local/cuda-10.1/bin:${PATH}"; export PATH;
elif [ -d "/usr/lib/cuda/bin" ]; then
PATH="/usr/lib/cuda/bin:${PATH}"; export PATH;
fi
# Add cuDNN to LD_LIBRARY_PATH
# /usr/local/cuda should be a symlink to /usr/lib/cuda
if [ -d "/usr/local/cuda/lib64" ]; then
LD_LIBRARY_PATH="/usr/local/cuda/lib64:${LD_LIBRARY_PATH}"; export LD_LIBRARY_PATH;
elif [ -d "/usr/lib/cuda/lib64" ]; then
LD_LIBRARY_PATH="/usr/lib/cuda/lib64:${LD_LIBRARY_PATH}"; export LD_LIBRARY_PATH;
fi
TensorFlow does not really respect semvar as minor releases act essentially as major releases with breaking changes.
This comes into play when considering the tested build configurations for CUDA and cuDNN versions.
For example, looking at supported ranges for TensorFlow v2.3.0
through v2.5.0
Version | Python version | Compiler | Build tools | cuDNN | CUDA |
---|---|---|---|---|---|
tensorflow-2.5.0 | 3.6-3.9 | GCC 7.3.1 | Bazel 3.7.2 | 8.1 | 11.2 |
tensorflow-2.4.0 | 3.6-3.8 | GCC 7.3.1 | Bazel 3.1.0 | 8.0 | 11.0 |
tensorflow-2.3.0 | 3.5-3.8 | GCC 7.3.1 | Bazel 3.1.0 | 7.6 | 10.1 |
it is seen that for our example of
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
only TensorFlow v2.3.0
will be compatible with out installation.
However, TensorFlow v2.3.0
requires cuDNN v7.X
(libcudnn.so.7
) and we have cuDNN v8.x
(libcudnn.so.8
).
The NVIDIA cuDNN installation documentation notes that
Since version 8 can coexist with previous versions of cuDNN, if the user has an older version of cuDNN such as v6 or v7, installing version 8 will not automatically delete an older revision.
While we could go and try to install cuDNN v7.6
from the cuDNN archives it turns out that TensorFlow is okay with symlinking libcudnn.so.8
to a target of libcudnn.so.7
, so until this causes problems move forward with this approach
sudo ln -s /usr/lib/cuda/lib64/libcudnn.so.8 /usr/local/cuda/lib64/libcudnn.so.7
You should now have a directory structure for usr/local/cuda
that looks something like the following
$ tree /usr/local/cuda
/usr/local/cuda
├── bin
│ └── nvcc -> /usr/bin/nvcc
├── include
│ ├── cuda.h -> /usr/include/cuda.h
│ ├── cudnn_adv_infer.h
│ ├── cudnn_adv_infer_v8.h
│ ├── cudnn_adv_train.h
│ ├── cudnn_adv_train_v8.h
│ ├── cudnn_backend.h
│ ├── cudnn_backend_v8.h
│ ├── cudnn_cnn_infer.h
│ ├── cudnn_cnn_infer_v8.h
│ ├── cudnn_cnn_train.h
│ ├── cudnn_cnn_train_v8.h
│ ├── cudnn.h
│ ├── cudnn_ops_infer.h
│ ├── cudnn_ops_infer_v8.h
│ ├── cudnn_ops_train.h
│ ├── cudnn_ops_train_v8.h
│ ├── cudnn_v8.h
│ ├── cudnn_version.h
│ └── cudnn_version_v8.h
├── lib64
│ ├── libcudnn_adv_infer.so
│ ├── libcudnn_adv_infer.so.8
│ ├── libcudnn_adv_infer.so.8.2.2
│ ├── libcudnn_adv_train.so
│ ├── libcudnn_adv_train.so.8
│ ├── libcudnn_adv_train.so.8.2.2
│ ├── libcudnn_cnn_infer.so
│ ├── libcudnn_cnn_infer.so.8
│ ├── libcudnn_cnn_infer.so.8.2.2
│ ├── libcudnn_cnn_infer_static.a
│ ├── libcudnn_cnn_infer_static_v8.a
│ ├── libcudnn_cnn_train.so
│ ├── libcudnn_cnn_train.so.8
│ ├── libcudnn_cnn_train.so.8.2.2
│ ├── libcudnn_cnn_train_static.a
│ ├── libcudnn_cnn_train_static_v8.a
│ ├── libcudnn_ops_infer.so
│ ├── libcudnn_ops_infer.so.8
│ ├── libcudnn_ops_infer.so.8.2.2
│ ├── libcudnn_ops_train.so
│ ├── libcudnn_ops_train.so.8
│ ├── libcudnn_ops_train.so.8.2.2
│ ├── libcudnn.so
│ ├── libcudnn.so.7 -> /usr/lib/cuda/lib64/libcudnn.so.8
│ ├── libcudnn.so.8
│ ├── libcudnn.so.8.2.2
│ ├── libcudnn_static.a
│ └── libcudnn_static_v8.a
├── nvvm
│ └── libdevice -> ../../nvidia-cuda-toolkit/libdevice
└── version.txt
5 directories, 49 files
With this final set of libraries installed restart your computer.
For all of the ML libraries you can now run the x_detect_GPU.py
tests which test that the library can properly access the GPU and CUDA, where x
is the library name/nickname.
For all of the ML libraries you can run a simple MNIST test by running x_MNIST.py
, where x
is the library name/nickname.
It is worthwhile in another terminal to watch the GPU performance with nvidia-smi
while running tests
watch --interval 0.5 nvidia-smi
Thanks to Giordon Stark who greatly helped me scaffold the right approach to this setup, as well as for his help doing system setup comparisons.