Open Source Ecosystems

This repository implements the Schulze voting algorithm using CUDA for hardware acceleration. That algorithm is often used by Pirate Parties and open-source foundations, and it's a good example of a combinatorial problem that can be parallelized efficiently on GPUs. It's built as a single scaling_democracy.cu CUDA file, wrapped with PyBind11, and compiled without CMake directly from the setup.py. To pull, build and benchmark locally:

git clone https://github.com/ashvardanian/scaling-democracy.git
cd scaling-democracy
git submodule update --init --recursive
pip install pybind11 numpy numba cupy-cuda12x pycuda
pip install -e .
python benchmark.py --num-candidates 128 --num-voters 128 --run-openmp --run-numba --run-serial --run-cuda

It's a fun project by itself. It might be a good example of using GPUs for combinatorial problems with order-dependant traversal and also a good starting point for single-day CUDA hacking experiences with the header-only CCCL, containing Thrust, CUB, and libcudacxx.

Throughput

A typical benchmark output comparing serial Numba code to 20 Intel Icelake threads to SXM Nvidia H100 GPU would be:

Generating 128 random voter rankings with 8,192 candidates
Generated voter rankings, proceeding with 20 threads
Serial: 454.5124 seconds, 1,209,550,826.73 candidates^3/sec
Parallel: 68.4111 seconds, 8,036,063,293.27 candidates^3/sec
CUDA: 1.9802 seconds, 277,620,752,084.85 candidates^3/sec

CUDA outperforms the baseline JIT-compiled parallel kernel by a factor of 34.55x. 40 core CPU uses ~270 Watts, so 10 cores use ~67.5 Watts. Our SXM Nvidia H100 has a ~700 Watt TDP, but consumes only 360 under such load, so 5x more power-hungry, meaning the CUDA implementation is up to 7x more power-efficient than Numba on that Intel CPU. As the matrix grows, the GPU utilization improves and the experimentally observed throughput fits a sub-cubic curve. Comparing to Arm-based CPUs and native SIMD-accelerated code would be more fair. Repeating the experiment with 192-core AWS Graviton 4 chips, the timings with tile-size 32 are:

Candidates	Numba on `c8g`	OpenMP on `c8g`	OpenMP + NEON on `c8g`	CUDA on `h100`
2'048	1.14 s	0.35 s	0.16 s
4'096	1.84 s	1.02 s	0.35 s
8'192	7.49 s	5.50 s	4.64 s	1.98 s
16'384	38.04 s	24.67 s	24.20 s	9.53 s
32'768	302.85 s	246.85 s	179.82 s	42.90 s

Comparing the numbers, we are still looking at a roughly 4x speedup of CUDA for the largest matrix size tested for a comparable power consumption and hardware rental cost.

scaling-democracy

Throughput

Links