Exploring George Hotz's minikeyvalue design in Python
MIT License
Exploring George Hotz's minikeyvalue design in Python. Overall, I think George's implementation (and language choice) is better. Sure, my Python version can match his when it comes to performance and throughput but it takes an extra reverse proxy layer (more NGINX) to get there. Go really is a powerhouse when it comes to high-performance network-based applications and it shows.
Volume "servers" store all of the actual data. The data is evenly distributed into a predictable nested directory structure based on the base64'd hash of its contents. Could be text, a file, doesn't really matter. It's all bytes. This part of the program is just an NGINX subprocess. The config is generated as needed, thrown into a tempfile, and started as a child process. NGINX then starts up its own worker child processes.
The index server manages a mapping of where stuff lives on the volume servers. A DBM database is used to store this mapping in a fast, persistent way. The index server also exposes the primary API. GET gets data, PUT stores data, DELETE deletes data. Pretty simple! The index server talks with the volume servers in a similar manner.
Besides Python, this project also depends on having NGINX installed.
# linux, debian-based
sudo apt install nginx
# macos
brew install nginx
If you are unfamiliar with virtual environments, I suggest taking a brief moment to learn about them and set one up. The Python docs provide a great tutorial for getting started with virtual environments and packages.
Install the project's Python dependencies via:
pip install -r requirements.txt
First start one or more volume servers (these could run on different systems):
# ignore the nginx "could not open error log" warnings
python3 volume.py 3001 /tmp/volume1/ &
python3 volume.py 3002 /tmp/volume2/ &
python3 volume.py 3003 /tmp/volume3/ &
Then start the index server using the addresses of the volume servers:
python3 index.py localhost:3001 localhost:3002 localhost:3003
With the index server running on port 3000, the following commands demonstrate core functionality:
# put "bigswag" in key "wehave"
curl -v -L -X PUT -d bigswag http://localhost:3000/wehave
# get key "wehave" (should be "bigswag")
curl -v -L -X GET http://localhost:3000/wehave
# delete key "wehave"
curl -v -L -X DELETE http://localhost:3000/wehave
# put file in key "file.txt"
curl -v -L -X PUT -T /path/to/local/file.txt http://localhost:3000/file.txt
# get file in key "file.txt"
curl -v -L -X GET -o /path/to/local/file.txt http://localhost:3000/file.txt
All of these benchmarks were executed against a small cluster (1 index, 3 volumes) of Digital Ocean droplets (1vCPU, 512MB, $5/month).
Benchmark | Command |
---|---|
fetch missing | hey -c 100 -z 10s http://<index_server_ip>:3000/missing |
fetch present | hey -c 100 -z 10s http://<index_server_ip>:3000/present |
thrasher.go | go run extras/thrasher.go -addr <index_server_ip>:3000 |
Each of these results are the average of three trials and are measured in requests per second.
Benchmark | pymkv | minikeyvalue |
---|---|---|
fetch missing | 1397 | 2152 |
fetch present | 1012 | 941 |
thrasher.go | 99 | 102 |