Zobrist hashing in C
APACHE-2.0 License
Zobrist hashing in C
Zobrist hashing is the simplest form of tabulation-based hashing. It can be shown to be 3-wise independent. The Zobrist approach tested here is used in real systems, e.g., Gigablast https://www.gigablast.com/ Alternatively, one could use a tabulation-based function as a complement to other hash functions: first hash the content down to a few bytes (e.g., 4) and then apply a tabulation-based hash on the result.
Tabulation-based hashing uses a lot of memory and is susceptible to cache faults. E.g., to hash 4-byte strings to 64-bit values, you need 8 KB. Moreover, its speed is limited (in part) by the system's ability to issue random access loads.
In an exhaustive experimental evaluation of hash-table performance, Richter et al. (VLDB, 2016) found that Zobrist hashing produces a low throughput. Consequently, the authors declare it to be "less attractive in practice" than its strong randomness properties would suggest.
This C code expects a GCC-like compiler on an x64 system.
The code demonstrates that it is difficult on a x64 to hash much more than 0.65 bytes per cycle on recent Intel processors, even when repeatedly hashing the same short string. In contrast, it is possible to hash 4 to 10 bytes per cycle using fast hash families. See https://github.com/lemire/StronglyUniversalStringHashing
zobrist_t k;
init_zobrist(& k);// call once
// then you can hash as many strings as you want:
uint64_t hashvalue = zobrist (mystring, mystringsize, &k)
// for null-terminated strings, you can use:
uint64_t hashvalue = zobrist_nt (mystring, &k)
// strings longer than 256 bytes fail the theoretical bounds
make
./benchmark