fastscancount

Fast implementations of the scancount algorithm: C++ header-only library

APACHE-2.0 License

Stars
24

fastscancount

Fast implementations of the scancount algorithm

Given a set of arrays of integers, we seek to identify all values that occur more than 'threshold' times. We do so using the 'scancount' algorithm. It is assumed that you have fewer than 256 arrays of integers and that the threshold is no larger than 254.

We are effectively providing optimized versions of the following function:

void scancount(std::vector<std::vector<uint32_t>> &data, std::vector<uint32_t> &out, uint8_t threshold) {
  std::fill(counters.begin(), counters.end(), 0);
  out.clear();
  for (size_t c = 0; c < data.size(); c++) {
    std::vector<uint32_t> &v = data[c];
    for (size_t i = 0; i < v.size(); i++) {
      counters[v[i]]++;
    }
  }
  for (uint32_t i = 0; i < counters.size(); i++) {
    if (counters[i] > threshold)
      out.push_back(i);
  }
}

Our optimized versions assume that your arrays are made of sorted integers.

There are two headers, fastscancount.h uses plain C++ and should be portable. It has one main function in the fastscancount namespace. We always write the result to 'out'.

void fastscancount(std::vector<std::vector<uint32_t>> &data,
    std::vector<uint32_t> &out, uint8_t threshold)

There is another header fastscancount_avx2.h which expects an x64 processor supporting the AVX2 instruction set. It has a similar function signature:

void fastscancount_avx2(std::vector<std::vector<uint32_t>> &data,
    std::vector<uint32_t> &out, uint8_t threshold)

The AVX2 version assumes that you have fewer than 128 arrays of integers.

Because this library is made solely of headers, there is no need for a build system.

Linux benchmark

If you have bare metal access to a Linux box, you can run cycle-accurate benchmarks.

make
./counter

Sample output with GNU GCC 8.3:

$ ./counter
Got 2497 hits
optimized cache-sensitive scancount
4.01381 cycles/element
AVX2-based scancount
3.58494 cycles/element

With LLVM clang, we seem to get better results:

$ ./counter
Got 2497 hits
optimized cache-sensitive scancount
3.54267 cycles/element
2.8908 instructions/cycles
0.0134279 miss/element
AVX2-based scancount
3.57374 cycles/element
2.03391 instructions/cycles
0.0109755 miss/element

Blog post

How fast can scancount be?

Using actual data

./counter --postings data/postings.bin --queries data/queries.bin --threshold 3

Credit

The AVX2 version was designed and implemented by Travis Downs. The scalar version was designed and implemented by Daniel Lemire based on ideas by Nathan Kurz, Travis Downs and others.

Reference

Owen Kaser, Daniel Lemire, Compressed bitmap indexes: beyond unions and intersections, Software: Practice and Experience 46 (2), 2016