SimSIMD - v3.7.3

Published by ashvardanian 9 months ago

3.7.3 (2024-01-28)

Make

Rever Crate dependency versions (3807f28)

SimSIMD - v3.7.1

Published by ashvardanian 9 months ago

3.7.1 (2024-01-28)

Fix

_Float16 support in Go (d5534fc), closes #71

Make

Versioning Rust crates (ba1c941)

SimSIMD - v3.7.0

Published by ashvardanian 9 months ago

3.7.0 (2024-01-28)

Add

Rust binding for SimSIMD (#75) (ec4c686), closes #75

SimSIMD - v3.6.7

Published by ashvardanian 9 months ago

3.6.7 (2024-01-24)

Fix

GoLang bindings (#70) (0627795), closes #70

SimSIMD - v3.6.6

Published by ashvardanian 9 months ago

3.6.6 (2024-01-22)

Fallback for Vercel-based apps (#66) (dc6de11), closes #66
Memory leak in cdist (#61) (0469ec2), closes #61
Py version is inferred from macros (234a282)

Thanks to @sroussey and @smthngslv 👏

SimSIMD - v3.6.5

Published by ashvardanian 9 months ago

3.6.5 (2024-01-18)

Make

ESM and CommonJS release with fallbacks (#63) (d57f82b), closes #63

SimSIMD - v3.6.4

Published by ashvardanian 10 months ago

3.6.4 (2024-01-08)

Docs

TypeScript declaration file (#53) (5f6a688), closes #53

Make

Prebuild JavaScript bindings (#56) (1bd9001), closes #56

SimSIMD - v3.6.3

Published by ashvardanian 10 months ago

3.6.3 (2024-01-06)

Make

Revert test location (82c4dcb)

SimSIMD - v3.6.2

Published by ashvardanian 10 months ago

3.6.2 (2024-01-06)

Docs

Describe usage in C (555ce0c)
JS installation, grammar and counters (#50) (ba0e233), closes #50
typo in README.md (#49) (330c039), closes #49

Fix

Type errors in JS benchmarks (#51) (57ced28), closes #51

SimSIMD - v3.6.1

Published by ashvardanian 10 months ago

3.6.1 (2023-12-19)

Docs

New header (248082d)

Fix

SEGFAULT creating NumPy Array (6cccca9)

Improve

Cleaner accumulator init (af4a818)
Logging exceptions (c5b4c0e)

Make

Update Python library __version__ (14559ed)

Test

Increase error tolerance (d216035)

SimSIMD - Faster Double-Precision Math

Published by ashvardanian 11 months ago

As was discussed in the SciPy integration thread, Python libraries use double-precision floating-point numbers by default.
So in this release I've extended the spatial distance functions - cosine, sqeuclidean, inner with support for double arguments with specialized implementations on AVX-512-capable x86 CPUs and SVE-capable Arm CPUs.

Benchmarking SimSIMD vs. SciPy on Intel Sapphire Rapids CPU

Vector dimensions: 1536
Vectors count: 1000
Hardware capabilities: serial, x86_avx2, x86_avx512, x86_avx2fp16, x86_avx512fp16, x86_avx512vpopcntdq, x86_avx512vnni
NumPy BLAS dependency: openblas64
NumPy LAPACK dependency: dep140640983012528

Between 2 Vectors, Batch Size: 1

Datatype	Method	Ops/s	SimSIMD Ops/s	SimSIMD Improvement
`f64`	`scipy.cosine`	63,612	572,605	9.00 x
`f64`	`scipy.sqeuclidean`	238,547	915,596	3.84 x
`f64`	`numpy.inner`	449,499	986,522	2.19 x

Between 2 Vectors, Batch Size: 1,000

Datatype	Method	Ops/s	SimSIMD Ops/s	SimSIMD Improvement
`f64`	`scipy.cosine`	68,962	1,457,172	21.13 x
`f64`	`scipy.sqeuclidean`	247,727	1,535,547	6.20 x
`f64`	`numpy.inner`	463,509	1,512,004	3.26 x

Benchmarking SimSIMD vs. SciPy on AWS Graviton 3

Vector dimensions: 1536
Vectors count: 1000
Hardware capabilities: serial, arm_neon, arm_sve
NumPy BLAS dependency: openblas64
NumPy LAPACK dependency: openblas64

Between 2 Vectors, Batch Size: 1

Datatype	Method	Ops/s	SimSIMD Ops/s	SimSIMD Improvement
`f64`	`scipy.cosine`	40,729	725,382	17.81 x
`f64`	`scipy.sqeuclidean`	160,812	728,114	4.53 x
`f64`	`numpy.inner`	473,443	767,374	1.62 x
`f64`	`scipy.jensenshannon`	15,684	38,528	2.46 x
`f64`	`scipy.kl_div`	49,983	61,811	1.24 x

Between 2 Vectors, Batch Size: 1,000

Datatype	Method	Ops/s	SimSIMD Ops/s	SimSIMD Improvement
`f64`	`scipy.cosine`	41,130	1,460,850	35.52 x
`f64`	`scipy.sqeuclidean`	162,147	1,486,255	9.17 x
`f64`	`numpy.inner`	473,856	1,580,136	3.33 x

SimSIMD - v3.5.5

Published by ashvardanian 12 months ago

3.5.5 (2023-11-11)

Docs

Reorder sections (ac2ee97)

Improve

Detecting compile-time capabilities (e02a24f)

SimSIMD - v3.5.4

Published by ashvardanian 12 months ago

3.5.4 (2023-11-09)

Docs

Improve comparison table (db65654)

Improve

revert back to using the reciprocal (3fe57c3)

Make

Remove a few more files from NPM pack (882614a)

SimSIMD - v3.5.3

Published by ashvardanian 12 months ago

3.5.3 (2023-10-31)

Make

Upgade JS CI pipeline (7f78936)

SimSIMD - v3.5.2

Published by ashvardanian 12 months ago

3.5.2 (2023-10-31)

Make

Remove any package links for NPM (ea20fd0)

SimSIMD - v3.5.1

Published by ashvardanian 12 months ago

3.5.1 (2023-10-31)

Make

JS package hard-link resolved (971b9c5)

SimSIMD - v3.5.0

Published by ashvardanian 12 months ago

3.5.0 (2023-10-31)

Add

.npmignore & some minor fixes (#37) (f2555af), closes #37

Docs

Download stats (5df5220)

Fix

Avoid ifnan compilation issues for GBench (fe3286f)
normalize vectors for JS tests (7f9c6df)
SciPy JS uses square root (d8b9762)

Improve

Parameterize epsilon for different types (2a49d9e)
Randomize NumPy seed on every run (a49b866)
Same epsilon for JS/KL backends (a70479f)
use GlibC in CPython bindings (65fe343)

Test

Compare our f16 to SciPy f64 (dd655c1)

SimSIMD - v3.4.0

Published by ashvardanian 12 months ago

3.4.0 (2023-10-31)

Add

.npmignore & some minor fixes (#37) (#38) (de56014), closes #37 #38 #37

SimSIMD - v3.3.0

Published by ashvardanian 12 months ago

3.3.0 (2023-10-27)

Add

VNNI capability (2dd106f)

Fix

AVX2 int8 angular distance (143aa34)
Use rtol for L2sq and atol for other (89a61b3)

Improve

goto to avoid more conditions (ef71253)
Run benchmarks on 1 thread (2e1a714)
Use BMI2 and AVX-512VNNI for masks & fma (161eee9)

Make

Disable -ffast-math (afcb7f8)
Separate CI (62c4901)
Use recent compilers (ac01aa2)

Test

distances for int8 arrays (c1c06ba)
Normalize bitwise distances (3edfe0b)

SimSIMD - Beating GCC 12 - 118x Speedup for Jensen Shannon Divergence via AVX-512FP16

Published by ashvardanian almost 1 year ago

Divergence functions are a bit more complex than the Cosine Similarity, primarily because they have to compute logarithms, which are relatively slow when using LibC's logf.

So, aside from minor patches, in this PR, I've rewritten the Jensen Shannon distances leveraging several optimizations, mainly focusing on AVX-512 and AVX-512FP16 extensions, which resulted in 4.6x improvement over the auto-vectorized single-precision variant and a whopping 118x improvement over the half-precision code produced by GCC 12.

Optimizations

Logarithm Computation. Instead of multiple bitwise operations, _mm512_getexp_ph and _mm512_getmant_ph are now used to extract the exponent and the mantissa of the floating-point number, streamlining the process. I've also used Horner's method for the polynomial approximation.
Division Avoidance. To avoid expensive division operations, reciprocal approximations are utilized - _mm512_rcp_ph for half-precision and _mm512_rcp14_ps for single-precision. The _mm512_rcp28_ps was found to be unnecessary for this implementation.
Handling Zeros. The _mm512_cmp_ph_mask is used to compute a mask for close-to-zero values, avoiding the addition of an "epsilon" to every component, which is both cleaner and more accurate.
Parallel Accumulation. The accumulation of $KL(P||Q)$ and $KL(Q||P)$ are now handled in separate registers, and the masked _mm512_maskz_fmadd_ph replaces distinct addition and multiplication operations, optimizing the calculation further.

Implementation

To remind, the Jensen Shannon divergence is the symmetric version of the Kullback-Leibler divergence:

JSD(P, Q) = \frac{1}{2} D(P || M) + \frac{1}{2} D(Q || M) \\

M = \frac{1}{2}(P + Q),     D(P || Q) = \sum P(i) \cdot \log \left( \frac{P(i)}{Q(i)} \right)

For AVX-512FP16, the current implementation looks like this:

__attribute__((target("avx512f,avx512vl,avx512fp16")))
inline __m512h simsimd_avx512_f16_log2(__m512h x) {
    // Extract the exponent and mantissa
    __m512h one = _mm512_set1_ph((_Float16)1);
    __m512h e = _mm512_getexp_ph(x);
    __m512h m = _mm512_getmant_ph(x, _MM_MANT_NORM_1_2, _MM_MANT_SIGN_src);

    // Compute the polynomial using Horner's method
    __m512h p = _mm512_set1_ph((_Float16)-3.4436006e-2f);
    p = _mm512_fmadd_ph(m, p, _mm512_set1_ph((_Float16)3.1821337e-1f));
    p = _mm512_fmadd_ph(m, p, _mm512_set1_ph((_Float16)-1.2315303f));
    p = _mm512_fmadd_ph(m, p, _mm512_set1_ph((_Float16)2.5988452f));
    p = _mm512_fmadd_ph(m, p, _mm512_set1_ph((_Float16)-3.3241990f));
    p = _mm512_fmadd_ph(m, p, _mm512_set1_ph((_Float16)3.1157899f));

    return _mm512_add_ph(_mm512_mul_ph(p, _mm512_sub_ph(m, one)), e);
}

__attribute__((target("avx512f,avx512vl,avx512fp16")))
inline static simsimd_f32_t simsimd_avx512_f16_js(simsimd_f16_t const* a, simsimd_f16_t const* b, simsimd_size_t n) {
    __m512h sum_a_vec = _mm512_set1_ph((_Float16)0);
    __m512h sum_b_vec = _mm512_set1_ph((_Float16)0);
    __m512h epsilon_vec = _mm512_set1_ph((_Float16)1e-6f);
    for (simsimd_size_t i = 0; i < n; i += 32) {
        __mmask32 mask = n - i >= 32 ? 0xFFFFFFFF : ((1u << (n - i)) - 1u);
        __m512h a_vec = _mm512_castsi512_ph(_mm512_maskz_loadu_epi16(mask, a + i));
        __m512h b_vec = _mm512_castsi512_ph(_mm512_maskz_loadu_epi16(mask, b + i));
        __m512h m_vec = _mm512_mul_ph(_mm512_add_ph(a_vec, b_vec), _mm512_set1_ph((_Float16)0.5f));

        // Avoid division by zero problems from probabilities under zero down the road.
        // Masking is a nicer way to do this, than adding the `epsilon` to every component.
        __mmask32 nonzero_mask_a = _mm512_cmp_ph_mask(a_vec, epsilon_vec, _CMP_GE_OQ);
        __mmask32 nonzero_mask_b = _mm512_cmp_ph_mask(b_vec, epsilon_vec, _CMP_GE_OQ);
        __mmask32 nonzero_mask = nonzero_mask_a & nonzero_mask_b & mask;

        // Division is an expensive operation. Instead of doing it twice,
        // we can approximate the reciprocal of `m` and multiply instead.
        __m512h m_recip_approx = _mm512_rcp_ph(m_vec);
        __m512h ratio_a_vec = _mm512_mul_ph(a_vec, m_recip_approx);
        __m512h ratio_b_vec = _mm512_mul_ph(b_vec, m_recip_approx);

        // The natural logarithm is equivalent to `log2`, multiplied by the `loge(2)`
        __m512h log_ratio_a_vec = simsimd_avx512_f16_log2(ratio_a_vec);
        __m512h log_ratio_b_vec = simsimd_avx512_f16_log2(ratio_b_vec);

        // Instead of separate multiplication and addition, invoke the FMA
        sum_a_vec = _mm512_maskz_fmadd_ph(nonzero_mask, a_vec, log_ratio_a_vec, sum_a_vec);
        sum_b_vec = _mm512_maskz_fmadd_ph(nonzero_mask, b_vec, log_ratio_b_vec, sum_b_vec);
    }
    simsimd_f32_t log2_normalizer = 0.693147181f;
    return _mm512_reduce_add_ph(_mm512_add_ph(sum_a_vec, sum_b_vec)) * 0.5f * log2_normalizer;
}

Benchmarks

I conducted benchmarks at both the higher-level Python and lower-level C++ layers, comparing the auto-vectorization on GCC 12 to our new implementation on an Intel Sapphire Rapids CPU on AWS:

The program was compiled with -O3 and -ffast-math and was running on all cores of the 4-core instance, potentially favoring the non-vectorized solution. When normalized and tabulated, the results are as follows:

Benchmark	Pairs/s	Gigabytes/s	Absolute Error	Relative Error
`serial_f32_js_1536d`	0.243 M/s	2.98 G/s	0	0
`serial_f16_js_1536d`	0.018 M/s	0.11 G/s	0.123	0.035
`avx512_f32_js_1536d`	1.127 M/s	13.84 G/s	0.001	345u
`avx512_f16_js_1536d`	2.139 M/s	13.14 G/s	0.070	0.020
`avx2_f16_js_1536d`	0.547 M/s	3.36 G/s	0.011	0.003

Of course, the results will vary depending on the vector size. I generally use 1536 dimensions, matching the size of OpenAI Ada embeddings, standard in NLP workloads. The Jensen Shannon divergence, however, is used broadly in other domains of statistics, bio-informatics, and chem-informatics, so I'm adding it as a new out-of-the-box supported metric into USearch today 🥳

This further accelerates the k-approximate Nearest Neighbors Search and the clustering of Billions of different protein sequences without alignment procedures. Expect one more "Less Slow" post soon! 🤗