Up to 200x Faster Inner Products and Vector Similarity — for Python, JavaScript, Rust, and C, supporting f64, f32, f16 real & complex, i8, and binary vectors using SIMD for both x86 AVX2 & AVX-512 and Arm NEON & SVE 📐
APACHE-2.0 License
Published by ashvardanian 10 months ago
Published by ashvardanian 11 months ago
As was discussed in the SciPy integration thread, Python libraries use double-precision floating-point numbers by default.
So in this release I've extended the spatial distance functions - cosine
, sqeuclidean
, inner
with support for double
arguments with specialized implementations on AVX-512-capable x86 CPUs and SVE-capable Arm CPUs.
Datatype | Method | Ops/s | SimSIMD Ops/s | SimSIMD Improvement |
---|---|---|---|---|
f64 |
scipy.cosine |
63,612 | 572,605 | 9.00 x |
f64 |
scipy.sqeuclidean |
238,547 | 915,596 | 3.84 x |
f64 |
numpy.inner |
449,499 | 986,522 | 2.19 x |
Datatype | Method | Ops/s | SimSIMD Ops/s | SimSIMD Improvement |
---|---|---|---|---|
f64 |
scipy.cosine |
68,962 | 1,457,172 | 21.13 x |
f64 |
scipy.sqeuclidean |
247,727 | 1,535,547 | 6.20 x |
f64 |
numpy.inner |
463,509 | 1,512,004 | 3.26 x |
Datatype | Method | Ops/s | SimSIMD Ops/s | SimSIMD Improvement |
---|---|---|---|---|
f64 |
scipy.cosine |
40,729 | 725,382 | 17.81 x |
f64 |
scipy.sqeuclidean |
160,812 | 728,114 | 4.53 x |
f64 |
numpy.inner |
473,443 | 767,374 | 1.62 x |
f64 |
scipy.jensenshannon |
15,684 | 38,528 | 2.46 x |
f64 |
scipy.kl_div |
49,983 | 61,811 | 1.24 x |
Datatype | Method | Ops/s | SimSIMD Ops/s | SimSIMD Improvement |
---|---|---|---|---|
f64 |
scipy.cosine |
41,130 | 1,460,850 | 35.52 x |
f64 |
scipy.sqeuclidean |
162,147 | 1,486,255 | 9.17 x |
f64 |
numpy.inner |
473,856 | 1,580,136 | 3.33 x |
Published by ashvardanian 12 months ago
ifnan
compilation issues for GBench (fe3286f)f16
to SciPy f64
(dd655c1)Published by ashvardanian almost 1 year ago
Divergence functions are a bit more complex than the Cosine Similarity, primarily because they have to compute logarithms, which are relatively slow when using LibC's logf
.
So, aside from minor patches, in this PR, I've rewritten the Jensen Shannon distances leveraging several optimizations, mainly focusing on AVX-512 and AVX-512FP16 extensions, which resulted in 4.6x improvement over the auto-vectorized single-precision variant and a whopping 118x improvement over the half-precision code produced by GCC 12.
_mm512_getexp_ph
and _mm512_getmant_ph
are now used to extract the exponent and the mantissa of the floating-point number, streamlining the process. I've also used Horner's method for the polynomial approximation._mm512_rcp_ph
for half-precision and _mm512_rcp14_ps
for single-precision. The _mm512_rcp28_ps
was found to be unnecessary for this implementation._mm512_cmp_ph_mask
is used to compute a mask for close-to-zero values, avoiding the addition of an "epsilon" to every component, which is both cleaner and more accurate._mm512_maskz_fmadd_ph
replaces distinct addition and multiplication operations, optimizing the calculation further.To remind, the Jensen Shannon divergence is the symmetric version of the Kullback-Leibler divergence:
JSD(P, Q) = \frac{1}{2} D(P || M) + \frac{1}{2} D(Q || M) \\
M = \frac{1}{2}(P + Q), D(P || Q) = \sum P(i) \cdot \log \left( \frac{P(i)}{Q(i)} \right)
For AVX-512FP16, the current implementation looks like this:
__attribute__((target("avx512f,avx512vl,avx512fp16")))
inline __m512h simsimd_avx512_f16_log2(__m512h x) {
// Extract the exponent and mantissa
__m512h one = _mm512_set1_ph((_Float16)1);
__m512h e = _mm512_getexp_ph(x);
__m512h m = _mm512_getmant_ph(x, _MM_MANT_NORM_1_2, _MM_MANT_SIGN_src);
// Compute the polynomial using Horner's method
__m512h p = _mm512_set1_ph((_Float16)-3.4436006e-2f);
p = _mm512_fmadd_ph(m, p, _mm512_set1_ph((_Float16)3.1821337e-1f));
p = _mm512_fmadd_ph(m, p, _mm512_set1_ph((_Float16)-1.2315303f));
p = _mm512_fmadd_ph(m, p, _mm512_set1_ph((_Float16)2.5988452f));
p = _mm512_fmadd_ph(m, p, _mm512_set1_ph((_Float16)-3.3241990f));
p = _mm512_fmadd_ph(m, p, _mm512_set1_ph((_Float16)3.1157899f));
return _mm512_add_ph(_mm512_mul_ph(p, _mm512_sub_ph(m, one)), e);
}
__attribute__((target("avx512f,avx512vl,avx512fp16")))
inline static simsimd_f32_t simsimd_avx512_f16_js(simsimd_f16_t const* a, simsimd_f16_t const* b, simsimd_size_t n) {
__m512h sum_a_vec = _mm512_set1_ph((_Float16)0);
__m512h sum_b_vec = _mm512_set1_ph((_Float16)0);
__m512h epsilon_vec = _mm512_set1_ph((_Float16)1e-6f);
for (simsimd_size_t i = 0; i < n; i += 32) {
__mmask32 mask = n - i >= 32 ? 0xFFFFFFFF : ((1u << (n - i)) - 1u);
__m512h a_vec = _mm512_castsi512_ph(_mm512_maskz_loadu_epi16(mask, a + i));
__m512h b_vec = _mm512_castsi512_ph(_mm512_maskz_loadu_epi16(mask, b + i));
__m512h m_vec = _mm512_mul_ph(_mm512_add_ph(a_vec, b_vec), _mm512_set1_ph((_Float16)0.5f));
// Avoid division by zero problems from probabilities under zero down the road.
// Masking is a nicer way to do this, than adding the `epsilon` to every component.
__mmask32 nonzero_mask_a = _mm512_cmp_ph_mask(a_vec, epsilon_vec, _CMP_GE_OQ);
__mmask32 nonzero_mask_b = _mm512_cmp_ph_mask(b_vec, epsilon_vec, _CMP_GE_OQ);
__mmask32 nonzero_mask = nonzero_mask_a & nonzero_mask_b & mask;
// Division is an expensive operation. Instead of doing it twice,
// we can approximate the reciprocal of `m` and multiply instead.
__m512h m_recip_approx = _mm512_rcp_ph(m_vec);
__m512h ratio_a_vec = _mm512_mul_ph(a_vec, m_recip_approx);
__m512h ratio_b_vec = _mm512_mul_ph(b_vec, m_recip_approx);
// The natural logarithm is equivalent to `log2`, multiplied by the `loge(2)`
__m512h log_ratio_a_vec = simsimd_avx512_f16_log2(ratio_a_vec);
__m512h log_ratio_b_vec = simsimd_avx512_f16_log2(ratio_b_vec);
// Instead of separate multiplication and addition, invoke the FMA
sum_a_vec = _mm512_maskz_fmadd_ph(nonzero_mask, a_vec, log_ratio_a_vec, sum_a_vec);
sum_b_vec = _mm512_maskz_fmadd_ph(nonzero_mask, b_vec, log_ratio_b_vec, sum_b_vec);
}
simsimd_f32_t log2_normalizer = 0.693147181f;
return _mm512_reduce_add_ph(_mm512_add_ph(sum_a_vec, sum_b_vec)) * 0.5f * log2_normalizer;
}
I conducted benchmarks at both the higher-level Python and lower-level C++ layers, comparing the auto-vectorization on GCC 12 to our new implementation on an Intel Sapphire Rapids CPU on AWS:
The program was compiled with -O3
and -ffast-math
and was running on all cores of the 4-core instance, potentially favoring the non-vectorized solution. When normalized and tabulated, the results are as follows:
Benchmark | Pairs/s | Gigabytes/s | Absolute Error | Relative Error |
---|---|---|---|---|
serial_f32_js_1536d |
0.243 M/s | 2.98 G/s | 0 | 0 |
serial_f16_js_1536d |
0.018 M/s | 0.11 G/s | 0.123 | 0.035 |
avx512_f32_js_1536d |
1.127 M/s | 13.84 G/s | 0.001 | 345u |
avx512_f16_js_1536d |
2.139 M/s | 13.14 G/s | 0.070 | 0.020 |
avx2_f16_js_1536d |
0.547 M/s | 3.36 G/s | 0.011 | 0.003 |
Of course, the results will vary depending on the vector size. I generally use 1536 dimensions, matching the size of OpenAI Ada embeddings, standard in NLP workloads. The Jensen Shannon divergence, however, is used broadly in other domains of statistics, bio-informatics, and chem-informatics, so I'm adding it as a new out-of-the-box supported metric into USearch today 🥳
This further accelerates the k-approximate Nearest Neighbors Search and the clustering of Billions of different protein sequences without alignment procedures. Expect one more "Less Slow" post soon! 🤗