SimSIMD - v5.1.1: Faster Dispatch & Better Docs in 🐍 Latest Release

Published by ashvardanian about 1 month ago

Preview

Most importantly, this is the first SimSIMD release deprecating Python 3.6, released in 2016. Now, 8 years later, we deprecated it to more broadly utilize the Fast Calling Convention. Read more in a dedicated article on the cost of function arguments parsing in Pyhton - 35% discount on keyword arguments 😄

Thanks to @stuartatnosible for noticing dtype= issues 👓
Thanks to @MarkReedZ for accelerating bf16 dot-product on Arm 🦾

SimSIMD - v5.1: Curved Spaces & Sparse Sets

Published by ashvardanian about 2 months ago

Major additions:

Set Intersections & Sparse Distances #100
Bilinear Forms & Curved Spaces #149, suggested by @guyrosin
Reduced Memory Consumption #142, suggested by @Charlyo
Fixing accuracy issues #153, spotted by @cbornet

Minor fixes:

Extends benchmarks, thanks to @MarkReedZ
Cleans up CMake, suggested by @rschu1ze
Exposing f16, i8, b8 to Python buffers in 33f1b13
Documenting algorithms in 405bef0
Exposing bf16 to Rust, thanks to @Wyctus

Some crazy findings:

rsqrt precision on Arm is not documented at all
rsqrt approximation for double in AVX-512 is only 6x more accurate, than for float
SciPy raises the wrong errors when overflows - ValueError instead of OverflowError
SciPy occasionally misuses math.sqrt over numpy.sqrt when dealing with NumPy array
sqrt in libc is bit-precise

SimSIMD - Release v5.0.1

Published by ashvardanian 2 months ago

Release: v5.0.1 [skip ci]

SimSIMD - New bf16 capabilities on Arm 🦾

Published by ashvardanian 3 months ago

Handling `bf16` & Dynamic Dispatch on Arm

This major release adds new capability levels for Arm allowing us to differentiate f16, bf16. and i8-supporting generations of CPUs, becoming increasingly popular in the datacenter. Similar to speedups on AMD Genoa, on Arm Graviton3 the bf16 kernels perform very well:

dot_bf16_neon_1536d/min_time:10.000/threads:1            183 ns          183 ns     76204478 abs_delta=0 bytes=33.5194G/s pairs=5.45563M/s relative_error=0
cos_bf16_neon_1536d/min_time:10.000/threads:1            239 ns          239 ns     58180403 abs_delta=0 bytes=25.7056G/s pairs=4.18386M/s relative_error=0
l2sq_bf16_neon_1536d/min_time:10.000/threads:1           312 ns          312 ns     43724273 abs_delta=0 bytes=19.7064G/s pairs=3.20742M/s relative_error=0

The bf16 kernels reach 33 GB/s as opposed to 19 GB/s for f16:

dot_f16_neon_1536d/min_time:10.000/threads:1             323 ns          323 ns     43311367 abs_delta=82.3015n bytes=19.0324G/s pairs=3.09772M/s relative_error=109.717n
cos_f16_neon_1536d/min_time:10.000/threads:1             367 ns          367 ns     38007895 abs_delta=1.5456m bytes=16.7349G/s pairs=2.72377M/s relative_error=6.19568m
l2sq_f16_neon_1536d/min_time:10.000/threads:1            341 ns          341 ns     41010555 abs_delta=66.7783n bytes=18.0436G/s pairs=2.93679M/s relative_error=133.449n

Researching MMLA Extensions & Future Work

Arm supports 2x2 matrix multiplications for i8 and bf16. All of our initial attempts with @eknag to use them for faster cosine computations for different length vectors have failed. Old measurements:

cos_i8_neon_16d/min_time:10.000/threads:1       5.41 ns         5.41 ns   1000000000 abs_delta=910.184u bytes=5.91441G/s pairs=184.825M/s relative_error=4.20295m
cos_i8_neon_64d/min_time:10.000/threads:1       7.63 ns         7.63 ns   1000000000 abs_delta=939.825u bytes=16.7729G/s pairs=131.039M/s relative_error=3.82144m
cos_i8_neon_1536d/min_time:10.000/threads:1        101 ns          101 ns    139085845 abs_delta=917.35u bytes=30.394G/s pairs=9.89387M/s relative_error=3.63925m

Attempts with i8 for different dimensionality vectors:

cos_i8_neon_16d/min_time:10.000/threads:1       5.72 ns         5.72 ns   1000000000 abs_delta=0.282084 bytes=5.59562G/s pairs=174.863M/s relative_error=1.15086
cos_i8_neon_64d/min_time:10.000/threads:1       8.40 ns         8.40 ns   1000000000 abs_delta=0.234385 bytes=15.2345G/s pairs=119.02M/s relative_error=0.923009
cos_i8_neon_1536d/min_time:10.000/threads:1        117 ns          117 ns    118998604 abs_delta=0.23264 bytes=26.2707G/s pairs=8.55167M/s relative_error=0.920099

This line of work is to be continued, in parallel with new similarity metrics and distance functions.

Other Improvements

Added the missing Haswell kernels for f32 spatial distances on older x86 CPUs. Thanks to @makarr 👏
Added Python typing extensions for the native modules. Thanks to @cbornet 👏
Fixed offline builds from the .tar.gz packages, that were missing a VERSION file. Thanks to @OD-Ice 👏

SimSIMD - New `bf16` & faster `i8` kernels ❤️ AMD Genoa

Published by ashvardanian 4 months ago

New `bf16` kernels

The "brain-float-16" is a popular machine learning format. It's broadly supported in hardware and is very machine-friendly, but software support is still lagging behind - https://github.com/numpy/numpy/issues/19808. Most importantly, low-precision bf16 dot-products are supported by the most recent Zen4-based AMD Genoa CPUs. Those have up-to 96 cores, and just one of those cores is capable of computing 86 GB/s worth of such dot-products.

------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
dot_bf16_haswell_1536d/min_time:10.000/threads:1         203 ns          203 ns     68785823 abs_delta=29.879n bytes=30.1978G/s pairs=4.91501M/s relative_error=39.8289n
dot_bf16_haswell_1536b/min_time:10.000/threads:1        93.0 ns         93.0 ns    150582910 abs_delta=24.8365n bytes=33.0344G/s pairs=10.7534M/s relative_error=33.1108n
dot_bf16_genoa_1536d/min_time:10.000/threads:1          71.0 ns         71.0 ns    197340105 abs_delta=23.6042n bytes=86.5917G/s pairs=14.0937M/s relative_error=31.4977n
dot_bf16_genoa_1536b/min_time:10.000/threads:1          36.1 ns         36.1 ns    387637713 abs_delta=22.3063n bytes=85.0019G/s pairs=27.6699M/s relative_error=29.7341n
dot_bf16_serial_1536d/min_time:10.000/threads:1        15992 ns        15991 ns       874491 abs_delta=311.896n bytes=384.216M/s pairs=62.5352k/s relative_error=415.887n
dot_bf16_serial_1536b/min_time:10.000/threads:1         7979 ns         7978 ns      1754703 abs_delta=193.719n bytes=385.045M/s pairs=125.34k/s relative_error=258.429n
dot_bf16c_serial_1536d/min_time:10.000/threads:1       16430 ns        16429 ns       852438 abs_delta=251.692n bytes=373.964M/s pairs=60.8665k/s relative_error=336.065n
dot_bf16c_serial_1536b/min_time:10.000/threads:1        8207 ns         8202 ns      1707289 abs_delta=165.209n bytes=374.54M/s pairs=121.92k/s relative_error=220.35n
vdot_bf16c_serial_1536d/min_time:10.000/threads:1      16489 ns        16488 ns       849194 abs_delta=247.646n bytes=372.639M/s pairs=60.6509k/s relative_error=330.485n
vdot_bf16c_serial_1536b/min_time:10.000/threads:1       8224 ns         8217 ns      1704397 abs_delta=162.036n bytes=373.839M/s pairs=121.693k/s relative_error=216.042n

That's a steep 3x improvement over single-precision FMA throughput we can obtain by simply shifting bf16 left by 16 bits and using _mm256_fmadd_ps intrinsic / vfmadd instruction available since Intel Haswell.

Faster `i8` kernels

We can't directly use _mm512_dpbusd_epi32 every time we want to compute a low-precision integer dot-product, as it's asymmetric with respect to the sign of the input arguments:

Signed(ZeroExtend16(a.byte[4j]) * SignExtend16(b.byte[4j]))

In the past we would just upcast to 16-bit integers and resort to _mm512_dpwssds_epi32. It is a much more costly multiplication circuit, and, assuming that I avoid loop unrolling, also implies 2x fewer scalars per loop. But for cosine distances there is something simple we can do. Assuming that we multiply the vector by itself, even if a certain vector component is negative, its square will always be positive. So we can avoid the expensive 16-bit operation at least where we compute the vector norms:

    a_abs_vec = _mm512_abs_epi8(a_vec);
    b_abs_vec = _mm512_abs_epi8(b_vec);
    a2_i32s_vec = _mm512_dpbusds_epi32(a2_i32s_vec, a_abs_vec, a_abs_vec);
    b2_i32s_vec = _mm512_dpbusds_epi32(b2_i32s_vec, b_abs_vec, b_abs_vec);

On Intel Sapphire Rapids it resulted in a higher single-thread utilization, but didn't lead to improvements on other platforms.

---------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
cos_i8_haswell_1536d/min_time:10.000/threads:1       92.4 ns         92.4 ns    151487077 abs_delta=105.739u bytes=33.2344G/s pairs=10.8185M/s relative_error=405.868u
cos_i8_haswell_1536b/min_time:10.000/threads:1       92.4 ns         92.4 ns    151478714 abs_delta=0 bytes=33.2383G/s pairs=10.8198M/s relative_error=0
cos_i8_ice_1536d/min_time:10.000/threads:1           61.6 ns         61.6 ns    227445214 abs_delta=0 bytes=49.898G/s pairs=16.2428M/s relative_error=0
cos_i8_ice_1536b/min_time:10.000/threads:1           61.5 ns         61.5 ns    227609621 abs_delta=0 bytes=49.9167G/s pairs=16.2489M/s relative_error=0
cos_i8_serial_1536d/min_time:10.000/threads:1         299 ns          299 ns     46788061 abs_delta=0 bytes=10.2666G/s pairs=3.34198M/s relative_error=0
cos_i8_serial_1536b/min_time:10.000/threads:1         299 ns          299 ns     46787275 abs_delta=0 bytes=10.2663G/s pairs=3.34191M/s relative_error=0

New timings:

---------------------------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------
cos_i8_haswell_1536d/min_time:10.000/threads:1       92.4 ns         92.4 ns    151463294 abs_delta=105.739u bytes=33.2359G/s pairs=10.819M/s relative_error=405.868u
cos_i8_haswell_1536b/min_time:10.000/threads:1       92.4 ns         92.4 ns    151470519 abs_delta=0 bytes=33.2392G/s pairs=10.82M/s relative_error=0
cos_i8_ice_1536d/min_time:10.000/threads:1           48.1 ns         48.1 ns    292087642 abs_delta=0 bytes=63.8408G/s pairs=20.7815M/s relative_error=0
cos_i8_ice_1536b/min_time:10.000/threads:1           48.2 ns         48.2 ns    291716009 abs_delta=0 bytes=63.7662G/s pairs=20.7572M/s relative_error=0
cos_i8_serial_1536d/min_time:10.000/threads:1         299 ns          299 ns     46784120 abs_delta=0 bytes=10.2647G/s pairs=3.34139M/s relative_error=0
cos_i8_serial_1536b/min_time:10.000/threads:1         299 ns          299 ns     46781350 abs_delta=0 bytes=10.2654G/s pairs=3.3416M/s relative_error=0

SimSIMD - v4.3.1

Published by ashvardanian 7 months ago

4.3.1 (2024-04-10)

Make

Lighter Rust package (c3bc06c)

SimSIMD - v4.3.0

Published by ashvardanian 7 months ago

4.3.0 (2024-04-08)

Add

toBinary for JavaScript (1f1fd3a)

Improve

Procedural Rust benchmarks (e01ec6c)
Unrolled Rust benchmarks (#108) (508e7a0), closes #108

SimSIMD - v4.2.2

Published by ashvardanian 7 months ago

4.2.2 (2024-03-31)

Fix

f16 casting on Arm (dda096b)
Dot-products compilation on SVE (f5fe36d)

Make

Test with Bun and Deno (3227b87)

SimSIMD - v4.2.1

Published by ashvardanian 7 months ago

4.2.1 (2024-03-26)

Fix

Function attributes (eefb54b)

Make

Avoid native f16 by default (bd02af2)
Shorter Rust compilation (db9a9d2)

SimSIMD - v4.2.0

Published by ashvardanian 7 months ago

4.2.0 (2024-03-25)

Add

Compile-time dispatch (b72bf56)

Fix

Rust benchmarks (5279de7)
Setting signaling NaN (4e6e686)

Make

Build with SIMD in Rust (391545e)
Sync VERSION for NPM (7ea0ea6)

SimSIMD - v4.1.1

Published by ashvardanian 7 months ago

4.1.1 (2024-03-24)

Fix

Detecting Ice Lake (098cd90)

Make

Unused function attributes (c85e098)

SimSIMD - SIMD on Windows 🤗 🪟

Published by ashvardanian 7 months ago

This release refactors compiler attributes and intrinsics usage to make it compatible with MSVC. Most noticeably, function defined like this:

__attribute__((target("+simd")))
inline static void simsimd_cos_f32_neon(simsimd_f32_t const* a, simsimd_f32_t const* b, simsimd_size_t n, simsimd_distance_t* result) { }

Now look like this:

#pragma GCC push_options
#pragma GCC target("+simd")
#pragma clang attribute push(__attribute__((target("+simd"))), apply_to = function)

inline static void simsimd_cos_f32_neon(simsimd_f32_t const* a, simsimd_f32_t const* b, simsimd_size_t n, simsimd_distance_t* result) { }

#pragma clang attribute pop
#pragma GCC pop_options

Thanks to that SimSIMD on Windows is gonna be just as fast as on Linux and MacOS 🤗 🪟

4.1.0 vs 4.0.0 Change Log (2024-03-23)

Add

Bench against TF, PyTorch, JAX (5e64152)
Complex dot-products for Rust (936fe73)
Double-precision interfaces for JS (07f2aca)

Docs

List all APIs (0a987a3)

Fix

Avoid vld2_f16 on MSVC (ce9800e)
Complex dispatch in Rust & C (d349bcd)
Missing _mm_rsqrt14_ps in MSVC (21e30fe)
Missing float16_t in MSVC Arm64 builds (94442c3)

Improve

Pragmas for MSVC compatibility (9d8a8d0), closes #74
Silence TF warnings (0cffc9c)
Type-casting (353fe43)

Make

120 line width (35c1de3)
Enable SIMD in MSVC builds (27f24ef)
Missing Node & TS dependencies (9983167)

SimSIMD - v4.0.0

Published by ashvardanian 7 months ago

This is a packed redesign! Let's start with what's cool about it and later cover the mechanics.

Extends dot products covering the entire matrix:
- all IEEE 754 floating-point formats (f16, f32, f64)
- real, complex, complex-conjugate dot-products
- Arm NEON & SVE, x86 Haswell, Skylake, Ice Lake, Sapphire Rapids
Add support for complex32 Python type ... that:
- NumPy doesn't support,
- but CuPy wants.

SimSIMD is now the fastest and most popular library for computing half-precision products/similarities for Fourier Series and other complex data 🥳

What breaks:

Return types are now 64-bit floats, up from 32.
Inner products are now defined as AB, instead of 1 - AB for broader applicability.

SimSIMD - v3.9.0

Published by ashvardanian 8 months ago

3.9.0 (2024-03-04)

Add

Complex numbers support (0a0665a)
Hamming & Jaccard for pre-AVX512 CPUs (4f1eba1), closes #69
Rust binary distances (960af05), closes #84

Fix

datatype variable repeated (8558c4a)
VNNI casting on AVX-512 (c4398d1), closes #91

Improve

Python type inference (227de70)

Make

Bump ip from 2.0.0 to 2.0.1 (#92) (559a16d), closes #92

SimSIMD - v3.8.1

Published by ashvardanian 8 months ago

3.8.1 (2024-02-22)

Docs

Reorg contribution guide (e8be593)

Fix

Detect tests running in QEMU (80b9fec)
Numerical issues (cdd7516)
NumPy and SciPy issues with PyPy (9bd4ada)

Improve

Accurate tail math on Aarch64 (5b39a8e)
Accurate tail math with AVX2 (f61d5be)
Drop NumPy dependency (819a406)
Rust bench performance and style (#85) (da62146), closes #85
Test and ship w/out NumPy and SciPy (5b800d3)

Make

cibuildwheel for manylinux2014 (#87) (43cdc9a), closes #87 #86 #72
Clean-up CI (af895ac)
Extend PyPi builds to all archs (5b3e3c9)
Faster CI (568f806)
Patching toolchains (edb064d)
Upgrade NodeJS (3e3ed04)
Wheels w/out f16 requirements (5e58966)

SimSIMD - v3.8.0

Published by ashvardanian 8 months ago

3.8.0 (2024-02-13)

Add

enable/disable_capability in Python (c7e90f9), closes #68

Improve

Reuse caps in Rust and C standard compliance (aa164f6)

SimSIMD - v3.7.7

Published by ashvardanian 8 months ago

3.7.7 (2024-02-11)

Make

crates.io allows only 5 keywords/categories (25ea3c8)

SimSIMD - v3.7.6

Published by ashvardanian 8 months ago

3.7.6 (2024-02-11)

Make

Don't download Google Benchmark by default (0d14000)
Extend Rust package metadata (6f85161)
Remove redundant file (352cfe8)

SimSIMD - v3.7.5

Published by ashvardanian 9 months ago

3.7.5 (2024-01-30)

Fix

Forward SIMSIMD_NATIVE_F16 (#78) (aff3bcf), closes #78 /github.com/unum-cloud/usearch/blob/ce54b814a8a10f4c0c32fee7aad9451231b63f75/include/usearch/index_plugins.hpp#L50

SimSIMD - v3.7.4

Published by ashvardanian 9 months ago

3.7.4 (2024-01-30)

Fix

Rust cosine function (#77) (cdd282d), closes #77

Package Rankings

Top 8.07% on Proxy.golang.org

Top 25.76% on Npmjs.org

Top 10.54% on Pypi.org

Related Projects

spector

Sparse vectors.

05 Jul 2018 6

sci-rs

Rust Scientific Analysis similar to SciPy

09 Aug 2022 23

sparse_dot_topn

Python package to accelerate the sparse matrix multiplication and top-n similarity selection

01 Aug 2017 392

PyExtJS

Python Extension Packages in Javascript and NodeJS

22 Jul 2016 51

pyrsb

PyRSB: a Python interface to the librsb Sparse Matrix library

18 Jun 2017 19

ip-nonlinear-solver

A trust-region interior-point method for general nonlinear programing problems (GSoC 2017).

28 May 2017 35

PROPACK

Software for large and sparse SVD calculations

19 Jun 2021 16

pycasino

Python realisation of Casino (QMC) program

22 Jun 2023 2

scipy_dae

Python implementation of solvers for differential algebraic equation's (DAE's) that should be add...

20 Feb 2024 16

types-scipy-sparse

Type hints for scipy sparse module

04 Aug 2024 0

jax

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, an...

25 Oct 2018 29,958

sparse_dot

Python wrapper for Intel Math Kernel Library (MKL) matrix multiplication

08 Dec 2019 73

numpsy

Integration of NumPy, SymPy, SciPy, and Pandas into a single extendable mathematical framework.

06 Jan 2021 2

YouML

YouML: A Machine Learning Toolkit

11 Dec 2021 13

gcc_termux

Gcc for termux with fortran scipy etc... Use apt for newest updates instructions in README.txt

10 Oct 2016 461

SimSIMD

Handling bf16 & Dynamic Dispatch on Arm

Researching MMLA Extensions & Future Work

Other Improvements

New bf16 kernels

Faster i8 kernels

4.3.1 (2024-04-10)

Make

4.3.0 (2024-04-08)

Add

Improve

4.2.2 (2024-03-31)

Fix

Make

4.2.1 (2024-03-26)

Fix

Make

4.2.0 (2024-03-25)

Add

Fix

Make

4.1.1 (2024-03-24)

Fix

Make

4.1.0 vs 4.0.0 Change Log (2024-03-23)

Add

Docs

Fix

Improve

Make

3.9.0 (2024-03-04)

Add

Fix

Improve

Make

3.8.1 (2024-02-22)

Docs

Fix

Improve

Make

3.8.0 (2024-02-13)

Add

Improve

3.7.7 (2024-02-11)

Make

3.7.6 (2024-02-11)

Make

3.7.5 (2024-01-30)

Fix

3.7.4 (2024-01-30)

Fix

Related Projects

spector

sci-rs

sparse_dot_topn

PyExtJS

pyrsb

ip-nonlinear-solver

PROPACK

pycasino

scipy_dae

types-scipy-sparse

jax

sparse_dot

numpsy

YouML

gcc_termux

Handling `bf16` & Dynamic Dispatch on Arm

New `bf16` kernels

Faster `i8` kernels