General matrix multiplication of f32 and f64 matrices in Rust. Supports matrices with general strides.
APACHE-2.0 License
General matrix multiplication for f32, f64, and complex matrices. Operates on matrices with general layout (they can use arbitrary row and column stride).
Please read the API documentation here
__
__ https://docs.rs/matrixmultiply/
We presently provide a few good microkernels, portable and for x86-64 and AArch64 NEON, and only one operation: the general matrix-matrix multiplication (“gemm”).
This crate was inspired by the macro/microkernel approach to matrix multiplication that is used by the BLIS_ project.
.. _BLIS: https://github.com/flame/blis
|crates|_
.. |crates| image:: https://img.shields.io/crates/v/matrixmultiply.svg .. _crates: https://crates.io/crates/matrixmultiply
cargo bench
is useful for special cases and small matricesexamples/benchmarks.rs
which supports custom sizes,benches/benchloop.py
to run benchmarks over parameter ranges.gemm: a rabbit hole
____ https://bluss.github.io/rust/2016/03/28/a-gemmed-rabbit-hole/
0.3.9
0.3.8
0.3.7
0.3.6
0.3.5
Significant improvements to complex matrix packing and kernels (#75)
Use a specialized AVX2 matrix packing function for sgemm, dgemm when this feature is detected on x86-64
0.3.4
Sgemm, dgemm microkernel implementations for AArch64 NEON (ARM)
Matrixmultiply now uses autocfg to detect rust version to enable these kernels when AArch64 intrinsics are available from Rust 1.61.
Small change to matrix packing functions so that they in some cases optimize better due to improvements to pointer alias information.
0.3.3
Attempt to fix macos bug #55 again (manifesting as a debug assertion, only in debug builds.)
Updated comments for x86 kernels by @Tastaturtaste
Updates to MIRI/CI by @jturner314
Silenced Send/Sync future compatibility warnings for a raw pointer wrapper
0.3.2
Add optional feature cgemm
for complex matmult functions cgemm
and
zgemm
Add optional feature constconf
for compile-time configuration of matrix
kernel parameters for chunking. Improved scripts for benchmarking over ranges
of different settings. With thanks to @DutchGhost for the const-time
parsing functions.
Improved benchmarking and testing.
Threading is now slightly more eager to threads (depending on matrix element count).
0.3.1
Attempt to fix bug #55 were the mask buffer in TLS did not seem to get its requested alignment on macos. The mask buffer pointer is now aligned manually (again, like it was in 0.2.x).
Fix a minor issue where we were passing a buffer pointer as &T
when it should have been &[T]
.
0.3.0
Implement initial support for threading using a bespoke thread pool with
little contention.
To use, enable feature threading
(and configure number of threads with the
variable MATMUL_NUM_THREADS
).
Initial support is for up to 4 threads - will be updated with more experience in coming versions.
Added a better benchmarking program for arbitrary size and layout, see
examples/benchmark.rs
for this; it supports csv output for better
recording of measurements
Minimum supported rust version is 1.41.1 and the version update policy has been updated.
Updated to Rust 2018 edition
Moved CI to github actions (so long travis and thanks for all the fish).
0.2.4
Support no-std mode by @vadixidav and @jturner314 New (default) feature flag "std"; use default-features = false to disable and use no-std. Note that runtime CPU feature detection requires std.
Fix tests so that they build correctly on non-x86 #49 platforms, and manage the release by @bluss
0.2.3
-Ctarget-cpu=native
use (not recommended -0.2.2
New dgemm avx and fma kernels implemented by R. Janis Goldschmidt (@SuperFluffy). With fast cases for both row and column major output.
Benchmark improvements: Using fma instructions reduces execution time on
dgemm benchmarks by 25-35% compared with the avx kernel, see issue #35
_
Using the avx dgemm kernel reduces execution time on dgemm benchmarks by 5-7% compared with the previous version's autovectorized kernel.
New fma adaption of the sgemm avx kernel by R. Janis Goldschmidt (@SuperFluffy).
Benchmark improvement: Using fma instructions reduces execution time on
sgemm benchmarks by 10-15% compared with the avx kernel, see issue #35
_
More flexible kernel selection allows kernels to individually set all their parameters, ensures the fallback (plain Rust) kernels can be tuned for performance as well, and moves feature detection out of the gemm loop.
Benchmark improvement: Reduces execution time on various benchmarks
by 1-2% in the avx kernels, see #37
_.
Improved testing to cover input/output strides of more diversity.
.. _#35: https://github.com/bluss/matrixmultiply/issues/35 .. _#37: https://github.com/bluss/matrixmultiply/issues/37
0.2.1
Improve matrix packing by taking better advantage of contiguous inputs.
Benchmark improvement: execution time for 64×64 problem where inputs are either both row major or both column major changed by -5% sgemm and -1% for dgemm. (#26)
In the sgemm avx kernel, handle column major output arrays just like it does row major arrays.
Benchmark improvement: execution time for 32×32 problem where output is column major changed by -11%. (#27)
0.2.0
Use runtime feature detection on x86 and x86-64 platforms, to enable AVX-specific microkernels at runtime if available on the currently executing configuration.
This means no special compiler flags are needed to enable native instruction performance!
Implement a specialized 8×8 sgemm (f32) AVX microkernel, this speeds up matrix multiplication by another 25%.
Use std::alloc
for allocation of aligned packing buffers
We now require Rust 1.28 as the minimal version
0.1.15
0.1.14
0.1.13
rawpointer
, a µcrate with raw pointer methods taken from this0.1.12
0.1.11
0.1.10
0.1.9
0.1.8
0.1.7
0.1.6
0.1.5
0.1.4
0.1.3
0.1.2
0.1.1