oneDNN

oneAPI Deep Neural Network Library (oneDNN)

APACHE-2.0 License

Stars
3.4K
Committers
299

Bot releases are hidden (Show)

oneDNN - graph-v0.2

Published by vpirogov about 3 years ago

This is a technical preview for oneDNN Graph API based on oneDNN v2.3.2.

oneDNN Graph API extends oneDNN with a unified, high-level graph API for multiple AI hardware classes (CPU, GPU, accelerators). The graph interface integrates with the deep learning frameworks and inference engines to maximize opportunities for performance optimizations across a variety of hardware targets. This preview has full support for the oneAPI Graph programming model and partial support of the operations in oneDNN Graph API specification v0.7.

Learn more about oneDNN Graph API:

Supported Functionality

  • C++ and DPC++ API.
  • Graph partition and compilation API.
  • Operations and fusions targeting fp32 inference for CNNs, MLPs, and transformer neural networks.

Performance Optimizations

Backend implementation relies on oneDNN and includes performance optimizations for Intel Architecture processors with Intel SSE4.1, Intel AVX, Intel AVX2, or Intel AVX512 instruction set.

Validation

  • Gtest suite is available for basic functional testing.
  • Comprehensive functional and performance validation is covered by the extended version of benchdnn.

Known Issues and Limitations

  • Some subgraphs might not be recognized as a partition even if it matches the general pattern description due to internal implementation.
  • The weight’s opaque layout can be queried only from a compiled partition, which requires that tensor shapes must be known at compilation time.
  • Binary operation with scalar and tensor inputs is not optimized.

Thanks to the Contributors

This release contains contributions from the project core teams as well as Jiong Gong, Pinzhen Xu, Chunyuan Wu, Jianping Chen, Scott Cyphers, Nishant Patel, Yiqiang Li, Yang Sheng, Kiefer Kuah, Adam Straw, Tim Zerrell, Namrata Choudhury and others.

oneDNN - v2.3.2

Published by tprimak about 3 years ago

This is a patch release containing the following changes to v2.3.1:

  • Fixed performance regression in fp32 inner product primitive for processors with Intel AVX512 support (3e379b8c51a2fc2e72be6c49c9e6855f003af9e6)
  • Removed assert related to Winograd convolution algorithm dispatching on GEN9 GPUs (2b4f73adf89a3804dd5018014596ad2354309d40)
oneDNN - v2.3.1

Published by vpirogov about 3 years ago

This is a patch release containing the following changes to v2.3:

  • Improved int8 GEMM performance for processors with Intel AVX2 and Intel DL Boost support (f5c071bc371c26cac30bb68cda3ab1224ed697c1)
  • Fixed integer overflow for inner product implementation on CPUs (66971b57889d1246c643d736e50195c1bcd46a60)
  • Fixed out of bounds access in GEMM implementation for Intel SSE 4.1 (4e81df0a26e520c161527d52ce63d55734e9dabb)
  • Fixed correctness issue for depthwise convolution post-op with non-default scales on CPUs (783e1d6f035d20915cc1c8722d1b512888111beb, 066c832f7a2f6892a79c3f1b5a04b1a5f236e874)
  • Fixed crash for s8 binary primitive on Windows (d9fd397e2f130dddffbd2ced37edb300a2ba7649)
  • Fixed performance regression in fp32 to u8 reorder for Intel AMX specific memory formats (97f40cf0efef17361e948423a0b4fc2db04a903c, 532648adff4fe8590838f1f90409463b9237e358)
  • Fixed correctness issue for bfloat16 convolution weight gradient on processors with Intel AMX support (053406d0fd5a91f3e64adb81828be1632b74f9a5, 6649b759a5e801ad095c3c44d74c1dc27ab82617)
  • Fixed correctness issue for bfloat16 inner product backpropagation on processors with Intel AMX support (a2e6c55261bb3c353a295b7e2e57d403e5d73696)
  • Fixed correctness issue for bfloat16 convolution with padded memory formats on GEN9 GPUs (c0aea07a7e5b21829e4d484e232b9eccf49128d4)
  • Fixed correctness issue for int8 matmul primitive with zero points on processors with Intel AMX support (55cb716084cc625bc97e5f90b4f82bb2fcd72962)
  • Fixed segfault in depthwise convolution post-op on CPUs (ad466354b3108c4cacb1b85a6f93f8bdfe9d4e59)
oneDNN - v2.3

Published by vpirogov over 3 years ago

Performance Optimizations

  • Extended primitive cache to improve primitive descriptor creation performance.
  • Improved primitive cache performance in multithreaded configurations.
  • Intel Architecture Processors
    • Introduced initial optimizations for bfloat16 compute functionality for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
    • Improved performance of binary primitive and binary post-op for cases with broadcast and mixed source and destination formats.
    • Improved performance of reduction primitive.
    • Improved performance of depthwise convolution primitive with NHWC activations for training cases
  • Intel Graphics Products
    • Improved fp32 and fp16 Winograd convolution performance.
    • Introduced support for automatic selection between direct and Winograd convolution algorithms.
    • Improved int8 depthwise convolution performance.
    • Improved performance of reorder, shuffle, concat, binary, and batch normalization primitives
    • Improved layer normalization performance for blocked formats.
  • AArch64-based Processors
    • Improved reorder primitive performance for systems with SVE 128 and SVE 256 support.
    • Improved eltwise primitive performance for systems with SVE 512 support.

Functionality

Usability

  • Introduced binary distribution in conda-forge. Supported configurations cover Linux, Windows, and macOS operating systems and Intel64/AMD64, Aarch64, and PPC64 architectures.
  • Introduced support for GPU-only build. This configuration helps to reduce binary footprint for applications targeting GPU.
  • Introduced an option to use GNU OpenMP as CPU runtime for DPC++ configuration.
  • Introduced verbose log converter. This tool processes oneDNN verbose logs and generates test cases for benchdnn.

Breaking Changes

  • Updated minimal supported CMake version from to 2.8.12 (was 2.8.11).
  • Updated minimal supported ACL version from 21.05 (was 21.02).

Thanks to the Contributors

This release contains contributions from the project core team as well as Alexandre Truong @aletru01, Arthur Mitrano @aaraujom, fitchbe @fitchbe, Isuru Fernando @isuruf, Joe Ramsay @joeramsay, Kentaro Kawakami @kawakami-k, leizheng1 @leizheng1, Nomoto Kazuhiro @NomotoKazuhiro, Peter Caday @petercad, Pablo Romero @pablocum, Takumi-H @Takumi-Honda, Uwe L. Korn @xhochy, Vasily Rubtsov @vasilyru. We would also like to thank everyone who asked questions and reported issues.

oneDNN - v2.3-rc2

Published by vpirogov over 3 years ago

This is a release candidate for oneDNN v2.3. Please provide feedback and submit defect reports via Github issues.

oneDNN - v2.2.4

Published by vpirogov over 3 years ago

This is a patch release containing the following changes to v2.2.3:

  • Fixed build error with GCC 11 (eda1add9567b2491a5e4892a0f8ba7aa1c0016cd)
  • Fixed an issue with reorder reporting unimplemented when quantizing f32 weights to s8 (4f05b76bb765ed8a892be3325730992763025f0b, 5d3d1e18747f210a121cf00d909024ff7b5d8b16, cc77eef809d0331b245eb21a7956d507505700aa)
  • Updated name for GPU gen12 architecture to xe (3d202c205473daec426a6de3a32e074db372c09d)
oneDNN - v2.3-rc

Published by vpirogov over 3 years ago

This is a release candidate for oneDNN v2.3. Please provide feedback and submit defect reports via Github issues.

oneDNN - v2.2.3

Published by vpirogov over 3 years ago

This is a patch release containing the following changes to v2.2.2:

  • Fixed a bug in int8 depthwise convolution ptimitive with groups and 1d spatial size for processors with Intel AVX-512 and Intel AVX2 support (8a784c60fa3d074bd719ff7a8aecfe8ff7ff8966, f0e4af96163e5fa41320d24cc6952980b843ca7b)
  • Fixed correctness issue for PReLU primitive on Intel Processor Graphics (f3c3daf8a67477fcf3dceb826ea9e84c641ed67d)
  • Fixed corretness issue in reorder for blocked layouts with zero padding (68f05d00ae7743f16b41decd9da27599fdb191ec, d51616bc7ebee49f501086ace373d20833cea6fa, fd2c6421f1eff12822ba8808e0f979c60e21b2cd)
  • Improved performance of weights reorders used by BRGEMM-based convolution primitive for processors with Intel AVX-512 support (23b2ec0d6f73aba06c722c54eeb6d6ac0082242b, 10f81875774d0cdf8b293146bc0277daa330a48a, 4c0819c432cfad488c897cf1deefe0e89cb11749)
  • Added -fp-model=precise build flag for DPC++ code (3e40e5e92ebcf40a9115827ce568d32c5049f74a)
  • Fixed potential memory leak in matmul primitive (36dba73d0f584d30ce714415a59f42db735f4494)
  • Fixed performance of matmul primitive when fused with bias update and sum (f993b25dbe71010fc63ef0a5591ce6d85c9e47c3)
  • Fixed a bug in matmul primitive when writing to non-contiguous destination buffer (36d25d4308a0bc5906df44f6ef6afc2074699500)
oneDNN - v2.2.2

Published by tprimak over 3 years ago

This is a patch release containing the following changes to v2.2.1:

  • Fixed performance regression in fp32 forward inner product for shapes with number of output channels equal to 1 for processors with Intel AVX-512 support (714b1fd7f9ee51cc4b8f8a09ac9a0fc9be8403c9)
  • Fixed performance regression in forward convolutions with groups for processors with Intel AVX-512 support(3555d4a76e63f07fd36fdeea3947e0267bfcb814)
  • Removed -std=c++11 build flag for DPC++ headers (1fcb867e37ef48c82ee2c720a0405ad4e6299300)
  • Fixed buffer access in initializing workspace in RNN implementation on GPU (9b0309142937001f7140f80c451a294d31464626)
  • Fixed fix a bug in convolution with 1x1 kernel and mixed strides on processors with Intel AVX-512 support (d0b3e3fe0b15d9d8c05d21b97df303cdfb101076)
  • Used getauxval for Linux to get CPU features on for AArch64 systems (25c4ceaca3472dbd340dc942718a4e4b22c8a77c)
  • Added -fp-model=precise build flag for DPC++ code (3e40e5e92ebcf40a9115827ce568d32c5049f74a)
  • Fixed out-of-bounds writes in elementwise primitive on Intel Processor Graphics (bcf823c48574e163f34abbd4226d7a7af52bf374)
oneDNN - v2.2.1

Published by vpirogov over 3 years ago

This is a patch release containing the following changes to v2.2:

  • Fixed segfault for cases when primitive descriptor or attributed contain NaN (e6d05ecf20a110f83bf037be99c6c5110bf4d981, dbca1e9370c49fa4fe0fa0b4a42a4fa86b6e64a6, 0326b096eff60a2813265dce1bcb31c12177023d, 0326b096eff60a2813265dce1bcb31c12177023d)
  • Fixed engine creation failure for GPU subdevices (4c3a11438405ca191b1efc24b057286fc236c2d2)
  • Fixed long lines clipping in verbose output (70d70a8d064ad802344d90f6395760ef9bd720e2)
  • Fixed segfault in bfloat16 convolution weight gradient implementation on processors with Intel AMX support (a3a73a370797bc4b28a6868d533a6fbed0dad0df)
  • Fixed performance regression in binary primitive with per_oc broadcast strategy (9ac85d8508658adf0b141844f2355448aa5a3a2a)
  • Worked around a bug with Microsoft Visual C++ compiler version detection in CMake 3.19 (2f39155b256367e2b37ce782a222144a0b294cdc)
  • Removed -std=c++11 build flag for DPC++ code to align with SYCL standard (1b026f5e303649d9c0f98168a922e6f085001d3c)
oneDNN - v2.1.3

Published by vpirogov over 3 years ago

This is a patch release containing the following changes to v2.1.2:

  • Updated xbyak_aarch64 to support Apple silicon (dd1a02ab2a962bbeadfc0d2e53fedf39ed2b7b7e, 913010b253eccd4654c29f78c81227f7342e3262, 2d155dd22c59f4a059e9a7903c503d2221542811)
  • Fixed segfault in fp32 depthwise convolution with padded memory (2d8283f575d0a0a43a8a967f659f95e2fd8dd866)
  • Fixed potential issues in BRGEMM-based convolution implementation (b183dffa0fefa2c342070daae95c00ff274e8310, d2b1653f28f35ea3dc93c10ba6b9b538e80ba08e)
  • Fixed memory leak on NVIDIA GPUs (06803f2c2834b67a357fdb24d03ea906b9ffdd3a)
oneDNN - v2.2

Published by vpirogov over 3 years ago

Performance Optimizations

  • Intel Architecture processors
    • Improved performance of int8 compute functionality for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
    • Improved performance of compute functionality for future Intel Core processor with Intel AVX2 and Intel DL Boost instructions support (code name Alder Lake).
    • Improved fp32 inner product forward propagation performance for processors with Intel AVX-512 support.
    • Improved dnnl_gemm performance for cases with n=1 on all supported processors.
  • Intel Graphics products
    • Introduced NHWC format support for activations for int8 primitives.
  • AArch64-based processors
    • Improved performance of fp32 and int8 convolution, and softmax primitives for processors with SVE 512 support.
    • Improved performance of fp32 convolution via Arm Compute Library (ACL).
    • Improved performance of convolution with a combination of sum and relu post-ops via ACL.

Functionality

  • Extended eltwise primitive with support for mish and hardswish algorithms.
  • Extended binary primitive with support for comparison operators.
  • Introduced support for post-ops in GPU resampling implementation.
  • Introduced asymmetric quantization support for int8 deconvolution.
  • Introduced binary post-ops support for matmul primitive.

Usability

  • Improved presentation of oneDNN primitives in VTune Amplifier.
  • Introduced Linux perf support for AArch64.
  • Introduced support for Fujitsu C++ compiler.
  • Introduced a build time check for minimal supported ACL version. Currently oneDNN requires ACL 21.02 or later.
  • Added support for cuDNN 8.x

Thanks to the contributors

This release contains contributions from the project core team as well as Aleksandr Nikolaev @alenik01, araki.kenichi @qnet-araki, Arthur Mitrano @aaraujom, Dr-Noob @Dr-Noob, Gmc2 @GHGmc2, higuchi.motoko @higuchi-motoko, Joe Ramsay @joeramsay, Kentaro Kawakami @kawakami-k, Louie Tsai @louie-tsai, masafumi yamazaki @m-ymzk, Nathan John Sircombe @nSircombe, Takumi-H @Takumi-Honda. We would also like to thank everyone who asked questions and reported issues.

oneDNN - v2.1.2

Published by tprimak over 3 years ago

This is a patch release containing the following changes to v2.1.1:

  • Improved performance of forward convolution with plain activations for processors with Intel AVX-512 support (2147a58a6b075edcbb8b03fb158a73b7e706c324)
  • Enabled I-cache refreshing before executing JIT-ed code for AArch64 systems (9f3bc1c9279dde44383ef476ae49e813142b3cdc)
  • Returned blocked layouts as default for forward training (7af2898e65136ad2dd8cfc280027428e3ef2ec72, bd4826d8f098d196a9502d0c6d347f0956a243ad)
oneDNN - v2.2-rc

Published by vpirogov over 3 years ago

This is a release candidate for oneDNN v2.2. Please provide feedback and submit defect reports via Github issues.

oneDNN - v2.1.1

Published by vpirogov over 3 years ago

This is a patch release containing the following changes to v2.1:

  • Improved performance of fp32 depthwise convolution with plain activations on CPU (762a9c75a01476457d705c1e98f4d28f74b80e4d)
  • Worked around internal compiler error in GCC 7.3.1 when building with --std=c++14 (f637501d41e0d9a1515430a5530fca53fe656903)
  • Fixed memory leaks in batchnorm and gemm implementations (2ea5385402c2b3d6995b9e6bb8cb773339d9b7c2, 4f3a7cf1bc3009415a2cd065ffe2ed4ed45fda6c)
  • Addressed several issues in benchdnn and gtests (bb7bdb41e13ff47d7993e29827b3e60697c4809a, 0e04cc29a09eacc81d9e0dd705b55381b19166ea, d7df8d2240ea0c4d5ce74a209ccf652dd7094570, a59354fad484c46dd98956c406534d371d3fd08e)
oneDNN - v2.1

Published by anita-intel over 3 years ago

Performance optimizations

  • Reduced overheads associated with primitive cache.

  • Intel Processor Graphics and Xe architecture-based Graphics:

    • Improved performance of Winograd convolution.
    • Improved functionality performance for padded memory formats.
    • Improved performance of reorder and shuffle primitives for multiple formats and all dimensions.
    • Improved performance of pooling primitive for float16 data type.
    • Improved performance of lnorm primitive for plain formats.
    • Improved performance of resampling primitive for blocked formats.
  • Intel Architecture processors

    • Introduced initial optimizations for bfloat16 functionality for future Intel Xeon Scalable processor with Intel AMX support (code name Sapphire Rapids).
    • Improved performance of int8 and bfloat16 RNN and inner product primitives.
    • Improved performance of shuffle primitive for bfloat16 data type.
    • Introduced CPU ISA hints environment variable and API. New API is intended to dispatch function implementations using YMM registers to improve performance on processors with a single Intel AVX512 compute unit.
    • Improved forward convolution performance for Intel AVX-512 systems.
    • Introduced initial performance optimizations for future Intel Core processor with Intel AVX2 and Intel DL Boost instructions support (code name Alder Lake).
    • Improved performance of int8 primitive for processors with Intel SSE4.1 instruction set support.
    • Improved convolution and batch normalization performance with threadpool.
  • AArch64-based processors

    • Improved performance of Winograd convolution with ArmCL.
    • Improved performance of int8 convolution with ArmCL.
    • Added JIT support for Aarch64 and JIT implementations for reorder, eltwise, pooling, and batch normalization primitives.
  • NVIDIA GPUs

    • (preview) Introduced support for NVIDIA GPU. The implementation relies on DPC++ Compiler, cuDNN, and cuBLAS libraries.

New Functionality

  • Introduced int8 support for LSTM primitive with projection for CPU.
  • Introduced binary post-op for (de)-convolution, pooling, eltwise, binary, inner product, matmul and reduction (GPU only) along with performance optimizations for CPUs and GPUs.
  • Extended the number of supported post-ops for primitives to 20.
  • Extended eltwise primitive with support for logsigmoid and clip_v2 algorithms.
  • Introduced support for PRelu primitive.
  • Extended matmul implementation with support for per-output channel zero-points for quantization.
  • Extended support for broadcasting in binary primitive to both inputs for CPU.
  • Introduced float16 support in reduction primitive for GPU.
  • Introduced support for mixed input and output types in binary primitive for GPU.

Usability

  • Added API to enable displaying timestamps in oneDNN verbose mode. Timestamps allow to use oneDNN verbose output in profiling tools.

Validation

  • Extended benchdnn to report operation bandwidth.
  • Added ability to choose target GPU in benchdnn.

Thanks to the contributors

This release contains contributions from the project core team as well as Alejandro Alvarez, Aleksandr Nikolaev @alenik01, araki.kenichi @qnet-araki, Arthur Mitrano @aaraujom, Benjamin Fitch, Ben Tracy @CodeplayBen, Daniel Soutar @danielsoutar, @dylan-angus-codeplay, Diana Bite @diaena, higuchi.motoko @higuchi-motoko, Jacob Kahn @jacobkahn, Kentaro Kawakami @kawakami-k, Kumudha KN @KumudhaN, kurihara @Koji-Kurihara, Mehdi Goli @mehdi-goli, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Rafik Saliev @rfsaliev, Xinyu Chen @xinyu-intel, yuri@FreeBSD @yurivict. We would also like to thank everyone who asked questions and reported issues.

oneDNN - v2.1-rc

Published by anita-intel over 3 years ago

This is a release candidate for oneDNN v2.1. Please provide feedback and report bugs in Github issues.

oneDNN - v1.8.1

Published by vpirogov almost 4 years ago

This is a patch release containing the following changes to v1.8:

  • Fixed performance regression for fp32 convolutions forward propagation on Intel Processor Graphics and Xe architecture-based Graphics (2c8d20640d5068e2d85e378b266644fe86220e84, d8d6807c9c3b3346ac1045cf2dd88c0aaddfa5ce)
  • Fixed segmentation fault for fp32 and bfloat16 convolutions with huge spatial dimensions on processors with Intel AVX2 and Intel AVX512 support (fe8487db3a85e4a497af3bfa7ed96a2a986ce5f6, cb8ef4ed81b8f1f63cf5c5e444dc31add17317fb)
  • Fixed correctness issue in depthwise convolution (groups = channels) weight gradient with non-trivial padding and strides on Intel64 processors (b7ffe4859a17c360849018b8b4c187ddcdb64dcc)
  • Fixed correctness issue in int8 convolution with 1x1 filter and non-trivial padding on Intel Processor Graphics and Xe architecture-based Graphics (5b4201c2f302ff770593140802f508041973e310)
  • Fixed performance regression for dnnl_sgemm, fp32 matmul and inner product on Inte64 processors and improved this functionality performance with threadpool threading (32c1110807b999c9d434a1be8455e42c35124a93)
oneDNN - v1.8

Published by vpirogov almost 4 years ago

Performance optimizations

  • Intel Processor Graphics and Xe architecture-based Graphics:
    • Improved performance of Winograd convolution.
  • Intel Architecture processors
    • Introduced initial performance optimizations for future Intel Core processor with Intel AVX2 and Intel DL Boost instructions support (code name Alder Lake).
    • Improved performance of int8 primitive for processors with Intel SSE4.1 instruction set support.
    • Improved performance of int8 and bfloat16 RNN and Inner Product primitives.
  • AArch64-based processors
    • Improved performance of Winograd convolution with ArmCL
    • Improved performance of int8 convolution with ArmCL
    • Added JIT support for Aarch64 and JIT reorder implementation

New Functionality

  • Introduced int8 support for LSTM primitive with projection for CPU.

Thanks to the contributors

This release contains contributions from the project core team as well as Alejandro Alvarez, Aleksandr Nikolaev @alenik01, Arthur Mitrano @aaraujom, Benjamin Fitch, Diana Bite @diaena, Kentaro Kawakami @kawakami-k, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Rafik Saliev @rfsaliev, yuri@FreeBSD @yurivict. We would also like to thank everyone who asked questions and reported issues.

oneDNN - v2.0

Published by anita-intel almost 4 years ago

This is a major oneDNN release based on oneDNN v1.7.

Binary distribution of this software is available as Intel(R) oneAPI Deep Neural Network Library in Intel(R) oneAPI.

Breaking API changes

  • OpenCL API:
    • OpenCL interoperability API moved to dnnl_ocl.hpp.
    • Engine, stream, and memory are created from corresponding CL objects using free functions.
  • Threadpool
    • Threadpool API is moved to dnnl_threadpool.hpp.
    • Stream object for threadpool is created using free function dnnl::threadpool_interop::make_stream.
    • Removed stream attributes.

New Functionality

Known Issues and Limitations

  • Pooling, batch normalization, and binary primitives may segfault when executed on Xe architecture-based graphics. No workaround available.
  • Non-Intel GPUs are not supported. The library API allows to create a DNNL engine by index (the order of devices is determined by the SYCL runtime), and there is no check for GPU devices being non-Intel. To have more control, users can create a DNNL engine passing SYCL device and context explicitly.
  • When running GPU kernels that take longer than a certain time (it depends on OS and system settings), you may face a situation resulting in apparent hang of the application. There are ways to configure driver or system settings to disable this timeout and avoid hanging of DPC++ or OpenCL programs, including oneDNN examples:
    o On Linux* (See more details at OpenCL™ Driver for Intel® HD, Iris™, and Iris™ Pro Graphics for Linux):
    $ sudo bash -c 'echo N > /sys/module/i915/parameters/enable_hangcheck'
    o On Windows* (See more details at Timeout Detection and Recovery (TDR) Registry Keys):
    Increase TdrDelay and TdrDdiDelay values in registry
  • See DPC++ limitations that impact the library as well.