Bot releases are hidden (Show)

oneDNN - v3.0.1

Published by vpirogov over 1 year ago

This is a patch release containing the following changes to v3.0:

Fixed potential correctness issue in convolution weight gradient with 1x1 filter and strides (e58996692802f4a94651f6baa6e3f0debf93b537)
Improved convolution, deconvolution, inner product, and matmul primitives performance with scales on Intel CPUs (38319f1f822387bd755183bcac2ec3d0745a88b4, 18de927dc205543701942f0f26d61f72c51f5f0b, b6170d1b79332d8ba0f72227cb5edd2aced837c0, 85171b0cc057d5ba682dee582cd72c48543389db)
Reverted MEMFD allocator in Xbyak to avoid fails in high load scenarios (eaaa41b8a30101640094e46af7f27969ed105ee2)
Fixed array out of bounds issue in bfloat16 convolution weight gradient on Intel CPUs (a17a64c330d1153fdea3d81f1420fb38c50248bd)
Improved compatibility with future versions of Intel GPU driver (eb7a0a07df12874a40c0f135d8bf16116594e0e8)
Fixed segfault in fp16 and bfloat16 convolution backward propagation on systems with Intel AMX support (293561b6a2644ef05d8d664cd81c1bcde876b481)
Fixed build issue with GCC 13 (1d7971ce488da657e23f08488cdb6ef8e484c5e8)
Fixed correctness issue in int8 RNN primitive Vanilla GRU flavor on Intel CPUs (f4a149c16faff0fb51fb292d12a7b51f6fac53bf, fbf8dca1ba9b565ddedd1cb291d3b466d0a5a45b)
Added check for unsupported arguments in binary primitive implementation for AArch64-based processors (5bb907077cd7b4c3983f7215d5509b17f3da67e2)
Fixed correctness issue in int8 convolution with zero-points on Intel Data Center GPU Max Series (96e868c473bb0e2a9b1a42b51e8f91997b52b471)
Fixed runtime error in convolution primitive with small number of channels on Xe-based graphics (068893e1c792c8e9ad5b17bc6e494359b32f910f)
Removed use of OpenCL C variable length arrays in reduction primitive implementation for Intel GPUs (41e8612f212d939643932ef309cd78bd4194f42d)
Fixed correctness issue in matmul and inner product primitives on Intel Data Center GPU Max Series (a1e6bc57b233d85a6f382db611879614236d9b05, dbb7c284e0834cd0fe84c8311484880802fa9af0)
Fixed segfault in fp16 and bfloat16 convolution backward propagation on future Intel Xeon processors (code name Sierra Forest) (399b7c5af4c5238f9956d71270adbd44f3cb25a3)
Fixed runtime error in Graph API for partitions with quantized matmul and add operations (f881da5be31abc71f90a1a750c50ec2ea5dbc516, 699ba755fde86aea3714bbce75d5b0b274302545, b8d21a58d8247097ed26816b730e3cd4c19f61c, 9421fb2a453aee957a0c1dc10be5675e5f916c2e)
Fixed convolution performance regression on Xe-based graphics (1869bf26a92f8d8f36853e537f9727412a4d1f94)
Improved convolution performance with OHWI and OIHW weight formats on Intel Data Center GPU Max Series (2d0b31ee82dc681b829f67100c05ae4e689633e6, 5bd5d52e7ee832fb0d5ece6d42a6b230023c9dd0)
Fixed include files handling in build system affecting CMake projects relying on oneDNN (c61645392fde55ac361c95a752df0cfa7ef24345)
Added tbb::finalize to tests and examples to address intermittent test crashes with TBB runtime (891a41560382cc0f991c428392078d13ccb76129, c79e54322f251aa70783ca1b837ce0d558bf3396, 8312c3addc597e6565cf1233801234c2ffafd092, 1a32b95a2c61d094206ed49d69843fdcdeb2ffcd, bd0389d81509baf6696d3927d0da4cce4c06d2d4, f05013d0e419df22ec2755dc5d74f5974871cf9e, ab7938f1b889aa43f155216f774297e8c765cd97, 31c9e7b3c1a7e262cecafe98bed128843f1c2969, f3261e4556935424946697be4b336020653b41a5, d58ac41a12179f8cca48962c4b5a44940bea97d7, f8c67b9026dc2945ed66a8f1c276611c063dae4d, 258849b71c24a89b08ac12972ec1fcaa72a9da39, b20a8c786c5a2cb676a2a8b599edf5cfd7ee0c3a)
Fixed segfault in fp16 convolution primitive on future Intel Xeon processors (code name Granite Rapids) (a574ffff870318cc104d8af4a2368d47b433b27f)
Fixed correctness issue in fp16 convolution primitive on future Intel Xeon processors (code name Sierra Forest) (f165ed8a8872e72a7d9651c3dd38bd6c2909fdce)
Fixed correctness issue in int8 convolution primitive on Intel CPUs (ca1592237b87cae5e4a55fb464ad90fb9f91957d, 27845b8e66d354549ac6c6fceeb92c267a9e910f)
Fixed correctness issue in int8 convolution primitive on Intel Data Center GPU Max Series (8bb651cb99e2875aea44b907bdc54418b2d4932a)
Fixed correctness issue in resampling primitive with post-ops on Intel CPUs (aa52a5128d44c6d745b89beabcd47f428665843e)
Addressed excessive memory consumption in 3D convolution on Intel CPUs (3d6412af5cb99863ede8753238533dcabcd3c5d9, 097acb5e108eb57b38a8a2409b083a1819b9f962, fd696639c70c4cd92e2aaf871bc4165c269d29f7)
Fixed segfault in convolution with sum and relu post-ops on Intel CPUs (63ad769939dd8307935caac67c0fc7c9bc9206de, 1b1303748b80360e5f93740d6ea03063132fd8f8, 0a8116b3de98243a234680d8cda869d2f20dd178, 9972cb80a29da9f14efbe8518bc10a21f7ae6e36)
Addressed convolution performance regression with small number of channels on Intel GPUs (d3af87710fcae9561ae22017d45bd670f8858272)
Worked around MSVS 2019 bug resulting in build fails on Windows (40247753290e3e886b9235c5f80a2997eb85372a)
Updated code base formatting to clang-format 11 (23576f935fcef245b26cc78ef74935ea6bb7e6b7, 0b1bf845e05da75e4d994e01a0d7996b64787ece)

oneDNN - graph-v0.8.1

Published by vpirogov over 1 year ago

This is a patch release containing the following changes to graph-v0.8:

Upgraded oneDNN dependency from v2.7.2 to v2.7.3 (93237aa, 260bdb5)
Fixed a correctness issue of quantized Convolution + Add fusion (26a9a5b, beba352)
Fixed query_dynamic_outputs() interface implementation in graph compiler backend (8dbca04)

oneDNN - v2.7.3

Published by tprimak almost 2 years ago

This is a patch release containing the following changes to v2.7.2:

Fixed segfault in int8 convolution with binary post-ops on Intel CPUs (c8d40c0719f9d9cffa1c5eb04f3f40fa1f9546b8)
Applied workaround for tanh post-op on some Xe architecture based GPUs (3eb3267dc3bcfe64a081731ac9d08c84bc6827f7)
Disabled fp16 post-ops with Compute Library for Arm Architecture (ACL) (f7b7dc0a8b3125602295047cdd7feb3cbb8d9a06)
Fixed incorrect results for sequence of eltwise post-op with same algorithm but different parameters (02c26781171f6350634b41d80cbff7ae5092c1a1, 1c36e27520617e23b74ed32e675804ac7806576e, 81ba0fe626c93e51935d5e8776dd7e8bf4105487)
Fixed issue in convolution with groups and plain activation layout on Intel GPUs (df6f2e34bfb1e3d6bcd5498a4febb149b2be8b2b, d0c14c204782945b3732bd83b7329c314c3339c1)
Fixed reorder failures on Xe HPC architecture based GPUs (c3cb1d5fa7e2e41c7059fa7e5ebcee34aa3e5242)
Fixed thread safety issue in convolution primitive (2955c9d5d5f97f03c4068af37f6783f0be256695)
Fixed scratchpad allocation issue in matmul (989acd3b0dbd304fe47ac7837bb33e73a4ca7cd6)
Disabled concat batching with scales on Intel GPUs since implementation doesn't support it yet (8aab73fe1897542c5ec740ac718b00e7d72edd92, 1eac450ca742cd9905addf36ee038a8e17e03474, 82838de623057ffd1dfc0f879afcd02e72f9538f)
Fixed segfault and correctness issues in convolution primitive with sum and relu post-ops on Intel CPUs (fc335be0d1376f1dca527bd543f929739dffd55f, 0f4697a87c0f550339598c1918d5479801337426, 60f1727fcaf06416c5464b44c177ec16829bd2c1, d28f2c1757e2cc6b792e4fd5de40987e811d086d, 4761ee91b3729d124135273a7450d3d2cf0dce53, f674fbf917e92b2623184ad8c603f20ae4fe0ad7)

oneDNN - graph-v0.8

Published by vpirogov almost 2 years ago

This is the Beta Update 2 release of oneDNN Graph API based on oneDNN v2.7.2.

Functionality

Added HardSigmoid operation.
Added block tensor layout support to improve performance on Xe architecture-based GPUs.
Added support of IOX and XOI weight formats for ConvTranspose operation.
Added query_dynamic_outputs API to support dynamic shapes in the graph. This functionality allows Graph API to infer output tensors shapes based on input tensors.
Experimental: Introduced dynamic shapes support for MHA via oneDNN Graph Compiler.

Known Issues and Limitations

The weight’s opaque layout can be queried only from a compiled partition, which requires that input tensor shapes must be known at compilation time.
MHA and MLP fusion are not activated on machines without Intel AVX-512 support.

Thanks to the Contributors

This release contains contributions from the project core teams as well as Jiong Gong, Chunyuan Wu, Sanchit Jain, Yiqiang Li, Yunfei Mao, Kiefer Kuah and others.

oneDNN - v3.0

Published by harrymao2022 almost 2 years ago

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processor (formerly Sapphire Rapids).
- Introduced FP16 support and initial optimizations for future Intel Xeon Scalable processor (code name Granite Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
Intel Graphics Products:
- Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
AArch64-based Processors:
- Improved reorder performance for processors with Scalable Vector Extensions (SVE) support.
- Improved pooling performance with post-ops for processors with SVE 512 support.
- Improved batch normalization performance with non-default flags for processors with SVE 512 support.
- Improved performance of FP16 functionality with Compute Library for Arm Architecture (ACL).
- Improved deconvolution performance with ACL.
PowerPC64-based Processors:
- Improved int8 GEMM performance.

Functionality

Introduced new quantization scheme. Major changes include support for per-argument runtime scales in all primitives and unquantized bias.
[experimental] Introduced Graph API support that simplifies oneDNN integration into applications. The functionality is disabled by default and can be enabled at build time with ONEDNN_BUILD_GRAPH=ON flag.
Introduced support for Intel DPC++/C++ Compiler 2023.0, including new features from the SYCL 2020 standard.
Extended persistent cache to cover GPU engine object. This improvement allows applications to further reduce oneDNN initialization time.
Extended threadpool API with a function to indicate maximum available concurrency.
Extended binary primitive implementation on GPU with bfloat16 source and int8 destination support.
Introduced pooling and reduction primitives support on AMD GPUs.
Introduced reduction primitive support on NVIDIA GPUs.

Usability

Extended the set of supported format tags to cover formats used in applications.

Validation

Extended the GoogleTest (gtest) suite with support for Parametric Rectified Linear Unit (PReLU) primitive.

Breaking Changes

Removed deprecated APIs.
Removed operation descriptor object and made memory descriptor object opaque. See details in operation and memory descriptors RFC.
Removed creation time primitive scales support and primitive output scales support. See details in quantization scaling RFC.
Removed support for Intel DPC++/C++ Compiler 2022 and SYCL 1.2.1 (aka SYCL 2017) standard support. Use Intel DPC++/C++ Compiler and SYCL 2020 standard instead.
Removed Winograd convolution implementation for int8 data type.
Updated minimal supported ACL version to 22.08 (was 22.05).

Thanks to the Contributors

This release contains contributions from the project core team as well as @akshatasangelkar, Aryan Karumuri @AryanKarumuri, Crefeda Rodrigues @cfRod, Divakar Mariyanna @bmdivakar, Gordon Fossum @austinpagan, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, lilianhuang @lilh9598, Milos Puzovic @milpuz01, Mona Minakshi @monaminakshi, Nathan John Sircombe @nSircombe, Peter Caday @petercad, and Sreekanth Yalachigere @sreekanth-yalachigere. We would also like to thank everyone who asked questions and reported issues.

oneDNN - v3.0-rc

Published by harrymao2022 almost 2 years ago

This is a release candidate for oneDNN v3.0. Please provide feedback and submit defect reports via Github issues.

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processor (formerly Sapphire Rapids).
- Introduced FP16 support and initial optimizations for future Intel Xeon Scalable processor (code name Granite Rapids).
Intel Graphics Products:
- Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
AArch64-based Processors:
- Improved reorder performance for processors with Scalable Vector Extensions (SVE) support.
- Improved pooling performance with post-ops for processors with SVE 512 support.
- Improved batch normalization performance with non-default flags for processors with SVE 512 support.
- Improved performance of FP16 functionality with Compute Library for Arm Architecture (ACL).
- Improved deconvolution performance with ACL.
PowerPC64-based Processors:
- Improved int8 GEMM performance.

Functionality

Introduced new quantization scheme. Major changes include support for per-argument runtime scales in all primitives and unquantized bias.
[experimental] Introduced Graph API support that simplifies oneDNN integration into applications. The functionality is disabled by default and can be enabled at build time with ONEDNN_BUILD_GRAPH=ON flag.
Introduced support for Intel DPC++/C++ Compiler 2023.0, including new features from the SYCL 2020 standard.
Extended persistent cache to cover GPU engine object. This improvement allows applications to further reduce oneDNN initialization time.
Extended threadpool API with a function to indicate maximum available concurrency.
Extended binary primitive implementation on GPU with bfloat16 source and int8 destination support.
Introduced pooling and reduction primitives support on AMD GPUs.
Introduced reduction primitive support on NVIDIA GPUs.

Usability

Extended the set of supported format tags to cover formats used in applications.

Validation

Extended the GoogleTest (gtest) suite with support for Parametric Rectified Linear Unit (PReLU) primitive.

Breaking Changes

Removed deprecated APIs.
Removed operation descriptor object and made memory descriptor object opaque. See details in operation and memory descriptors RFC.
Removed creation time primitive scales support and primitive output scales support. See details in quantization scaling RFC.
Removed support for Intel DPC++/C++ Compiler with SYCL 1.2.1 (aka SYCL 2017) standard.
Removed Winograd convolution implementation for int8 data type.
Updated minimal supported ACL version to 22.08 (was 22.05).

Thanks to the Contributors

oneDNN - graph-v0.7.2

Published by vpirogov almost 2 years ago

This is a patch release containing the following changes to graph-v0.7.1:

Upgraded oneDNN dependency to v2.7.2 (dec9f8cc6)

oneDNN - v2.7.2

Published by vpirogov almost 2 years ago

This is a patch release containing the following changes to v2.7.1:

Fixed segfaults in deconvolution backpropagation with ACL on AArch64-based processors (f02e6f3f262813b8d0b6cb1f7b55fcc08b4b5bac)
Fixed code generation issues in Intel AVX2 convolution implementation (2ba25236bc417c4d5fe1729ddf9e01f1d1d25fb3, b60633f79947199a1f0cfce7aa42b0ae14690401, 844326b853ba9ca9b7a34ec08ca6e2e28d7332e8, 2009164c2ae90e1e938ab8823c817a6c95fccc11)
Fixed correcteness issues and runtime errors in deconvolution with binary post-ops on Intel GPUs (dd54d3906c9613a967b709907306b946cfe32cac)
Improved performance of convolutions with small number of channels and large spatial sizes on systems with Intel AMX (26f97dc7a47aa2c0f0e13e6ff61dd3fc28fa077b, 4cb648d9e3620876fa7d7dca38a902643cd97dbc)
Fixed runtime error in int8 convolutions with groups on Xe architecture based GPUs (e5a70f43639ba968869a99931d77116791ace355)
Improved inner product weight gradient performance on Xe architecture based GPUs (9e9b859fddc6f813f9b9cac093d7d131c84054ab, 12ec4e3a51ddc105e86e9d29661690750560cd1c)
Improved batch normalization performance with threadpool threading (4fd5ab2dd312b2b79e8f2f1b18b39a94fee39e84)
Improved inner product performance with binary post-ops in broadcast mode on Intel CPUs (d43c70d4aafd58c241d456453994f4c7fe6aefff, 49ca4e17e7fd889c6c153f52dffa6f4d4a10e7c9)
Fixed segfaults and correctness issues in sum primitive with threadpool threading (ee7a3219db8bcdb7870b65b6ee0aadfba2275513)
Extended persistent cache API to cover engine objects (58481d606c19f4e46c1cd7dbfd4aba819ae024d3, 5f69dade29e317eab37455d477892996e80aea75, 16c0a95180a362c079fb2d3f01a4cea084b99628, 068071b326f253791ae767cae25258e6d47426ad)
Added support for newer versions of Intel GPU drivers (71443935355ef4fc52b510be761c487de8677386)
Updated ITT API version to 3.23.0 (d23cc9503f94ea9267bc8b6e654a912caa70e333)
Fixed convolution correctness issue on Intel Data Center GPU Flex Series (365ac202ca2f58078549116a0650a91566a256b6)
Fixed fp64 convolution correctness issue on Intel Data Center GPU MAX Series (9d4bf94d89b945cb703a7b4d04d539daf7aab8b5, 67054032e4b1b4eae11f006e3857fe20a0d7b16a)
Fixed correctness issues in reduction primitive with binary post-op on Intel GPUs (ae9d075dbba068287b6cb280f0f22d3cdcbfcb36, e3b80c58f493e7972eb4d0317518534c1d8412e9)
Improved convolution performance on on Intel Data Center GPU MAX Series (90be8d501f3b35e88f997bf9e0fd139a740f72f7, caf4863f40dd06b807d2bb1abb487aad21d586a6)
Fixed build errors with ONEDNN_ENABLE_PRIMITIVE_GPU_ISA build option (de2db042bbb733de7c925224934ded766de74d68)
Fixed correctness issues in convolution with per-tensor binary post-ops on Intel CPUs (9cf9c189f6f674bba38ea11217f4b06acab87194)
Improved convolution performance on Intel Data Center GPU Flex Series (8b08a07574888bc265818a751eab82aa28115d72)

oneDNN - graph-v0.7.1

Published by vpirogov almost 2 years ago

This is a patch release containing the following changes to graph-v0.7:

Fixed a build issue in compiler backend (70258d306)
Optimized for zero points folding (d6f12b50c)
Fixed a primitive descriptor cache issue in reorder fusion (08876524d)

oneDNN - v2.7.1

Published by vpirogov almost 2 years ago

This is a patch release containing the following changes to v2.7:

Fixed performance regression for batch normalization primitive in TBB and threadpool configurations (cd953e4ca7390387b53fba7105f81a6fc1fc0382)
Improved grouped convolution performance on Xe Architecture GPUs (d7a781e166ef3206d9b0ab79a69d76034d663c20, cb1f3fe27f466a26b484ed063546bd0b6c4cd306, 4e844740d6b26709c0aa3c2604ed52130560208a, 7ba3c40f65425c4bc2b922ae7b2cdd8cb8e5181c)
Fixed runtime error in int8 reorder on Intel GPUs (53532a9944b2e4694d4c0135f0a1a5102ca97613)
Reverted MEMFD allocator in Xbyak to avoid segfaults in high load scenarios (3e29ae26dba137a6232669bd1c5d42ad4449b794)
Fixed a defect with incorrect caching of BRGEMM-based matmul primitive implementations with trivial dimensions (87cd9796a98497ab9a3ff5250ad3a396199590fb)
Improved depthwise convolution performance with per-tensor binary post-ops for Intel CPUs (f430a5a4c883ef846f938f571020565d41719e9c)
Extended threadpool API to manage maximum concurrency (8a1e9595f131e1303887fba407a03dbd64ac301e, 64e559454787651186ed6a32e4eef2a17132b9b6)
Fixed potential integer overflow in BRGEMM-based convolution implementation (25ccee38b97e935e6c3c729b9134804c6a2ea6a7)
Fixed performance regression in concat primitive with any format on Intel CPUs (2a60adec0e73895caefb3dc7d1de74b5eac8c6da, feb614d5fef07fb2a188ceef15ebeaf9f9f45acf)
Fixed compile-time warnings in matmul_perf example (b5faa77a4a651f1e44fa77348eded54ea3ec3eef)
Fixed 'insufficient registers in requested bundle' runtime error in convolution primitive on Xe Architecture GPUs (4c9d46acc35126fec2b59125403566a90b6bed36)
Addressed performance regression for certain convolution cases on Xe Architecture GPUs (f28b58aec55c5087127702f7c0a38d21b3006d35, 18764fbef1f1f90bc696fe35d059685b2b37f149)
Added support for Intel DPC++/C++ Compiler 2023 (c3781c671dcc23c0fa16eb648c98ef33b79c737b, a1a8952656b2e84a4124cc0d2f8c7aae10e62a46, 9bc87e635dbeffd77808c70fbd51ac5dc834b582, e3b19871cab6c9b5c317cddb18f4264575868ed7)
Fixed int8 matmul and inner product performance regression on Xe Architecture GPUs (3693fbf0e8b0cd3bcc2308a4504772c0af2eaf88, c8adc179133f7212523f4ecb1cdab648b0cec796)
Fixed accuracy issue for convolution, inner product and matmul primitives with tanh post-op on Xe Architecture GPUs (88b4e57718014bd50f78461a5c80dc680074f9b6, 83ce6d27a8699d7ab0d1ee450e2e7e9ec87a6e13, 6224dc6b3e2073c98f4b8278bf7e87769dd85a55, 10f0d0ade797a90c93b7450c1e0b151dc415dab3)
Suppressed spurious build warnings with GCC 11 (44255a8a57dc40ccc8f7b464e5638d6715216756)

oneDNN - v2.6.3

Published by tprimak almost 2 years ago

This is a patch release containing the following changes to v2.6.2:

Fixed potential integer overflow in BRGEMM-based convolution implementation (deb5595a0f96b54f9106cb846e6fc4e0af49aadf)
Fixed a defect with incorrect caching of BRGEMM-based matmul primitive implementations with trivial dimensions (305bed526492f2400a1a7fdfcb54b0ee41adc67e)
Extended benchdnn performance benchmarking capabilities on GPU with device-side performance measurement mode (ba8632592018070a46e4d349bbe3628756022c15)
Fixed segfault in pooling primitive on CPUs (689d874bbf0a3e1bdc75e99ad2453e6aac9cfe84)

oneDNN - graph-v0.7

Published by vpirogov about 2 years ago

This is the Beta Update release for oneDNN Graph API based on oneDNN v2.7 release.

Functionality

Added operations Select, LogicalAnd, LogicalOr, LogicalXor, LogicalNot, Greater, GreaterEqual, Equal, NoeEqual, Less, and LessEqual.
Added boolean data type to support logical operations.
Added support for passing compilation context to the compile API. This feature allows passing additional information, like tensor shape context, for the backend to generate better kernel code.
Introduced convolution block fusion via oneDNN Graph Compiler.
Experimental: Introduced dynamic shapes support for multi-level perceptron (MLP) block via oneDNN Graph Compiler.

Known Issues and Limitations

The weight’s opaque layout can be queried only from a compiled partition, which requires that input tensor shapes must be known at compilation time.
MHA and MLP fusion are not activated on machines without Intel AVX-512 support.

Thanks to the Contributors

This release contains contributions from the project core teams as well as Jiong Gong, Chunyuan Wu, Sanchit Jain, Yiqiang Li, Yunfei Mao, Kiefer Kuah and others.

oneDNN - graph-v0.6

Published by vpirogov about 2 years ago

This is the Beta release for oneDNN Graph based on oneDNN v2.7 release.

Functionality

Introduced FP32, BF16, FP16, and INT8 inference support on GPU.
Introduced FP32 and BF16 training support on GPU.
Introduced support for floating point math mode at graph construction phase. The mode allows the implementation to use low precision datatype for computations when possible.
Added graph::finalize() function to indicate that the user has finished adding operations into the graph and the graph is ready for partitioning.
Added operations AbsBackprop, Mish, MishBackprop, and LeakyReLU.
Updated API and operation definitions to comply with oneDNN Graph Specification 1.0-beta.

Usability

Integrated Graph component headers, source and build system into oneDNN:
- Headers moved to include/oneapi/dnnl.
- Source moved to src/graph.
- Graph functionality is included into single shared object or dynamic library produced by the build system.
Aligned API with oneDNN:
- Shared common dnnl::engine and dnnl::stream. The original dnnl::graph::engine and dnnl::graph::stream API were removed.
- Added a new make_engine_with_allocator() API to create dnnl::engine with dnnl::graph::allocator.
- A few common basic types were shared between oneDNN and oneDNN Graph, including dnnl_status_t, dnnl_data_type_t, and dnnl_dims_t, etc.
Introduced ONEDNN_BUILD_GRAPH build option to manage Graph component build.

Validation

Introduced ONEDNN_GRAPH_DUMP environment variable that serialized library graph and subgraph into JSON files.
Added the initial version of benchdnn graph driver which can be used to benchmark the performance with a dumped graph JSON file.

Breaking changes

Removed operations HardTanh, Index, Pow, etc. Please check the operation kind list for details.

Known Issues and Limitations

Graph Compiler component is not included with this release. It will be reinstated in oneDNN Graph Beta Update release.
The weight’s opaque layout can be queried only from a compiled partition, which requires that input tensor shapes must be known at compilation time.
Build option ONEDNN_BUILD_GRAPH is not compatible with some of the build options supported by the build system including ONEDNN_GPU_RUNTIME=OCL, ONEDNN_ENABLE_WORKLOAD=INFERENCE, ONEDNN_ENABLE_PRIMITIVE, and others.

Thanks to the Contributors

This release contains contributions from the project core teams as well as Jiong Gong, Chunyuan Wu, Sanchit Jain, Yiqiang Li, Yunfei Mao, Kiefer Kuah and others.

oneDNN - v2.7

Published by harrymao2022 about 2 years ago

Performance Optimizations

Intel Architecture Processors
- Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids).
- Introduced performance optimizations for bf16 floating point math mode on Intel Xeon Scalable processors (code name Sapphire Rapids). The bf16 math mode allows oneDNN to use bf16 arithmetic and Intel AMX instructions in computations on fp32 data.
Intel Graphics Products
- Improved performance for future Xe Architecture graphics (code name Ponte Vecchio).
- Introduced performance optimizations for tf32 floating point math mode on future Xe Architecture graphics (code name Ponte Vecchio). The tf32 math mode allows oneDNN to use tf32 arithmetic in computations on fp32 data.
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M)
AArch64-based Processors
- Improved convolution and binary primitive performance for processors with SVE 512 support.
- Improved shuffle and eltwise primitives performance for processors with SVE 256 and SVE 128 support.
- Improved PReLU, batch normalization, and pooling primitives performance via Compute Library for the Arm Architecture (ACL).
- Improved performance of inner product, matmul, convolution, and batch norm primitives with post-ops via ACL.
PowerPC64-based Processors
- Introduced performance optimizations for int8 and bfloat16 GEMM.

Functionality

Introduced runtime output scales support in all primitives.
Introduced scales support in concat primitive.
Extended floating point math mode API with tf32 data type option.
Extended eltwise primitive with support for hardsigmoid algorithm.
Extended layer normalization primitive with support for mixed source and destination data types.
Extended depthwise post-op with support for arbitrary padding size. The implementation is available only on Intel processors.
Added limited fp64 data type support in convolution primitive. Optimized implementation is available for future Xe Architecture graphics (code name Ponte Vecchio).
Extended int8 convolution and deconvolution implementations on GPUs with arbitrary destination data type support.
Extended batch normalization primitive with dnnl_fuse_norm_add_relu flag that allows to fuse sum and relu operations. The implementation is available for Intel GPUs.
Extended GPU deconvolution primitive implementation with support for output scales and zero points.
Introduced threadpool threading support for AArch64-based processors.
Introduced Unified Shared Memory (USM) support for SYCL backend on NVIDIA GPUs.
Introduced initial support for AMD GPUs via MIOpen library. Supported primitives include Local Response Normalization (LRN), softmax, and eltwise.

Usability

Added matmul_perf example that benchmarks matmul primitive for all supported data types.
Introduced annotations for JIT kernels to allow profilers like Linux perf to correctly label JIT code.
Extended verbose logs converter with RNN primitive support.
Added verbose output for dnnl_*gemm* calls.
Removed Level Zero headers from the list of build time dependencies.
Adjusted NVIDIA GPU implementation to comply with oneDNN numerical behavior. Implicit downconvert to fp16 and tf32 are now managed via math mode API.

Validation

Added benchdnn driver for validation of internal BRGEMM implementation.
Improved benchdnn reference implementation performance with threadpool threading model.
Extended benchdnn performance benchmarking capabilities on GPU with device-side performance measurement mode (mode=po).

Deprecated Functionality

Support for SYCL 1.2.1 (aka SYCL 2017 standard) is deprecated and will be removed in the future releases.
Static output scales are deprecated and will be removed in the next release.
Convolution Winograd algorithm implementation for int8 data type is deprecated and will be removed in the next release.

Breaking Changes

Changed formula for AUGRU RNN cell to align with Tensorflow. See proposal for details.

Thanks to the Contributors

This release contains contributions from the project core team as well as Aidan Belton @AidanBeltonS, @akshatasangelkar, Alex Bojan @lb991, Crefeda Rodrigues @cfRod, Damian Szwichtenberg @dszwicht, Diana Bite @diaena, Divakar Mariyanna @bmdivakar, Emilio Cota @cota, Gordon Fossum @austinpagan, Hugh Delaney @hdelan, Jacek Czaja @jczaja, @jakpiase, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, Kotha Sowmya @Sowmyakotha1999, Louie Tsai @louie-tsai, Mark Ryan @markdryan, MITSUNARI Shigeo @herumi, Mona Minakshi @monaminakshi, @NaNAGISaSA, Nathan John Sircombe @nSircombe, Peter Caday @petercad, @pgorlani, Sreekanth Yalachigere @sreekanth-yalachigere, Tadej Ciglarič @t4c1, and Thiago Macieira @thiagomacieira. We would also like to thank everyone who asked questions and reported issues.

oneDNN - v2.7-rc

Published by harrymao2022 about 2 years ago

This is a release candidate for oneDNN v2.7. Please provide feedback and submit defect reports via Github issues.

Performance Optimizations

Intel Architecture Processors
- Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids).
- Introduced performance optimizations for bf16 floating point math mode on Intel Xeon Scalable processors (code name Sapphire Rapids). The bf16 math mode allows oneDNN to use bf16 arithmetic and Intel AMX instructions in computations on fp32 data.
Intel Graphics Products
- Improved performance for future Xe Architecture graphics (code name Ponte Vecchio).
- Introduced performance optimizations for tf32 floating point math mode on future Xe Architecture graphics (code name Ponte Vecchio). The tf32 math mode allows oneDNN to use tf32 arithmetic in computations on fp32 data.
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M)
AArch64-based Processors
- Improved convolution and binary primitive performance for processors with SVE 512 support.
- Improved eltwise and shuffle primitives performance for processors with SVE 256 and SVE 128 support.
- Improved PReLU, batch normalization, and pooling primitives performance via Compute Library for the Arm Architecture (ACL).
- Improved performance of inner product, matmul, convolution, and batch norm primitives with post-ops via ACL.
PowerPC64-based Processors
- Introduced performance optimizations for int8 and bfloat16 GEMM.

Functionality

Introduced runtime output scales support in all primitives.
Introduced scales support in concat primitive.
Extended floating point math mode API with tf32 data type option.
Extended eltwise primitive with support for hardsigmoid algorithm.
Extended layer normalization primitive with support for mixed source and destination data types.
Extended depthwise post-op with support for arbitrary padding size. The implementation is available only on Intel processors.
Added limited fp64 data type support in convolution primitive. Optimized implementation is available for future Xe Architecture graphics (code name Ponte Vecchio).
Extended int8 convolution and deconvolution implementations on GPUs with arbitrary destination data type support.
Extended batch normalization primitive with dnnl_fuse_norm_add_relu flag that allows to fuse sum and relu operations. The implementation is available for Intel GPUs.
Extended GPU deconvolution primitive implementation with support for output scales and zero points.
Introduced threadpool threading support for AArch64-based processors.
Introduced Unified Shared Memory (USM) support for SYCL backend on NVIDIA GPUs.
Introduced initial support for AMD GPUs via MIOpen library. Supported primitives include Local Response Normalization (LRN), softmax, and eltwise.

Usability

Introduced annotations for JIT kernels to allow profilers like Linux perf to correctly label JIT code.
Extended verbose logs converter with RNN primitive support.
Added verbose output for dnnl_*gemm* calls.
Removed Level Zero headers from the list of build time dependencies.
Adjusted NVIDIA GPU implementation to comply with oneDNN numerical behavior. Implicit downconvert to fp16 and tf32 are now managed via math mode API.

Validation

Added benchdnn driver for validation of internal BRGEMM implementation.
Improved benchdnn reference implementation performance with threadpool threading model.
Extended benchdnn performance benchmarking capabilities on GPU with device-side performance measurement mode (mode=po).

Deprecated Functionality

Support for SYCL 1.2.1 (aka SYCL 2017 standard) is deprecated and will be removed in the future releases.
Static output scales are deprecated and will be removed in the next release.
Convolution Winograd algorithm implementation for int8 data type is deprecated and will be removed in the next release.

Breaking Changes

Changed formula for AUGRU RNN cell to align with Tensorflow. See proposal for details.

Thanks to the Contributors

oneDNN - v2.6.2

Published by vpirogov about 2 years ago

This is a patch release containing the following changes to v2.6.1:

Removed unused variables (2500b0f6c1931f4b0b22b5fc92fcc87c6b875a3f, b4e00322c93984082b987408af8a2e341c7fd6c2)
Fixed correctness issue in fp32 convolution implementation for cases with large spatial size (207af06637ccf36fb08c5fd93b55d52a578cfa5a)
Fixed correctness issue in bfloat16 matmul implementation for processors with Intel AMX support (404b762f27350d5ad59225d966310b481951451e)
Fixed correctness issue in int8 reorder implementation with zero points (b340cba1cadc8fc6424945b5b2a09960bd8d47ec)
Improved int8 matmul and inner product primitives performance with small matrices for processors with Intel AMX support (73b75723921e9881b88b027a8f1b2d42251f6403, 58b386a21cfc9dbb7c331626e9e4752751cdf415)
Improved int8 convolution performance for processors with Intel DL Boost support (f35a62f9b3c1db5ce8a2704e530e050b2f4b1807)
Aligned AUGRU formula with Tensorflow definition (e47c6c570d97545b56f3afef77ce9fbd63ea320b, 4ba0a577947733690cdd0f9ecf269121148a28e1, b311e24ac3b669d6200b595201107601b6ce1f58)
Suppressed 'unvectorized loop' warning for Intel C/C++ Compiler (3932d0493586963df3cefb3c8f35cb6503cd444e)

oneDNN - graph-v0.5.2

Published by tprimak about 2 years ago

This is a patch release containing the following changes to graph-v0.5.1:

Deprecated quantized ReLU fusion patterns (85405a94)

oneDNN - v2.6.1

Published by vpirogov over 2 years ago

This is a patch release containing the following changes to v2.6:

Extended depthwise convolution post-op with support for arbitrary filter size, stride, and padding (79b019b102c5d68843d52473f7d26a80597d84d2)
Improved GEMM performance with threadpool threading on system with Intel AVX2 instruction set (2be0060dbf0291687bb8243068121d6cdda30ec2)
Fixed runtime error in GPU reduction primitive for specific tensor sizes (efbf9b5e8c12666314f3484ce279cee0a1a91a44)
Improved convolution performance on GPUs with Xe-HPG IP (f8de0c93e9ff53a7d0a41b97aabc85e828020881, c1fb8acd0f74f63db021d41dedcd54546aab5289)
Updated ITT API to 3.22.5 (9b186765dded79066e0cd9c17eb70b680b76fb8e)
Fixed correctness issues in reorder implementation for non-x64 systems (9961b8698b603842c79b492d82a05ba8dccb15da, 102063159c37b63c80fe6310e4d0481370a8ff02, 8b960dfaf43c417ed86b7da25451c12151c1a87b, ef1d9fa441f2e4e5c06a34042934cc272171a2e1, 8edd85907f42b72f9ace5dbc2bfcf43a63ce3d1b, 39edcf61e162d7f3a7449e05bfedccd1301fe34e, 3e0a0d9dbff6dd1c5e5d94f3c29727d489af7917, 1dff6251dd262c3bf1c5ec36a24ad9c2c46f2624, 8661958a4f4fce5c3f1dd65f30b03d9742579179)
Fixed handling on inf and -inf values in eltwise log algorithm (732cbdd2651bc8ea4c7ae125c29e542fecd79b8e, 3fd0f2e44c84869181aa2506e8924c37e9267b64)
Improved depthwise convolution performance on GPUs with Xe-HPG IP (7a6fe1d964d423a22d9e3525f7851a7d221460ad)
Addressed fails in test_isa_hints gtest on GPUs (78c1c68305f81cb087f3e4dc2cebb07cace1ef4d)
Fixed issues with bfloat16 GEMM producing NaNs in certain cases on GPUs with Xe-HPC IP (5d659707f0cd9bc432e5f74d6e9d8b3bbc4776ad)
Changed default layout to blocked for depthwise convolutions to avoid spurious reorders (78f231b03f4a1126991f4e725b75c090925fd870)
Addressed issue with incorrect values in padded areas for convolution with post-ops on GPUs (2e4ad3ab7182cbc666af3a5c32d59bbd7cf710b7)
Fixed build issues with -Werror=odr option (27668dd728a3a3460315e44275490daab317fa8d)
Addressed issues detected by clang USAN in BRGEMM kernel (2bbaa3092b27dc0bf08dc2c534e3ee761d6fb6e0, 9b3826f762de28b2c35aa8f9249b916973b7b140, b59b02716367e64e35264093828da1c0b3edc646)

oneDNN - graph-v0.5.1

Published by vpirogov over 2 years ago

This is a patch release containing the following changes to graph-v0.5:

Fixed the layout propagation of Reshape and Transpose operators in oneDNN backend (3b681d4, 09863f9)
Enabled scalar Divide + MatMul fusion in oneDNN backend (d4c7dc6)
Enabled Convolution + LeakyReLU fusion in oneDNN backend (b0f4dbb, c8fb4c13, e15979e)
Improved the document of fusion patterns (b9a52384)
Fixed operands swapping for binary operators (a07bfdac, d2567d7)
Worked around a false positive build issue in GCC11 for compiler backend (17a40d0)

oneDNN - graph-v0.4.3

Published by vpirogov over 2 years ago

This is a patch release containing the following changes to graph-v0.4.2:

Upgraded to oneDNN v2.5.4 patch release (3418ec1)
Fixed compiler backend to build with downstream projects when LLVM is used (c73dd858)
Fixed the layout propagation of Reshape and Transpose operators in oneDNN backend (cbdb736f)