Bot releases are hidden (Show)

oneDNN - v3.4.1 Latest Release

Published by vpirogov 7 months ago

This is a patch release containing the following changes to v3.4:

Fixed an issue with caching and serialization of primitives in deterministic mode (7ed604a1e5688022a59444059e53a6a7967f679a)
Introduced memory descriptor serialization API (4cad420e673f4cd49568ea7c4dd6a55e6f55794e, 929a27ae0412a0851629da70916eee360a39baac, 9b848c859a6b1d046dd63cf20f817aa9428fb483)
Fixed incorrect results in fp64 convolution and deconvolution on Intel GPUs based on Xe-LPG architecture (ebe77b566bb1cd273e9bda99cc62063b7c2a7e45, 0b399ac42740a9c6ed458aacafdb31ce16205cbd, d748d642d7871608e09f5cee5d964ddcfc8a42ef, 9f4f3d510ddc9d639db052302be579621d46bb1f, 21a8caebb34a85074f3f8a5cef35ed85532a5bbe)
Fixed incorrect results in reorder with large sizes on Intel CPUs and GPUs (69a111e6d835f8632ea571f3ea0e273b22488d37, 4b7236134bde1c1a71859a844eae860a71670b97, 74a343bf66a1c8f113fa8e025391aba5015c6e48)
Reduced creation time for deconvolution primitive on Intel CPUs (bec487e4ae16b3e88382adf9574e9c62cc76d1bd, 1eab00586881f4fb6966a16f71216528ec549c11)
Fixed performance regression in deconvolution on Intel CPUs (fbe5b97c966696a3f5be2240c0eb4592ed548036, 1dd3c6af03addefcf92ac45eddeb8becf63d6a6e)
Removed dangling symblols from static builds (e92c4041b12e55837452327c3ebd9411dbc2e861, 6f5621aed75226b93f07879fafa6fb799a36f042)
Fixed crash during platform detection on some AArch64-based systems (406a0798c1c5b939726a892ad5a96e20298396ca)
Fixed performance regression in int8 deconvolution on Intel CPUs (7e50e152f21a79978b8910260e042b43941b601c)
Fixed handling of zero points for matmul in verbose logs converter (15c791686f94291eddda7a2e24835ba1113c530a)

oneDNN - v3.3.6

Published by vpirogov 7 months ago

This is a patch release containing the following changes to v3.3.5:

Fixed crash during platform detection on some AArch64-based systems (3e0e69b21ba0694db95bd2af0877f936dcc86dd2)
Improved inner product performance with Arm Compute Library (ACL) (e7abee2d883d41613cf243c135037fc68d2dacd0, 214fb9e14227880097729ffffac3b666a0febcd7, 8aacc8ff0dfefddfae30681d056757dba1fb0815)
Fixed incorrect results in int8 depthwise convolution with post-ops on processors with Intel AVX2 instruction set support (0c922e04df62cf3042ebdc578a72883bde35079a)
Fixed performance regression in fp32 convolution on processors with Intel AVX2 instruction set support (4efc0ad7234741459bab6afc21f571ddb645bcae)

oneDNN - v3.4

Published by vpirogov 8 months ago

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
- Improved RNN primitive performance with LBR_GRU cell.
- Improved softmax performance on processors with Intel AVX2 or Intel AVX-512 instruction set support.
- Improved fp32 inner product performance on processors with Intel AVX2 instruction set support.
- Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel AVX-512 and Intel AMX instruction set support.
- Improved int8 matmul performance with transposed A tensor.
- Improved performance of resampling primitive on processors with Intel AVX2 instruction set support.
- Improved performance of int8 convolution with post-ops.
- Optimized batch matmul with binary post-op and broadcast mask 1 and 14.
- Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
- Improved performance of subgraphs including matmul and add operations and mixed int8 and bfloat16 data types with Graph API.
- [experimental] Improved performance of reduction, softmax and layernorm operations with experimental Graph Compiler backend.
- [experimental] Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.
Intel Graphics Products:
- Introduced initial optimizations for Processor Graphics based on Xe2 architecture.
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
- Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
- Improved convolution performance for cases relevant to the Stable Diffusion model.
- Improved RNN primitive performance.
- Improved pooling forward propagation performance.
- Improved batched matmul performance for cases with 5 dimensions or more.
AArch64-based Processors:
- Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
- Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
- Improved bf16 inner product product primitive performance with ACL.

Functionality

Introduced GPT-Q support to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel Graphics Products and support matmul with int8 wight compression.
Introduced fp8 data type support in primitives and Graph API. Optimized implementation is available for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
[experimental] Introduced unstructured sparsity support for processors with Intel AMX support relying on VCOMPRESS/VPEXPAND instructions.
Intel Graphics Products
- Introduced support for Intel Data Center GPU Max 1550VG
- Introduced PReLU post-op support for inner product and matmul primitives.

Usability

Added opt-in deterministic mode support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
Introduced accumulation mode control.
Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
Reduced RNN primitive memory consumption on GPUs.
Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
Extended tensor constructor in Graph API to support memory allocation and management by the library.
Introduced new API and environment variable to manage Graph API constant tensor cache capacity.
Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
Changed default optimization flags for AArch64 builds to -mcpu=generic to improve portability.

Validation

Improved benchdnn performance by optimizing bottlenecks in validation code.
Introduced --num-streams knob in benchdnn to support benchmarking in multi-stream scenarios.

Known Limitations

Intel Datacenter GPU Flex Series driver for Windows has an issue resulting in program hangs or crashes when oneDNN primitives are created concurrently.
int8 concat primitive may produce incorrect results on integrated GPUs with current GPU driver.
fp32 pooling primitive may produce incorrect results in rare conditions on Intel Datacenter GPU Max Series with current GPU driver.
reorder primitive causes segmentation fault for prime sizes exceeding 2^31 on Intel CPUs.
fp64 convolution and deconvolution produces incorrect results on integrated graphics in future Intel Core processors (code name Arrow Lake)
int8 matmul primitive creation with fp32 bias fails on Intel GPU Flex Series and Intel Arc Graphics.

Breaking Changes

Updated minimal supported ACL version to 23.11 (was 23.02.1).

Thanks to these Contributors

This release contains contributions from the project core team as well as Alexander Grund @Flamefire, David Svantesson @davsva01, Fadi Arafeh @fadara01, Hugh Delaney @hdelan, Ilya Lavrenov @ilya-lavrenov, Jacob Kahn @jacobkahn, Nathan John Sircombe @nSircombe, Renato Barros Arantes @renato-arantes, Sergey Shalnov @shssf, Sunita Nadampalli @snadampal, and Svetlozar Georgiev @sgeor255. We would also like to thank everyone who asked questions and reported issues.

oneDNN - v3.3.5

Published by vpirogov 8 months ago

This is a patch release containing the following changes to v3.3.4:

Fixed undefined behavior in 3D depthwise convolution on Intel CPUs (bbaec145f8c64818fd5c3ed2cb9e2ae69daef887)
Added warning for ACL versions newer than maximum supported (7473012743ae3227dbfa208cad260d29d86d5080)
Added citation file (fea9f88fa7f8056a5addedfdebdb2dda35ee7a9d)
Fixed SEGFAULT in int8 convolution on processors with Intel AMX support (2a8e122b63b55f897c470d23f21003bb70f0e839)

oneDNN - v3.4-rc

Published by harrymao2022 8 months ago

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
- Improved RNN primitive performance with LBR_GRU cell.
- Improved softmax performance on processors with Intel AVX2 or Intel AVX-512 instruction set support.
- Improved fp32 inner product performance on processors with Intel AVX2 instruction set support.
- Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel AVX-512 and Intel AMX instruction set support.
- Improved int8 matmul performance with transposed A tensor.
- Improved performance of resampling primitive on processors with Intel AVX2 instruction set support.
- Improved performance of int8 convolution with post-ops.
- Optimized batch matmul with binary post-op and broadcast mask 1 and 14.
- Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
- Improved performance of subgraphs including matmul and add operations and mixed int8 and bfloat16 data types with Graph API.
- [experimental] Improved performance of reduction, softmax and layernorm operations with experimental Graph Compiler backend.
- [experimental] Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.
Intel Graphics Products:
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
- Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
- Improved convolution performance for cases relevant to the Stable Diffusion model.
- Improved RNN primitive performance.
- Improved pooling forward propagation performance.
- Improved batched matmul performance for cases with 5 dimensions or more.
AArch64-based Processors:
- Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
- Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
- Improved bf16 inner product product primitive performance with ACL.

Functionality

Introduced GPT-Q support to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel Graphics Products and support matmul with int8 wight compression.
Introduced fp8 data type support in primitives and Graph API. Optimized implementation is available for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
[experimental] Introduced unstructured sparsity support for processors with Intel AMX support relying on VCOMPRESS/VPEXPAND instructions.
Intel Graphics Products
- Introduced PReLU post-op support for inner product and matmul primitives.

Usability

Added opt-in deterministic mode support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
Introduced accumulation mode control.
Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
Reduced RNN primitive memory consumption on GPUs.
Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
Extended tensor constructor in Graph API to support memory allocation and management by the library.
Introduced new API and environment variable to manage Graph API constant tensor cache capacity.
Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
Changed default optimization flags for AArch64 builds to -mcpu=generic to improve portability.

Validation

Improved benchdnn performance by optimizing bottlenecks in validation code.
Introduced --num-streams knob in benchdnn to support benchmarking in multi-stream scenarios.

Breaking Changes

Updated minimal supported ACL version to 23.11 (was 23.02.1).

Thanks to these Contributors

oneDNN - v3.3.4

Published by vpirogov 10 months ago

This is a patch release containing the following changes to v3.3.3:

Fixed performance regression in convolution, matmul and inner product primitives with post-ops on Intel CPUs (2e3c94c5aeb6be1ce992d799943fdc4f3123905f)
Fixed performance regression in bfloat16 matmul on processors with Intel AMX instruction set support (c0ae38cdf1201caf8ffd2906077defdfe7f4aaa3, fa4364057891fdec528d9442c88d0715306bff2d)
Fixed segfault in 3D convolutions with different h and w parameters on Intel CPUs (b5f916ec068f783dbba2cd4f04a673e996f9efba)
Fixed performance regression in fp32 convolution backpropagation on Intel CPUs (ee3b12d5388d7d749a120cf8522efd6f5aeecc09)
Reduced benchdnn memory consumption on Intel GPUs (84a8f57d45f215cf89d0f80a57a66b78eaf9b440)

oneDNN - v3.3.3

Published by vpirogov 10 months ago

This is a patch release containing the following changes to v3.3.2:

Fixed performance regression in int8 convolutions on processors with Intel AVX-512 and Intel DL Boost support (a00661ff735e5448ef3a80e4e2df7a1556f8a84f)
Fixed race condition during library initialization on Intel Data Center GPU Max Series (7dfcd116e245e4a167a64bd39a24e957d2b939de)
Fixed accuracy issue in experimental Graph Compiler with LLVM code generator (8892e7efadeaf42d75f75e64d095635458836cd7)
Disabled int8 RNN implementation for cases with non-trivial strides (2195e4b23d57c38a439c50232783f654b96f575c)
Fixed incorrect results in bfloat16 convolution implementation on processors with Intel AMX support (9f00af9312a9b76a1880e1aaac513188793ecaa7)
Fixed incorrect results in fp16 and int8 convolution on Intel Core Ultra integrated GPUs (69cef84c4f09398858393035eafa2bd4a29ec0b0, 79bc6cc0477db1ce7e732f20d005ff2b9e88390e, c9c0b09c5e64114eada1b6beb7f6db36331e0fac)

oneDNN - v3.3.2

Published by vpirogov 11 months ago

This is a patch release containing the following changes to v3.3.1:

Fixed incorrect results in bfloat16 reorder on Intel Core Ultra integrates GPUs (9025980286c506908f98819e068a047a1d268842, ed9de2afd1fede32a317cbc5df953dfe997e78ea, 0c6bda10b3ea760205d4707a554b76045ef6f964)
Fixed incorrect results in matmul, inner product, and RNN primitives on Intel Core Ultra integrated GPUs (6edab9f01ec5cf8b30ee0b474aa25417f0493897)
Updated compiler optimization flags for AArch64 processors to make build portable (8829c249b713dddc87c2669120a9798e202ac633)
Fixed segmentation fault during library initialization on AArch64 processors (3e15c6113ffeff3545775cbcca9bd84911856cb9)

oneDNN - v3.3.1

Published by vpirogov 11 months ago

This is a patch release containing the following changes to v3.3:

Fixed int8 convolution accuracy issue on Intel GPUs (09c87c79bccbad8fa451b224a0f07f87095e3907)
Switched internal stream to in-order mode for NVIDIA and AMD GPUs to avoid synchronization issues (db01d62b3fc80897d88dc42f4dcdfcb0d90c131a)
Fixed runtime error for avgpool_bwd operation in Graph API (d025ef6620b131f3487bb748866ddd9d7225c09f, 9e0602ad37afa18d46f407cb52577f1afead238b, e0dc1b3d070313052f5fd6ac739778d45b57859c)
Fixed benchdnn error reporting for some Graph API cases (98dc9dbecb3f36234474c9d6e96ab6571497633b)
Fixed accuracy issue in experimental Graph Compiler for int8 MHA variant from StarCoder model (5476ef7c165d943fbce94ca0f44a13d6868e65f3)
Fixed incorrect results for layer normalization with trivial dimensions on Intel GPUs (a2ec0a0c5805314220db925e1323e4675e3ca379)
Removed redundant synchronization for out-of-order SYCL queues (a96e9b1a6769171e74b0b8e031489303438906e5)
Fixed runtime error in experimental Graph Compiler for int8 MLP subgraph from LLAMA model (595543dd093df3e92621c253d6da3f9092ec7ff8)
Fixed SEGFAULT in experimental Graph Compiler for fp32 MLP subgraph (42071057abb2fcbbca6ed67117bdb1a5ee3dc0cd)
Fixed incorrect results in experimental Graph Compiler for MLP subgraph (57e14b56d4e6fab2ab49dbd47fd579482d79535a)
Fixed the issue with f16 inner product primitive with s8 output returning unimplemented on Intel GPUs (bf12207b0312c0174f0c47ae0d3abd70edc31957, 800b5e9613bd0994af82706ef024ad2b453be2b6, ec7054a2c79ae33d3db4ff04ce11360c2c896d56)
Fixed incorrect results for int8 deconvolution with zero-points on processors with Intel AMX instructions support (55d2cecd698f865efac2e1dbf2f701b4b8095df1)

oneDNN - v3.3

Published by harrymao2022 about 1 year ago

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved int8 convolution performance with zero points on processors with Intel AMX instruction set support.
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). This functionality is disabled by default and can be enabled via CPU dispatcher control.
- Improved fp32 and int8 convolution performance for cases with small numbers of input channels for processors with Intel AVX-512 and/or Intel AMX instruction set support.
- Improved s32 binary primitive performance.
- Improved fp16, fp32, and int8 convolution performance for processors with Intel AVX2 instructions support.
- Improved performance of subgraphs with convolution, matmul, avgpool, maxpool, and softmax operations followed by unary or binary operations with Graph API.
- Improved performance of convolution for depthwise cases with Graph API.
- [experimental] Improved performance of LLAMA2 MLP block with Graph Compiler.
Intel Graphics Products:
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Reduced RNN primitive initialization time on Intel GPUs.
AArch64-based Processors:
- Improved fp32 to bf16 reorder performance.
- Improved max pooling performance with Arm Compute Library (ACL).
- Improved dilated convolution performance for depthwise cases with ACL.

Functionality

Introduced group normalization primitive support. The functionality is currently available on CPUs.
Intel CPUs:
- Introduced support for zero points in int8 convolution with groups and 3D spatial.

Usability

Extended verbose mode output:
- Improved diagnostics on engine creation errors.
- Added information on Graph API calls.
- Added information on strides for non-dense memory objects.
- Added values of runtime dimension.
- Added indication that primitive descriptor was created with any memory format tag.
Introduced examples for Graph API.
Graph API constant tensor cache is now disabled by default and requires opt-in with dnnl::graph::set_constant_tensor_cache() call.
Reduced oneDNN Graph API memory consumption in certain scenarios.

Validation

Extended benchdnn performance reporting with primitive creation time.
Introduced cold cache mode in benchdnn.

Known Limitations

Current GPU OpenCL runtime for Linux has an issue resulting in convolution producing incorrect results on integrated GPUs based on Xe architecture. SYCL configuration is not affected.
Pooling, resampling, prelu, batch normalization, layer normalization, and eltwise primitives may sporadically produce incorrect results on Intel Arc GPUs on Windows.
Current GPU driver for Linux has an issue resulting in program hangs or crashes when oneDNN primitives are executed concurrently on Intel Datacenter GPU Max Series.
Extensive use of RNN primitive on Intel GPUs with default primitive cache setting may lead to a device reboot. Workaround: consider reducing primitive cache size to 100.
Int8 deconvolution with signed weights and activations may produce incorrect results of processors with Intel AMX support.
Int8 softmax may fail crash on Windows in SYCL debug configuration.

Thanks to these Contributors

This release contains contributions from the project core team as well as Amy Wignall @AmyWignall-arm, @baibeta, Benjamin Taylor @bentaylorhk-arm, Ilya Lavrenov @ilya-lavrenov, Kentaro Kawakami @kawakami-k, Milos Puzovic @milpuz01, Renato Barros Arantes @renato-arantes, @snadampal, @sparkyrider, and Thomas Köppe @tkoeppe. We would also like to thank everyone who asked questions and reported issues.

oneDNN - v3.3-rc

Published by harrymao2022 about 1 year ago

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved int8 convolution performance with zero points on processors with Intel AMX instruction set support.
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). This functionality is disabled by default and can be enabled via CPU dispatcher control.
- Improved fp32 and int8 convolution performance for cases with small numbers of input channels for processors with Intel AVX-512 and/or Intel AMX instruction set support.
- Improved s32 binary primitive performance.
- Improved fp16, fp32, and int8 convolution performance for processors with Intel AVX2 instructions support.
- Improved performance of subgraphs with convolution, matmul, avgpool, maxpool, and softmax operations followed by unary or binary operations with Graph API.
- Improved performance of convolution for depthwise cases with Graph API.
- [experimental] Improved performance of LLAMA2 MLP block with Graph Compiler.
Intel Graphics Products:
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Reduced RNN primitive initialization time on Intel GPUs.
AArch64-based Processors:
- Improved fp32 to bf16 reorder performance.
- Improved max pooling performance with Arm Compute Library (ACL).
- Improved dilated convolution performance for depthwise cases with ACL.

Functionality

Introduced group normalization primitive support. The functionality is currently available on CPUs.
Intel CPUs:
- Introduced support for zero points in int8 convolution with groups and 3D spatial.

Usability

Extended verbose mode output:
- Improved diagnostics on engine creation errors.
- Added information on Graph API calls.
- Added information on strides for non-dense memory objects.
- Added values of runtime dimension.
- Added indication that primitive descriptor was created with any memory format tag.
Introduced examples for Graph API.
Graph API constant tensor cache is now disabled by default and requires opt-in with dnnl::graph::set_constant_tensor_cache() call.
Reduced oneDNN Graph API memory consumption in certain scenarios.

Validation

Extended benchdnn performance reporting with primitive creation time.
Introduced cold cache mode in benchdnn.

Thanks to these Contributors

This release contains contributions from the project core team as well as Amy Wignall @AmyWignall-arm, @baibeta, Benjamin Taylor @bentaylorhk-arm, Kentaro Kawakami @kawakami-k, Milos Puzovic @milpuz01, @snadampal, @sparkyrider, Thomas Köppe @tkoeppe. We would also like to thank everyone who asked questions and reported issues.

oneDNN - v2.7.5

Published by vpirogov about 1 year ago

This is a patch release containing the following changes to v2.7.4:

Fixed a correctness issue in fp32 batched matmul with transposed source tensor on processors with Intel AVX-512 instruction set support (1a9b80d7ad6856437c3e0b504bb53dca772eb0fe)
Improved batched matmul performance on processors with Intel AMX instructions support (8c20f62dbcd4d622c8a279b7c81dacb629f1de41, acb8e12b0f3e70d2e80543e31a91362c8852bbaf)
Fixed a correctness issue in int8 convolution primitive with zero points on processors with Intel AVX2 and Intel DL Boost support (0abbf225ce906987bc3728252b5842fb0239daab, d3a9f02e50334a0ebe102dd8bdb7887deeb12ec5)
Improved convolution performance with small number of input channels on processors with Intel AVX-512 instruction set support (fc7fced9988124027220fb53dfb16022c9be35c0)

oneDNN - v3.2.1

Published by vpirogov about 1 year ago

This is a patch release containing the following changes to v3.2:

Fixed a potential issue SEGFAULT when oneDNN primitives created in parallel (0a6202f5000cf347995ab744c25aa26cabf2482d)
Replaced deprecated SYCL API get_pointer with get_multi_ptr (fdbff4591f952d02a0c934f854a9b225a7097a21, 51ed43bb5cb08f38b0b652255a13bb4072b2ee57)
Fixed an error in device indices detection for persistent cache (25575c2d20a9885640c89771c99a0d27b5444b4d)
Improved benchdnn performance results accuracy for Graph API (9dfe343992209ecc6eb1265a140b6f0db228d90a)
Fixed an issue with profiling API not respecting ONEDNN_EXPERIMENTAL_PROFILING build knob. This behavior manifests in apparent memory leak when oneDNN primitives are executed on a queue with enabled profiling (8d796efb609c33ecdd31e3e7b26d94d959dd83b9, 51a8f7ad892b1174d32cba8358804fad09b58f76, 2ca29381eeb5dde64d90468e440f87b6f9ad01d9)
Fixed a correctness issue in resampling primitive with binary and/or sum post-op on Intel CPUs (65ccd2506eeafb44822c682acfef97ef18bea09f, 4a0e087b405f4ebc682cf82c4a5bb96e9b9976d4, f333bb8c191fbfab368645aeac1c3a0d1892eda4)
Fixed a correctness issue in int8 matmul with zero-points for processors with Intel AVX2 and Intel DL Boost instructions support (ec0b2ee85fc2a2dbdeec10035c5ef5813d8fb5ea, 6d2e567c9361992adf235545c9fc2047184ed6e6)
Fixed a correctness issue in fp32 batched matmul with transposed source tensor on processors with Intel AVX-512 instruction set support (36f355e0773f79cca5a639a5a3558f45da57c35d)
Fixed a correctness issue in matmul and inner product with post-ops on processors with Intel AVX2 and Intel DL Boost with fp16 and bfloat16 instruction set support (b76d4cae333fc4e015d47eb737e10551daf30334)
Fixed a potential out of bounds issue during GPU kernel creation (190a9b28170f5326241c9c4ab6bc7964877e953d)
Updated build system to use TBB-provided CMake config file when available (40112196287e8866a7259df35f817229454d0c96)

oneDNN - v3.2

Published by harrymao2022 over 1 year ago

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable Processor (formerly Sapphire Rapids).
- Improved performance for future Intel Xeon Scalable Processor (code-named Sierra Forest). The functionality is disabled by default and can be enabled via CPU dispatcher control.
- Improved fp32 inner product performance for processors with Intel AVX-512 instructions support.
- Improved bf16 and int8 matmul performance with runtime dimensions for processors with Intel AMX instructions support.
Intel Graphics Products:
- Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc Graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Reduced creation time for matmul, inner product, and RNN primitives.
AArch64-based Processors:
- Improved convolution performance with post-ops on processors with SVE support.
- Improved fp32 and fp16 depth-wise convolution performance with Arm Compute Library (ACL).
- Improved fp32 deconvolution performance for math mode bf16 or any with ACL.
IBM Z Platform:
- Improved int8 matmul, inner product, and RNN performance for s390 z15 systems.

Functionality

[experimental] Introduced Graph Compiler backend for Graph API. Graph Compiler improves performance of composite operations like multi-head attention (MHA), multi-level perceptron (MLP), and convolution residual blocks for processors with Intel AVX-512 and Intel AMX instructions support.
Extended Graph API with boolean data type, select, and pow operations.
Introduced support for binary and eltwise post-ops in softmax primitives.
Introduced reference SYCL implementations of batch normalization, layer normalization, linear response normalization (LRN), binary, softmax, eltwise, pooling, PReLU, shuffle, and resampling primitives. These implementations address functional gaps on NVIDIA and AMD GPUs where support is missing in native libraries.
Intel Graphics Products:
- Introduced mixed precision support for binary primitives.
NVIDIA GPUs:
- Introduced bfloat16 support for deconvolution and softmax primitives.
AMD GPUs:
- Introduced support for inner product, convolution, deconvolution, batch normalization, and reorder primitives support.

Usability

Extended verbose mode with additional capabilities, including information about implementation dispatching decisions and reasons for primitive creation errors.
Reduced stack consumption to less than 20 KB across implementations.
[experimental] Introduced profiling API for SYCL and OpenCL applications.

Validation

Introduced fast performance validation mode (--mode=F) in benchdnn. Testing speed is improved by initializing oneDNN objects in parallel and avoiding use of host memory when benchmarking GPU primitives.
Reduced benchdnn memory consumption in performance validation mode.
Introduced smoke test set for benchdnn. This test set provides basic validation for all primitives.

Known Limitations

On future Sierra Forest platforms, fp32 matmul with bfloat16 binary post-op may produce incorrect results on processors with Intel® Advanced Vector Extensions 2 (Intel® AVX2) and Intel® Deep Learning Boost (Intel® DL Boost) support.
On SKX+ platforms, fp32 convolution forward propagation with strides has performance regression on processors with Intel® AVX-512 instructions support.
On all platforms, resampling primitive with binary post-op may produce incorrect results on CPUs.
On all GPU platforms, extensive use of the RNN primitive on Intel GPUs with default primitive cache settings may lead to a device reboot. Workaround: consider reducing primitive cache size to 100.
On DG2 and ATS-M platforms
- Convolution and deconvolution primitives on Intel® Arc™ GPUs on Windows may lead to memory corruption under heavy repeated use.
- The bfloat16 matmul primitive may crash on Intel® Arc™ GPUs on Windows.
- Pooling, resampling, prelu, batch normalization, and layer normalization may sporadically produce incorrect results on Intel® Arc™ GPUs on Windows.
- oneDNN Graph partitions containing ConvTransposeBackwardWeights or int8 matmul operations may produce incorrect results on Intel® Processor Graphics on Windows.
On PVC platforms
- The bfloat16 matmul primitive has performance regression with shapes 14x128:128x200:14x200 and 200x128:128x200:200x200 on the Intel® Data Center GPU MAX Series.
- oneDNN primitives may crash or produce incorrect results with tensors exceeding 4 Gb in size.
- The softmax primitive with a NHWC memory format may produce incorrect results on the Intel® Data Center GPU Max Series.
On GEN12 platforms, the inner product weight gradient may produce incorrect results on Intel® Processor Graphics on Windows.

Thanks to the Contributors

This release contains contributions from the project core team as well as Abdelrauf @quickwritereader, Alexey Vishnyakov @SweetVishnya, Annop Wongwathanarat @annop-w, Anthony Roberts @anthony-linaro, Crefeda Rodrigues @cfRod, David Svantesson @davsva01, Fadi Arafeh @fadara01, Ilya Lavrenov @ilya-lavrenov, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, Milos Puzovic @milpuz01, RambabuSwargam @RambabuSwargam, Sai Teja @saiteja13427, Taiju Tsuiki @tzik. We would also like to thank everyone who asked questions and reported issues.

oneDNN - v3.1.1

Published by vpirogov over 1 year ago

This is a patch release containing the following changes to v3.1:

Fixed correctness issue in pooling primitive with post-ops on Intel GPUs (4b7bc1a7bf16909003f63bf66d3d730cee00e5db)
Fixed segfault in bfloat16 convolution on processors with Intel AMX support (461d55e65f2bc0f45fcdfc3405493226218d22ee)
Fixed correctness issue in deconvolution primitive with post-ops on Intel GPUs based on Xe-LP architecture (c8943f588e99f6251a443ee4eb5c274e9c942947, ad3c62f104b07d30cc0f5cf34ca7bf127041e4dc)
Fixed performance regression in int8 convolution primitive with scales (7fa3b6f335893270cdd079f4f8aadd36cf8f490b, bb3ecc460605eae3ca8a8ee79a8d9122f195730b)
Fixed correctness issue in int8 convolution primitive with zero points on processors with Intel AVX2 and Intel DL Boost support (d721767a554f9a4da70bd6bc1c27c00b1ea80cc2, f6365b1b2c6e6d79e59207dad090b9643224f147)
Fixed performance regression in int8 inner product on processors with Intel AVX-512 and Intel DL Boost or Intel AMX support (2ede31e834a25ca14c648e8617b972148c94554c)
Fixed segfault in pooling primitive with post-ops on processors with Intel SSE4.1 support (d712173a5b9df2bdefd12cc94be2e83e64cfb433, e4085a706dd0b41c3d8171193b816a3c4e52c01d)
Fixed integer overflow in eltwise primitive on Intel GPUs (1932b3d04e574745d54802ee19e18bcbe8887e2d, be05c3392eaf86f2d897c5ec42a8860361c290b8, 148006b86f66e4af8f3ebd7db94980de487b9287, 2e643692480be21019b2b71db69e07729bfbf26c, b4423fbc11e574697d97eda18d4b0d8d7b1f60f3, 87fd48f48847463cbd1c42a39c9aa092158dbf2f, 9a66ac6f394071b05285b063a393acd297e3c662, 6ce52eb340486373670a9975c54449cf14a73d4f, 36bf079e7e99e0408ec11fe94cd64439f30b4014, 161d2b6416f4e9c17eabd1d45b8a3aeb2d4e9dd0, a5ef0788afcb719d22a311f91b31f3afca392a7c, d058bd8898b92330546d3f8d52335631fda5051a)
Fixed primitive creation error in large 3D convolutions on Intel GPUs (7c23d9e85ef328081f7d9836ebfffda755f4b496)
Fixed performance regression in fp32 convolution primitive weight gradient on Intel GPUs (ff209f967c2bdfa1139779cf59dced374e2064c5, 87108392da71b06594356a18232ac1378e28adfc)
Fixed primitive creation error in int8 convolution with zero points on Intel GPUs (cb9169397ceee206fece71f73b5d627ee9eea33f, 85e58af6b5cb1a9cd42cd602832c035a3b3a660f)
Fixed correctness issue in fp32 convolution with Winograd algorithm on Intel GPUs (97ac88509bf8799fd03eb768faec302d44ce38dc)
Fixed primitive creation error in depthwise convolution on Intel GPUs based on Xe-LP architecture (51d608d24f09d6b0ad2d60008f09646dbf79ee60)
Fixed segfault during Graph partition compilation (a5d35682307ec81107f603b66c5f4ca95f421fbb)
Fixed crashes in inner product with unsupported weight formats on Intel64 CPUs (c0f4e93903f1c32bef8378d58177ef971c400e90)
Fixed an issue with compilation of Graph partitions containing matmul and using destination tensor layout any on Intel GPUs (ab2041d39862de747535037eb5a73c675d93d323, f2c457d72896d6c86245a6c6e80539b842aec369)
Improved accuracy of eltwise primitive with gelu_erf algorithm on Intel64 CPUs (e67abefadbb4fd73ea6a4d3981965bc56eb77b97)
Fixed correctness issue in int8 matmul and inner product primitives on Intel GPUs based on Xe-HPG and Xe-HPC architecture (36aa6224ebae1413a6badd43ffc96d3412c8f8ec)
Fixed potential correctness issue in bfloat16 convolution weight gradient on processors with Intel AMX support (c93e673bba299fdc62733f22d65d91f4dbc300dd, 8da108375bc02b08a385b167a49aa8d1189b66d6, f7acf9877b368a5f704dcc9efcb913345b477bbc)
Fixed memory corruption in inner product weight gradient on processors with Intel AMX support (b56a89e1b977d793f2de89dc95bb7f07f2449cd8)
Fixed integer overflow issue in convolution primitive on Intel GPUs (774deabcbb9dc3e452bdafcde5e92a55c3701309, 663c2e44272c57a97e5f20e3a7a28cb9ac91ae01, 12d57430c66eb4d83532a2338443faae7be8ea5c, 31ac0e045981b03434c7592fe84af97a79a3d4a8, e3cb07d60473c23829db987384e5366b924e22c4)
Fixed correctness issue in matmul primitive with broadcasted bias on Intel GPUs (3ba7e8b9c14948da35c86d4d74725f0d23511fc8)
Fixed correctness issue in inner product primitive with post-ops on processors with Intel AVX2 support (69260f661030f66b34fefeab97044c81769462a9)
Fixed out of bounds prefetching in matmul and inner product primitives on Intel GPUs (2b8f6b16dd894f7c13c33a9fd5c497cff10d66c2)
Fixed dispatching issues for fp32 inner product implementation on processors with Intel AVX2 and Intel DL Boost supprt (f27dedbfc093f51032a4580198bb80579440dc15, f8d7c2e40a965fc52521d4ba9c793d8adc2be4e1)
Fixed division by zero issue in eltwise and eltwise post-op on Intel GPUs (f5654f55582f003c22aee23e5a91acfead8d1e1b, a18c19e654483b547bbe791d0640eceef4ef2e79, a7c8cbc428ad361e2f290605be1280268eb8ea56, 44355a60e31fd20bf6fa029af5bf3eebc533ec2c)
Fixed correctness issue for 3D convolution primitive with post-ops (e6b93af5bdb32691ad90d3f537158649b61a6fc4)

oneDNN - v3.2-rc

Published by harrymao2022 over 1 year ago

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable Processor (formerly Sapphire Rapids).
- Improved performance for future Intel Xeon Scalable Processor (code-named Sierra Forest). The functionality is disabled by default and can be enabled via CPU dispatcher control.
- Improved fp32 inner product performance for processors with Intel AVX-512 instructions support.
- Improved bf16 and int8 matmul performance with runtime dimensions for processors with Intel AMX instructions support.
Intel Graphics Products:
- Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc Graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Reduced creation time for matmul, inner product, and RNN primitives.
AArch64-based Processors:
- Improved convolution performance with post-ops on processors with SVE support.
- Improved fp32 and fp16 depth-wise convolution performance with Arm Compute Library (ACL).
- Improved fp32 deconvolution performance for math mode bf16 or any with ACL.
IBM Z Platform:
- Improved int8 matmul, inner product, and RNN performance for s390 z15 systems.

Functionality

[experimental] Introduced Graph Compiler backend for Graph API. Graph Compiler improves performance of composite operations like multi-head attention (MHA), multi-level perceptron (MLP), and convolution residual blocks for processors with Intel AVX-512 and Intel AMX instructions support.
Extended Graph API with boolean data type, select, and pow operations.
Introduced support for binary and eltwise post-ops in softmax primitives.
Introduced reference SYCL implementations of batch normalization, layer normalization, linear response normalization (LRN), binary, softmax, eltwise, pooling, PReLU, shuffle, and resampling primitives. These implementations address functional gaps on NVIDIA and AMD GPUs where support is missing in native libraries.
Intel Graphics Products:
- Introduced mixed precision support for binary primitives.
NVIDIA GPUs:
- Introduced bfloat16 support for deconvolution and softmax primitives.
AMD GPUs:
- Introduced support for inner product, convolution, deconvolution, batch normalization, and reorder primitives support.

Usability

Extended verbose mode with additional capabilities, including information about implementation dispatching decisions and reasons for primitive creation errors.
Reduced stack consumption to less than 20 KB across implementations.
[experimental] Introduced profiling API for SYCL and OpenCL applications.

Validation

Introduced fast performance validation mode (--mode=F) in benchdnn. Testing speed is improved by initializing oneDNN objects in parallel and avoiding use of host memory when benchmarking GPU primitives.
Reduced benchdnn memory consumption in performance validation mode.
Introduced smoke test set for benchdnn. This test set provides basic validation for all primitives.

Thanks to the Contributors

oneDNN - v2.7.4

Published by vpirogov over 1 year ago

This is a patch release containing the following changes to v2.7.3:

Fixed potential NaN issue in convolution weight gradient on Intel CPUs (6d80bb48714f8f8d030f055435f5bfde3a382f15, 4c34f89653259b2e15e277ff0663d6705f093e1b, 017950a16168640764d17558e41010d0ae038377, 796a600c3de2993b5d5819995ad13eb70d097496)
Improved bfloat16 convolution weight gradient performance for processors with Intel AMX support (21bdc21f37ff835b9ce54d4b713d7bfd65060e30, 82cb7d37f861a471215b242e8df0330523cdf223, b2e948f931367c81a6887d4e0e544a9f50dcd673, 0a33f70c1b283d18631d299d3c907743d215e80d, ff05d0e8c2db056b0857bcbed22c5097f76529da)
Fixed out of bounds writes in bfloat16 inner product weight gradient for processors with Intel AMX support (caead724fc6d309c7706760a520908e28b8f8b0b)
Fixed illegal instruction in matmul for processors with Intel AMX support (be942a240e775dfda47bfff5622106851df218e5, 28ddb5bc91b01e266575047a676569c4af35a5eb, d264ba494a9f6b15d3eb21ec26e4606dd8d458c8)
Fixed segfault in convolution with depthwise post-op for processors with Intel SSE4.1 support (f7081009737b836f23ef8adce70994815acfa842)
Worked around segfaults for builds with Intel C/C++ Compiler 2021 for macOS (1382605c20bcdac9aa17c62cc38924138bc57db1)
Fixed segfault in bfloat16 convolution with strides for processors with Intel AMX support (c3b1dcd2605cae5609d7175fcf5223da16e03fb9)
Fixed correctness issue in int8 convolution with zero points for processors with Intel AMX support (5e76d8b07a431051b7d6a612c4933e36621fbc39)
Fixed assertion fail in int8 convolution for processors with Intel AMX support (05629a5ccfae9250e6495ffc7d51152025fcfee1)
Fixed incorrect results in vanilla GRU for Intel CPUs (2089770c4818be8933c5e9d1dd3cbaeba1457667)
Improved bfloat16 convolution performance for cases with large number of channels and spatial dimensions (c67f46b0df29c3a7c6cbd0a9f1ebbc9adf4457e8, c9cb51d6bfb68aee8377e7781a5c4512f6aa4bea, 4e2c5730426422fc362c02a963b66072c083acaf, 474527f47acb1aeff2bf52efd64e09ac95d8ef5b, 87e8ea9d8e0499b19149c69748ef8503ad2fb75b)
Fixed an issue with incorrect header files location when using oneDNN as subproject (be6abca883303e0cb4d2edac28da929a21d5d2a2)

oneDNN - v3.1

Published by harrymao2022 over 1 year ago

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processor (formerly Sapphire Rapids).
- Introduced initial optimizations for future Intel Xeon Scalable processor (code name Sierra Forest). The functionality is disabled by default and should be enabled via CPU dispatcher control.
Intel Graphics Products:
- Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Improved concat primitive performance with per-argument scales on Intel GPUs.
AArch64-based Processors:
- Improved layer normalization primitive performance with Compute Library for the Arm Architecture (ACL).
AMD GPUs:
- Introduced optimized matmul implementation.
RISC-V-based Processors:
- Improved pooling primitive performance for processors with RISC-V vector extension (RVV) support.

Functionality

Enabled Graph API as a production feature. Graph API is intended to simplify oneDNN integration into frameworks.
Added an option to zero-out weight gradient in RNN primitive. See details in corresponding RFC.
[experimental] Added support for sparse memory and dense by sparse matrix-matrix multiplication support in the matmul primitive. The functionality is supported on processors with Intel AVX2 and Intel AVX-512 instruction support.
Introduced out-of-order queues support for OpenCL runtime. See the OpenCL Interoperability section in the Developer Guide for more details.
Added support for the non-zero alpha parameter in the batch normalization ReLU post-op on Intel GPUs.
Enabled the layer normalization primitive with f64 datatype support on Intel GPUs.
Added support of per-argument scales in matmul, convolution, inner product, and reorder primitives on NVIDIA GPUs.

Validation

Extended benchdnn with functional and performance validation for Graph API.

Breaking Changes

Builds with OpenCL runtime will fail unless Graph API is disabled with ONEDNN_BUILD_GRAPH=OFF.

Known Issues and Limitations

Graph API constant cache feature is disabled with SYCL CPU runtime due to an issue with the oneAPI DPC++ Compiler runtime. This will result in lower performance for some scenarios.

Thanks to the Contributors

This release contains contributions from the project core team as well as Amy Wignall @AmyWignall-arm, Annop Wongwathanarat @annop-w, @arlesniak, @bdmoore1, Crefeda Rodrigues @cfRod, David Svantesson @davsva01, Fadi Arafeh @fadara01, Jonathan Deakin @jondea, Kentaro Kawakami @kawakami-k, Pavel Zamelin @pazamelin, Pawel Piotrowicz @pawelpiotrowicz, Peter Caday @petercad, @ranzhejiang, and Sanchit Grover @sanchit-grover-intel. We would also like to thank everyone who asked questions and reported issues.

oneDNN - graph-v0.9

Published by vpirogov over 1 year ago

This is the Beta Update 3 release of oneDNN Graph API based on oneDNN v3.0.1.

Performance Optimizations

Improved multi-level perceptron (MLP) and residual block subgraphs performance with oneDNN Graph Compiler backend on 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
Improved dynamic shape performance for MLP and multi-head attention (MHA) patterns with oneDNN Graph Compiler backend.
Improved performance of oneDNN Graph Compiler built-in code generator.

Functionality

Extended the set of multi-head attention (MHA) variants supported by oneDNN Graph Compiler.

Known Issues and Limitations

The weight’s opaque layout can be queried only from a compiled partition.

Thanks to the Contributors

This release contains contributions from the project core teams as well as Jiong Gong, Chunyuan Wu, Sanchit Jain, Yiqiang Li, Yunfei Mao, Kiefer Kuah and others.

oneDNN - v3.1-rc

Published by harrymao2022 over 1 year ago

This is a release candidate for oneDNN v3.1. Please provide feedback and submit defect reports via Github issues.

Performance Optimizations

Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processor (formerly Sapphire Rapids).
- Introduced initial optimizations for future Intel Xeon Scalable processor (code name Sierra Forest). The functionality is disabled by default and should be enabled via CPU dispatcher control.
Intel Graphics Products:
- Improved performance for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and Intel Data Center GPU Flex Series (formerly Arctic Sound-M).
- Improved concat primitive performance with per-argument scales on Intel GPUs.
AArch64-based Processors:
- Improved layer normalization primitive performance with Compute Library for the Arm Architecture (ACL).
AMD GPUs:
- Introduced optimized matmul implementation.
RISC-V-based Processors:
- Improved pooling primitive performance for processors with RISC-V vector extension (RVV) support.

Functionality

Enabled Graph API as a production feature. Graph API is intended to simplify oneDNN integration into frameworks.
Added an option to zero-out weight gradient in RNN primitive. See details in corresponding RFC.
[experimental] Added support for sparse memory and dense by sparse matrix-matrix multiplication support in the matmul primitive. The functionality is supported on processors with Intel AVX2 and Intel AVX-512 instruction support.
Introduced out-of-order queues support for OpenCL runtime. See the OpenCL Interoperability section in the Developer Guide for more details.
Added support for the non-zero alpha parameter in the batch normalization ReLU post-op on Intel GPUs.
Enabled the layer normalization primitive with f64 datatype support on Intel GPUs.
Added support of per-argument scales in matmul, convolution, inner product, and reorder primitives on NVIDIA GPUs.

Validation

Extended benchdnn with functional and performance validation for Graph API.

Breaking Changes

Builds with OpenCL runtime will fail unless Graph API is disabled with ONEDNN_BUILD_GRAPH=OFF.

Known Issues and Limitations

Graph API constant cache feature is disabled with SYCL CPU runtime due to an issue with the oneAPI DPC++ Compiler runtime. This will result in lower performance for some scenarios.