Bot releases are visible (Hide)

oneDNN - v1.8-rc

Published by anita-intel almost 4 years ago

This is a release candidate for oneDNN v1.8. Please provide feedback and report bugs in Github issues.

oneDNN - v1.7

Published by anita-intel almost 4 years ago

Performance optimizations

Intel Processor Graphics and Xe architecture-based Graphics:
- Improved performance of convolutions and matmul primitives.
- Improved performance of int8 convolutions for NHWC activations format.
Intel Architecture processors:
- Improved performance of primitives for NHWC activations format.
- Improved fp32 GEMM performance for small N
- Improved performance of int8 primitives for processors with Intel SSE4.1 instruction set support.
AArch64-based processors
- Added support for Arm Performance Library (ArmPL). ArmPL provides optimized GEMM implementation for aarch64.
- Added support for Arm Compute Library (ArmCL). ArmCL provides optimized convolution implementation for aarch64.

New Functionality

Added support for IBMz (s390x) and IBM POWER (powerpc64) architectures
Introduced RNN GRU for GPU.
Introduced int8 RNN GRU for CPU
Introduced asymmetric quantization support for convolutions and matmul
Introduced dilated pooling support.
Extended matmul primitive to support multiple dimensions in batch and broadcast on CPU.
(preview) Introduced binary post-op for (de)-convolution, pooling, eltwise, binary, inner product, and matmul.
(preview) Extended the number of supported post-ops for primitives to 20.
(preview) Introduced reduction primitive for CPU. Together with post-ops this functionality allows to implement normalization.

Thanks to the contributors

This release contains contributions from the project core team as well as Ben Fitch, Brian Shi, David Edelsohn @edelsohn, Diana Bite @diaena, Moaz Reyad @moazreyad, Nathan John Sircombe @nSircombe, Niels Dekker @N-Dekker, Peter Caday @petercad, Pinzhen Xu @pinzhenx, pkubaj @pkubaj, Tsao Zhong @CaoZhongZ. We would also like to thank everyone who asked questions and reported issues.

oneDNN - v2.0-beta10

Published by anita-intel almost 4 years ago

This is a preview release for oneDNN v2.0. The release is based on oneDNN v1.7.

Binary distribution of this software is available as Intel(R) oneAPI Deep Neural Network Library in Intel(R) oneAPI.

Performance optimizations

Intel Processor Graphics and Xe architecture-based Graphics:
- Improved performance of convolutions and matmul primitives.
- Improved performance of int8 convolutions for NHWC activations format.
Intel Architecture processors:
- Improved performance of primitives for NHWC activations format.
- Improved fp32 GEMM performance for small N
- Improved performance of int8 primitives for processors with Intel SSE4.1 instruction set support.
AArch64-based processors
- Added support for Arm Performance Library (ArmPL). ArmPL provides optimized GEMM implementation for aarch64.
- Added support for (Arm Compute Library (ArmCL))[https://github.com/arm-software/ComputeLibrary]. ArmCL provides optimized convolution implementation for aarch64.

New Functionality

Added support for IBMz (s390x) and IBM POWER (powerpc64) architectures
Introduced RNN GRU for GPU.
Introduced int8 RNN GRU for CPU
Introduced asymmetric quantization support for convolutions, matmul, and inner product
Introduced dilated pooling support.
Extended matmul primitive to support multiple dimensions in batch and broadcast on CPU.
(preview) Introduced binary post-op for (de)-convolution, pooling, eltwise, binary, inner product, and matmul.
(preview) Extended the number of supported post-ops for primitives to 20.
(preview) Introduced reduction primitive for CPU. Together with post-ops this functionality allows to implement normalization.

Thanks to the contributors

Known Issues and Limitations

f32 convolutions may hang sporadically on Intel Processor Graphics Gen11. No workaround available.
Pooling, batch normalization, and binary primitives may segfault when executed on Xe architecture-based graphics. No workaround available.
oneDNN functionality may corrupt memory and lead to application crash on GPU with Level Zero runtime in USM mode on all GPU platforms. As a workaround use SYCL buffers or OpenCL runtime:
export SYCL_BE=PI_OPENCL
Matmul function may hang on GPU with Level Zero runtime on Windows. As a workaround use OpenCL runtime:
export SYCL_BE=PI_OPENCL
Convolution may hang on GPU for shapes with 3 input channels. No workaround available.
Non-Intel GPUs are not supported. The library API allows to create a DNNL engine by index (the order of devices is determined by the SYCL runtime), and there is no check for GPU devices being non-Intel. To have more control, users can create a DNNL engine passing SYCL device and context explicitly.
When running GPU kernels that take longer than a certain time (it depends on OS and system settings), you may face a situation resulting in apparent hang of the application. There are ways to configure driver or system settings to disable this timeout and avoid hanging of DPC++ or OpenCL programs, including oneDNN examples:
o On Linux* (See more details at OpenCL™ Driver for Intel® HD, Iris™, and Iris™ Pro Graphics for Linux):
$ sudo bash -c 'echo N > /sys/module/i915/parameters/enable_hangcheck'
o On Windows* (See more details at Timeout Detection and Recovery (TDR) Registry Keys):
Increase TdrDelay and TdrDdiDelay values in registry
See DPC++ limitations that impact the library as well.

oneDNN - v1.6.5

Published by vpirogov almost 4 years ago

This is a patch release containing the following changes to v1.6.4:

Fixed issue with memory descriptor size computations (fc836a38713d5a8fd8915f56c16f63b81e3973e2)
Reduced required scratchpad size for RNNs (c7e165a541b903c4e3c38dc27b3e7fcc0c1e1294)
Improved performance of fp16 convolution with bias on GPUs (943760e66e19dcbdb58ac8d24c0862a289d2c947)
Fixed segmentation fault for convolution weight gradient on systems with Intel AVX512 support (85e92b326ef7109ed68a7e9bdaf6113d6f59276d)

oneDNN - v1.6.4

Published by vpirogov about 4 years ago

This is a patch release containing the following changes to v1.6.3:

Fixed performance regression in dnnl_sgemm with N=1 (379a216b94393f17a37d5f042323fc923a7553af, f35e9917608925b57bb4e1486f77720f36970aef)
Extended matmul to support multiple demensions and broadcast (0728f265f18448a3375574e622bdd6fcad0d2787)
Fixed performance regression for convolution weight gradient implementation for Intel AVX2(9ab050b0f4a3d434cbb14b7ddb7056736564b9dc, 6cd0c352f9949191dac1938b8f16b53b5967c1ea)
Fixed unknown primitive kind assertion on GPU (c95a01cea1bd43445497eae4f1323947bd56c977)
Fixed build issue on Windows for the case when oneDNN is built as submodule (2fceddf2f564b729550b288eb2e7bba5523c223e)
Fixed issues with NaN results produced by dnnl_sgemm in some scenarios (5ce95efe6f5e86cddbf704b637063cd8dc914125)
Improved performance for convolution backpropagation with 1x1 filter and NHWC activations on systems with Intel AVX2 support (74bfc74ccb089c32829ffb1711842f880a1fb99b)
Fixed correctness issue for convolution with 3D spatial (bf6ee840bef680223ccdb0c358bfce460f10d371)
Fixed potential segmentation fault when destroying RNN primitive (0d9839b085263c0f4f6dcaf95e1bc2618a684297)
Fixed performance regression for fp32 convolutions Intel AVX512 implementation (668e28289ccf17dad541238155c03a42e99802ba)

oneDNN - v1.7-rc

Published by vpirogov about 4 years ago

This is a release candidate for oneDNN v1.7. Please provide feedback and report bugs in Github issues.

oneDNN - v2.0-beta09

Published by anita-intel about 4 years ago

This is a preview release for oneDNN v2.0. This is a patch release based on v2.0-beta08.

Binary distribution of this software is available as Intel(R) oneAPI Deep Neural Network Library in Intel(R) oneAPI.

Known Issues and Limitations

int8 LSTM cell may produce incorrect results when dimensions exceed 16.
oneDNN functions executed on GPU with Level Zero driver in Remote Desktop Connection session on Windows may produce incorrect results or hang up an application. As a workaround switch Intel oneAPI DPC++ Runtime to OpenCL backend by setting environment variable SYCL_BE=PI_OPENCL.
Average pooling backpropagation may produce incorrect results for 1D spatial on Intel® Processor Graphics Gen9.
Optimized primitives can crash or fail for huge spatial sizes on CPU.
f32 convolutions may fail sporadically on Intel® Processor Graphics Gen11 due to a known issue in Intel Graphics Compiler.
Non-Intel GPUs are not supported. The library API allows to create a DNNL engine by index (the order of devices is determined by the SYCL runtime), and there is no check for GPU devices being non-Intel. To have more control, users can create a DNNL engine passing SYCL device and context explicitly.
When running GPU kernels that take longer than a certain time (it depends on OS and system settings), you may face a situation resulting in apparent hang of the application. There are ways to configure driver or system settings to disable this timeout and avoid hanging of DPC++ or OpenCL programs, including oneDNN examples:
o On Linux* (See more details at OpenCL™ Driver for Intel® HD, Iris™, and Iris™ Pro Graphics for Linux):
$ sudo bash -c 'echo N > /sys/module/i915/parameters/enable_hangcheck'
o On Windows* (See more details at Timeout Detection and Recovery (TDR) Registry Keys):
Increase TdrDelay and TdrDdiDelay values in registry
See DPC++ limitations that impact the library as well.

oneDNN - v1.6.3

Published by vpirogov about 4 years ago

This is a patch release containing the following changes to v1.6.2:

Implemented workaround for GCC internal compiler error with -std=c++14 (5ef631a030a6f73131c77892041042805a06064f)

oneDNN - v1.6.2

Published by vpirogov about 4 years ago

This is a patch release containing the following changes to v1.6.1:

Implemented workaround for running examples using cmake on macOS (089a877733899fc1ac3d0b9028afe0ca2e1675ca)
Implemented workaround for internal compiler error when building oneDNN with Microsoft Visual Studio 2019 (c6f9b7a3e5833bfe06580be6c70c7a4e019e3a43)
Fixed segfault for grouped convolutions (77e5d5744d522cad984443e92bd1b95a9f55ae85)
Fixed segfault for convolutions with 1x1 filter on Intel AVX2 systems (09c18e65a2061bec659d073b8b0dc5f96e9d7312)
Fixed segfault for convolutions with 1x1 filter on Intel AVX-512 system (2c4ad3806e344251d9555eaa02e9a803a652200f)
Fixed issue with zero padding in bfloat16 convolutions with NHWC activations (4c05c181b40cf7132f8943411fb3fab1786df0f7)

oneDNN - v1.6.1

Published by vpirogov about 4 years ago

This is a patch release containing following changes to v1.6:

Fixed performance regression for convolutions with 1x1 filter on Intel AVX2 (8186817fa59b97c602944f4ff46ce9b5b63d217c)
Fixed invalid memory access issue for bfloat16 1D grouped convolutions (9ebda6517eb5a28d991d126da8d8babaa3d3c4dd)
Fixed RuntimeError: label is redefined for convolutions with large filter size on Intel AVX512 (f974b50ea37571826662f3d1fed7ced8642d6f43)
Suppressed MSBuild warning MSB8065 (f91e641b87c625d83b329164b2471a655a880447)
Restricted support for shared virtual memory (SVM) to OpenCL 2.0 and later (fa6bbf40f7aba32a9593d3703bfcaa4abd3dd379)

oneDNN - v1.6

Published by anita-intel about 4 years ago

Performance optimizations

Intel Architecture processors

Introduced initial int8 optimizations for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control.
Improved matmul and inner product performance with bfloat16 data type.
Improved performance of tanh algorithm for eltwise primitive and LSTM cells.

Intel Processor Graphics and Xe architecture-based Graphics

Improved performance of Convolution, RNN, Inner Product and Matmul functionality for all supported GPUs.
Improved performance of int8 convolutions with activations in NHWC format for Xe architecture-based Graphics (code named DG1 and Tiger Lake).

AArch64-based processors

Added support for ArmPL library to improve performance of functionality relying on GEMM (matmul, inner product, convolutions).

New Functionality

Introduced support for processors based on IBM POWER architecture.
Introduced Linear-Before-Reset GRU for GPU.
Extended eltwise primitive with support for round operation.

Usability

Reduced primitives creation time due to enabled OpenCL pre-compiled headers feature in recent versions of OpenCL driver.
Reduced entitlement required on macOS with hardened runtime to allow-jit.
Extended documentation on runtime and build time controls for JIT profilers support, primitive cache, CPU dispatcher controls, and verbose mode.

Validation

Introduced validation mode for out of memory situations.

Thanks to the contributors

This release contains contributions from the project core team as well as Alberto Gonzalez Palomo @AlbertoGP, Arthur Mitrano @aaraujom, Ilia Taraban @itaraban, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Tsao Zhong @CaoZhongZ. We would also like to thank everyone who asked questions and reported issues.

oneDNN - v2.0-beta08

Published by anita-intel about 4 years ago

This is a preview release for oneDNN v2.0. The release is based on oneDNN v1.6.

Binary distribution of this software is available as Intel(R) oneAPI Deep Neural Network Library in Intel(R) oneAPI.

Performance Optimizations

Intel Architecture processors

Introduced initial int8 optimizations for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disable by default and should be enabled via CPU dispatcher control.
Improved matmul and inner product performance with bfloat16 data type.
Improved performance of tanh algorithm for eltwise primitive and LSTM cells.

Intel Processor Graphics and Xe architecture-based Graphics

Improved performance of Convolution, RNN, Inner Product and Matmul functionality for all supported GPUs.
Improved performance of int8 convolutions with activations in NHWC format for Xe architecture-based Graphics (code named DG1 and Tiger Lake).

New Functionality

Introduced support for processors based on IBM POWER architecture.
Introduced Linear-Before-Reset GRU for GPU.
Extended eltwise primitive with support for round operation.

Usability

Reduced primitives creation time due to enabled OpenCL pre-compiled headers feature in recent versions of OpenCL driver.
Reduced entitlement required on macOS with hardened runtime to allow-jit.
Extended documentation on runtime and build time controls for JIT profilers support, primitive cache, CPU dispatcher controls, and verbose mode.

Validation

Introduced validation mode for out of memory situations.

Known Issues and Limitations

RNN functionality is not functional with Level Zero GPU runtime. The workaround is to use OpenCL GPU runtime via setting SYCL_BE=PI_OPENCL before running a DPC++ program.
Optimized primitives can crash or fail for huge spatial sizes on CPU.
f32 convolutions may fail sporadically on Intel® Processor Graphics Gen11 due to a known issue in Intel Graphics Compiler.
Non-Intel GPUs are not supported. The library API allows to create a DNNL engine by index (the order of devices is determined by the SYCL runtime), and there is no check for GPU devices being non-Intel. To have more control, users can create a DNNL engine passing SYCL device and context explicitly.
When running GPU kernels that take longer than a certain time (it depends on OS and system settings), you may face a situation resulting in apparent hang of the application. There are ways to configure driver or system settings to disable this timeout and avoid hanging of DPC++ or OpenCL programs, including oneDNN examples:
o On Linux* (See more details at OpenCL™ Driver for Intel® HD, Iris™, and Iris™ Pro Graphics for Linux):
$ sudo bash -c 'echo N > /sys/module/i915/parameters/enable_hangcheck'
o On Windows* (See more details at Timeout Detection and Recovery (TDR) Registry Keys):
Increase TdrDelay and TdrDdiDelay values in registry
See DPC++ limitations that impact the library as well.

oneDNN - v1.6-rc

Published by anita-intel over 4 years ago

This is a release candidate for oneDNN v1.6. Please provide feedback and report bugs in Github issues.

oneDNN - v1.5.1

Published by vpirogov over 4 years ago

This is a patch release containing following changes to v1.5:

Fixed potential crash related to primtive cache (95eff24e7adae32fab844b3fb7dfb9f111441693, 00205d3816349826f72faf3280faeae9a818e563)
Fixed correctness issue for Winograd convolution implementation on Intel Xeon Phi processors (f310ded959d009f9ffe70d2c8611da4fb272abc8)
Fixed issue with tail processing in channel dimension for depthwise convolution (24eda67cd31fbfea4dd184a32577991ba6b9ea05)

oneDNN - v2.0-beta07

Published by anita-intel over 4 years ago

This is a preview release for oneDNN v2.0. The release is based on oneDNN v1.5.

Binary distribution of this software is available as Intel(R) oneAPI Deep Neural Network Library in Intel(R) oneAPI.

Performance optimizations

Intel Architecture processors

Improved performance of convolutional neural networks (CNN) related functionality with NHWC activations on all supported processors
Improved binary primitive performance for the broadcast case
Improved performance of eltwise primitive backpropagation and corresponding post-ops
Improved performance of pooling, resampling, LRN primitives
Improved performance of bfloat16 and fp32 weights gradient convolutions with groups
Improved performance of int8 convolutions with 1x1 kernel and spatial strides

Intel Processor Graphics and Xe architecture-based Graphics

Introduced initial optimizations for Xe architecture-based Graphics (code named DG1 and Tiger Lake).
Improved performance of convolutional neural networks (CNN) related functionality with NHWC activations.

New Functionality

Level Zero (L0) GPU runtime is used by default on Windows* operating system. OpenCL GPU runtime still can be used if SYCL_BE environment variable is set to PI_OPENCL before running a DPC++ program.

Usability

Introduced support for Arm* 64-bit Architecture (AArch64) and other non-x86 processors.
Separated primitive cache state from engine making it persistent.
Introduced API for managing primitive cache state.

Validation

Introduced validation mode to detect out of bounds access.

Known Limitations

RNN functionality is not functional with Level Zero GPU runtime. The workaround is to use OpenCL GPU runtime via setting SYCL_BE=PI_OPENCL before running a DPC++ program.
Optimized primitives can crash or fail for huge spatial sizes on CPU.
f32 convolutions may fail sporadically on Intel® Processor Graphics Gen11 due to a known issue in Intel Graphics Compiler.
Non-Intel GPUs are not supported. The library API allows to create a DNNL engine by index (the order of devices is determined by the SYCL runtime), and there is no check for GPU devices being non-Intel. To have more control, users can create a DNNL engine passing SYCL device and context explicitly.
When running GPU kernels that take longer than a certain time (it depends on OS and system settings) you may face a situation resulting in apparent hang of the application. Configure driver to disable this timeout and avoid hanging of DPC++ or OpenCL programs, including DNNL examples.

On Linux:

$ sudo bash -c 'echo N > /sys/module/i915/parameters/enable_hangcheck'

On Windows increase TdrDelay and TdrDdiDelay values using registry.

oneDNN - v1.5

Published by anita-intel over 4 years ago

Performance optimizations

Intel Architecture processors

Improved performance of convolutional neural networks (CNN) related functionality with NHWC activations on all supported processors
Improved binary primitive performance for the broadcast case
Improved performance of eltwise primitive backpropagation and corresponding post-ops
Improved performance of pooling, resampling, LRN primitives
Improved performance of bfloat16 and fp32 weights gradient convolutions with groups
Improved performance of int8 convolutions with 1x1 kernel and spatial strides

Intel Processor Graphics and Xe architecture-based Graphics

Introduced initial optimizations for Xe architecture-based Graphics (code named DG1 and Tiger Lake).
Improved performance of convolutional neural networks (CNN) related functionality with NHWC activations.

Usability

Introduced support for Arm* 64-bit Architecture (AArch64) and other non-x86 processors.
Separated primitive cache state from engine making it persistent.
Introduced API for managing primitive cache state.

Validation

Introduced validation mode to detect out of bounds access.

Thanks to the contributors

This release contains contributions from the project core team as well as Anuj Mittal @anujm1, Arthur Mitrano @aaraujom, Benjamin Fitch, Ilia Taraban @itaraban, Leona C. @indie, Nathan John Sircombe @nSircombe, Sergey Nesterov @cepera, Tsao Zhong @CaoZhongZ, yuri@FreeBSD @yurivict. We would also like to thank everyone who asked questions and reported issues.

oneDNN - v1.5-rc

Published by anita-intel over 4 years ago

This is a release candidate for oneDNN v1.5. Please provide feedback and report bugs in Github issues.

oneDNN - v2.0-beta06

Published by anita-intel over 4 years ago

This is a preview release for oneDNN v2.0. The release is based on oneDNN v1.4.

Binary distribution of this software is available as Intel(R) oneAPI Deep Neural Network Library in Intel(R) oneAPI.

New Functionality

Level Zero (L0) GPU runtime is used by default on Linux. OpenCL GPU runtime still can be used if SYCL_BE environment variable is set to PI_OPENCL before running a DPC++ program.

Known Limitations

Level Zero GPU runtime is not supported on Windows OS.
RNN functionality is not functional with Level Zero GPU runtime. The workaround is to use OpenCL GPU runtime via setting SYCL_BE=PI_OPENCL before running a DPC++ program.
Zero Level runtime is enabled by default. Please make sure proper installation of zero level driver including level-zero-devel package following installation guide. If users still encounter runtime issue, please apply workaround to set SYCL_BE=PI_OPENCL before running a DPC++ program.
Optimized primitives can crash or fail for huge spatial sizes on CPU.
dnnl_sgemm, dnnl_gemm_u8s8u32, and inner product functionality does not support sizes exceeding 2^32.
f32 convolutions may fail sporadically on Intel® Processor Graphics Gen11 due to a known issue in Intel Graphics Compiler.
Non-Intel GPUs are not supported. The library API allows to create a DNNL engine by index (the order of devices is determined by the SYCL runtime), and there is no check for GPU devices being non-Intel. To have more control, users can create a DNNL engine passing SYCL device and context explicitly.
When running GPU kernels that take longer than a certain time (it depends on OS and system settings) you may face a situation resulting in apparent hang of the application. Configure driver to disable this timeout and avoid hanging of DPC++ or OpenCL programs, including DNNL examples.

On Linux:

$ sudo bash -c 'echo N > /sys/module/i915/parameters/enable_hangcheck'

On Windows increase TdrDelay and TdrDdiDelay values using registry.

oneDNN - v0.21.5

Published by vpirogov over 4 years ago

This is a patch release containing following changes to v0.21.4:

Fixed s8 reorders that did not compute compensation correctly (d446661de2865741b1ad5f35a913feb6953b2592, 7a497726bfaf009eeb92ca62f873b20d53b7a3d9)
Fixed potential buffer overflow in int8 convolution scratchpad (8c5c7cf34e1e36a4c47afa506ab3af510423e28e)
Fixed segfault for s8 reorders on blocked formats (9497accb06f3d0e4f53ac8719d0f9c6721e5df38, 6f1d0c93bf461be9adcbf25201a8905bd055e478)
Fixed correctness in fp32 convolution weight gradient with dilation and padding (503bf57e447b458dd26af03189b21603395c89aa, d00afabbdd8fb67eb07e65b8eb8445934789dfa6)
Fixed correctness inssue in 1D bfloat16 dilated convolution (481dd391bee2442994db7589e00ddba3044ca682)

oneDNN - v1.4

Published by anita-intel over 4 years ago

Performance optimizations

Intel Architecture processors:
- Improved performance of int8 GEMM, RNN, inner product, matmul and GEMM-based convolution for systems with Intel SSE4.1 and Intel AVX support.
- Improved performance of eltwise backpropagation on all supported processors.
- Improved performance of bfloat16 inner product for processors with Intel DL Boost support.
Intel Processor Graphics
- Improved performance of the following functionality with NHWC activations:
  - f32 convolution forward propagation
  - f32 and f16 pooling
  - f32 and f16 batch normalization forward propagation.
- Improved performance of f32 and f16 batch normalization forward propagation and binary primitives

New functionality

Introduced support for LSTM cell with projection (LSTMP). The functionality is not implemented for Intel Processor Graphics.
Introduced bfloat16 data type support for Softmax and LogSoftmax primitives.

Usability improvements

Introduced threadpool CPU runtime. New runtime allows to run multi-thread computations with user-provided threadpool implementation, for instance Eigen threadpool.
Extended set of examples to cover all primitives supported by the library. New examples are included into corresponding sections of the Developer Guide.

Thanks to the contributors

This release contains contributions from the project core team as well as Araujo Mitrano, Arthur @aaraujom, Ilya Taraban @itaraban, Nathan Sircombe @nSircombe, and Sergey Nesterov @cepera. We would also like to thank everyone who asked questions and reported issues.