ILGPU

ILGPU JIT Compiler for high-performance .Net GPU programs

OTHER License

Stars
1.3K
Committers
39

Bot releases are visible (Hide)

ILGPU - Release v0.10.1

Published by m4rs-mt over 3 years ago

The new stable version contains several bug fixes and improves the code quality of the generated kernel programs (get the Nuget package).

It is strongly recommended to upgrade to this version as soon as possible to avoid known bugs and some CPU-buffer deallocation issues.

Changes

  • Added CopySign intrinsic (#438).
  • Added intrinsic mappings for BitConverter functions (#437).
  • Added call stack recording during compilation for error reporting (#436).
  • Gracefully fail when loading symbols from in-memory assemblies (#435).
  • Fixed invalid detection of loop bodies (#452).
  • Fixed incorrect assertion on repeating successors (#447).
  • Fixed emitting switch statement with constant condition (#442).
  • Fixed invalid disposal of CPU buffers (#440).
  • Fixed applications blocking during tear-down by changing Accelerator GC thread to run in the background (#439).
  • Fixed bounds check on large views (#433).
  • Fixed retrieving field from structure types (#426).

Special thanks

Special thanks to @MoFtZ, @marcin-krystianc and @jgiannuzzi for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

ILGPU - Release v0.10.0

Published by m4rs-mt over 3 years ago

The new stable version offers significant performance improvements of the generated kernel programs and contains critical resource deallocation fixes (get the Nuget package).

It is strongly recommended to upgrade to this version as soon as possible to avoid resource and GC related deallocation issues.

Breaking changes

  • The inheritance hierarchy of the ExchangeBuffer class has been changed to avoid exposing internal memory buffers. If you previously relied on the immediate inheritance from ExchangeBufferBase on MemoryBuffer, you have to adapt your program to use the intermediate base class MemoryBuffer<T, TIndex> instead (see diff).
  • Properties exposing internal memory buffers of the high-level MemoryBufferXD classes have been removed to avoid ownership related GC-free issues (see diff).

Why are there breaking changes?

We have decided to remove dangerous properties from several memory buffer classes. The use of these properties can lead to program crashes, since buffers could be disposed asynchronously in the background by the GC without further notice.

Changes

  • Improved performance of kernel launchers by passing packed argument structures (#358, #372).
  • Graduated different optimizations from O2 to O1 (release mode) to improve performance in release builds using an additional of stable optimization passes (#344).
  • Graduated O2 optimizations in the Cuda backend to O1 pipeline to generate vectorized IO operations in release builds (#350).
  • Added support for managed sizeof IL instruction (#380).
  • Added PrintInformation method to Accelerator instances to print detailed accelerator information (#389).
  • Added enhanced assertions and out-of-bounds checks to all ArrayView accesses on GPU devices (Use flag ContextFlags.EnableAsserations or attach a debugger to your application to enable assertion checks. Make sure to use the portable debug information format for detailed source location information) (#375).
  • Added support for printf-like output in Kernels for CPU, Cuda and OpenCL accelerators (#342).
  • Added new utility Launch/LaunchAutoGrouped methods to immediately launch kernels using a separate strong-reference cache (#336).
  • Added new AlignTo alignment methods to explicitly align ArrayView instances to a particular alignment in bytes (#316).
  • Added enhanced support for local memory via a new LocalMemory class (#316).
  • Added support for several PopCount, CLZ and CTZ operations (#324).
  • Added new MemSet functions to all memory buffers (#338).
  • Added new IfConditionalConversion to fold nested and-also and or-else block chains to O2 pipeline (#328).
  • Added new local memory optimizations to simplify array accesses (#317).
  • Added simple 64-bit-based LongGlobalIndex helper to simplify correct computations using 64-bit integers (#337).
  • Added new CLPlatformVersion and fixed OpenCL 1.2 compatibility issues (#335).
  • Removed support for .NET Core 2.0 (#353).
  • Prevent using SharedMemory in implicitly grouped kernels (#354).
  • Prevent using CudaAccelerator and CLAccelerator instances to run on non-native OS .NET versions (#396).
  • Fixed critical GC-related resource deallocation issues (#376, #393).
  • Fixed returning correct length of dynamic shared memory buffers (#357).
  • Fixed invalid alignment information in the presence of reinterpret casts (#386).
  • Fixed invalid address computations of fixed array buffers (#361).
  • Fixed invalid PTX calling convention (#362).
  • Fixed edge cases in LoopUnrolling (#373).
  • Fixed invalid printf formats for int64 and uintX types (#391).
  • Fixed invalid DebugArrayView implementations (#345).
  • Fixed invalid initializations of local memory arrays (#287).

Major internal changes:

  • Removed singleton instance of RuntimeSystem to avoid concurrency/reflection-API issues (#393).
  • Updated default optimizations for ILGPU debug builds (#384).
  • Added support for unity tests running on. NET Framework 4.7 (#355).
  • Migrated from FxCop analyzers to .NET analyzers. (#352).
  • Redesigned internal address-space inference passes (#364).

Special thanks

Special thanks to @MoFtZ, @Ruberik and @jgiannuzzi for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

ILGPU - Release v0.10.0-beta2

Published by m4rs-mt over 3 years ago

This new beta version offers important bug fixes and performance improvements of the generated kernel programs and a set of new features (get the Nuget package).

  • Improved performance of kernel launchers by passing packed argument structures (#358, #372).
  • Added support for managed sizeof IL instruction (#380).
  • Added PrintInformation method to Accelerator instances to print detailed accelerator information (#389).
  • Added enhanced assertions and out-of-bounds checks to all ArrayView accesses on GPU devices (Use flag ContextFlags.EnableAsserations or attach a debugger to your application to enable assertion checks. Make sure to use the portable debug information format for detailed source location information) (#375).
  • Removed support for .NET Core 2.0 (#353).
  • Prevent using SharedMemory in implicitly grouped kernels (#354).
  • Prevent using CudaAccelerator and CLAccelerator instances to run on non-native OS .NET versions (#396).
  • Fixed critical GC-related resource deallocation issues (#376, #393).
  • Fixed returning correct length of dynamic shared memory buffers (#357).
  • Fixed invalid alignment information in the presence of reinterpret casts (#386).
  • Fixed invalid address computations of fixed array buffers (#361).
  • Fixed invalid PTX calling convention (#362).
  • Fixed edge cases in LoopUnrolling (#373).
  • Fixed invalid printf formats for int64 and uintX types (#391).

Major internal changes:

  • Removed singleton instance of RuntimeSystem to avoid concurrency/reflection-API issues (#393).
  • Updated default optimizations for ILGPU debug builds (#384).
  • Added support for unity tests running on. NET Framework 4.7 (#355).
  • Migrated from FxCop analyzers to .NET analyzers. (#352).
  • Redesigned internal address-space inference passes (#364).

Special thanks to @MoFtZ, @Ruberik for their contributions to this release and to the entire ILGPU community for providing feedback, submitting issues and feature requests.

ILGPU - Release v0.9.0

Published by m4rs-mt almost 4 years ago

This new stable version offers significant performance and code quality improvements of the generated kernel programs.

  • Fixed invalid range checks in memory buffer implementations.
  • Fixed invalid 32-bit offsets in memory buffer implementations.
  • Fixed if-conversion transformation generating invalid programs in some cases (#232, #233).
  • Fixed code-analyses issues that could cause invalid analysis results (#220).
  • Added support for 64-bit length buffers and views (#196, #210, #215, #216).
    Note that this feature includes breaking changes that might affect existing code bases. Please refer to the upgrade guide for more information.
  • Added new if-conversion transformation to improve performance (#183).
  • Added support for 16-bit float (Half) types (#180, #208).
  • Added initial support for fixed array buffers (#200).
  • Added support for non-capturing lambda kernels (#79, #136).
  • Added support for multidimensional ExchangeBuffers (#148).
  • Extended ExchangeBuffers to support conversions to Span and Memory instances (#122).
  • Fixed invalid lowering of arrays in divergent control flow (#201).
  • Fixed invalid handling of prefixed IL instructions (#204, #211).

Special thanks to @MoFtZ, @Yey007 and @jgiannuzzi for contributing to this release.

ILGPU - Release v0.9.0-beta1

Published by m4rs-mt almost 4 years ago

  • Added support for 64-bit length buffers and views (#196, #210, #215, #216).
    Note that this feature includes breaking changes that might affect existing code bases. Please refer to the upgrade guide for more information.
  • Added new if-conversion transformation to improve performance (#183).
  • Added support for 16-bit float (Half) types (#180, #208).
  • Added initial support for fixed array buffers (#200).
  • Added support for non-capturing lambda kernels (#79, #136).
  • Added support for multidimensional ExchangeBuffers (#148).
  • Extended ExchangeBuffers to support conversions to Span and Memory instances (#122).
  • Fixed invalid lowering of arrays in divergent control flow (#201).
  • Fixed invalid handling of prefixed IL instructions (#204, #211).

Special thanks to @MoFtZ, @Yey007 and @jgiannuzzi for contributing to this release.

ILGPU - Release v0.8.1.1

Published by m4rs-mt almost 4 years ago

The new stable version offers significant performance and code quality improvements of the generated kernel programs.

  • Fixed related to Trace and Debug asserts (#176).
  • Fixed related to Trace and Debug asserts (#176).
  • Improved compile-time performance by up to 4X (#110).
  • Reduced memory footprint by up to 3X (#109, #118).
  • Added new optimization level O2 to enable expensive and aggressive optimizations (#70, #110, #111, #121).
  • No compiler release builds in Nuget package to improve runtime performance (#130).
  • Added new IR verifier that can be enabled via ContextFlags.EnableVerifier (#121).
  • Added generation of vectorized instructions to PTX backend (#111).
  • Fixed critical code-generation issue on Unix platforms (#116).
  • Added dynamic shared memory support for all platforms (#97, #98).
  • Added new KernelInfo objects to kernel loaders in order to query detailed kernel statistics (e.g. amount of local memory in bytes) (#104).
ILGPU - Release v0.8.1-beta1

Published by m4rs-mt almost 4 years ago

The new beta version offers significant performance improvements of the generated kernel programs.

  • Improved compile-time performance by up to 4X (#110).
  • Reduced memory footprint by up to 3X (#109, #118).
  • Added new optimization level O2 to enable expensive and aggressive optimizations (#70, #110, #111, #121).
  • No compiler release builds in Nuget package to improve runtime performance (#130).
  • Added new IR verifier that can be enabled via ContextFlags.EnableVerifier (#121).
  • Added generation of vectorized instructions to PTX backend (#111).
  • Fixed critical code-generation issue on Unix platforms (#116).
  • Added dynamic shared memory support for all platforms (#97, #98).
  • Added new KernelInfo objects to kernel loaders in order to query detailed kernel statistics (e.g. amount of local memory in bytes) (#104).
ILGPU - Release v0.8.0

Published by m4rs-mt almost 4 years ago

The new stable version offers significant performance and code quality improvements of the generated kernel programs.

  • Added support for on-the-fly specialization of kernels using dynamic partial evaluation.
  • Added support for dynamic shared memory (CPU & Cuda backends).
  • Added new KernelConfig structure to specify launch dimensions for explicitly grouped kernels.
  • Added new Index1 structure to avoid name clashes with new System.Index structure.
  • Added additional tuple conversion methods to Index2 and Index3 types.
  • Added new EntryPointDescription structure to specify an entry point and its index type.
  • Added RuntimeKernelConfig structure to combine static and dynamic information about a particular kernel launch.
  • Added support for linear arrays in local memory.
  • Added support for enum-value interop (#66).
  • Reworked explicitly grouped kernel launchers to use the new KernelConfig structure instead of GroupedIndex types.
  • Simplified static Grid and Group properties.
  • Removed all GroupedIndex types.
  • Updated the whole compilation pipeline to enable more aggressive optimizations.
  • Significantly improved performance of emitted PTX and OpenCL code by enabling more aggressive optimizations and clever code generation (#70).
  • Added Support for "unmanaged" C# structures in the scope of buffers and views.
  • Reworked PTX backend to support all API changes and to fix several critical code-generation issues. This also includes emission of PTX instructions that mimic the Cuda compiler (#68).
  • Reworked OpenCL backend to support all API changes and to fix several
    critical code-generation issues (#67, #72, #73, #74, #78, #85, #88, #91, #92).
  • New debug information input module to support the latest PDB format updates.
  • Considerably improved error messages using debug information. (#86)
  • Reduced memory consumption during the compilation process.
  • Performance improvements of the internal compilation pipeline.
  • Improved performance of kernel launchers.
  • Extended CudaAPI to supported paged-lock host-memory allocation functions.
  • Extended ExchangeBuffer to use new page-locked memory allocation (if available).
  • Added new IR-rewriter API to perform more advanced IR transformations.
  • Adapted all existing transformations to use the new rewriter API.
  • Reduced memory consumption of all nodes by compressing information.
  • Redesigned several IR nodes to support global program transformations.
  • Reworked implementation of GetSubView in the context of generic and multidimensional array views (#19).
  • Fixed several issues in the scope of address-space inference.
  • Fixed critical code generation issues that could occur when replacing values.

Special thanks to @MoFtZ for contributing to this release.

ILGPU - Release v0.8.0-beta3

Published by m4rs-mt almost 4 years ago

  • Considerably improved error messages using debug information. (#86)
  • Reduced memory consumption during the compilation process.
  • Performance improvements of the internal compilation pipeline.
  • Added Support for "unmanaged" C# structures in the scope of buffers and views.
  • New debug information input module to support the latest PDB format updates.
  • Fixed several OpenCL code generation issues (#85, #88, #91, #92)

Special thanks to @MoFtZ for contributing to this release.

ILGPU - Release v0.8.0-beta2

Published by m4rs-mt almost 4 years ago

  • Significantly improved performance of emitted PTX and OpenCL code by enabling more aggressive optimizations and clever code generation (#70).
  • Improved performance of kernel launchers.
  • Added support for linear arrays in local memory.
  • Added support for enum-value interop (#66).
  • Reworked PTXBackend to support all API changes and to fix several critical code-generation issues. This also includes emission of PTX instructions that mimic the Cuda compiler.
  • Reworked OpenCL backend to support all API changes and to fix several critical code-generation issues (#72, #73, #74, #78).
  • Updated the whole compilation pipeline to enable more aggressive optimizations.
  • Added new IR-rewriter API to perform more advanced IR transformations.
  • Adapted all existing transformations to use the new rewriter API.
  • Reduced memory consumption of all nodes by compressing information.
  • Redesigned several IR nodes to support global program transformations.

Special thanks to @MoFtZ for contributing to this release.

ILGPU - Release v0.8.0-beta1

Published by m4rs-mt almost 4 years ago

  • Added support for on-the-fly specialization of kernels using dynamic partial evaluation.
  • Added support for dynamic shared memory (CPU & Cuda backends).
  • Added new KernelConfig structure to specify launch dimensions for explicitly grouped kernels.
  • Reworked explicitly grouped kernel launchers to use the new KernelConfig structure instead of GroupedIndex types.
  • Simplified static Grid and Group properties.
  • Added new Index1 structure to avoid name clashes with new System.Index structure.
  • Added additional tuple conversion methods to Index2 and Index3 types.
  • Added new EntryPointDescription structure to specify an entry point and its index type.
  • Added RuntimeKernelConfig structure to combine static and dynamic information about a particular kernel launch.
  • Removed all GroupedIndex types.
  • Extended PTXInstructions to support bool-based IOs in PTXBackend (#68).
  • Extended ExchangeBuffer to use new page-locked memory allocation (if available).
  • Extended CudaAPI to supported paged-lock host-memory allocation functions.
  • Reworked implementation of GetSubView in the context of generic and multidimensional array views (#19).
  • Fixed several issues in the scope of address-space inference.
  • Fixed critical code generation issues that could occur when replacing values.
  • Fixed invalid pointer types in the scope of AtomicCAS operations on AMD hardware (#67).
ILGPU - Release v0.7.1

Published by m4rs-mt almost 4 years ago

  • Added extension method to load the effective address for Cuda and CPU-based array views.
  • Added support for data blocks (value containers) for easy the interop with value tuples.
  • Added additional primitive data blocks to simplify operations on tuples consisting of primitive values.
  • Added new ExchangeBuffer class to simplify memory transfers between CPU and GPU memory.
  • Fixed invalid sub-group extension name in CLAccelerator.
  • Fixed invalid association of supported and unsupported CL accelerators.
  • Removed obsolete dispose functionality from AcceleratorId classes.
  • Fixed OpenCL code generator for float values that are assign integers values.
  • Fixed invalid creation of kernel interop types in OpenCL backend.
  • Made ABI thread safe to support concurrent queries of size/alignment information.
ILGPU - Release v0.7.0

Published by m4rs-mt almost 4 years ago

  • Added support for .Net Standard 2.1.
  • Added support for OpenCL-compatible GPUs (beta)
  • Added parallel code generation in backends to improve code-generation speed.
  • Added minimum CUDA driver version detection.
  • Enabled adaptive shared-memory allocation in CPUAccelerator.
  • Added new Utility.Select method that can be used to create highly-efficient select instructions in favor of if branches.
  • Added support to access Grid and Group indices via properties.
  • Added support for generic Warp intrinsics that will be automatically generated by the compiler.
  • Redesigned intrinsic math functions and moved XMath functions to the ILGPU.Algorihtms library. Use the new IntrinsicMath class for math functions that are supported on all platforms.
  • Reworked intrinsic functions to allow custom implementations of intrinsics for different backends.
  • Ported project to VS2019 including all static-program analysis checks.
  • Applied generate code cleanup to be compliant with the new analysis checks.
  • Redesigned AcceleratorId functionality.
  • Updated CudaMemoryBuffer to support MemSetToZero using alternate streams.
  • Fixed retrieving version number of ILGPU assembly.
  • Fixed non-deterministic generation of Phi mappings.
  • Fixed invalid loading of small basic types onto the evaluation stack.
  • Added utility property to Accelerator to resolve a launch extent with the maximum number of groups.
  • Fixed invalid shared-memory allocation within non-kernel functions in PTXBackend.

Special thanks to @MoFtZ for contributing to this release.

ILGPU - Release v0.6.0

Published by m4rs-mt almost 4 years ago

Greatly improved ILGPU version that included significant performance and code quality improvements.

  • Added support for new GeForce RTX cards.
  • Added initial support for arrays in kernels.
  • Added additional 3D indexing functionality to ArrayView types.
  • Added automatic binding of accelerators in advanced multi-GPU scenarios.
  • Tested debugging and profiling capabilities on NVIDIA GPUs.
  • Released test framework to verify generated kernel code.
  • Improved performance of predicates in PTXBackend.
  • Removed strict array-length restriction from allocation nodes.
  • Enhanced generation of get/set field operations.
  • Optimized generation of conditional branches.
  • Fixed invalid generation of predicate barriers in PTXBackend.
  • Fixed invalid register allocation of string types in PTXBackend.
  • Removed explicit tracking of predecessors in phi nodes.
  • Fixed invalid debug assertion in SequencePoint.
  • Fixed invalid alignment of shared-memory allocations in PTXBackend.
  • Fixed invalid shared memory configuration of Cuda kernels.

Special thanks to @MoFtZ and @mikhail-khalizev for contributing to this release.

ILGPU - Release v0.5.1

Published by m4rs-mt almost 4 years ago

Improved version of v0.5 that contains bug fixes and performance improvements and features based on community feedback.

  • Polished error messages and util methods.
  • Fixed invalid DebuggerDisplay attributes on array views.
  • Added support for loading addresses of static fields.
  • Added support to disable kernel caches and automatic disposal of kernels and memory buffers (Community request)..
  • Extended kernel loaders with additional overloads.
  • Added support to clear internal caches (Community request).
  • Fixed invalid extent and bounds checks in MemoryBuffer.CopyTo.
  • Fixed invalid initialization of PTX-specific intrinsic functions.
  • Fixed invalid load/store instructions of bytes in PTXBackend.
  • Fixed invalid generation of null values in PTXBackend.
ILGPU - Release v0.5.0

Published by m4rs-mt almost 4 years ago

Note that this has been the first standalone compiler version without any native library dependencies.

  • Removed all native library dependencies (including LLVM).
  • Redesigned huge parts of the compiler.
  • Added conceptionally new (experiemental) IR.
  • Introduced generic data views in order to generate code for low-level (e.g. PTX) and high-level (e.g Vulkan) targets.
  • Adapted IL-Frontend to generate IR code instead of LLVM-IR.
  • Added new code-transformation phases to optimize code.
  • Added support for parallel code generation.
  • Adapted support for portable PDBs.
  • Added support for basic line-based GPU debugging and profiling.
  • Added support for jagged arrays. (Community Request)
  • Redesigned support for sub-warp shuffles. (Community Request)
  • Redesigned implicit stream launchers. (Community Request)
  • Fixed several code generation issues. (see GitHub)
  • Redesigned all required transformations and code generators.
  • Redesigned IR in order to significantly improve compilation time and memory consumption.
  • Extended kernel loaders with additional delegate overloads. (Community Request)
  • Fixed invalid loading of debug symbols from dynamic assemblies.
ILGPU - Release v0.3.0 [LLVM based]

Published by m4rs-mt almost 4 years ago

Note that this version is based on the LLVM compiler framework and contains native dependencies. Please use ILGPU >= v0.5 to use the platform independent ILGPU compiler version.

*** This release has been the last LLVM-based compiler version ***

  • Added support for .Net Standard 2.0. (Community Request)
  • Added first support for specializing kernels during compilation.
  • Updated accelerator caching functionality. (Community Request)
  • Improved multithreading support. (Community Request)
  • Added automatic disposal of kernels, memory buffers and accelerator streams. (Community Request)
  • Integrated basic support for portable PDBs.
  • Added native build scripts for Linux operating systems.
  • Added support for Linux operating systems in DLLLoader and PTXBackend.
ILGPU - Release v0.2.1 [LLVM based]

Published by m4rs-mt almost 4 years ago

Note that this version is based on the LLVM compiler framework and contains native dependencies. Please use ILGPU >= v0.5 to use the platform independent ILGPU compiler version.

  • Fixed critical invalid code generation of non-zero (true) branches.
ILGPU - Release v0.2.0 [LLVM based]

Published by m4rs-mt almost 4 years ago

Note that this version is based on the LLVM compiler framework and contains native dependencies. Please use ILGPU >= v0.5 to use the platform independent ILGPU compiler version.

  • Added convenient kernel loading and caching to accelerator classes.
  • Added properties to query the maximum number of threads of an accelerator.
  • Added Disposed event to Accelerator.
  • Added support for Cuda 9.0.
  • Added new cross-platform Cuda API.
  • Added integer-division operators to GPUMath.
  • Added RadToDeg and DegToRad conversion methods to GPUMath.
  • Added support for .Net Core 2.0.
  • Removed LLVMSharp dependency.
  • Enhanced SSA code generation quality.
  • Fixed invalid constant generation of padded structures.
  • Updated CompilerServices.Unsafe dependency to version 4.4.0.
ILGPU - Release v0.1.4 [LLVM based]

Published by m4rs-mt almost 4 years ago

Note that this version is based on the LLVM compiler framework and contains native dependencies. Please use ILGPU >= v0.5 to use the platform independent ILGPU compiler version.

  • Fixed invalid code generation of float-based Atomic.Min/Atomic.Max functions.
  • Added support for nullable types in kernels.
  • Fixed invalid return value of atomic add in CPU mode.
  • Fixed invalid resolving of generic virtual methods in IL frontend.