ILGPU JIT Compiler for high-performance .Net GPU programs
OTHER License
Bot releases are visible (Hide)
Published by github-actions[bot] about 1 year ago
This new stable release includes bug fixes and internal improvements (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
Special thanks to @MoFtZ, @pavlovic-ivan, and @jgiannuzzi for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community for providing feedback, submitting issues and feature requests.
Full Changelog: https://github.com/m4rs-mt/ILGPU/compare/v1.5.0...v1.5.1
Published by github-actions[bot] about 1 year ago
This new stable release includes bug fixes, new utility vector types, a newly introduced meta-optimization targeted optimization API, and specific sparse-matrix extensions (get the ILGPU Nuget package and ILGPU Algorithms Nuget package). Furthermore ILGPU now supports nullable annotations on all internal and external APIs.
Special thanks to @gartenkralle, @MoFtZ, @pavlovic-ivan, and @TriceHelix for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @kilngod, @NullandKale, @MPSQUARK, and @Yey007) for providing feedback, submitting issues and feature requests.
Published by github-actions[bot] over 1 year ago
This new beta release includes bug fixes, new utility vector types, a newly introduced meta-optimization targeted optimization API, and specific sparse-matrix extensions (get the ILGPU Nuget package and ILGPU Algorithms Nuget package). Furthermore ILGPU now supports nullable annotations on all internal and external APIs.
Special thanks to @gartenkralle, @MoFtZ, @pavlovic-ivan, and @TriceHelix for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @kilngod, @NullandKale, @MPSQUARK, and @Yey007) for providing feedback, submitting issues and feature requests.
Published by github-actions[bot] over 1 year ago
This new stable release includes bug fixes and compatibility for .Net 7.0 [in terms of abstract interface operators and generic math operations] (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
Special thanks to @lostmsu, @MoFtZ, and @pavlovic-ivan for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @kilngod, @NullandKale, @MPSQUARK, and @Yey007 for providing feedback, submitting issues and feature requests.
Published by github-actions[bot] over 1 year ago
This new release candidate includes bug fixes and compatibility for .Net 7.0 [in terms of abstract interface operators and generic math operations] (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
Special thanks to @MoFtZ for his contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community for providing feedback, submitting issues and feature requests.
Published by github-actions[bot] over 1 year ago
This new release candidate includes bug fixes and compatibility for .Net 7.0 [in terms of abstract interface operators and generic math operations] (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
Special thanks to @lostmsu, @MoFtZ, and @pavlovic-ivan for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @kilngod, @NullandKale, @MPSQUARK, and @Yey007 for providing feedback, submitting issues and feature requests.
Published by github-actions[bot] over 1 year ago
This new stable release includes important bug fixes (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
Special thanks to @MoFtZ for his contribution to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community for providing feedback, submitting issues and feature requests.
Published by github-actions[bot] almost 2 years ago
This new stable release includes bug fixes, performance improvements and compatibility for Arm and Arm64 platforms (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
.IOOperations()
(#818).Special thanks to @GinkoBalboa, @KosmosisDire, @MoFtZ, @NullandKale, @TortillaZHawaii, @TriceHelix, @jgiannuzzi, @naskio, and @pavlovic-ivan for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @kilngod, @NullandKale, @MPSQUARK, and @Yey007) for providing feedback, submitting issues and feature requests.
Last but not least, we would like to thank our first-time contributors: @GinkoBalboa, @TortillaZHawaii, @TriceHelix, @naskio 🎉
Full Changelog: https://github.com/m4rs-mt/ILGPU/compare/v1.2.0...v1.3.0
Published by github-actions[bot] about 2 years ago
This new beta release includes bug fixes and compatibility for Arm and Arm64 platforms (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
.IOOperations()
(#818).Special thanks to @jgiannuzzi, @KosmosisDire, @MoFtZ, @NullandKale, and @pavlovic-ivan for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @kilngod, @NullandKale and @Yey007) for providing feedback, submitting issues and feature requests.
Published by github-actions[bot] over 2 years ago
This new release includes bug fixes and a significantly improved O2
optimization pipeline (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
LoopUnrolling
to cover more cases (#766).LibDevice
integration (#784).Special thanks to @hokb, @jgiannuzzi, @kilngod, @MoFtZ, @pavlovic-ivan and @Ruberik for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @Joey9801, @MPSQUARK, @NullandKale and @Yey007) for providing feedback, submitting issues and feature requests.
Published by github-actions[bot] over 2 years ago
This new beta release includes bug fixes and a significantly improved O2
optimization pipeline (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
LoopUnrolling
to cover more cases (#766).LibDevice
integration (#784).Special thanks to @hokb, @jgiannuzzi, @MoFtZ and @Ruberik for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @Joey9801, @kilngod, @MPSQUARK, @NullandKale and @Yey007) for providing feedback, submitting issues and feature requests.
Published by github-actions[bot] over 2 years ago
This new release includes bug fixes, a huge set of new features (e.g. LibDevice
integration, CudaFFT
and NVML
bindings) and a significantly improved O2
optimization pipeline (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
System.Reflection.Metadata
from 6.0.0 to 6.0.1 (#767).NVML
bindings (#518).CuFFT
and CuFFTW
bindings (#706).NvJpeg
image-decoding bindings (#716, #721).LibDevice
bindings to include highly optimized math functions on NVIDIA GPUs (#707).FP16
support to CuBlas
bindings (#658).alignment
methods to views to improve performance (#684).O2
pipeline (#704, #734).SetField
operations (#671).LoadElementAddress
operations (#733).Cuda
memcopy operations (#705).CPUMultiprocessor
during lazy initialization (#747).Accelerator
instance (#714).T4.Build
from 0.2.3 to 0.2.4 (#767).FluentAssertions
from 6.5.0 to 6.5.1 (#748).Microsoft.NET.Test.SDK
from 17.0.0 to 17.1.0 (#752).TraversalSuccessorsProvider
(#727).logo
folder (#717).Special thanks to @Debiday, @jgiannuzzi, @MoFtZ and @Ruberik for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @Joey9801, @kilngod, @mikhail-khalizev, @MPSQUARK, @NullandKale, @RER009 and @Yey007) for providing feedback, submitting issues and feature requests.
Published by github-actions[bot] over 2 years ago
This new beta release includes bug fixes, a huge set of new features (e.g. LibDevice
integration, CudaFFT
and NVML
bindings) and a significantly improved O2
optimization pipeline (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
NVML
bindings (#518).CuFFT
and CuFFTW
bindings (#706).NvJpeg
image-decoding bindings (#716, #721).LibDevice
bindings to include highly optimized math functions on NVIDIA GPUs (#707).FP16
support to CuBlas
bindings (#658).alignment
methods to views to improve performance (#684).O2
pipeline (#704, #734).SetField
operations (#671).LoadElementAddress
operations (#733).Cuda
memcopy operations (#705).Accelerator
instance (#714).TraversalSuccessorsProvider
(#727).logo
folder (#717).Special thanks to @Debiday, @jgiannuzzi, @MoFtZ and @Ruberik for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @Joey9801, @kilngod, @mikhail-khalizev, @MPSQUARK, @NullandKale, @RER009 and @Yey007) for providing feedback, submitting issues and feature requests.
Published by github-actions[bot] almost 3 years ago
This new stable release offers major performance improvements, new APIs to simplify programming, improve productivity and reduce programming errors. It also includes a lot of amazing new features (see below and get the Nuget package).
Memory API
, involving ArrayView
and MemoryBuffer
types has been significantly improved to support explicit Stride
information (see below).IndexX
and LongIndexX
types have been renamed to IndexXD
and LongIndexXD
to have a unified programming experience with respect to memory buffers and array views (see below).Device API
has been redesigned to explicitly enable, filter and configure the available hardware accelerator devices (see below).Memory API
to support explicit stride information (#421, #475, #483).Device API
to enable, filter and configure the available hardware accelerator devices (#428).OpenCL 3.0
API (#464).ProfilingMarker
s (#482).Warp
/Group
/Multiprocessor
configurations (#402, #484).IRBuilder
(#477).OpenCL
kernels in the presence of constant switch conditions (#441)..NET 5
to a default target framework (#529, #536).Array
processing pipeline to have full support for nD-arrays (#513).AsNDView
(#571).SubView
operations (#550).UCE
transformation to the backend optimization passes (#569).EnableAlgorithms
on Context builder instances (#515).IndexND
and LongIndexND
types (#510).InvalidEntryPointIndexParameterOfWrongType
error message to be more descriptive (#535).DllImportSearchPath
to LegacyBehavior
(#514).Stride
and ArrayView
types (#509).RadixSortProvider
and ScanProvider
test cases (#516).feedz.io
(#521, #520).v4.7
to v4.7.1
(#594).PTX
assembly instructions (#588).CUDA
and CL
) allocations to enable allocations of zero bytes (#547, #610).v4.7
to v4.7.1
) to benefit from the most recent dependency updates (#595).The new API distinguishes between a coherent, strongly typed ArrayView<T>
structure and its n-D versions ArrayViewXD<T, TStride>
, which carry dimension-dependent stride information (The actual logic for computing element addresses is moved from the IndexXD
types to the newly added StrideXD
types). This allows developers to explicitly specify a particular stride of a view, reinterpret
the data layout itself (by changing the stride), and perform compile-time optimizations based on explicitly typed stride information. Consequently, ILGPU's optimization pipeline is able to remove the overhead of these abstractions in most cases (except in rare use cases where strange-looking strides are used). It also makes all memory transfer-related operations explicit in terms of what memory layout the underlying data will have after an operation is performed.
In addition, it moves all copy
related methods to the ArrayView
instances instead of exposing them on the memory buffers. This realizes a "separation of concerns": One the one hand, a MemoryBuffer
holds a reference to the native memory area and controls its lifetime. On the other hand, ArrayView
structures manage the contents of these buffers and make them available to the actual GPU kernels.
Example:
// Simple 1D allocation of 1024 longs with TStride = Stride1D.Dense (all elements are accessed contiguously in memory)
var t = accl.Allocate1D<long>(1024);
// Advanced 1D allocation of 1024 longs with TStride = Stride1D.General(2) (each memory access will skip 2 elements)
// -> allocates 1024 * 2 longs to be able to access all of them
var t = accl.Allocate1D<long, Stride1D.General>(1024, new Stride1D.General(2));
// Simple 1D allocation of 1024 longs using the array provided
var data1 = new long[1024];
var t2 = accl.Allocate1D(data1);
// Simple 2D allocation of 1024 * 1024 longs using the array provided with TStride = Stride2D.DenseX
// (all elements in X dimension are accessed contiguously in memory)
// -> this will *not* transpose the input buffer as the memory layout will be identical on CPU and GPU
var data2 = new long[1024, 1024];
var t3 = accl.Allocate2DDenseX(data2);
// Simple 2D allocation of 1024 * 1024 longs using the array provided, with TStride = Stride2D.DenseY
// (all elements in Y dimension are accessed contiguously in memory)
// -> this *will* transpose the input buffer to match the desired data layout
var data3 = new long[1024, 1024];
var t4 = accl.Allocate2DDenseY(data3);
The major changes/features of the new Memory API are:
Index1
|Index2
|Index3
types have been renamed to Index1D
|Index2D
|Index3D
to match the naming scheme of ArrayViewXD
and MemoryBufferXD
types.LongIndex1
|LongIndex2
|LongIndex3
types have been renamed to LongIndex1D
|LongIndex2D
|LongIndex3D
to match the naming scheme of the ArrayViewXD
and MemoryBufferXD
types.MemoryBuffer
and ArrayView
instances:
ArrayView...
structures represent and manage the contents of buffers (or chunks of buffers).MemoryBuffer...
classes manage the lifetime of allocated memory chunks on a device.ILGPU.ArrayView
intrinsic structure implements the newly added IContiguousArrayView
interface that marks contiguous memory sections.ILGPU.Runtime.MemoryBuffer...
classes implement the newly added IContiguousArrayView
interface that marks contiguous memory sections.IContiguousArrayView
interface provide extension methods for initializing, copying from and to the memory region (not supported on accelerators).Stride
s. ILGPU contains built-in common strides for 1D, 2D and 3D views.
Stride1D.Dense
represents contiguous chunks of memory that pack elements side by side.Stride1D.General
represents strides that skip a certain number of elements.Stride2D.DenseX
represents 2D strides that pack elements side by side in dimension X (transfers from a to views with this stride involve transpose operations).Stride2D.DenseY
represents 2D strides that pack elements in the Y dimension side by side.Stride2D.General
represents strides that skip a certain number of elements in the X and Y dimensions.Stride3D.DenseXY
represents 3D strides that pack elements in the X,Y dimension side by side (transfers from a to views with this stride involve transposition operations).Stride3D.DenseZY
represents 3D strides that pack elements in the Z,Y dimension side by side.Stride3D.General
represents strides that omit a certain number of elements in the X, Y and Z dimensions.ArrayViewXD
types have been moved to the ILGPU.Runtime
namespace.ArrayViewXD
types do not implement IContiguousArrayView
, as they support arbitrary stride information.ArrayView1D<T, Stride1D.Dense>
specialization has an implicit conversion to ArrayView<T>
(and vice versa) for auxiliary purposes.CopyFromCPU
and CopyToCPU
methods are provided with additional hints as to whether they are transposing the input elements or keeping the original layout.GetAsXDArray(...)
always returns elements in .Net standard layout for 1D, 2D and 3D arrays (this may result in transposing the input elements of the buffer on the CPU).view.AsContiguous().GetAsArray()
to get the memory layout of the input buffer.This also affects the implementation of all IndexND
types.
We moved the index reconstruction functions from the index types to the individual stride implementations:
Index2D index = <some_extent>.ReconstructIndex(index);
New way:
Index2D index = Stride2D.DenseX.ReconstructFromElementIndex(index, <some_extent>);
// .. or ..
Index2D index = Stride2D.DenseY.ReconstructFromElementIndex(index, <some_extent>);
The new Device API removes the enumeration ContextFlags
and implements the same functionality in an object oriented way using a Context.Builder
class. It offers a fluent-API like configuration interface which makes it easy to set up:
// Enables all supported accelerators (default CPU accelerator only) and puts the context
// into auto-assertion mode via "AutoAssertions()". In other words, if a debugger is attached,
// the `Context` instance will turn on all assertion checks. This behavior is identical
// to the current implementation via new Context();
using var context = Context.CreateDefault();
// Turns on O2 and enables all compatible Cuda devices.
using var context = Context.Create(builder =>
{
builder.Optimize(OptimizationLevel.O2).Cuda();
});
// Turns on all assertions, enables the IR verifier and enables all compatible OpenCL devices.
using var context = Context.Create(builder =>
{
builder.Assertions().Verify().OpenCL();
});
// Turns on kernel source-line annotations, fast math using 32-bit float and enables
// *all* (even incompatible) OpenCL devices.
using var context = Context.Create(builder =>
{
builder
.DebugSymbols(DebugSymbolsMode.KernelSourceAnnotations)
.Math(MathMode.Fast32BitOnly)
.OpenCL(device => true);
});
// Selects an OpenCL device with a warp size of at least 32:
using var context = Context.Create(builder =>
{
builder.OpenCL(device => device.WarpSize >= 32);
});
// Turns on all assertions in debug mode (same behavior like calling CreateDefault()):
using var context = Context.Create(builder =>
{
builder.AutoAssertions();
});
// Turns on debug optimizations (level O0) and all assertions if a debugger is attached:
using var context = Context.Create(builder =>
{
builder.AutoDebug();
});
// Turns on debug mode (optimization level P0, assertions and kernel debug information):
using var context = Context.Create(builder =>
{
builder.Debug();
});
// Disable caching, enable conservative inlining and inline mutable static field values:
using var context = Context.Create(builder =>
{
builder
.Caching(CachingMode.Disabled)
.Inlining(InliningMode.Conservative)
.StaticFields(StaticFieldMode.MutableStaticFields);
});
// Turn on *all* CPU accelerators that simulate different hardware platforms:
using var context = Context.Create(builder => builder.CPU());
// Turn on an AMD-based CPU accelerator:
using var context = Context.Create(builder => builder.CPU(CPUDeviceKind.AMD));
Note that by default all debug symbols are automatically turned off when a debugger is attached. If you want to turn on the debug information in all cases, call .builder.DebugSymbols(DebugSymbolsMode.Basic)
. At the same time, this PR introduces the notion of a Device
, which replaces the implementation of AcceleratorId
. This allows us to query detailed device information without explicitly instantiating an accelerator:
// Print all device information without instantiating a single accelerator
// (device context) instance.
using var context = Context.Create(...);
foreach (var device in context)
{
// Print detailed accelerator information
device.PrintInformation();
// ...
}
Note that we removed the ability to call the accelerator constructors (e.g. new CudaAccelerator(...)
) directly. Either use the CreateAccelerator
methods defined in the Device
classes or use one of the extension methods like CreateCudaAccelerator(...)
of the Context
class itself:
using var context = Context.Create(...);
foreach (var device in context)
{
// Instantiate an accelerator instance on this device
using Accelerator accel = device.CreateAccelerator();
// ...
}
// Instantiate the 2nd Cuda accelerator (NOTE that this is the *2nd* Cuda device
// and *not* the 2nd device of your machine).
using CudaAccelerator cudaDevice = context.CreateCudaAccelerator(1);
// Instantiate the 1st OpenCL accelerator (NOTE that this is the *1st* OpenCL device
// and *not* the 1st device of your machine).
using CLAccelerator clDevice = context.CreateOpenCLAccelerator(0);
Context
properties that expose types from other (ILGPU internal) namespaces that cannot/should not (?) be covered by the API/ABI guarantees we want to give, has been made internal
properties. To access these properties, use one of the available extensionmethods located in the corresponding namespaces:
using var context = ...
// OLD way
var internalIRContext = context.IRContext;
// NEW way:
// using namespace ILGPU.IR;
var internalIRContext = context.GetIRContext();
To use the new version of the algorithms library with ILGPU v1.0.0, you need to initialize the library with the help of the new builder pattern:
// Enables all algorithm library features
using var context = Context.Create(builder =>
{
builder.EnableAlgorithms();
});
The new CPU runtime significantly improves the existing CPUAccelerator
runtime by adding support for user-defined warp
, group
and multiprocessor
configurations. It changes the internal functionality to simulate a single warp of at least 2 threads (which ensures that all shuffle-based/reduction-like algorithms can also be run on the CPU by default). At the same time, each virtual multiprocessor can only execute a single thread group at a time. Increasing the number of virtual multiprocessors allows the user to simulate multiple concurrent groups. Most use cases will not require more than a single multiprocessor in practice.
Note that all device-wide static Grid
/Group
/Atomic
/Warp
classes are fully supported to debug/simulate all ILGPU kernels on the CPU.
Note that a custom warp size must be a multiple of 2.
This PR adds a new set of static creation methods:
CreateDefaultSimulator(...)
which creates a CPUAccelerator
instance with 4 threads per warp, 4 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 16
).CreateNvidiaSimulator(...)
which creates a CPUAccelerator
instance with 32 threads per warp, 32 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 1024
).CreateAMDSimulator(...)
which creates a CPUAccelerator
instance with 32 threads per warp, 8 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 256
).CreateLegacyAMDSimulator(...)
which creates a CPUAccelerator
instance with 64 threads per warp, 4 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 256
).CreateIntelSimulator(...)
which creates a CPUAccelerator
instance with 16 threads per warp, 8 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 128
).Furthermore, this PR adds support for advanced debugging features that enable a "sequential-like" execution mode. In this mode, each thread of a group will run sequentially one after another until it hits a synchronization barrier or exits the kernel function. This allows users to conveniently debug larger thread groups consisting of concurrent threads without switching to single-threaded execution. This behavior can be controlled via the newly added CPUAcceleratorMode
enum:
/// <summary>
/// The accelerator mode to be used with the <see cref="CPUAccelerator"/>.
/// </summary>
public enum CPUAcceleratorMode
{
/// <summary>
/// The automatic mode uses <see cref="Sequential"/> if a debugger is attached.
/// It uses <see cref="Parallel"/> if no debugger is attached to the
/// application.
/// </summary>
/// <remarks>
/// This is the default mode.
/// </remarks>
Auto = 0,
/// <summary>
/// If the CPU accelerator uses a simulated sequential execution mechanism. This
/// is particularly useful to simplify debugging. Note that different threads for
/// distinct multiprocessors may still run in parallel.
/// </summary>
Sequential = 1,
/// <summary>
/// A parallel execution mode that runs all execution threads in parallel. This
/// reduces processing time but makes it harder to use a debugger.
/// </summary>
Parallel = 2,
}
By default, all CPUAccelerator
instances use the automatic mode (CPUAcceleratorMode.Auto
) that switches to a sequential execution model as soon as a debugger is attached to the application.
Note that threads in the scope of multiple multiprocessors may still run in parallel.
Special thanks to @76creates, @conghuiw, @deng0, @GPSnoopy, @jgiannuzzi, @Joey9801, @ljubon, @MoFtZ, @Nnelg, @nullandkale and @sucrose0413 for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @faruknane, @mikhail-khalizev, @MPSQUARK, @Ruberik, @Yey007, and @yuryGotham) for providing feedback, submitting issues and feature requests.
Published by github-actions[bot] almost 3 years ago
This final release candidate is a preview of the upcoming ILGPU stable release with a frozen API surface/feature level. It includes performance improvements and several bug fixes including critical patches for the internal loop optimization phases and cross-device peer accesses (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
Atomic
implementations to overcome performance limitations (#667).ArrayView
and ArrayView1D
(#666).Atomics
performance (#667).IO
operations (#694).LoopUnrolling
phases (#653, #657, #661).CPUDevice
and CPUMultiprocessor
classes (#665).NotInsideKernel
attributes on MemSet
functions (#651).Special thanks to @MoFtZ, @jgiannuzzi , @deng0 and @conghuiw for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community for providing feedback, submitting issues and feature requests.
Full Changelog: https://github.com/m4rs-mt/ILGPU/compare/v1.0.0-rc2...v1.0.0-rc3
Published by github-actions[bot] about 3 years ago
This new release candidate is a preview of the upcoming ILGPU stable release with a frozen API surface/feature level. It includes bug fixes, new features and a refined ILGPU Index
/Stride
, ScanExtensions
,RadixSortExtensions
and CuBlas
APIs (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
Index1D|Index2D|Index3D|LongIndex1D|LongIndex2D|LongIndex3D
type API surface: removed multidimensional index reconstruction methods.Stride1D|Stride2D|Stride3D
types.ArrayView1D|ArrayView2D|ArrayView3D
types to Stride1D|Stride2D|Stride3D
types.CuBlas
API to be compatible with stride information.Scan
and RadixSort
APIs to be compatible with stride information.ValueType.GetHashCode
(#617).OutOfRessources
when emitting Code with debug assertions turned on using the Cuda backend (#628).net471
target without Windows (#616).Special thanks to @MoFtZ, @nullandkale, @jgiannuzzi, @Joey9801, @lostmsu and @kilngod for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community for providing feedback, submitting issues and feature requests.
Full Changelog: https://github.com/m4rs-mt/ILGPU/compare/v1.0.0-rc1...v1.0.0-rc2
Published by github-actions[bot] about 3 years ago
This new release candidate is a preview of the upcoming ILGPU stable release with a frozen API surface/feature level. It includes bug fixes, a lot of amazing new features and improved samples and documentation (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
v4.7
to v4.7.1
) to benefit from the most recent dependency updates (#595).v4.7
to v4.7.1
(#594).PTX
assembly instructions (#588).CUDA
and CL
) allocations to enable allocations of zero bytes (#547, #610).Special thanks to @MoFtZ, @nullandkale, @Joey9801, @jgiannuzzi and @sucrose0413 for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community for providing feedback, submitting issues and feature requests.
Published by github-actions[bot] about 3 years ago
This new beta offers significant performance improvements to the generated kernel programs and includes a lot of amazing new features (get the ILGPU Nuget package and ILGPU Algorithms Nuget package).
.NET 5
to a default target framework (#529, #536).Array
processing pipeline to have full support for nD-arrays (#513).AsNDView
(#571).SubView
operations (#550).UCE
transformation to the backend optimization passes (#569).EnableAlgorithms
on Context builder instances (#515).IndexND
and LongIndexND
types (#510).InvalidEntryPointIndexParameterOfWrongType
error message to be more descriptive (#535).DllImportSearchPath
to LegacyBehavior
(#514).Stride
and ArrayView
types (#509).RadixSortProvider
and ScanProvider
test cases (#516).feedz.io
(#521, #520).Special thanks to @MoFtZ, @Joey9801, @jgiannuzzi ,@nullandkale, @76creates, @Nnelg, @ljubon for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community for providing feedback, submitting issues and feature requests.
Published by m4rs-mt over 3 years ago
This new beta offers significant performance improvements to the generated kernel programs and includes a lot of amazing new features (get the Nuget package).
Please note that this version has some breaking changes compared to previous ILGPU versions.
Refer to the v1.0-beta1 summary for more information.
Published by m4rs-mt over 3 years ago
This new beta offers significant performance improvements to the generated kernel programs and includes a lot of amazing new features (get the Nuget package).
Please note that this version has some breaking changes compared to previous ILGPU versions.
Memory API
, involving ArrayView
and MemoryBuffer
types has been significantly improved to support explicit Stride
information (see below).IndexX
and LongIndexX
types have been renamed to IndexXD
and LongIndexXD
to have a unified programming experience with respect to memory buffers and array views (see below).Device API
has been redesigned to explicitly enable, filter and configure the available hardware accelerator devices (see below).Memory API
to support explicit stride information (#421, #475, #483).Device API
to enable, filter and configure the available hardware accelerator devices (#428).OpenCL 3.0
API (#464).ProfilingMarker
s (#482).Warp
/Group
/Multiprocessor
configurations (#402, #484).IRBuilder
(#477).OpenCL
kernels in the presence of constant switch conditions (#441).The new API distinguishes between a coherent, strongly typed ArrayView<T>
structure and its n-D versions ArrayViewXD<T, TStride>
, which carry dimension-dependent stride information (The actual logic for computing element addresses is moved from the IndexXD
types to the newly added StrideXD
types). This allows developers to explicitly specify a particular stride of a view, reinterpret
the data layout itself (by changing the stride), and perform compile-time optimizations based on explicitly typed stride information. Consequently, ILGPU's optimization pipeline is able to remove the overhead of these abstractions in most cases (except in rare use cases where strange-looking strides are used). It also makes all memory transfer-related operations explicit in terms of what memory layout the underlying data will have after an operation is performed.
In addition, it moves all copy
related methods to the ArrayView
instances instead of exposing them on the memory buffers. This realizes a "separation of concerns": One the one hand, a MemoryBuffer
holds a reference to the native memory area and controls its lifetime. On the other hand, ArrayView
structures manage the contents of these buffers and make them available to the actual GPU kernels.
Example:
// Simple 1D allocation of 1024 longs with TStride = Stride1D.Dense (all elements are accessed contiguously in memory)
var t = accl.Allocate1D<long>(1024);
// Advanced 1D allocation of 1024 longs with TStride = Stride1D.General(2) (each memory access will skip 2 elements)
// -> allocates 1024 * 2 longs to be able to access all of them
var t = accl.Allocate1D<long, Stride1D.General>(1024, new Stride1D.General(2));
// Simple 1D allocation of 1024 longs using the array provided
var data1 = new long[1024];
var t2 = accl.Allocate1D(data1);
// Simple 2D allocation of 1024 * 1024 longs using the array provided with TStride = Stride2D.DenseX
// (all elements in X dimension are accessed contiguously in memory)
// -> this will *not* transpose the input buffer as the memory layout will be identical on CPU and GPU
var data2 = new long[1024, 1024];
var t3 = accl.Allocate2DDenseX(data2);
// Simple 2D allocation of 1024 * 1024 longs using the array provided, with TStride = Stride2D.DenseY
// (all elements in Y dimension are accessed contiguously in memory)
// -> this *will* transpose the input buffer to match the desired data layout
var data3 = new long[1024, 1024];
var t4 = accl.Allocate2DDenseY(data3);
The major changes/features of the new Memory API are:
Index1
|Index2
|Index3
types have been renamed to Index1D
|Index2D
|Index3D
to match the naming scheme of ArrayViewXD
and MemoryBufferXD
types.LongIndex1
|LongIndex2
|LongIndex3
types have been renamed to LongIndex1D
|LongIndex2D
|LongIndex3D
to match the naming scheme of the ArrayViewXD
and MemoryBufferXD
types.MemoryBuffer
and ArrayView
instances:
ArrayView...
structures represent and manage the contents of buffers (or chunks of buffers).MemoryBuffer...
classes manage the lifetime of allocated memory chunks on a device.ILGPU.ArrayView
intrinsic structure implements the newly added IContiguousArrayView
interface that marks contiguous memory sections.ILGPU.Runtime.MemoryBuffer...
classes implement the newly added IContiguousArrayView
interface that marks contiguous memory sections.IContiguousArrayView
interface provide extension methods for initializing, copying from and to the memory region (not supported on accelerators).Stride
s. ILGPU contains built-in common strides for 1D, 2D and 3D views.
Stride1D.Dense
represents contiguous chunks of memory that pack elements side by side.Stride1D.General
represents strides that skip a certain number of elements.Stride2D.DenseX
represents 2D strides that pack elements side by side in dimension X (transfers from a to views with this stride involve transpose operations).Stride2D.DenseY
represents 2D strides that pack elements in the Y dimension side by side.Stride2D.General
represents strides that skip a certain number of elements in the X and Y dimensions.Stride3D.DenseXY
represents 3D strides that pack elements in the X,Y dimension side by side (transfers from a to views with this stride involve transposition operations).Stride3D.DenseZY
represents 3D strides that pack elements in the Z,Y dimension side by side.Stride3D.General
represents strides that omit a certain number of elements in the X, Y and Z dimensions.ArrayViewXD
types have been moved to the ILGPU.Runtime
namespace.ArrayViewXD
types do not implement IContiguousArrayView
, as they support arbitrary stride information.ArrayView1D<T, Stride1D.Dense>
specialization has an implicit conversion to ArrayView<T>
(and vice versa) for auxiliary purposes.CopyFromCPU
and CopyToCPU
methods are provided with additional hints as to whether they are transposing the input elements or keeping the original layout.GetAsXDArray(...)
always returns elements in .Net standard layout for 1D, 2D and 3D arrays (this may result in transposing the input elements of the buffer on the CPU).view.AsContiguous().GetAsArray()
to get the memory layout of the input buffer.The new Device API removes the enumeration ContextFlags
and implements the same functionality in an object oriented way using a Context.Builder
class. It offers a fluent-API like configuration interface which makes it easy to set up:
// Enables all supported accelerators (default CPU accelerator only) and puts the context
// into auto-assertion mode via "AutoAssertions()". In other words, if a debugger is attached,
// the `Context` instance will turn on all assertion checks. This behavior is identical
// to the current implementation via new Context();
using var context = Context.CreateDefault();
// Turns on O2 and enables all compatible Cuda devices.
using var context = Context.Create(builder =>
{
builder.Optimize(OptimizationLevel.O2).Cuda();
});
// Turns on all assertions, enables the IR verifier and enables all compatible OpenCL devices.
using var context = Context.Create(builder =>
{
builder.Assertions().Verify().OpenCL();
});
// Turns on kernel source-line annotations, fast math using 32-bit float and enables
// *all* (even incompatible) OpenCL devices.
using var context = Context.Create(builder =>
{
builder
.DebugSymbols(DebugSymbolsMode.KernelSourceAnnotations)
.Math(MathMode.Fast32BitOnly)
.OpenCL(device => true);
});
// Selects an OpenCL device with a warp size of at least 32:
using var context = Context.Create(builder =>
{
builder.OpenCL(device => device.WarpSize >= 32);
});
// Turns on all assertions in debug mode (same behavior like calling CreateDefault()):
using var context = Context.Create(builder =>
{
builder.AutoAssertions();
});
// Turns on debug optimizations (level O0) and all assertions if a debugger is attached:
using var context = Context.Create(builder =>
{
builder.AutoDebug();
});
// Turns on debug mode (optimization level P0, assertions and kernel debug information):
using var context = Context.Create(builder =>
{
builder.Debug();
});
// Disable caching, enable conservative inlining and inline mutable static field values:
using var context = Context.Create(builder =>
{
builder
.Caching(CachingMode.Disabled)
.Inlining(InliningMode.Conservative)
.StaticFields(StaticFieldMode.MutableStaticFields);
});
// Turn on *all* CPU accelerators that simulate different hardware platforms:
using var context = Context.Create(builder => builder.CPU());
// Turn on an AMD-based CPU accelerator:
using var context = Context.Create(builder => builder.CPU(CPUDeviceKind.AMD));
Note that by default all debug symbols are automatically turned off when a debugger is attached. If you want to turn on the debug information in all cases, call .builder.DebugSymbols(DebugSymbolsMode.Basic)
. At the same time, this PR introduces the notion of a Device
, which replaces the implementation of AcceleratorId
. This allows us to query detailed device information without explicitly instantiating an accelerator:
// Print all device information without instantiating a single accelerator
// (device context) instance.
using var context = Context.Create(...);
foreach (var device in context)
{
// Print detailed accelerator information
device.PrintInformation();
// ...
}
Note that we removed the ability to call the accelerator constructors (e.g. new CudaAccelerator(...)
) directly. Either use the CreateAccelerator
methods defined in the Device
classes or use one of the extension methods like CreateCudaAccelerator(...)
of the Context
class itself:
using var context = Context.Create(...);
foreach (var device in context)
{
// Instantiate an accelerator instance on this device
using Accelerator accel = device.CreateAccelerator();
// ...
}
// Instantiate the 2nd Cuda accelerator (NOTE that this is the *2nd* Cuda device
// and *not* the 2nd device of your machine).
using CudaAccelerator cudaDevice = context.CreateCudaAccelerator(1);
// Instantiate the 1st OpenCL accelerator (NOTE that this is the *1st* OpenCL device
// and *not* the 1st device of your machine).
using CLAccelerator clDevice = context.CreateOpenCLAccelerator(0);
Context
properties that expose types from other (ILGPU internal) namespaces that cannot/should not (?) be covered by the API/ABI guarantees we want to give, has been made internal
properties. To access these properties, use one of the available extension methods located in the corresponding namespaces:
using var context = ...
// OLD way
var internalIRContext = context.IRContext;
// NEW way:
// using namespace ILGPU.IR;
var internalIRContext = context.GetIRContext();
The new CPU runtime significantly improves the existing CPUAccelerator
runtime by adding support for user-defined warp
, group
and multiprocessor
configurations. It changes the internal functionality to simulate a single warp of at least 2 threads (which ensures that all shuffle-based/reduction-like algorithms can also be run on the CPU by default). At the same time, each virtual multiprocessor can only execute a single thread group at a time. Increasing the number of virtual multiprocessors allows the user to simulate multiple concurrent groups. Most use cases will not require more than a single multiprocessor in practice.
Note that all device-wide static Grid
/Group
/Atomic
/Warp
classes are fully supported to debug/simulate all ILGPU kernels on the CPU.
Note that a custom warp size must be a multiple of 2.
This PR adds a new set of static creation methods:
CreateDefaultSimulator(...)
which creates a CPUAccelerator
instance with 4 threads per warp, 4 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 16
).CreateNvidiaSimulator(...)
which creates a CPUAccelerator
instance with 32 threads per warp, 32 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 1024
).CreateAMDSimulator(...)
which creates a CPUAccelerator
instance with 32 threads per warp, 8 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 256
).CreateLegacyAMDSimulator(...)
which creates a CPUAccelerator
instance with 64 threads per warp, 4 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 256
).CreateIntelSimulator(...)
which creates a CPUAccelerator
instance with 16 threads per warp, 8 warps per multiprocessor and a single multiprocessor (MaxGroupSize = 128
).Furthermore, this PR adds support for advanced debugging features that enable a "sequential-like" execution mode. In this mode, each thread of a group will run sequentially one after another until it hits a synchronization barrier or exits the kernel function. This allows users to conveniently debug larger thread groups consisting of concurrent threads without switching to single-threaded execution. This behavior can be controlled via the newly added CPUAcceleratorMode
enum:
/// <summary>
/// The accelerator mode to be used with the <see cref="CPUAccelerator"/>.
/// </summary>
public enum CPUAcceleratorMode
{
/// <summary>
/// The automatic mode uses <see cref="Sequential"/> if a debugger is attached.
/// It uses <see cref="Parallel"/> if no debugger is attached to the
/// application.
/// </summary>
/// <remarks>
/// This is the default mode.
/// </remarks>
Auto = 0,
/// <summary>
/// If the CPU accelerator uses a simulated sequential execution mechanism. This
/// is particularly useful to simplify debugging. Note that different threads for
/// distinct multiprocessors may still run in parallel.
/// </summary>
Sequential = 1,
/// <summary>
/// A parallel execution mode that runs all execution threads in parallel. This
/// reduces processing time but makes it harder to use a debugger.
/// </summary>
Parallel = 2,
}
By default, all CPUAccelerator
instances use the automatic mode (CPUAcceleratorMode.Auto
) that switches to a sequential execution model as soon as a debugger is attached to the application.
Note that threads in the scope of multiple multiprocessors may still run in parallel.
Special thanks to @MoFtZ, @Joey9801, @jgiannuzzi and @GPSnoopy for their contributions to this release in form of code, feedback, ideas and proposals. Furthermore, we would like to thank the entire ILGPU community (especially @MPSQUARK, @Nnelg, @Ruberik, @Yey007, @faruknane, @mikhail-khalizev, @nullandkale and @yuryGotham) for providing feedback, submitting issues and feature requests.