Implementation of SYCL and C++ standard parallelism for CPUs and GPUs from all vendors: The independent, community-driven compiler for C++-based heterogeneous programming models. Lets applications adapt themselves to all the hardware in the system - even at runtime!
BSD-2-CLAUSE License
Bot releases are visible (Hide)
This release increases performance even further, while also adding various new features. AdaptiveCpp 24.06 is now without a doubt one of the leading heterogeneous C++ compilers when it comes to performance. In many cases, it is faster than vendor-supported compiler stacks such as CUDA or oneAPI. At the same time, as a purely community-driven project, is is completely free from vendor politics, giving the community back control over their preferred programming models.
Users are encouraged to read the performance guide for directions as to how to get the most out of the AdaptiveCpp stack.
ACPP_ADAPTIVITY_LEVEL
environment variable. This release introduces support for the more aggressive ACPP_ADAPTIVITY_LEVEL=2
setting.std::execution::par
execution policy on devices which support strong forward progress guarantees. In this case, there is experimental support for std::atomic
and std::atomic_ref
in device code.buffer
and multi_ptr
interfaces.sycl::specialized
extension, which hints to the JIT compiler that a runtime kernel argument should be replaced with a constant at JIT-time. This makes AdaptiveCpp the first SYCL implementation to support specialization semantics across all backends thanks to its unified JIT compiler.HIPSYCL_*, __hipsycl*
and adds new versions following ACPP_*, __acpp_*
naming scheme. Users are encouraged to migrate to the new names.The following benchmarks explore the performance of the new ACPP_ADAPTIVITY_LEVEL=2
(AL2) feature as well as performance in general.
ACPP_ADAPTIVITY_LEVEL
environment variable, which controls the aggressiveness of additional JIT-time optimizations that AdaptiveCpp supports. The AL2 results were obtained after 2-3 application runs when performance has converged.-O3 -ffast-math
universally, which aligns most compilers. For hipcc
, -fno-hip-fp32-correctly-rounded-divide-sqrt
was used in addition to align the behavior with the other compilers.--acpp-targets=generic
)-DACPP_ALLOW_INSTANT_SUBMISSION=1
compilation flag in line with the recommendations in the AdaptiveCpp performance guide.The figures below may be freely shared under CC-by license, with attribution to the AdaptiveCpp project.
Full Changelog: https://github.com/AdaptiveCpp/AdaptiveCpp/compare/v24.02.0...v24.06.0
Published by illuhad 7 months ago
AdaptiveCpp 24.02 introduces multiple compiler improvements, making it one of the best SYCL compilers - and in many cases the best - in the world when it comes to extracting performance from the hardware.
If you are not using it already, try it now and perhaps save some compute time!
The following performance results have been obtained with AdaptiveCpp's generic single-pass compiler (--acpp-targets=generic
).
Note: oneAPI by default compiles with -ffast-math
, while AdaptiveCpp does not enable fast math by default. All benchmarks have been explicitly compiled with -fno-fast-math
to align compiler behavior, except where noted otherwise.
Note: oneAPI for AMD does not correctly round sqrt()
calls even if -fno-fast-math
is passed, using approximate builtins instead. This loss of precision can substantially skew benchmark results, resulting in misleading performance results. AdaptiveCpp 24.02 correctly rounds math functions by default. To align precision and allowed compiler optimizations, AdaptiveCpp was allowed to use approximate sqrt
builtins as well for the AMD results.
Note: AdaptiveCpp was running on the Intel GPU through OpenCL, while DPC++ was using its default backend Level Zero, which allows for more low-level control. Some of the differences may be explained by the different backend runtimes underneath the SYCL implementations.
AdaptiveCpp 24.02 ships with the world's fastest compiler for offloading C++ standard parallelism constructs. This functionality was already part of 23.10, however AdaptiveCpp includes multiple important improvements. It can substantially outperform vendor compilers, and is the world's only compiler that can demonstrate C++ standard parallelism offloading performance across Intel, NVIDIA and AMD hardware. Consider the following performance results for the CloverLeaf, TeaLeaf and miniBUDE benchmarks:
icpx -fsycl-pstl-offload=gpu
on Intel Data Center GPU Max 1550.In particular, note that AdaptiveCpp does not depend on the XNACK hardware feature to obtain performance on AMD GPUs. XNACK is an elusive feature that is not available on most consumer hardware, and usually not enabled on most production HPC systems.
--acpp-targets=generic
. This means that a simple compiler invocation such as acpp -o test -O3 test.cpp
will create a binary that can run on Intel, NVIDIA and AMD GPUs. AdaptiveCpp 24.02 is the world's only SYCL compiler that does not require specifying compilation targets to generate a binary that can run "everywhere".
--acpp-targets=generic
can now also target the host CPU through the generic JIT compiler. This can lead to performance improvements over the old omp
compiler. E.g. on AMD Milan, babelstream's dot benchmark was observed to improve from 280GB/s to 380GB/s. This also means that it is no longer necessary to target omp
to run on the CPU. generic
is sufficient, and will likely perform better. Not having to compile for omp
explicitly can also reduce compile times noticably (we observed e.g. ~15% for babelstream).--acpp-targets=generic
. This can substantially reduce JIT overheads.--acpp-targets=generic
, AdaptiveCpp can now automatically apply optimizations to kernels at JIT-time based on runtime knowledge. This can lead to noticable speedups in some cases, although the full potential of this is expected to only become apparent with future AdaptiveCpp versions.
ACPP_ADAPTIVITY_LEVEL
environment variable. Set it to 0 to recover the old behavior. The default is currently 1. If you are running benchmarks, you may have to update your benchmarking infrastructure to run applications multiple times.
Full Changelog: https://github.com/AdaptiveCpp/AdaptiveCpp/compare/v23.10.0...v24.02.0
Published by illuhad 12 months ago
This release contains several major features, and introduces a major shift in the project's capabilities:
acpp
), compiler flags (e.g. --acpp-targets
), cmake integration and more. The old name is still supported for backward compatibility during a transitional period. For details on why this renaming occured, see https://github.com/AdaptiveCpp/AdaptiveCpp/issues/1147
--acpp-targets=generic
): This release is the first release to contain our new single-pass compiler. This is the world's only SYCL compiler which does not need to parse the code multiple times to generate a binary. Instead, during the regular host compilation, LLVM IR for kernels is extracted and embedded in the binary. At runtime, this IR is then JIT-compiled to whatever is needed (currently supported is PTX, amdgcn and SPIR-V)
acpp --acpp-targets=generic
can directly be executed on all supported GPUs from Intel, NVIDIA and AMD. The new approach can dramatically reduce compile times, especially when many devices need to be targeted since the code still is only parsed a single time.--acpp-stdpar
). This heterogeneous programming model was until now primarily supported by NVIDIA's nvc++ for NVIDIA GPUs. AdaptiveCpp not only supports it for NVIDIA, AMD and Intel GPUs, but also conveniently allows to generate a binary that can dispatch to all supported devices using the new single-pass compiler. See here for details on this new experimental feature: https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/stdpar.md
The full list of changes it too long for release pages; please see here for a comprehensive list of all changes:
Full Changelog: https://github.com/AdaptiveCpp/AdaptiveCpp/compare/v0.9.4...v23.10.0
Published by illuhad about 1 year ago
This is a prerelease for the upcoming 23.10.0 to provide a testing target.
multi_ptr == nullptr
by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/924
std::string
by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/926
sycl::vec
class to reflect SYCL 2020 requirements by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/907
halfn = vec<half, n>
by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/970
half
with other scalar types by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/969
Args...
of vec
constructor in template parameter to allow SFINAE by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/954
id
class by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1018
buffer(Container)
constructor by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/990
range
class by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1027
half
by @normallytangent in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1038
Dimensions
template paramter for {nd_}range
, {nd_,h_}item
and id
by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1016
buffer(Container)
and buffer(Iterator, Iterator)
constructors by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1033
size_t
in decl and def of createExitWithID
by @fodinabor in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1063
marray
by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1075
-std=c++20
by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1083
FindCUDAToolkit
for cmake versions >= 3.17 by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1124
{m,aligned_}alloc
and free
by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1114
std::for_each_n
by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1132
get_backend
to Interop handle by @normallytangent in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1141
sycl::exception
class to SYCL2020 by @nilsfriess in https://github.com/AdaptiveCpp/AdaptiveCpp/pull/1066
Full Changelog: https://github.com/AdaptiveCpp/AdaptiveCpp/compare/v0.9.4...v23.10.0-alpha
Published by illuhad over 1 year ago
This is a maintenance release, intended as a last stop before major additions. It therefore does not include major functionality already available on the develop branch such as the generic single-pass compiler.
my_id
since it does not exist in class item by @nilsfriess in https://github.com/illuhad/hipSYCL/pull/868
device{} == device{default_selector{}}
by @nilsfriess in https://github.com/illuhad/hipSYCL/pull/888
global_mem_cache_type::write_only
to read_write
by @nilsfriess in https://github.com/illuhad/hipSYCL/pull/875
Full Changelog: https://github.com/illuhad/hipSYCL/compare/v0.9.3...v0.9.4
Published by illuhad about 2 years ago
hip.explicit-multipass
compilation flowstd::weak_ptr
instead of shared_ptr
to express dependencies in the DAG; making old DAG nodes and their associated events eligible earlier for reuse by the event pool.std::filesystem
is availablehipsycl-hcf-tool
to inspect and edit HCF fileshipsycl-info
to print information about detected devices.Thank you to our first-time contributors!
Full Changelog: https://github.com/illuhad/hipSYCL/compare/v0.9.2...v0.9.3
Published by illuhad over 2 years ago
The following is an incomplete list of changes and improvements:
__hipsycl_kernel<KernelNameT>
or __hipsycl_kernel<KernelBodyT>
.atomic_ref
, device selector API, device aspect API and othersqueue::get_wait_list()
hipSYCL extension to allow barrier-like semantics at the queue levelaccessor_variant
extension which allows accessors to automatically optimize the internal data layout of the accessor object depending on how they were constructed. This can save registers on device without any changes needed by the user.handler::update_device()
extension in analogy to already existing update_host()
. This can be e.g. used to prefetch data.See the documentation on extensions for more details.
cuda.explicit-multipass
compilation flow when multiple translation units are involved.Yes, a lot of them :-)
Published by illuhad over 3 years ago
-- This release is dedicated to the memory of Oliver M. Some things just end too soon.
hipSYCL_retarget
command group property. Execute an operation submitted to a queue on an arbitrary device instead of the one the queue is bound to.hipSYCL_prefer_group_size<Dim>
command group property. Provides a recommendation to hipSYCL which group size to choose for basic parallel for kernels.hipSYCL_prefer_execution_lane
command group property. Provides a hint to the runtime on which backend queue (e.g. CUDA stream) an operation should be executed. This can be used to optimize kernel concurrency or overlap of data transfers and compute in case the hipSYCL scheduler does not already automatically submit an optimal configuration.buffer
objects, turning any buffer
into a collection of USM pointers, as well as constructing buffer
objects on top of existing USM pointers.hipSYCL_page_size
buffer property can be used to enable data state tracking inside a buffer at a granularity below the buffer size. This can be used to allow multiple kernels to concurrently write to the same buffer as long as they access different hipSYCL data management pages. Unlike subbuffers, this also works with multi-dimensional strided memory accesses.sycl::mem_advise()
as free functionhandler::prefetch_host()
and queue::prefetch_host()
for a simpler mechanism of prefetching USM allocations to host memory.auto v = sycl::make_async_view(ptr, range)
constructs a buffer that operates directly on the input pointer and does not block in the destructor.HIPSYCL_VISIBLITY_MASK
environment variable can be used to select which backends should be loaded.See https://github.com/illuhad/hipSYCL/blob/develop/doc/extensions.md for a list of all hipSYCL extensions with more details.
host_accessor
queue::wait()
Yes, a lot of them!
Published by illuhad almost 4 years ago
hipSYCL 0.9 is packed with tons of new features compared to the older 0.8 series:
hipSYCL 0.9.0 introduces support for several key SYCL 2020 features, including:
There are two new extensions in hipSYCL 0.9.0:
hipSYCL 0.9.0 is the first release containing the entirely rewritten, brand new runtime library, which includes features such as:
libhipSYCL-rt
) instead of libraries for each backend (libhipSYCL_cpu
, libhipSYCL_cuda
etc)syclcc
is no longer required and has been removed. When building hipSYCL, only the runtime needs to be compiled which can be done with any regular C++ compiler. This should simplify the build process greatly.sycl::queues
exist, compute/memory-overlap always works equally well. This means a sycl::queue
is now nothing more than an interface to the runtime.syclcc
and compilation improvements--hipsycl-targets
flag that allows to compile for multiple targets and backends, e.g. syclcc --hipsycl-targets="omp;hip:gfx906,gfx900"
compiles for the OpenMP backend as well as for Vega 10 and Vega 20. Note that simultaneous compilation for both NVIDIA and AMD GPUs is not supported due to clang limitations.syclcc.json
), giving the user more control to adapt the compilation flow to individual requirements. This can be helpful for uncommon setup scenarios where different flags may be required.nd_range
parallel for on CPU, bringing several orders of magnitudes of performance. Note that nd_range
parallel for is inherently difficult to implement in library-only CPU backends, and basic parallel for
or our scoped parallelism extension should be preferred if possible.Yes, a lot of them :-)
Published by illuhad about 5 years ago
Note: hipSYCL 0.8.0 is deprecated, users are encouraged to use our package repositories instead
This is the release of hipSYCL 0.8.0. We provide the following packages:
While we cannot provide matching CUDA packages for NVIDIA support due to legal reasons, scripts for installing a matching CUDA distribution as well as scripts to generate CUDA packages are provided. You will find further information in the readme here on github.
At the moment, Arch Linux, CentOS 7 and Ubuntu 18.04 packages are provided.
Published by illuhad about 5 years ago
This is a prerelease of hipSYCL 0.8.0. In particular, it serves to test new packages of the entire hipSYCL stack. We provide the following packages:
hipSYCL-base
provides the basic LLVM compiler stack that is needed in any casehipSYCL-rocm
provides a compatible ROCm stack that additionally allows hipSYCL to target AMD GPUshipSYCL
provides the actual hipSYCL libraries, tools and headersWhile we cannot provide matching CUDA packages due to legal reasons, CUDA installation scripts will be provided for the actual hipSYCL 0.8.0 release.
At the moment, Arch Linux and Ubuntu 18.04 packages are provided.