DaCe - Data Centric Parallel Programming
BSD-3-CLAUSE License
mktemp
by @fazledyn-or in https://github.com/spcl/dace/pull/1428
auto_optimizer()
never took effect. by @philip-paul-mueller in https://github.com/spcl/dace/pull/1410
to_sdfg()
ignores the auto_optimize flag (Issue #1380). by @philip-paul-mueller in https://github.com/spcl/dace/pull/1395
SDFG.arg_names
was not a member but a class variable. by @philip-paul-mueller in https://github.com/spcl/dace/pull/1457
Full Changelog: https://github.com/spcl/dace/compare/v0.15...v0.15.1rc1
Published by tbennun about 1 year ago
A new analysis engine allows SDFGs to be statically analyzed for work and depth / average parallelism. The analysis allows specifying a series of assumptions about symbolic program parameters that can help simplify and improve the analysis results. For an example on how to use the analysis, see the following example:
from dace.sdfg.work_depth_analysis import work_depth
# A dictionary mapping each SDFG element to a tuple (work, depth)
work_depth_map = {}
# Assumptions about symbolic parameters
assumptions = ['N>5', 'M<200', 'K>N']
work_depth.analyze_sdfg(mysdfg, work_depth_map, work_depth.get_tasklet_work_depth, assumptions)
# A dictionary mapping each SDFG element to its average parallelism
average_parallelism_map = {}
work_depth.analyze_sdfg(mysdfg, average_parallelism_map, work_depth.get_tasklet_avg_par, assumptions)
To improve our integration with external codes, we limit the symbolic parameters generated by DaCe to only the used symbols. Take the following code for example:
@dace
def addone(a: dace.float64[N]):
for i in dace.map[0:10]:
a[i] += 1
Since the internal code does not actually need N
to process the array, it will not appear in the generated code. Before this release the signature of the generated code would be:
DACE_EXPORTED void __program_addone(addone_t *__state, double * __restrict__ a, int N);
After this release it is:
DACE_EXPORTED void __program_addone(addone_t *__state, double * __restrict__ a);
Note that this is a major, breaking change that requires users who manually interact with the generated .so files to adapt to.
A new allocation lifetime, dace.AllocationLifetime.External
, has been introduced into DaCe. Now you can use your DaCe code with external memory allocators (such as PyTorch) and ask DaCe for: (a) how much transient memory it will need; and (b) to use a specific pre-allocated pointer. Example:
@dace
def some_workspace(a: dace.float64[N]):
workspace = dace.ndarray([N], dace.float64, lifetime=dace.AllocationLifetime.External)
workspace[:] = a
workspace += 1
a[:] = workspace
csdfg = some_workspace.to_sdfg().compile()
sizes = csdfg.get_workspace_sizes() # Returns {dace.StorageType.CPU_Heap: N*8}
wsp = # ...Allocate externally...
csdfg.set_workspace(dace.StorageType.CPU_Heap, wsp)
The same interface is available in the generated code:
size_t __dace_get_external_memory_size_CPU_Heap(programname_t *__state, int N);
void __dace_set_external_memory_CPU_Heap(programname_t *__state, char *ptr, int N);
// or GPU_Global...
An experimental feature that allows you to analyze your SDFGs in a schedule-oriented format. It takes in SDFGs (even after applying transformations) and outputs a tree of elements that can be printed out in a Python-like syntax. For example:
@dace.program
def matmul(A: dace.float32[10, 10], B: dace.float32[10, 10], C: dace.float32[10, 10]):
for i in range(10):
for j in dace.map[0:10]:
atile = dace.define_local([10], dace.float32)
atile[:] = A[i]
for k in range(10):
with dace.tasklet:
# ...
sdfg = matmul.to_sdfg()
from dace.sdfg.analysis.schedule_tree.sdfg_to_tree import as_schedule_tree
stree = as_schedule_tree(sdfg)
print(stree.as_string())
will print:
for i = 0; (i < 10); i = i + 1:
map j in [0:10]:
atile = copy A[i, 0:10]
for k = 0; (k < 10); k = (k + 1):
C[i, j] = tasklet(atile[k], B(10) [k, j], C[i, j])
There are some new transformation classes and passes in dace.sdfg.analysis.schedule_tree.passes
, for example, to remove empty control flow scopes:
class RemoveEmptyScopes(tn.ScheduleNodeTransformer):
def visit_scope(self, node: tn.ScheduleTreeScope):
if len(node.children) == 0:
return None
return self.generic_visit(node)
We hope you find new ways to analyze and optimize DaCe programs with this feature!
CPU_Persistent
map schedule (OpenMP parallel regions) by @tbennun in #1330array_equal
of empty arrays by @tbennun in #1374Full Changelog: https://github.com/spcl/dace/compare/v0.14.4...v0.15
Published by tbennun over 1 year ago
Minor release; adds support for Python 3.11.
Published by phschaad over 1 year ago
The schedule type of a scope (e.g., a Map) is now also determined by the surrounding storage. If the surrounding storage is ambiguous, dace will fail with a nice exception. This means that codes such as the one below:
@dace.program
def add(a: dace.float32[10, 10] @ dace.StorageType.GPU_Global,
b: dace.float32[10, 10] @ dace.StorageType.GPU_Global):
return a + b @ b
will now automatically run the +
and @
operators on the GPU.
(#1262 by @tbennun)
Easier interface for profiling applications: dace.profile
and dace.instrument
can now be used within Python with a simple API:
with dace.profile(repetitions=100) as profiler:
some_program(...)
# ...
other_program(...)
# Print all execution times of the last called program (other_program)
print(profiler.times[-1])
Where instrumentation is applied can be controlled with filters in the form of strings and wildcards, or with a function:
with dace.instrument(dace.InstrumentationType.GPU_Events,
filter='*add??') as profiler:
some_program(...)
# ...
other_program(...)
# Print instrumentation report for last call
print(profiler.reports[-1])
With dace.builtin_hooks.instrument_data
, the same technique can be applied to instrument data containers.
(#1197 by @tbennun)
Data container instrumentation can further now be used conditionally, allowing saving and restoring of data container contents only if certain conditions are met. In addition to this, data instrumentation now saves the SDFG's symbol values at the time of dumping data, allowing an entire SDFG's state / context to be restored from data reports.
(#1202, #1208 by @phschaad)
Two new passes (ScalarFission
and StrictSymbolSSA
) allow fissioning of scalar data containers (or arrays of size 1) and symbols into separate containers and symbols respectively, based on the scope or reach of writes to them. This is a form of restricted SSA, which performs SSA wherever possible without introducing Phi-nodes. This change is made possible by a set of new analysis passes that provide the scope or reach of each write to scalars or symbols.
(#1198, #1214 by @phschaad)
SDFG Cutouts can now be taken from more than one state.
Additionally, taking cutouts that only access a subset of a data containre (e.g., A[2:5]
from a data container A
of size N
) results in the cutout receiving an "Alibi Node" to represent only that subset of the data (A_cutout[0:3] -> A[2:5]
, where A_cutout
is of size 4). This allows cutouts to be significantly smaller and have a smaller memory footprint, simplifying debugging and localized optimization.
Finally, cutouts now contain an exact description of their input and output configuration. The input configuration is anything that may influence a cutout's behavior and may contain data before the cutout is executed in the context of the original SDFG. Similarly, the output configuration is anything that a cutout writes to, that may be read externally or may influence the behavior of the remaining SDFG. This allows isolating all side effects of changes to a particular cutout, allowing transformations to be tested and verified in isolation and simplifying debugging.
(#1201 by @phschaad)
unsqueeze_memlet
fixes by @alexnick83 in https://github.com/spcl/dace/pull/1203
Full Changelog: https://github.com/spcl/dace/compare/v0.14.2...v0.14.3
Please let us know if there are any regressions with this new release.
Published by phschaad over 1 year ago
Full Changelog: https://github.com/spcl/dace/compare/v0.14.1...v0.14.2
Published by tbennun about 2 years ago
This release of DaCe offers mostly stability fixes for the Python frontend, transformations, and callbacks.
Full Changelog: https://github.com/spcl/dace/compare/v0.14...v0.14.1
Published by tbennun about 2 years ago
This release brings forth a major change to how SDFGs are simplified in DaCe, using the Simplify pass pipeline. This both improves the performance of DaCe's transformations and introduces new types of simplification, such as dead dataflow elimination.
Please let us know if there are any regressions with this new release.
dace.constant
type hint has now achieved stable status and was renamed to dace.compiletime
~/.dace.conf
. The SDFG build folders still include the full configuration file. Old .dace.conf
files are detected and migrated automatically..instrument
to dace.InstrumentationType.LIKWID_Counters
mallocAsync
API. To enable, set desc.pool = True
on any GPU data descriptor.import dace
from dace.dtypes import StorageType, ScheduleType
N = dace.symbol('N')
@dace
def add_on_gpu(a: dace.float64[N] @ StorageType.GPU_Global,
b: dace.float64[N] @ StorageType.GPU_Global):
# This map will become a GPU kernel
for i in dace.map[0:N] @ ScheduleType.GPU_Device:
b[i] = a[i] + 1.0
@dace
def optional(maybe: Optional[dace.float64[20]], always: dace.float64[20]):
always += 1 # "always" is always used, so it will not be optional
if maybe is None: # This condition will stay in the code
return 1
if always is None: # This condition will be eliminated in simplify
return 2
return 3
"string"
) use in the Python frontendeinsum
is now a library nodepip
__array_interface__
objects by @gronerl in https://github.com/spcl/dace/pull/1071
Full Changelog: https://github.com/spcl/dace/compare/v0.13.3...v0.14
Published by tbennun over 2 years ago
sdfg.view()
inside a VSCode console or debug session will open the file directly in the editor!Full Changelog: https://github.com/spcl/dace/compare/v0.13.2...v0.13.3
Published by tbennun over 2 years ago
arange
, round
, etc.def potentially_parsed_by_dace():
if not dace.in_program():
print('Called by Python interpreter!')
else:
print('Compiled with DaCe!')
sdfg.save('myprogram.sdfgz', compress=True) # or just run gzip on your old SDFGs
sdfg.view(8080)
(or any other port)Full Changelog: https://github.com/spcl/dace/compare/v0.13.1...v0.13.2
Published by tbennun over 2 years ago
StateFusion
, RedundantSecondArray
)Full Changelog: https://github.com/spcl/dace/compare/v0.13...v0.13.1
Published by tbennun over 2 years ago
Cutout allows developers to take large DaCe programs and cut out subgraphs reliably to create a runnable sub-program. This sub-program can be then used to check for correctness, benchmark, and transform a part of a program without having to run the full application.
* Example usage from Python:
def my_method(sdfg: dace.SDFG, state: dace.SDFGState):
nodes = [n for n in state if isinstance(n, dace.nodes.LibraryNode)] # Cut every library node
cut_sdfg: dace.SDFG = cutout.cutout_state(state, *nodes)
# The cut SDFG now includes each library node and all the necessary arrays to call it with
Also available in the SDFG editor:
Just like node instrumentation for performance analysis, data instrumentation allows users to set access nodes to be saved to an instrumented data report, and loaded later for exact reproducible runs.
* Data instrumentation natively works with CPU and GPU global memory, so there is no need to copy data back
* Combined with Cutout, this is a powerful interface to perform local optimizations in large applications with ease!
* Example use:
@dace.program
def tester(A: dace.float64[20, 20]):
tmp = A + 1
return tmp + 5
sdfg = tester.to_sdfg()
for node, _ in sdfg.all_nodes_recursive(): # Instrument every access node
if isinstance(node, nodes.AccessNode):
node.instrument = dace.DataInstrumentationType.Save
A = np.random.rand(20, 20)
result = sdfg(A)
# Get instrumented data from report
dreport = sdfg.get_instrumented_data()
assert np.allclose(dreport['A'], A)
assert np.allclose(dreport['tmp'], A + 1)
assert np.allclose(dreport['__return'], A + 6)
SDFG elements can now be grouped by any criteria, and they will be colored during visualization by default (by @phschaad). See example in action:
sdfg.add_constant
) can now be used as access nodes in SDFGs. The constants are hard-coded into the generated program, so you can run code with the best performance possible.views
connector to disambiguate which access node is being viewedelse
clause is now handled in for and while loops__dace_init
generated function signature (by @orausch)Full Changelog available at https://github.com/spcl/dace/compare/v0.12...v0.13
Published by tbennun almost 3 years ago
Important: Pattern-matching transformation API has been significantly simplified. Transformations using the old API must be ported! Summary of changes:
SingleStateTransformation
or MultiStateTransformation
classes instead of using decoratorsPatternNode
scan_be_applied
and apply
directly using self.nodename
strict
is now replaced with permissive
(False by default). Permissive mode allows transformations to match in more cases, but may be dangerous to apply (e.g., create race conditions).can_be_applied
is now a method of the transformationapply
method accepts a graph and the SDFG.Example of using the new API:
import dace
from dace import nodes
from dace.sdfg import utils as sdutil
from dace.transformation import transformation as xf
class ExampleTransformation(xf.SingleStateTransformation):
# Define pattern nodes
map_entry = xf.PatternNode(nodes.MapEntry)
access = xf.PatternNode(nodes.AccessNode)
# Define matching subgraphs
@classmethod
def expressions(cls):
# MapEntry -> Access
return [sdutil.node_path_graph(cls.map_entry, cls.access)]
def can_be_applied(self, graph: dace.SDFGState, expr_index: int, sdfg: dace.SDFG, permissive: bool = False) -> bool:
# Returns True if the transformation can be applied on a subgraph
if permissive: # In permissive mode, we will always apply this transformation
return True
return self.map_entry.schedule == dace.ScheduleType.CPU_Multicore
def apply(self, graph: dace.SDFGState, sdfg: dace.SDFG):
# Apply the transformation using the SDFG API
pass
Simplifying SDFGs is renamed from sdfg.apply_strict_transformations()
to sdfg.simplify()
AccessNodes no longer have an AccessType
field.
Full Changelog: https://github.com/spcl/dace/compare/v0.11.4...v0.12
Published by tbennun almost 3 years ago
@dace.program
s in JIT modeFull Changelog: https://github.com/spcl/dace/compare/v0.11.3...v0.11.4
Published by tbennun almost 3 years ago
Full Changelog: https://github.com/spcl/dace/compare/v0.11.2...v0.11.3
Published by tbennun almost 3 years ago
Full Changelog: https://github.com/spcl/dace/compare/v0.11.1...v0.11.2
Published by tbennun about 3 years ago
@dace
programs! Some examples:
@dataclass
and general object field supportdace.unroll
generator)dace.constant
as a type hint)dace.map
, dace.tasklet
) work in pure Python as welltorch.tensor
integration in @dace
program arguments@dace.program(auto_optimize=True, device=dace.DeviceType.CPU)
to automatically run some transformations, such as turning loops into parallel maps.Miscellaneous:
@dace
programsFull Changelog: https://github.com/spcl/dace/compare/v0.10.8...v0.11.1
Published by tbennun over 3 years ago
Published by tbennun about 4 years ago
@dace.program
s can now call other programs or SDFGs.instrument
property for SDFG nodes and states enables easy-to-use, localized performance reporting with timers, GPU events, and PAPI performance counters.SubgraphFusion
.Vectorization
) made more robust to corner cases.sdfgcc
to quickly compile and optimize .sdfg
files from the command line, generating header and library files. Great for interoperability and Makefiles.@dace
.einsum
) and new properties added, enabling faster performance and more productive high-performance coding than ever.Published by tbennun almost 5 years ago
ReduceExpansion
transformation.Published by tbennun about 5 years ago
@dace.program
and create SDFGs from implicit dataflow.