Advent of Numba

Solutions to Advent of Code using Numba. Some notes on the solutions:

Numba 0.52 is required to run the solutions.
- I may use features that only appear in the master branch in later days.
I will try to use CUDA for each solution.
Most solutions will be the easiest for me to write.
- This might mean a lot of brute force, due to the capabilities of a GPU and
  the low effort needed to invent brute force solutions.
- The solutions will probably not be close to performance-optimal.
I will try to demonstrate something "interesting" about the CUDA target for
each solution. E.g.:
- Atomic operations
- Cooperative grids
- etc.
I will try to annotate each solution to explain to a beginner the rationale
behind the implementation.
I will probably fall a few days behind.
I am not optimistic about finishing all 24 days.

Please direct comments / questions / criticisms / veneration to: @gmarkall.

Solutions

Links to solutions and some interesting features of them:

Day 1: 2D / 3D grids, atomic exchange for stores.
Day 2: Atomic increment, structured arrays.
Day 3: Building reduction kernels with @cuda.reduce,
host to device transfers to elide unnecessary copying
Day 4: I didn't finish doing this on the GPU.
- Pure Python Solution
- A start at the GPU solution
- Jacob Tomlinson had the good sense to use cuDF for this one. Check out his
  solution!
Day 5: Cooperative Groups (grid group / grid sync) and
device functions.
Day 6: Python solution only so far. Will need to
re-visit to complete a CUDA implementation - should be doable, but I'm under
time constraints.
Day 7: Python solution only so far. Will probably not
do a CUDA implementation of this one as it doesn't easily map to a GPU.
Day 8: Sharing the core computation of an implementation
on both the CPU and GPU targets by calling an @njit function from a
@cuda.jit function.
Day 9: Demonstrates some changes and workarounds needed when porting a pure Python code to the CUDA target - e.g. involving lists, array slicing, array functions.
- Python reference solution
- CUDA solution with commentary on workarounds

Other approaches

Jacob Tomlinson is also using Numba CUDA to
solve AoC. He is
also streaming and recording his work:
- YouTube
- Twitch

Notes

I'm using this section to collect thoughts I have whilst working on solutions about improving the usability and accessibility of Numba and the CUDA target.

Nice-to-haves:

Ability to call atomic inc without specifying a maximum (e.g.
cuda.atomic.max(arr, idx) (day 2).
The ability to return things from kernels (every day).
- Kernel launches are asynchronous, so this could return a future.
- Alternatively, allow an optional blocking launch to directly return the
  result.
A library of small sort functions (day 4).
- E.g. a function for a block to cooperate sorting a small array,
- A whole-grid sort for larger arrays,
- etc.
Better string op support (day 2).
- E.g. allow passing strings or arrays of bytes to kernels.
- Lots of lowering of string operations missing in CUDA (but probably present
  for nopython mode).
Support for a better print, for "prinf debugging" (all days)
- There is a printf-like function somewhere (in libdevice?) that can format
  strings that could be used.