perlAssembly

This is probably one of the things that should never be allowed to exist, but why not use Perl and its capabilities to inline foreign code, to FAFO with assembly without a build system? Everything in a single file! In the process one may find ways to use Perl to enhance NASM and vice versa. But for now, I make no such claims : I am just using the perlAssembly git repo to illustrate how one can use Perl to drive (and learn to code!) assembly programs from a single file.

x86-64 examples

Adding Two Integers

Script: addIntegers.pl

Simple integer addition in Perl - this is the Hello World version of this git repo

The sum of an array of integers

Scripts: addArrayofIntegers.pl & addArrayOfIntegers_C.pl

Explore multiple equivalent ways to add large arrays of short integers (-100 to 100 in this implementat) in Perl:

ASM_blank : tests the speed of calling ASM from Perl (no computations are done)
ASM : passes the integers as bytes and then uses conversion operations and scalar floating point addition
ASM_doubles : passes the array as a packed string of doubles and do scalar double floating addition in assembly
ASM_doubles_AVX: passes the array as a packed string of doubles and do packed floating point addition in assembly
ForLoop : standard for loop in Perl
ListUtil: sum function from list utilities
PDL : uses summation in PDL

Scenarios w_alloc : allocate memory for each iteration to test the speed of pack, those marked as wo_alloc, use a pre-computed data structure to pass the array to the underlying code. Benchmarks of the first scenario give the true cost of offloading summation to of a Perl array to a given function when the source data are in Perl. Timing the second scenario benchmarks speed of the underlying implementation.

The script illustrates

an important (but not the only one!) strategy to create a data structure
that is suitable for Assembly to work with, i.e. a standard array of the appropriate type,
in which one element is laid adjacent to the previous one in memory
the emulation of declaring a pointer as constant in the interface of a C function. In the
AVX code, we don't FAFO with the pointer (RSI in the calling convention) to the array directly,
but first load its address to another register that we manipulate at will.

Results

Those were obtained on the i7 with the following topology

And here are the timings!

	mean	median	stddev
ASM_blank	2.3e-06	2.0e-06	1.1e-06
ASM_doubles_AVX_w_alloc	3.6e-03	3.5e-03	4.2e-04
ASM_doubles_AVX_wo_alloc	3.0e-04	2.9e-04	2.7e-05
ASM_doubles_w_alloc	4.3e-03	4.1e-03	4.5e-04
ASM_doubles_wo_alloc	8.9e-04	8.7e-04	3.0e-05
ASM_w_alloc	4.3e-03	4.2e-03	4.5e-04
ASM_wo_alloc	9.2e-04	9.1e-04	4.1e-05
ForLoop	1.9e-02	1.9e-02	2.6e-04
ListUtil	4.5e-03	4.5e-03	1.4e-04
PDL_w_alloc	2.1e-02	2.1e-02	6.7e-04
PDL_wo_alloc	9.2e-04	9.0e-04	3.9e-05

Let's say we wanted to do this toy experiment in pure C (using Inline::C of course!) This code obtains the integers as a packed "string" of doubles and forms the sum in C

double sum_array_C(char *array_in, size_t length) {
    double sum = 0.0;
    double * array = (double *) array_in;
    for (size_t i = 0; i < length; i++) {
        sum += array[i];
    }
    return sum;
}

Here are the timing results:

	mean	median	stddev
C_doubles_w_alloc	4.1e-03	4.1e-03	2.3e-04
C_doubles_wo_alloc	9.0e-04	8.7e-04	4.6e-05

What if we used SIMD directives and parallel loop constructs in OpenMP? This was done in the file addArrayOfIntegers_C.pl. All three combinations were tested, i.e. SIMD directives alone (the C equivalent of the AVX code), OpenMP parallel loop threads and SIMD+OpenMP. Here are the timings!

	mean	median	stddev
C_OMP_w_alloc	4.0e-03	3.7e-03	1.4e-03
C_OMP_wo_alloc	3.1e-04	2.3e-04	9.5e-04
C_SIMD_OMP_w_alloc	4.0e-03	3.8e-03	8.6e-04
C_SIMD_OMP_wo_alloc	3.1e-04	2.5e-04	8.5e-04
C_SIMD_w_alloc	4.1e-03	4.0e-03	2.4e-04
C_SIMD_wo_alloc	5.0e-04	5.0e-04	8.9e-05

Discussion of the sum of an array of integers example

For calculations such as this, the price that must be paid is all in memory currency: it
takes time to generate these large arrays, and for code with low arithmetic intensity this
time dominates the numeric calculation time.
Look how insanely effective sum in List::Util is : even though it has to walk the Perl
array whose elements (the doubles, not the AV*) are not stored in a contiguous area in memory,
it is no more than 3x slower than the equivalent C code C_doubles_wo_alloc.
Look how optimized PDL is compared to the C code in the scenario without memory allocation.
Manual SIMD coded in assembly is 40% faster than the equivalent SIMD code in OpenMP (but it is
much more painful to write)
The threaded OpenMP version achieved equivalent performance to the single thread AVX assembly
programs, with no obvious improvement from combining SIMD+parallel loop for pragmas in OpenMP.
For the example considered here, it thus makes ZERO senso to offload a calculation as simple as a
summation because ListUtil is already within 15% of the assembly solution (at a latter iteration
we will also test AVX2 and AVX512 packed addition to see if we can improve the results).
If however, one was managing the array, not as a Perl array, but as an area in memory through
a Perl object, then one COULD consider offloading. It may be fun to consider an example in
which one adds the output of a function that has an efficient PDL and assembly implementation
to see how the calculus changes (in the to-do list for now).

Disclaimer

The code here is NOT meant to be portable. I code in Linux and in x86-64, so if you are looking into Window's ABI or ARM, you will be disappointed. But as my knowledge of ARM assembly grows, I intend to rewrite some examples in Arm assembly!