Examples of using Perl to augment NASM and vice versa
This is probably one of the things that should never be allowed to exist, but why not use Perl and its capabilities to inline foreign code, to FAFO with assembly without a build system? Everything in a single file! In the process one may find ways to use Perl to enhance NASM and vice versa. But for now, I make no such claims : I am just using the perlAssembly git repo to illustrate how one can use Perl to drive (and learn to code!) assembly programs from a single file.
Simple integer addition in Perl - this is the Hello World version of this git repo
Explore multiple equivalent ways to add large arrays of short integers (-100 to 100 in this implementat) in Perl:
Scenarios w_alloc : allocate memory for each iteration to test the speed of pack, those marked as wo_alloc, use a pre-computed data structure to pass the array to the underlying code. Benchmarks of the first scenario give the true cost of offloading summation to of a Perl array to a given function when the source data are in Perl. Timing the second scenario benchmarks speed of the underlying implementation.
The script illustrates
Those were obtained on the i7 with the following topology
And here are the timings!
mean | median | stddev | |
---|---|---|---|
ASM_blank | 2.3e-06 | 2.0e-06 | 1.1e-06 |
ASM_doubles_AVX_w_alloc | 3.6e-03 | 3.5e-03 | 4.2e-04 |
ASM_doubles_AVX_wo_alloc | 3.0e-04 | 2.9e-04 | 2.7e-05 |
ASM_doubles_w_alloc | 4.3e-03 | 4.1e-03 | 4.5e-04 |
ASM_doubles_wo_alloc | 8.9e-04 | 8.7e-04 | 3.0e-05 |
ASM_w_alloc | 4.3e-03 | 4.2e-03 | 4.5e-04 |
ASM_wo_alloc | 9.2e-04 | 9.1e-04 | 4.1e-05 |
ForLoop | 1.9e-02 | 1.9e-02 | 2.6e-04 |
ListUtil | 4.5e-03 | 4.5e-03 | 1.4e-04 |
PDL_w_alloc | 2.1e-02 | 2.1e-02 | 6.7e-04 |
PDL_wo_alloc | 9.2e-04 | 9.0e-04 | 3.9e-05 |
Let's say we wanted to do this toy experiment in pure C (using Inline::C of course!) This code obtains the integers as a packed "string" of doubles and forms the sum in C
double sum_array_C(char *array_in, size_t length) {
double sum = 0.0;
double * array = (double *) array_in;
for (size_t i = 0; i < length; i++) {
sum += array[i];
}
return sum;
}
Here are the timing results:
mean | median | stddev | |
---|---|---|---|
C_doubles_w_alloc | 4.1e-03 | 4.1e-03 | 2.3e-04 |
C_doubles_wo_alloc | 9.0e-04 | 8.7e-04 | 4.6e-05 |
What if we used SIMD directives and parallel loop constructs in OpenMP? This was done in the file addArrayOfIntegers_C.pl. All three combinations were tested, i.e. SIMD directives alone (the C equivalent of the AVX code), OpenMP parallel loop threads and SIMD+OpenMP. Here are the timings!
mean | median | stddev | |
---|---|---|---|
C_OMP_w_alloc | 4.0e-03 | 3.7e-03 | 1.4e-03 |
C_OMP_wo_alloc | 3.1e-04 | 2.3e-04 | 9.5e-04 |
C_SIMD_OMP_w_alloc | 4.0e-03 | 3.8e-03 | 8.6e-04 |
C_SIMD_OMP_wo_alloc | 3.1e-04 | 2.5e-04 | 8.5e-04 |
C_SIMD_w_alloc | 4.1e-03 | 4.0e-03 | 2.4e-04 |
C_SIMD_wo_alloc | 5.0e-04 | 5.0e-04 | 8.9e-05 |
The code here is NOT meant to be portable. I code in Linux and in x86-64, so if you are looking into Window's ABI or ARM, you will be disappointed. But as my knowledge of ARM assembly grows, I intend to rewrite some examples in Arm assembly!