Learning Rust: System V AMD64 ABI

Published in

Level Up Coding

10 min readNov 5, 2023

Are you ready to leave Rust’s safety net to wield the raw, unbridled might of Assembly for coding marvels?

Rust enables us to do almost everything, from high-level programming to detailed bit manipulation — and it’s quite fast at that. However, Rust isn’t a panacea. Sometimes you need to step outside its domain and integrate it with code written in other languages. As programming in Rust is often viewed as low-level coding, developers tend to write performance-critical components in Rust and then integrate them into slower environments, like Python, using PyO3 bindings, for example.

In this article, I’ll show you something contrary. I’ll demonstrate how to integrate Rust codebase with x86–64 assembly code to achieve incredible performance by utilizing the AVX-256 instruction set.

The System V AMD64 ABI is a calling convention utilized in Linux. It outlines how to make calls at the linking level — detailing how parameters are passed, how the stack is managed, and which registers should be preserved.

Suppose you wanted to add two numbers outside of Rust. You could declare an external function using Foreign Function Interface like so:

extern "C" {
    fn sum_u8(a: u8, b: u8) -> u8;
}

This declaration informs the compiler that there exists some code capable of adding two u8 integers and returning another u8. It also specifies how to locate the code (by its name) and how to call it (using "C" linkage). The external code isn't compiled; it's merely linked when constructing the final executable.

The assembly code that performs the sophisticated math summation might look like this::

section .text
global sum_u8

sum_u8:
    movzx   eax, dil
    add     al, sil
    ret

This code declares that the sum_u8 symbol is globally exported and includes all instructions to compute the sum. Unlike higher-level programming languages, assembly doesn't define a function signature. Instead, the signature is implicitly determined by the calling convention, and both the caller and callee must adhere to it to prevent any side effects.

To simplify things, let’s reduce our convention to the following rules:

Integer or pointer arguments are passed in the following registers in this exact order: RDI, RSI, RDX, RCX, R8, R9.
The return value is taken from RAX (for integers up to 64 bits).
Some registers are volatile and may be changed — or not restored — by the called code: RAX, RCX, RDX, RSI, RDI, R8, R9, R10, R11.

Looking back at our code snippet, we can decipher how parameters are passed and interpreted within the assembly code:

-- Rust
extern "C" {
    // 'a' is passed in RDI
    // 'b' is passed in RSI
    // The result is retrieved from RAX
    fn sum_u8(a: u8, b: u8) -> u8;
}

-- Assembly
section .text
global sum_u8

sum_u8:
    movzx   eax, dil  ; Move the lower 8 bits from RDI (DIL) to EAX
    add     al, sil   ; Add the lower 8 bits from RSI (SIL) to EAX
    ret               ; Return, with the final sum in EAX (AL)

Now, let’s leverage our newly acquired knowledge to tackle a more complex task. Suppose we need to transform an array of bytes into an array of bits (represented as bytes), such that the first bit corresponds to the least significant bit (LSB) of the first byte, and the final bit in the output array corresponds to the most significant bit (MSB) of the last byte. In Rust, we might approach this problem with the following code snippet:

let src: [u8; 8] = [15, 0, 255, 240, 240, 15, 0, 255];
let mut dst: [u8; 64] = [0; 64];

for index in 0..64 {
  for offset in 0..8 {
    dst[(index << 3) + offset] = if src[index] & (1 << offset) != 0 {
      1
    } else {
      0
    };
  }
}

println!("{:?}", &dst[0..32]);
println!("{:?}", &dst[32..64]);

This loop iterates over each byte and then over each bit, applying an AND mask to determine whether to store a 0 or a 1 based on the result. If we run this program, we can expect an output that might look like this:

For 0, it would display all zeros: [0, 0, 0, 0, 0, 0, 0, 0]
For 15, it would be transformed into: [1, 1, 1, 1, 0, 0, 0, 0]
For 240, it would be: [0, 0, 0, 0, 1, 1, 1, 1]
For 255, it would be all ones: [1, 1, 1, 1, 1, 1, 1, 1]

And the output would be:

[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1]
[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

The code runs efficiently and is straightforward, but it can be optimized significantly with SIMD, especially if we are dealing with much larger datasets.

AVX-256 is a potent feature present in many of today’s processors. It enables the use of 16 registers, each being 256 bits in size, which are further split into 128-bit lanes. This technology introduces numerous instructions that can perform wonders with the data housed within these registers, though some may initially come across as complex, and certain instructions might only work within the lanes.

Let’s conceptualize a method to transform our bit extraction code into a handful of AVX instructions. Consider processing 4 bytes simultaneously as we decode all 32 bits into an array of 32 bytes. Here’s a potential method:

Loading Data: Transfer 4 bytes into the YMM0 register from memory that’s 4-byte aligned. This action will essentially replicate those 4 bytes 8 times across the 32-byte YMM0 register, preparing them for the next step.

AND Operation: Conduct an AND operation between the populated YMM0 register and a predefined 256-bit mask. This single move will help us single out 256 distinct bits.

Shuffling Bytes: Rearrange the bytes within each 128-bit lane. This shuffling won’t perfectly sort them but will arrange the data adequately for the subsequent step.

Moving Data Blocks: Shift the 32-bit sections between 128-bit lanes to their correct positions.

Comparison and Substitution: Evaluate each byte against zero to convert zero values to 255 and non-zero values to 0.

Increment and Overflow: Increase each byte by one, causing it to overflow to 0 if it was 255, and change to 1 otherwise.

Data Extraction: Finally, transfer the 32 bytes from the YMM0 register back into 32-bit aligned memory.

By employing these steps, we transition from moving individual bytes or words between registers to creating a complete 32-byte structure with just a few AVX-256 instructions. In contrast, using the standard x86 instruction set would typically demand 17 instructions for each byte, whereas with AVX, we can consolidate the process down to merely 7 instructions for every 4 bytes, enhancing efficiency considerably.

To incorporate assembly within Rust, the process is akin to what we previously applied for adding two bytes. The objective is to craft an assembly snippet that works in tandem with the Rust code calling an external function:

extern "C" {
    fn extract_bits(dst: *mut u8, src: *const u8, count: usize);
}

#[repr(align(32))]
struct AlignedArray32<const T: usize>([u8; T]);

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let src = Box::new(AlignedArray32([15, 0, 255, 240, 240, 15, 0, 255]));
    let mut dst = Box::new(AlignedArray32([0; 64]));

    unsafe {
        let src = src.0.as_ptr().add(0);
        let dst = dst.0.as_mut_ptr().add(0);

        extract_bits(dst, src, 8);
    }

    println!("{:?}", &dst.0[0..32]);
    println!("{:?}", &dst.0[32..64]);

    Ok(())
}

In the Rust main function, we ensure our data arrays are 32-byte aligned, yet our code must be robust against non-aligned data inputs. The assembly routine must, therefore, address unaligned data by processing any unaligned leading or trailing bytes.

    unsafe {
        let src = src.0.as_ptr().add(1);
        let dst = dst.0.as_mut_ptr().add(8);

        extract_bits(dst, src, 6);
    }

The assembly routine I’ve drafted processes unaligned heads and tails, with the bulk being efficiently processed in aligned chunks, which is essential given its design for large data sizes such as 32k. The routine uses constants for vector operations and looks something like this:

section .rodata
    align 32
    and_256 dd 0b0000_0001_0000_0001_0000_0001_0000_0001
            dd 0b0000_0010_0000_0010_0000_0010_0000_0010
            dd 0b0000_0100_0000_0100_0000_0100_0000_0100
            dd 0b0000_1000_0000_1000_0000_1000_0000_1000
            dd 0b0001_0000_0001_0000_0001_0000_0001_0000
            dd 0b0010_0000_0010_0000_0010_0000_0010_0000
            dd 0b0100_0000_0100_0000_0100_0000_0100_0000
            dd 0b1000_0000_1000_0000_1000_0000_1000_0000

    align 32
    shuffle_256 db 0, 4, 8, 12, 1, 5, 9, 13
                db 2, 6, 10, 14, 3, 7, 11, 15
                db 0, 4, 8, 12, 1, 5, 9, 13
                db 2, 6, 10, 14, 3, 7, 11, 15

    align 32
    permute_256 dd 0, 4, 1, 5, 2, 6, 3, 7

    align 32
    one_256 db 0x01

section .text
    global extract_bits

    ; Function to extract bits from bytes in 'src' to 'dst'
    ; Arguments:
    ; - rsi: pointer to destination u8 array (dst)
    ; - rsi: pointer to source u8 array (src)
    ; - rdx: number of bytes to extract (count)

extract_bits:
    vmovdqa         ymm1, [rel and_256]
    vmovdqa         ymm2, [rel shuffle_256]
    vmovdqa         ymm3, [rel permute_256]
    vpxor           ymm4, ymm4, ymm4
    vpbroadcastb    ymm5, [rel one_256]

.scalar_check:
    test            rdx, rdx
    jz              .done

    test            rsi, 3
    jz              .simd_check

.scalar_head:
    movzx           eax, byte [rsi]

    test            al, 1
    setnz           [rdi]
    test            al, 2
    setnz           [rdi + 1]
    test            al, 4
    setnz           [rdi + 2]
    test            al, 8
    setnz           [rdi + 3]
    test            al, 16
    setnz           [rdi + 4]
    test            al, 32
    setnz           [rdi + 5]
    test            al, 64
    setnz           [rdi + 6]
    test            al, 128
    setnz           [rdi + 7]

    add             rsi, 1
    add             rdi, 8
    sub             rdx, 1
    jmp             .scalar_check

.simd_check:
    cmp             rdx, 4
    jl              .scalar_tail

.simd_loop:
    vpbroadcastd    ymm0, dword [rsi]
    vpand           ymm0, ymm0, ymm1
    vpshufb         ymm0, ymm0, ymm2
    vpermd          ymm0, ymm3, ymm0
    vpcmpeqb        ymm0, ymm0, ymm4
    vpaddb          ymm0, ymm0, ymm5
    vmovdqa         [rdi], ymm0

    add             rsi, 4
    add             rdi, 32
    sub             rdx, 4
    jmp             .simd_check

.done:
    vzeroupper
    ret

.scalar_tail:
    test            rdx, rdx
    jz              .done

    movzx           eax, byte [rsi]

    test            al, 1
    setnz           [rdi]
    test            al, 2
    setnz           [rdi + 1]
    test            al, 4
    setnz           [rdi + 2]
    test            al, 8
    setnz           [rdi + 3]
    test            al, 16
    setnz           [rdi + 4]
    test            al, 32
    setnz           [rdi + 5]
    test            al, 64
    setnz           [rdi + 6]
    test            al, 128
    setnz           [rdi + 7]

    add             rsi, 1
    add             rdi, 8
    sub             rdx, 1

    jmp            .scalar_tail

Regrettably, the code snippets I’ve provided are insufficient on their own. We cannot simply write code in a different language and expect Rust to integrate it without specific instructions for linking it into our executable. However, in Rust, this process is fairly straightforward. All that’s required is to modify the build process by creating a build.rs file, which is situated alongside the Cargo.toml, as shown below:

use std::process::Command;
use std::env;

fn main() {
    let dir = env::current_dir().unwrap();
    let asm_file = dir.join("src/extract.asm");
    let out_dir = env::var("OUT_DIR").unwrap();

    Command::new("nasm")
        .args(&["-f", "elf64", "-o"])
        .arg(&format!("{}/extract.o", out_dir))
        .arg(asm_file)
        .status()
        .expect("Failed to run nasm");

    Command::new("ar")
        .args(&["rcs"])
        .arg(&format!("{}/libamacal.a", out_dir))
        .arg(&format!("{}/extract.o", out_dir))
        .status()
        .expect("Failed to run nasm");

    println!("cargo:rustc-link-search=native={}", out_dir);
    println!("cargo:rustc-link-lib=static=amacal");
}

This code informs Rust that it needs to build the assembly code with NASM, assemble a static library with AR, and then link it all together. It’s quite a neat and efficient process, wouldn’t you agree?

How quickly does the new optimized code perform? Let’s assess its speed through a benchmark and compare it with a slightly optimized, naive solution. To afford the compiler an opportunity for optimizations, I’ve intentionally included alignment. The code deliberately omits the first and last bytes, generating a scenario where unaligned code is processed.

#[macro_use]
extern crate criterion;

use criterion::Criterion;
use rand::prelude::*;

#[repr(align(32))]
struct AlignedArray32<const T: usize>([u8; T]);

fn criterion_benchmark(criterion: &mut Criterion) {
    let mut rng = StdRng::seed_from_u64(64);
    let mut src = Box::new(AlignedArray32([0; 131072]));
    let mut dst = Box::new(AlignedArray32([0; 1048576]));

    for i in 0..131072 {
        src.0[i] = rng.gen();
    }

    criterion.bench_function("loop", |b| {
        b.iter(|| unsafe {
            for index in 1..131071 {
                for offset in 0..8 {
                    let bit = dst.0.get_unchecked_mut((index << 3) + offset);
                    let value = if src.0[index] & (1 << offset) != 0 { 1 } else { 0 };

                    *bit = value;
                }
            }
        })
    });

    let mut sum: u64 = 0;
    for i in 0..1048576 {
        sum += if i % 7 == 0 { dst.0[i] } else { 0 } as u64;
    }

    println!("simd: {}", sum);
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

The SIMD code bears resemblance but relies on an externally defined function written in assembly. It carries out essentially the same operations, just utilizing a different method. I’ve incorporated an additional println! statement to verify that both benchmarks yield an identical count of detected bits:

#[macro_use]
extern crate criterion;

use criterion::Criterion;
use rand::prelude::*;

extern "C" {
    fn extract_bits(dst: *mut u8, src: *const u8, count: usize);
}

#[repr(align(32))]
struct AlignedArray32<const T: usize>([u8; T]);

fn criterion_benchmark(criterion: &mut Criterion) {
    let mut rng = StdRng::seed_from_u64(64);
    let mut src = Box::new(AlignedArray32([0; 131072]));
    let mut dst = Box::new(AlignedArray32([0; 1048576]));

    for i in 0..131072 {
        src.0[i] = rng.gen();
    }

    criterion.bench_function("simd", |b| {
        b.iter(|| unsafe {
            let src = src.0.as_ptr().add(1);
            let dst = dst.0.as_mut_ptr().add(8);
    
            extract_bits(dst, src, 131070);
        })
    });

    let mut sum: u64 = 0;
    for i in 0..1048576 {
        sum += if i % 7 == 0 { dst.0[i] } else { 0 } as u64;
    }

    println!("simd: {}", sum);
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

What is the speed comparison of both code snippets? Following the benchmark execution, we observe:

Traditional loop: approximately 300 microseconds
Optimized AVX-256: approximately 30 microseconds

We’ve achieved a tenfold increase in speed.

The article outlines how to invoke code written outside of Rust within a Linux setting. It covers the fundamentals of the System V AMD64 ABI calling convention, demonstrates how to integrate with raw assembly code, and presents a functional prototype for bit extraction at remarkable speeds using AVX instructions.

As with most methods, there are trade-offs. My focus here was solely on performance, setting aside considerations of platform compatibility or specific CPU support. It’s a solid jumping-off point for your next coding venture. Give it a whirl!

https://github.com/amacal/learning-rust/tree/system-v-amd64-abi

Learning Rust: System V AMD64 ABI

Written by Adrian Macal