Name: 64-bit Assembly Programming: AArch64
Rating: 2.3 (9589 reviews)
Author: arilloid

Hello, Blog!

If you have just stumbled upon my SPO600 series of blog posts, it has been created to document and share my learnings as I progress through my Software Portability and Optimization college course.

After getting our feet wet with the 6502, my classmates and I are now transitioning to modern 64-bit platforms - AArch64 and x86_64.

In this post, I'll cover the AArch64 part of the 64-bit assembly language lab: I will talk about the code I’ve written, and share my initial impressions of programming in 64-bit assembly.

Useful links:

Lab Intro

In this lab, we were tasked to write a program that prints a message to stdout on each loop iteration, gradually enhancing the code logic.

Here is the starter code provided by our professor, Chris Tyler:

Basic loop implementation:
https://github.com/arilloid/assembly/blob/main/lab4-aarch64/loop.s

 .text
 .globl _start
 min = 0                          /* starting value for the loop index; **note that this is a symbol (constant)**, not a variable */
 max = 10                         /* loop exits when the index hits this number (loop condition is i<max) */
 _start:
     mov     x19, min       /* the value in register 19 = loop index
 loop:

     /* ... body of the loop ... do something useful here ... */

     add     x19, x19, 1     /* increment the loop counter */
     cmp     x19, max        /* see if we've hit the max */
     b.ne    loop            /* if not, then continue the loop */

     mov     x0, 0           /* set exit status to 0 */
     mov     x8, 93          /* exit is syscall #93 */
     svc     0               /* invoke syscall */

Notes:

Loop executes 10 times, but since the loop body is empty, it does nothing each iteration.
Register 19 (r19) is used to store the loop counter.

Printing "Hello, world!":
https://github.com/arilloid/assembly/blob/main/lab4-aarch64/hello.s

.text
.globl _start
_start:

        mov     x0, 1           /* file descriptor: 1 is stdout */
        adr     x1, msg         /* message location (memory address) */
        mov     x2, len         /* message length (bytes) */

        mov     x8, 64          /* write is syscall #64 */
        svc     0               /* invoke syscall */

        mov     x0, 0           /* status -> 0 */
        mov     x8, 93          /* exit is syscall #93 */
        svc     0               /* invoke syscall */

.data
msg:    .ascii      "Hello, world!\n"
len=    . - msg

Notes:

The message is printed to stdout using the write system call (syscall #64).
The write system call requires 3 parameters: file descriptor (goes into register 0 = x0), memory address (x1), and message length (x2).
In AArch64 assembler, the syscalls are specified by loading the appropriate value into register 8.

If you are unsure about the prefixes / usage of the registers (x0 vs. w0; usage during system calls), please refer to the AArch64 register and instruction quick start guide).

Setup

I wrote the code on the school-hosted AArch64 server. To make my coding experience more comfortable, I set up the college machine as a remote in VSCode using SSH (instead of going 100% CLI) and installed the ARM Assembly extension for syntax highlighting.

Code modifications

Part 1

The first objective was to combine the given code snippets, making the loop output a message that includes the loop index value.

Expected Output:

 Loop: 0
 Loop: 1
 Loop: 2
 Loop: 3
 Loop: 4
 Loop: 5
 Loop: 6
 Loop: 7
 Loop: 8
 Loop: 9

My implementation:

I combined the looping and message-printing by placing the printing logic inside the loop body and copying the .data section, which holds message-related constants, to the bottom of the file.

To print the loop index along with the message on each iteration, I added a placeholder # within the message, stored its address as msg_digit constant, and replaced the placeholder with the current loop index on each loop iteration. This way the message dynamically assembles on every pass before being printed to stdout.

https://github.com/arilloid/assembly/blob/main/lab4-aarch64/loop1.s

.text
.globl _start
min = 0                          /* starting value for the loop index; **note that this is a symbol (constant)**, not a variable */
max = 10                         /* loop exits when the index hits this number (loop condition is i<max) */
_start:
    mov     x19, min
    loop:
        /* convert loop index to ASCII and store it in the message buffer */
        add     w20, w19, #48   /* convert loop counter value to ASCII character (0-9) */
        adr     x21, msg_digit  /* load address of the digit placeholder in the message buffer */
        strb    w20, [x21]      /* store the digit at the placeholder position */

        /* write message to stdout */
        mov     x0, 1       /* file descriptor: 1 is stdout */
        adr     x1, msg     /* message location (memory address) */
        mov     x2, len     /* message length (bytes) */
        mov     x8, 64      /* write is syscall #64 */
        svc     0           /* invoke syscall */

        /* looping */
        add     x19, x19, 1     /* increment the loop counter */
        cmp     x19, max        /* see if we've hit the max */
        b.ne    loop            /* if not, then continue the loop */

        /* exit the program after loop exit */
        mov     x0, 0           /* set exit status to 0 */
        mov     x8, 93          /* exit is syscall #93 */
        svc     0               /* invoke syscall */

.data
msg:    .ascii      "Loop: #\n"
msg_digit = msg + 6     /* position of the digit placeholder */
len=    . - msg

Notes:

Converting loop index to ASCII: To print the loop index in a human-readable format, the binary-stored numeric value needs to be converted into its corresponding ASCII character. ASCII codes for digit characters range from 48 ('0') to 57 ('9'), therefore, adding 48 to a digit converts it from its numeric value to its ASCII code (add w20, w19, #48 - adds 48 to the value in w19 (the loop index)).
strb w20, [x21] instruction uses the w prefix for r20 because the strb (store byte) instruction only writes a single byte into memory. Using w here appropriately limits the register width to 32 bits, allowing access to the lower 8 bits needed for a 1-byte store.

Part 2

Next step was extending the code to loop from 00 to 30, printing each value as a 2-digit decimal number.

My implementation:

To achieve this, I split the 2-digit number into separate digits: I used udiv to divide the loop index by 10 for the first digit (quotient), and then calculated the second digit (remainder) using msub. + I added a placeholder for the second digit in the message template and reused the logic from the previous implementation to convert both digits to ASCII and store them in the message buffer.

The code snippet below focuses on the modified areas.
To see the full code go to: lab4-aarch64/loop2.s

divisor = 10   /* divisor for 2-digit conversion */
/*
        register assignments
        ---------------------
        r19 = loop index  
        r20 = divisor
        r21 = 1st digit of loop index
        r22 = 2nd digit of loop index
        r23 = adress of the digit in the message buffer
*/
mov     x20, divisor 
loop:
   /* divide loop index by 10 to find the first digit (quotient) */
        udiv    x21, x19, x20       /* quotient for first digit = x21 = index / 10 */

        /* calculate the remainder (second digit) using msub */
        msub    x22, x21, x20, x19  /* x22 = index - (quotient * divisor) */

        /* convert first digit to ASCII */
        add     w21, w21, #48       /* convert the 1st digit to ASCII */
        adr     x23, msg_digit_1    /* load address of the first digit placeholder in the message buffer */
        strb    w21, [x23]          /* store the first digit in the message buffer */

        /* convert second digit to ASCII */
        add     w22, w22, #48       /* convert the 2nd digit to ASCII */
        adr     x23, msg_digit_2    /* load address of the second digit placeholder in the message buffer */
        strb    w22, [x23]          /* store the second digit in the message buffer */

.data
msg:    .ascii      "Loop: ##\n"

/* positions of the digit placeholders */
msg_digit_1 = msg + 6     
msg_digit_2 = msg + 7

Notes:

To help keep track of the values stored in the registers, I added a comment detailing the register assignments for easy reference during the coding process.
udiv x21, x19, x20 - performs an unsigned division: places the quotient of the division of x19 by x20 into x21 (remainder is not calculated).
msub (multiply and subtract) - calculates the remainder by subtracting the product of the quotient (x21) and the divisor (x20) from the original dividend (x19).

Part 3

Lastly, the code needed to be modified to suppress the leading zero to print 0-30 instead of 00-30.

My implementation:

To suppress the leading zero, I added a check after converting the first digit to ASCII. If the value is not equal to ASCII '0', the code jumps to a continue label to proceed. If the first digit = 0, it gets replaced with a space.

I added the continue label before writing the first digit into the message buffer to bypass the space replacement logic when the digit is non-zero.

 /* convert first digit to ASCII */
        add     w21, w21, #48       /* convert the 1st digit to ASCII */

 /* replace the first digit with space if = 0; else continue */
        cmp     w21, #48            /* 48 = 0 in ASCII */
        b.ne    continue
        mov     w21, #32            /* 32 = space in ASCII */

    continue:
        adr     x23, msg_digit_1    /* load address of the first digit placeholder in the message buffer */
        strb    w21, [x23]          /* store the first digit in the message buffer */

See the full code here: lab4-aarch64/loop3.s

Thoughts on AArch64

To my surprise, writing 64-bit assembly was not as intimidating as I expected. After working with 6502 assembly, I found AArch64 to be quite intuitive: clear naming conventions for registers and relatively straightforward syntax made it easy to grasp. (Fortunately, we didn’t have to scour through huge instruction manuals, as our professor provided us with a starter kit containing all the required instructions.) Math operations were significantly easier, as I didn't have to manage carry flags. And, overall, the larger number of registers, dedicated instructions for complex math operations (such as division, which has to be done manually in 6502), and the availability of system calls for printing made my experience with AArch64 much more enjoyable compared to 6502.

Stay tuned! In the next blog post in the Software Optimization Series, I will implement the same logic in x86_64 assembly, contrasting x86_64 with AArch64 and 6502.