Tracing the Arm64 Linux System Call Path

Leesoo Ahn - Aug 13 - - Dev Community

Arm64 system has two type of traps,

  • Synchronous
  • Asynchronous

and four exceptions which start with el (stands for exception level.)

  • el0 (userspace)
  • el1 (kernel)
  • el2 (hypervisor)
  • el3 (secure mode)

Synchronous is known as system-call among many, while Asynchronous is as hardware interrupt in Arm whitepaper. But the latter is off-topic in this article.

One process is working in el0 and it would raise its hand by itself if it needs any system resource at a time. This is system-call and switches the exception level of CPUs from el0 to el1. Kernel takes the CPU and does something for the leftovers instead of the process. Once it's done, it hands out the CPU to the process again.


The following code is about one of (real) system-call APIs from musl, a well-known libc library.

#define __asm_syscall(...) do { \
    __asm__ __volatile__ ( "svc 0" \
    : "=r"(x0) : __VA_ARGS__ : "memory", "cc"); \
    return x0; \
} while (0)

static inline long __syscall0(long n)
{
    register long x8 __asm__("x8") = n;
    register long x0 __asm__("x0");
    __asm_syscall("r"(x8));
}
Enter fullscreen mode Exit fullscreen mode

Imagine that one process mentioned above is about to call fork() very soon. The API doesn't take any arguments and therefore, it maps to __syscall0(..).

What you need to keep in mind regarding to the code is svc instruction (stands for supervisor-call), to switch from el0 to el1 with x8 register holding digits that represent the system-call number.


el0t_64_sync_handler would be called in el1 by the exception vector table describing what to do if svc raised and jump to el0_svc(..) by esr system register holds syndrome information which is used to recognize the exception class (also known as exception reason.)

el0t_64_sync_handler(struct pt_regs *regs)
{
    unsigned long esr = read_sysreg(esr_el1);

    switch (ESR_ELx_EC(esr)) {
    case ESR_ELx_EC_SVC64:
        el0_svc(regs);
    ...
}
Enter fullscreen mode Exit fullscreen mode

From now on, showing a code diagram will be easier than words to understand for everyone. (code is based on v5.15)

el0_svc(struct pt_regs *regs)
{
    ...
    do_el0_svc(regs);
    ...   |
}         |
    +-----+
    |
    V
do_el0_svc(struct pt_regs *regs)
{
    ...
    el0_svc_common(regs, regs->regs[8],
           |       __NR_syscalls,
    ...    |       sys_call_table);
}          |
    +------+
    |
    V
el0_svc_common(struct pt_regs *regs, int scno, int sc_nr,
               const syscall_fn_t syscall_table[])
{
    ...
    invoke_syscall(regs, scno, sc_nr, syscall_table);
    ...
}
Enter fullscreen mode Exit fullscreen mode

We're almost at our destination now. scno was from x8 register (again, it was holding digits that represent a system-call number) and invoke_syscall(..) is looking up the system-call function in syscall_table using the number from scno. Eventually, it will carry out what was requested.

invoke_syscall(struct pt_regs *regs, unsigned int scno,
               unsigned int sc_nr,
               const syscall_fn_t syscall_table[])
{
    ...
    if (scno < sc_nr) {
        syscall_fn_t syscall_fn;
        syscall_fn = syscall_table[array_index_nospec(scno, sc_nr)];
        ret = __invoke_syscall(regs, syscall_fn);
    }                |
    ...              |
}                    |
    +----------------+
    |
    V
__invoke_syscall(struct pt_regs *regs, syscall_fn_t syscall_fn)
{
    return syscall_fn(regs);
}
Enter fullscreen mode Exit fullscreen mode

You may wonder that as far as we know, each system-call has a different number of parameters. But syscall_fn(..) takes only one, regs. We will see two cases by code, one for taking nothing and another does five parameters.

fork() takes nothing in parameters, therefore struct pt_regs object passing to syscall_fn is unused.

#define SYSCALL_DEFINE0(sname) \
    ...
    asmlinkage long __arm64_sys_##sname(const struct pt_regs *__unused)
Enter fullscreen mode Exit fullscreen mode

On the other hands, clone() takes five parameters, therefore struct pt_regs object expands itself to the number of parameters by SC_ARM64_REGS_TO_ARGS(..) and __MAP(..).

SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
      |        int __user *, parent_tidptr,
      |        unsigned long, tls,
      |        int __user *, child_tidptr)
      |
      +--------+
               |
               V
#define __SYSCALL_DEFINEx(x, name, ...) \
    ...
    __arm64_sys##name(const struct pt_regs *regs) \
    { \
        return __se_sys##name(SC_ARM64_REGS_TO_ARGS(x,__VA_ARGS__)); \
    } \           |
         +--------+
         |
         V
    __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
    { \
        long ret = __do_sys##name(__MAP(x,__SC_CAST,__VA_ARGS__)); \
        ...              |
        return ret; \    |
    } \                  |
         +---------------+
         |
         V
    __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))
Enter fullscreen mode Exit fullscreen mode

We have walked through the system-call code from el0 to el1. It wasn't a long journey, but wasn't easy either. I hope this tiny map (I like using metaphors) guides you to where you want to be.

happy hacking!

. . . . . .
Terabox Video Player