0

I am trying to build a mapping between the dynamic symbols in ELF files (from glibc) and the actual kernel syscalls they invoke.

My environment is x86_64 Ubuntu 22.04.

What I've Tried

  1. Parsing man 2 Pages: My first attempt was to parse the man 2 text. This was effective for extracting argument types and names, but it failed to reliably map the wrapper syscall (e.g., open) to the actual kernel syscall (e.g., openat) due to the limitations of the manuals.

  2. AI Recommendation (AST): I was advised by an AI that using an Abstract Syntax Tree (AST), for example with libclang, would be a viable approach. I'm a computer science student, but my university doesn't offer a compiler course, so I lack a deep understanding of ASTs and am seeking expert advice here.

My Core Problem & Example

My main challenge is that glibc is extremely complex, full of preprocessor directives and symbol aliases.

For example, if I compile a C program that calls open(), readelf shows the dynamic symbol [email protected].

I've traced this to the glibc source file open64.c. On my x86_64 system, the __OFF_T_MATCHES_OFF64_T preprocessor macro is defined, which leads to this block:

C

https://git.launchpad.net/ubuntu/+source/glibc/tree/sysdeps/unix/sysv/linux/open64.c?h=ubuntu/jammy

#ifdef __OFF_T_MATCHES_OFF64_T
strong_alias (__libc_open64, __libc_open)
strong_alias (__libc_open64, __open)
libc_hidden_weak (__open)
weak_alias (__libc_open64, open)
#endif

This weak_alias maps open to __libc_open64. The __libc_open64 function then internally calls SYSCALL_CANCEL (openat, ....). This macro (which eventually uses inline assembly) is the lowest-level call I'm trying to find.

My goal is to find this entire chain for all syscalls: [email protected]weak_alias (__libc_open64, open)__libc_open64SYSCALL_CANCEL (openat, ...)

...and ultimately build the mapping: openopenat, openatopenat.
(key:value)

My Questions

  1. Is it technically feasible to use an AST-based approach (like libclang) to reliably parse the entire glibc source and resolve all these preprocessor directives and aliases (strong_alias, weak_alias)?

  2. My ultimate goal is to create an N:1 mapping from all kernel syscalls (those found near SYS_ify(name)) to the various user-space aliases that call them. Does a public mapping of this information already exist? I would be overjoyed if I could just use an existing resource.

10
  • 2
    I can imagine that the mapping is actually n:m (i.e., more than one user-space function ends up calling the same kernel syscall (among others, perhaps)), and some single user-space functions make more than one kernel syscall. Commented Nov 14 at 15:06
  • Is there a case where a single user space system call is calling multiple kernel system calls? One thing I want to tell you here is that I'm trying to map the wrapper syscall to the kernel syscall, not the user space function to the kernel system call Commented Nov 14 at 15:18
  • My goal is to find this entire chain for all syscalls why? what for? Is it technically feasible to use an AST-based approach You will most probably have to write your own "AST-ish" parser on top of libclang to handle all cases. Does a public mapping of this information already exist? Not that I am aware. Commented Nov 14 at 15:46
  • This work is for my undergrad graduation project. My goal is to statically analyze a deployed image containing an ELF to identify potential syscalls and then provide that list to the administrator. The admin can then define and monitor prohibited syscall patterns (omitting the eBPF part). ​The reason I need this mapping is because the static analysis yields wrapper syscalls, but the administrator must define rules against the actual kernel syscalls. I need to translate the ELF analysis results from wrapper syscalls to kernel syscalls before presentation for adim Commented Nov 14 at 15:59
  • N:1 is certainly likely -- there may be just one exec kernel syscall, and all the exec* wrappers call it after massaging the arguments. Commented Nov 14 at 16:50

2 Answers 2

1

Very challenging problem, good luck!

Hope this will help, it expands on the ideas already mentioned in comments.

This approach is based on seeking the syscall instruction (0x0f 0x05) in a loaded (not running) process memory space. Other fingerprints might be helpful to search for, but one example is enough to convey the idea.

The major benefit here is that gdb has the real function tables in memory and can resolve which functions are being called.

Here is the process:

  1. load the program into memory (static or dynamic)
  2. stop program execution in the sandbox at specific points of interest
  3. look for prohibited fingerprints in process memory space

Compile both a dynamic and static program:

gcc -g -o prog-dynamic main.c
gcc -static -ffunction-sections -fdata-sections -Wl,--gc-sections -O2 -o prog-static main.c

Analyze the static program:

gdb prog-static

Try these commands:

(gdb) starti
(gdb) info files
0x0000000000401180 - 0x000000000047e2a0 is .text
(gdb) find /b 0x0000000000401180, 0x000000000047e2a0, 0x0f, 0x05
. . . SNIP . . .
0x456ac0 <openat64+64>
0x456b35 <openat64+181>
0x419179 <fstat64+9>
0x4191a9 <lseek64+9>
0x419223 <open64+83>
0x419293 <open64+195>
0x4192ff <read+15>
0x419338 <read+72>
0x4193a2 <write+18>
0x4193db <write+75>
. . . SNIP . . .
110 patterns found.
(gdb) disassemble openat
Dump of assembler code for function openat64:
   0x0000000000456a80 <+0>:     endbr64
   0x0000000000456a84 <+4>:     push   %rbp
   0x0000000000456a85 <+5>:     mov    %rsp,%rbp
   0x0000000000456a88 <+8>:     sub    $0x70,%rsp
   0x0000000000456a8c <+12>:    mov    %rcx,-0x18(%rbp)
   0x0000000000456a90 <+16>:    mov    %fs:0x28,%rax
   0x0000000000456a99 <+25>:    mov    %rax,-0x38(%rbp)
   0x0000000000456a9d <+29>:    xor    %eax,%eax
   0x0000000000456a9f <+31>:    test   $0x40,%dl
   0x0000000000456aa2 <+34>:    jne    0x456ae8 <openat64+104>
   0x0000000000456aa4 <+36>:    mov    %edx,%eax
   0x0000000000456aa6 <+38>:    xor    %r10d,%r10d
   0x0000000000456aa9 <+41>:    not    %eax
   0x0000000000456aab <+43>:    test   $0x410000,%eax
   0x0000000000456ab0 <+48>:    je     0x456ae8 <openat64+104>
   0x0000000000456ab2 <+50>:    cmpb   $0x0,0x5457f(%rip)        # 0x4ab038 <__libc_single_threaded>
   0x0000000000456ab9 <+57>:    je     0x456b0c <openat64+140>
   0x0000000000456abb <+59>:    mov    $0x101,%eax
   0x0000000000456ac0 <+64>:    syscall
   0x0000000000456ac2 <+66>:    cmp    $0xfffffffffffff000,%rax
   0x0000000000456ac8 <+72>:    ja     0x456b58 <openat64+216>
. . . SNIP . . .
   0x0000000000456b30 <+176>:   mov    $0x101,%eax
   0x0000000000456b35 <+181>:   syscall

Notice:

  • 0x456ac0 <openat64+64> value 0x101 being moved into eax before syscall
  • 0x456b35 <openat64+181> value Value 0x101 being moved into eax before syscall

Analyze the dynamic program:

gdb prog-dynamic
(gdb) starti
(gdb) info files
0x0000555555555060 - 0x0000555555555167 is .text
(gdb) find /b 0x0000555555555060, 0x0000555555555167, 0x0f, 0x05
Pattern not found.
(gdb) info sharedlibrary
From                To                  Syms Read   Shared Object Library
0x00007ffff7fc6000  0x00007ffff7ff0195  Yes         /lib64/ld-linux-x86-64.so.2
(gdb) find /b 0x00007ffff7fc6000, 0x00007ffff7ff0195, 0x0f, 0x05
. . . SNIP . . .
56 patterns found.
(gdb) br main
(gdb) cont
(gdb) info sharedlibrary
From                To                  Syms Read   Shared Object Library
0x00007ffff7fc6000  0x00007ffff7ff0195  Yes         /lib64/ld-linux-x86-64.so.2
0x00007ffff7c28800  0x00007ffff7dafcf9  Yes         /lib/x86_64-linux-gnu/libc.so.6
(gdb) find /b 0x00007ffff7c28800, 0x00007ffff7dafcf9, 0x0f, 0x05
. . . SNIP . . .
575 patterns found.
. . . SAME AS STATIC ANALYSIS . . .

Another helpful gdb command might be this one that can help you get a handle on resolving the cryptic source code.

(gdb) info function openat64
All functions matching regular expression "openat64":

File ../sysdeps/unix/sysv/linux/dl-openat64.c:
25:     int openat64(int, const char *, int, ...);

File ../sysdeps/unix/sysv/linux/openat64.c:
28:     int __libc_openat64(int, const char *, int, ...);

File ../sysdeps/unix/sysv/linux/openat64_nocancel.c:
26:     int __GI___openat64_nocancel(int, const char *, int, ...);

File ./io/openat64_2.c:
23:     int __openat64_2(int, const char *, int);
Sign up to request clarification or add additional context in comments.

Comments

1
#!/usr/bin/env python3

import sys
import re
import dataclasses


SYSCALLS: dict[int, str] = {}

with open("/usr/include/asm/unistd_64.h") as f:
    for line in f:
        if m := re.match(r"#define __NR_(\S+)\s+([0-9]+)", line):
            SYSCALLS[int(m[2])] = m[1]


@dataclasses.dataclass
class Func:
    name: str
    numbers: list[str] = dataclasses.field(default_factory=list)
    syscalls: set[int] = dataclasses.field(default_factory=set)
    calls: set[str] = dataclasses.field(default_factory=set)


functions: list[Func] = []
eax = None
for line in sys.stdin:
    if m := re.match("[0-9A-Fa-f]+ <([^>]+)>:", line):
        functions.append(Func(m[1]))
    elif m := re.match(
        r"\s+([0-9A-Fa-f]+):\t+[0-9A-Fa-f]{2}( [0-9A-Fa-f]{2})+\s*(.*)", line
    ):
        # print(line, m)
        functions[-1].numbers.append(m[1])
        instruction = m[3].strip()
        # 9305c:   ff 15 c6 5c 17 00       call   *0x175cc6(%rip)        # 208d28 <free@@GLIBC_2.2.5+0x160df8>
        if m := re.match(r".*<([^+>]*).*>", instruction):
            functions[-1].calls.add(m[1])
        if m := re.match(r"mov\s+\$(0x[0-9A-Fa-f]+),%(eax|rax)", instruction):
            eax = int(m[1], 16)
        elif instruction == "syscall":
            assert eax, f"{eax} {functions[-1]} {line}"
            functions[-1].syscalls.add(eax)

for f in functions:
    if f.syscalls:
        print(
            f.name,
            " ".join(map(str, ((SYSCALLS[s] if s in SYSCALLS else s) for s in f.syscalls))),
        )

parses the output of objdump -C. For example for libc.so.6 it outputs:

$ ./fun.py < /tmp/objdump_libc | head
abort@@GLIBC_2.2.5 rt_sigprocmask
__libc_init_first@@GLIBC_2.2.5 exit
__sigaction@@GLIBC_2.2.5 rt_sigreturn
__libc_sigaction@@GLIBC_PRIVATE rt_sigaction
kill@@GLIBC_2.2.5 kill
sigpending@@GLIBC_2.2.5 rt_sigpending
sigaltstack@@GLIBC_2.2.5 sigaltstack
sigqueue@@GLIBC_2.2.5 rt_sigqueueinfo
a64l@@GLIBC_2.2.5 rt_sigprocmask
abs@@GLIBC_2.2.5 ppoll
...

Additional work is needed to:

  • find the real function endings - obviously abs does not call ppoll. The assembly instruction is there, because compiler put code for arc4random below the abs function what looks to be like an optimization.
  • find all syscalls of subfunctions recursively
  • and then probably use dot graphviz to present the output in a nice graph

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.