Post

From exit() to Code Execution

From exit() to Code Execution

Code Execution via exit()

Overview

In this blog, we explore how code execution can be achieved during program termination by abusing the glibc exit() mechanism, specifically through TLS destructors.

All demonstrations use glibc 2.42, but the concept applies broadly to versions >= 2.23.

Although this technique is not new, this post presents my analysis and understanding of how it works in practice.

Background

Before a program reaches main, several initialization steps are performed by the runtime.

Contrary to common assumption, main is not the first function executed. For dynamically linked binaries, execution begins at the ELF entry point (_start), which eventually transfers control to __libc_start_main_impl. This function is responsible for setting up the runtime environment before invoking main.

Internally, __libc_start_main_impl delegates the final call to main to __libc_start_call_main.

It is defined in src

1
2
3
4
5
6
_Noreturn static __always_inline void
__libc_start_call_main (int (*main) (int, char **, char ** MAIN_AUXVEC_DECL),
                        int argc, char **argv MAIN_AUXVEC_DECL)
{
  exit (main (argc, argv, __environ MAIN_AUXVEC_PARAM));
}

One important detail is that when main returns, its return value is passed to exit().

This analysis focuses on the implementation of exit().

Analysis

The referenced snippets can be found here. They can be cross-referenced by searching for the relevant symbols.

The exit function is a thin wrapper around __run_exit_handlers.

1
2
3
4
5
6
void
exit (int status)
{
  __run_exit_handlers (status, &__exit_funcs, true, true);
}
libc_hidden_def (exit)

The function __run_exit_handlers takes four parameters:

1
2
3
4
extern void __run_exit_handlers (int status,
				 struct exit_function_list **listp,
				 bool run_list_atexit, bool run_dtors)
  attribute_hidden __attribute__ ((__noreturn__));

status: the exit status passed to exit(), either from the return value of main or from a direct call to exit().

listp: a pointer to the list of registered exit handlers (exit_function_list), which stores functions registered via atexit, __cxa_atexit, and on_exit.

run_list_atexit: a boolean that determines whether functions registered via atexit are executed.

run_dtors: a boolean that determines whether destructors are executed during program termination.

The following data structures are used to manage functions registered using atexit, on_exit and __cxa_atexit

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
enum
{
  ef_free,	/* `ef_free' MUST be zero!  */
  ef_us,
  ef_on,
  ef_at,
  ef_cxa
};

struct exit_function
  {
    /* `flavour' should be of type of the `enum' above but since we need
       this element in an atomic operation we have to use `long int'.  */
    long int flavor;
    union
      {
	void (*at) (void);
	struct
	  {
	    void (*fn) (int status, void *arg);
	    void *arg;
	  } on;
	struct
	  {
	    void (*fn) (void *arg, int status);
	    void *arg;
	    void *dso_handle;
	  } cxa;
      } func;
  };
struct exit_function_list
  {
    struct exit_function_list *next;
    size_t idx;
    struct exit_function fns[32];
  };

extern struct exit_function_list *__exit_funcs attribute_hidden;

The exit_function_list structure maintains a linked list of registered exit handlers. Each entry contains up to 32 exit_function objects, which represent functions registered via mechanisms such as atexit, on_exit, and __cxa_atexit.

Each exit_function stores a function pointer along with its associated metadata, depending on how it was registered.

Although these structures present potential avenues for code execution, they are not the focus of this analysis.

Moving on, here is the implementation of __run_exit_handlers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
void
attribute_hidden
__run_exit_handlers (int status, struct exit_function_list **listp,
		     bool run_list_atexit, bool run_dtors)
{
  /* The exit should never return, so there is no need to unlock it.  */
  __libc_lock_lock_recursive (__exit_lock);

  /* First, call the TLS destructors.  */
  if (run_dtors)
    call_function_static_weak (__call_tls_dtors);

  __libc_lock_lock (__exit_funcs_lock);

  /* We do it this way to handle recursive calls to exit () made by
     the functions registered with `atexit' and `on_exit'. We call
     everyone on the list and use the status value in the last
     exit (). */
  while (true)
    {
      struct exit_function_list *cur;

    restart:
      cur = *listp;

      if (cur == NULL)
	{
	  /* Exit processing complete.  We will not allow any more
	     atexit/on_exit registrations.  */
	  __exit_funcs_done = true;
	  break;
	}

      while (cur->idx > 0)
	{
	  struct exit_function *const f = &cur->fns[--cur->idx];
	  const uint64_t new_exitfn_called = __new_exitfn_called;

	  switch (f->flavor)
	    {
	      void (*atfct) (void);
	      void (*onfct) (int status, void *arg);
	      void (*cxafct) (void *arg, int status);
	      void *arg;

	    case ef_free:
	    case ef_us:
	      break;
	    case ef_on:
	      onfct = f->func.on.fn;
	      arg = f->func.on.arg;
	      PTR_DEMANGLE (onfct);

	      /* Unlock the list while we call a foreign function.  */
	      __libc_lock_unlock (__exit_funcs_lock);
	      onfct (status, arg);
	      __libc_lock_lock (__exit_funcs_lock);
	      break;
	    case ef_at:
	      atfct = f->func.at;
	      PTR_DEMANGLE (atfct);

	      /* Unlock the list while we call a foreign function.  */
	      __libc_lock_unlock (__exit_funcs_lock);
	      atfct ();
	      __libc_lock_lock (__exit_funcs_lock);
	      break;
	    case ef_cxa:
	      /* To avoid dlclose/exit race calling cxafct twice (BZ 22180),
		 we must mark this function as ef_free.  */
	      f->flavor = ef_free;
	      cxafct = f->func.cxa.fn;
	      arg = f->func.cxa.arg;
	      PTR_DEMANGLE (cxafct);

	      /* Unlock the list while we call a foreign function.  */
	      __libc_lock_unlock (__exit_funcs_lock);
	      cxafct (arg, status);
	      __libc_lock_lock (__exit_funcs_lock);
	      break;
	    }

	  if (__glibc_unlikely (new_exitfn_called != __new_exitfn_called))
	    /* The last exit function, or another thread, has registered
	       more exit functions.  Start the loop over.  */
	    goto restart;
	}

      *listp = cur->next;
      if (*listp != NULL)
	/* Don't free the last element in the chain, this is the statically
	   allocate element.  */
	free (cur);
    }

  __libc_lock_unlock (__exit_funcs_lock);

  if (run_list_atexit)
    call_function_static_weak (_IO_cleanup);

  _exit (status);
}

After all cleanup is complete, the process finally terminates via the exit syscall.

Of particular interest is the following portion of __run_exit_handlers:

1
2
3
4
5
6
7
8
  /* The exit should never return, so there is no need to unlock it.  */
  __libc_lock_lock_recursive (__exit_lock);

  /* First, call the TLS destructors.  */
  if (run_dtors)
    call_function_static_weak (__call_tls_dtors);

  __libc_lock_lock (__exit_funcs_lock);

After acquiring the relevant locks, the function checks whether run_dtors is set to true. If so, __call_tls_dtors is invoked.

What are TLS destructors?

TLS (Thread-Local Storage) destructors are functions associated with thread-local variables. They are automatically invoked when a thread exits, or during program termination via exit().

Unlike regular destructors in object-oriented programming, TLS destructors are specifically tied to thread-local data and are managed by the runtime.

Here’s a code sample to demonstrate this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

pthread_key_t key;

void destructor(void *ptr) {
    printf("TLS destructor called! ptr = %p\n", ptr);
}

void *thread_func(void *arg) {
    void *data = malloc(0x20);
    pthread_setspecific(key, data);
    printf("Thread exiting...\n");
    return NULL;
}

int main() {
    pthread_t t;

    // Register TLS destructor
    pthread_key_create(&key, destructor);

    pthread_create(&t, NULL, thread_func, NULL);
    pthread_join(t, NULL);

    printf("Main exiting...\n");
    return 0;
}

I compiled the program against glibc 2.42 built from source. On execution, the following output is produced:

tls sample

Notice that the thread function does not call destructor directly. Instead, the destructor is invoked automatically by glibc when the thread exits, since it was registered as the cleanup routine for thread-specific data associated with key.

Moving on, the following is the implementation of __call_tls_dtors:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/* Call the destructors.  This is called either when a thread returns from the
   initial function or when the process exits via the exit function.  */
void
__call_tls_dtors (void)
{
  while (tls_dtor_list)
    {
      struct dtor_list *cur = tls_dtor_list;
      dtor_func func = cur->func;
      PTR_DEMANGLE (func);

      tls_dtor_list = tls_dtor_list->next;
      func (cur->obj);

      /* Ensure that the MAP dereference happens before
	 l_tls_dtor_count decrement.  That way, we protect this access from a
	 potential DSO unload in _dl_close_worker, which happens when
	 l_tls_dtor_count is 0.  See CONCURRENCY NOTES for more detail.  */
      atomic_fetch_add_release (&cur->map->l_tls_dtor_count, -1);
      free (cur);
    }
}
libc_hidden_def (__call_tls_dtors)

The logic is fairly straightforward. As long as tls_dtor_list is not empty, glibc repeatedly takes the current entry, extracts its function pointer, demangles it, advances the list head, and then invokes the destructor on the stored object.

After the destructor call, the associated TLS destructor count is decremented and the current list node is freed.

But what is dtor_list?

1
2
3
4
5
6
7
8
9
10
11
typedef void (*dtor_func) (void *);

struct dtor_list
{
  dtor_func func;
  void *obj;
  struct link_map *map;
  struct dtor_list *next;
};

static __thread struct dtor_list *tls_dtor_list;

Here, tls_dtor_list is a thread-local pointer to a dtor_list, meaning each thread maintains its own destructor list.

The structure itself forms a singly linked list with the following important fields:

func: the destructor function to invoke.

obj: the object or data passed as an argument to the destructor.

map: a pointer to the associated link_map, which describes the loaded shared object.

next: a pointer to the next destructor entry in the list.

From an exploitation perspective, the most important field is func, since it is eventually invoked indirectly after being passed through PTR_DEMANGLE.

Recall that tls_dtor_list resides in thread-local storage and is therefore writable at runtime. With an arbitrary write primitive, we could forge a fake dtor_list structure and corrupt tls_dtor_list so that it points to the forged node.

Control over the next field would also allow additional fake entries to be chained into the list.

When __call_tls_dtors is later executed, glibc will walk through our forged list and eventually invoke the destructor function stored in func.

This provides a powerful primitive: not only do we gain control over the function pointer, but we also control its first argument:

1
cur->func(cur->obj);  // system("/bin/sh") => profit!

However, this is not as straightforward as it appears.

Before invocation, the function pointer is passed through PTR_DEMANGLE, a glibc protection mechanism that prevents direct control over stored function pointers.

How do we break this?

First we need to know how PTR_DEMANGLE works!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#define POINTER_GUARD 48

# ifdef __ASSEMBLER__
#  define PTR_MANGLE(reg)       xor %fs:POINTER_GUARD, reg;                   \
                                rol $2*LP_SIZE+1, reg
#  define PTR_DEMANGLE(reg)     ror $2*LP_SIZE+1, reg;                        \
                                xor %fs:POINTER_GUARD, reg
# else
#  define PTR_MANGLE(var)       asm ("xor %%fs:%c2, %0\n"                     \
                                     "rol $2*" LP_SIZE "+1, %0"               \
                                     : "=r" (var)                             \
                                     : "0" (var),                             \
                                       "i" (POINTER_GUARD))
#  define PTR_DEMANGLE(var)     asm ("ror $2*" LP_SIZE "+1, %0\n"             \
                                     "xor %%fs:%c2, %0"                       \
                                     : "=r" (var)                             \
                                     : "0" (var),                             \
                                       "i" (POINTER_GUARD))
# endif

Looking at the macro definition is somewhat messy, but when we take a look at it’s disassembly we get this.

demangle

1
2
3
0x00007ffff7e16dc4 <+36>:	mov    rax,QWORD PTR [rbx]
0x00007ffff7e16dc7 <+39>:	ror    rax,0x11
0x00007ffff7e16dcb <+43>:	xor    rax,QWORD PTR fs:0x30

In other words, glibc demangles the stored function pointer by rotating it right by 0x11 and then XORing it with the value at fs:0x30.

The mangling on the other hand, does the reverse operation.

On x86_64, the fs register points to the thread control block (TCB), which provides access to thread-local storage. The offset 0x30 corresponds to the pointer_guard field in tcbhead_t:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
typedef struct
{
  void *tcb;		/* Pointer to the TCB.  Not necessarily the
			   thread descriptor used by libpthread.  */
  dtv_t *dtv;
  void *self;		/* Pointer to the thread descriptor.  */
  int multiple_threads;
  int gscope_flag;
  uintptr_t sysinfo;
  uintptr_t stack_guard;
  uintptr_t pointer_guard;
  unsigned long int unused_vgetcpu_cache[2];
  /* Bit 0: X86_FEATURE_1_IBT.
     Bit 1: X86_FEATURE_1_SHSTK.
   */
  unsigned int feature_1;
  int __glibc_unused1;
  /* Reservation of some values for the TM ABI.  */
  void *__private_tm[4];
  /* GCC split stack support.  */
  void *__private_ss;
  /* The marker for the current shadow stack.  */
  unsigned long long int ssp_base;
  /* Must be kept even if it is no longer used by glibc since programs,
     like AddressSanitizer, depend on the size of tcbhead_t.  */
  __128bits __glibc_unused2[8][4] __attribute__ ((aligned (32)));

  void *__padding[8];
} tcbhead_t;

fs_sample

Inspecting it with gdb shows that it contains what appears to be a randomized 8-byte value:

random

The pointer guard is initialized in security_init:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
static void
security_init (void)
{
  /* Set up the stack checker's canary.  */
  uintptr_t stack_chk_guard = _dl_setup_stack_chk_guard (_dl_random);
#ifdef THREAD_SET_STACK_GUARD
  THREAD_SET_STACK_GUARD (stack_chk_guard);
#else
  __stack_chk_guard = stack_chk_guard;
#endif

  /* Set up the pointer guard as well, if necessary.  */
  uintptr_t pointer_chk_guard
    = _dl_setup_pointer_guard (_dl_random, stack_chk_guard);
#ifdef THREAD_SET_POINTER_GUARD
  THREAD_SET_POINTER_GUARD (pointer_chk_guard);
#endif
  __pointer_chk_guard_local = pointer_chk_guard;

  /* We do not need the _dl_random value anymore.  The less
     information we leave behind, the better, so clear the
     variable.  */
  _dl_random = NULL;
}

For our purposes, the important takeaway is not the exact generation routine, but the fact that the pointer guard is stored per thread in TLS.

As a result, forging a valid mangled function pointer requires either recovering the current thread’s guard or overwriting it.

In our case, we assume that we have an arb write primitive so we go with overwriting it’s current value.

Exploitation

I wrote a pwnable problem to showcase this technique.

Here’s the source code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <asm/prctl.h>
#include <syscall.h>
#include <inttypes.h>

typedef struct buf_t {
    char buf[0x100];
} buf_t;

void setup() __attribute__((constructor));

void setup() {
    setvbuf(stdin, NULL, _IONBF, 0);
    setvbuf(stdout, NULL, _IONBF, 0);
    setvbuf(stderr, NULL, _IONBF, 0);
}

int main() {
    uint32_t choice = 0;
    uint64_t addr = 0, value = 0;
    void *tmp = malloc(0x20);
    size_t *fs_base;

    syscall(SYS_arch_prctl, ARCH_GET_FS, &fs_base);
    printf("stdout: %p\n", stdout);
    printf("fs_base: %p\n", fs_base);
    printf("heap: %p\n", tmp);

    printf("enter data: ");
    buf_t *buf = (buf_t *)malloc(sizeof(buf_t));
    read(STDIN_FILENO, buf, sizeof(buf_t));

    while (1) {
        printf("[1]. store addr and val\n[2]. arb write\n[3]. end program\n> ");
        
        if (scanf("%u", &choice) != 1)
            exit(-1);

        if (choice == 1) {
            printf("addr: ");
            if (scanf("%" SCNu64, &addr) != 1)
                exit(-1);

            printf("val: ");
            if (scanf("%" SCNu64, &value) != 1)
                exit(-1);
        } else if (choice == 2) {
            *(uint64_t *)addr = value;
        } else {
            break;
        }
    }

    return 0;
}

After compilation, we check the protections enabled with checksec

1
2
3
4
5
6
7
8
9
10
11
12
13
mark@rwx:~/Desktop/CodeAnalysis$ ./run.sh 
mark@rwx:~/Desktop/CodeAnalysis$ checksec main
[*] '/home/mark/Desktop/CodeAnalysis/main'
    Arch:       amd64-64-little
    RELRO:      Full RELRO
    Stack:      Canary found
    NX:         NX enabled
    PIE:        PIE enabled
    RUNPATH:    b'/home/mark/Desktop/CodeAnalysis/glibc/glibc-2.42-out/lib'
    SHSTK:      Enabled
    IBT:        Enabled
    Stripped:   No
mark@rwx:~/Desktop/CodeAnalysis$ 

All protections are enabled.

The program itself is simple, first we get various leaks:

  • stdout which we can use to calculate the base of libc
  • fs base
  • heap leak

Our input is first stored into the heap buffer before it enters the while loop which does 3 things:

  • sets addr, val
  • stores val into addr
  • breaks out of the loop

So we are simply given a write-what-where primitive (arb write) and we have to use this to gain code execution.

As already explained, we can leverage the exit() mechanism to achieve this.

This post is licensed under CC BY 4.0 by the author.