Linux: unmap_mapping_range() race with munmap() on VM_PFNMAP mappings leads to stale TLB entry

For VM_PFNMAP VMAs, there is a race between unmap_mapping_range() and
munmap() that can lead to a page being freed by a device driver while
the page still has stale TLB entries.


There are drivers (in particular GPU drivers) that create
VM_PFNMAP VMAs containing PTEs that point to normal pages
from the page allocator. VM_PFNMAP means that the core kernel
won't track this using the page mapcounts; instead, the driver
is responsible for holding references to the page as long as
it is mapped into userspace.

Some of these drivers have codepaths that can remove userspace
mappings of such pages using unmap_mapping_range(), then give these
pages back to the page allocator.
For example, i915 has a shrinker callback i915_gem_shrink() that does
this.
To make this driver behavior correct, it is necessary that by the time
unmap_mapping_range() returns, all the PTEs in the specified range have
been removed and the corresponding TLB flushes have been executed.

However, munmap() ends up in unmap_region(), which does this:


  struct mmu_gather tlb;

  lru_add_drain();
  tlb_gather_mmu(&tlb, mm);
  update_hiwater_rss(mm);
  unmap_vmas(&tlb, vma, start, end);
  free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
                           next ? next->vm_start : USER_PGTABLES_CEILING);
  tlb_finish_mmu(&tlb);


unmap_vmas() removes all PTEs in the range, but does not necessarily
perform a TLB flush yet.
free_pgtables() then removes the VMA from the mapping's rbtree
(unlink_file_vma()) before tearing down page tables in the range:


void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
    unsigned long floor, unsigned long ceiling)
{
  while (vma) {
    struct vm_area_struct *next = vma->vm_next;
    unsigned long addr = vma->vm_start;

    /*
     * Hide vma from rmap and truncate_pagecache before freeing
     * pgtables
     */
    unlink_anon_vmas(vma);
    unlink_file_vma(vma);

    if (is_vm_hugetlb_page(vma)) {
      [...]
    } else {
      [... irrelevant optimization ...]
      free_pgd_range(tlb, addr, vma->vm_end,
              floor, next ? next->vm_start : ceiling);
    }
    vma = next;
  }
}


The TLB flush corresponding to the PTEs that were removed in
unmap_vmas() might only happen afterwards, in tlb_finish_mmu().


This is bad because starting at unlink_file_vma(), the VMA won't
be visible to unmap_mapping_range() anymore. If the driver calls
unmap_mapping_range() directly after munmap() called
unlink_file_vma(), unmap_mapping_range() won't notice the
existence of this VMA, it might return while there are still
stale TLB entries pointing to this page, and the driver could
then free the page while userspace can still read/write it
through the stale TLB entry.


It would be a pain to actually hit this bug through the i915
driver though, since the only time it ever uses
unmap_mapping_range() like this is in the i915_gem_shrink()
shrinker callback. Instead, I wrote a reproducer against some
out-of-tree GPU driver where the unmap_mapping_range() path
can be triggered directly from userspace, and on a system
with CONFIG_PAGE_POISONING, I managed to read PAGE_POISON
(0xaa) out of the stale PTE from userspace after a few
iterations. So sadly I don't have a nice reproducer for this
issue that works upstream.


I guess if we want to avoid having extra TLB flushes for
non-PFNMAP/MIXEDMAP VMAs, a possible fix would be to add
a new bit in struct mmu_gather to track the existence of
PTEs without struct page, and then conditionally flush
before free_pgtables() if either that bit is set or
mm_tlb_flush_nested() is true?


This bug is subject to a 90-day disclosure deadline. If a fix for this
issue is made available to users before the end of the 90-day deadline,
this bug report will become public 30 days after the fix was made
available. Otherwise, this bug report will become public at the deadline.
The scheduled deadline is 2022-10-04.


Found by: jannh@google.com