Linux: unmap_mapping_range() race with munmap() on VM_PFNMAP mappings leads to stale TLB entry For VM_PFNMAP VMAs, there is a race between unmap_mapping_range() and munmap() that can lead to a page being freed by a device driver while the page still has stale TLB entries. There are drivers (in particular GPU drivers) that create VM_PFNMAP VMAs containing PTEs that point to normal pages from the page allocator. VM_PFNMAP means that the core kernel won't track this using the page mapcounts; instead, the driver is responsible for holding references to the page as long as it is mapped into userspace. Some of these drivers have codepaths that can remove userspace mappings of such pages using unmap_mapping_range(), then give these pages back to the page allocator. For example, i915 has a shrinker callback i915_gem_shrink() that does this. To make this driver behavior correct, it is necessary that by the time unmap_mapping_range() returns, all the PTEs in the specified range have been removed and the corresponding TLB flushes have been executed. However, munmap() ends up in unmap_region(), which does this: struct mmu_gather tlb; lru_add_drain(); tlb_gather_mmu(&tlb, mm); update_hiwater_rss(mm); unmap_vmas(&tlb, vma, start, end); free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, next ? next->vm_start : USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb); unmap_vmas() removes all PTEs in the range, but does not necessarily perform a TLB flush yet. free_pgtables() then removes the VMA from the mapping's rbtree (unlink_file_vma()) before tearing down page tables in the range: void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long floor, unsigned long ceiling) { while (vma) { struct vm_area_struct *next = vma->vm_next; unsigned long addr = vma->vm_start; /* * Hide vma from rmap and truncate_pagecache before freeing * pgtables */ unlink_anon_vmas(vma); unlink_file_vma(vma); if (is_vm_hugetlb_page(vma)) { [...] } else { [... irrelevant optimization ...] free_pgd_range(tlb, addr, vma->vm_end, floor, next ? next->vm_start : ceiling); } vma = next; } } The TLB flush corresponding to the PTEs that were removed in unmap_vmas() might only happen afterwards, in tlb_finish_mmu(). This is bad because starting at unlink_file_vma(), the VMA won't be visible to unmap_mapping_range() anymore. If the driver calls unmap_mapping_range() directly after munmap() called unlink_file_vma(), unmap_mapping_range() won't notice the existence of this VMA, it might return while there are still stale TLB entries pointing to this page, and the driver could then free the page while userspace can still read/write it through the stale TLB entry. It would be a pain to actually hit this bug through the i915 driver though, since the only time it ever uses unmap_mapping_range() like this is in the i915_gem_shrink() shrinker callback. Instead, I wrote a reproducer against some out-of-tree GPU driver where the unmap_mapping_range() path can be triggered directly from userspace, and on a system with CONFIG_PAGE_POISONING, I managed to read PAGE_POISON (0xaa) out of the stale PTE from userspace after a few iterations. So sadly I don't have a nice reproducer for this issue that works upstream. I guess if we want to avoid having extra TLB flushes for non-PFNMAP/MIXEDMAP VMAs, a possible fix would be to add a new bit in struct mmu_gather to track the existence of PTEs without struct page, and then conditionally flush before free_pgtables() if either that bit is set or mm_tlb_flush_nested() is true? This bug is subject to a 90-day disclosure deadline. If a fix for this issue is made available to users before the end of the 90-day deadline, this bug report will become public 30 days after the fix was made available. Otherwise, this bug report will become public at the deadline. The scheduled deadline is 2022-10-04. Found by: jannh@google.com