tl;dr Grab the release binary from our repo and have fun. Also, happy new year; 2021 couldn’t end soon enough.
Background
A while back, I was asked by one of my coworkers on the PSC team about ways in which to make their custom credit card data scanner cloud native to assess Kubernetes clusters. While the large quantity of advice and sympathy I offered them is out-of-scope for this post, the thought of such a tool got me thinking about ways to get external code running inside of running containers, especially read-only ones.
Note the phrase “external code,” as in code not already in a container image. This is the main problem that makes nsenter
completely useless and an absolute headache to use; you can either run a binary from the container, or you can run your binary in every namespace of the container but the mount namespace. This is “fine” when all you want to do is some limited network namespace testing, but it’s a nonstarter when you want to grep across the filesystem that the container actually uses or interact with its actual procfs.
Obviously, outside of injecting code into a process already in a namespace, any implementation is going to involve setns(2)
. The problem is that you need to somehow call setns(2)
after the the execve(2)
syscall, and ideally after the early init of the program being thrown into a foreign namespace.
Introducing insject, Your Friendly Neigh^H^H^H^Hamespace Injection Utility
insject started out life as proof-of-concept C program and then an lldb script version. It is now a Rust program compiled into a dual shared library-executable; more on that later. insject’s primary feature is that it enables you to do all of your container attach fun after execve(2)
-ing. Its secondary feature is the flexibility it offers in doing so.
insject is fundamentally an LD_PRELOAD
library that uses libfrida-gum to set up a hook on a configurable symbol/function/address/etc. (default main()
) that, when hit, kicks off the container attach logic (including all of the setns(2)
calls). This enables deferring the namespace attaches to basically whenever it is fine to do so. It also does a lot of work to hide its LD_PRELOAD
environment variable from the process loaded with it to prevent it from being propagated to child processes; this is a major issue because shells such as Bash do a lot of wacky things with their environment variables.
As is par for the course, insject supports using arbitrary combinations of Linux namespaces based on process PIDs, uid:gid:groups, and AppArmor profiles (defaulting to the profile used by the target process). Additionally, insject supports controlling whether the userns setns(2)
call happens before or after the other setns(2)
calls, which can be occasionally useful. insject also supports (in LD_PRELOAD-based modes) fork(2)
-ing after namespace joins when joining a PID namespace.
insject supports multiple methods of operation:
- shared library: LD_PRELOAD with CLI-style arguments in an environment variable
- shared library: LD_PRELOAD with JSON-encoded configuration in an environment variable
- executable: exec into target command with self as LD_PRELOADed shared library and JSON-encoded configuration environment variable
- executable: remotely debug a target process into joining namespaces
That’s .so executable
So as to gain maximum compatibility without having to resort to tricks like memfd_create(2)
, insject is built as a dual shared library-executable so that it may use its own path as an LD_PRELOAD target. Up until recently, such polyglot PIE executables were fairly standard affairs, but glibc in their infinite fatuity decided that such things should not be allowed to exist and on recent glibc versions, dlopen(3)
will reject PIE executables. To get around this, we use the standard mechanism of removing the PIE flag from the executable because dynamically linked ELF executables are shared objects. However, due to this chicanery, there appear to be some weird edge cases with ld.so and/or Rust that result in the LD_PRELOAD execution environment being a bit finicky with such binaries, resulting in segfaults in seemingly non-unsafe
code; so it’s very likely a glibc hubris bug.
Unfortunately, due to weird issues I’m still working out, debuggers seem to have issues with injecting these kinds of binaries into processes, so the remote debug mode is currently a C version of the insject payload packed into an lldb call expression.
The other oddity with the build is that Rust/cargo don’t really like linking arbitrary statically linked libraries (.a
files) into shared libraries, which insject can be built as. And they also have all sorts of annoying issues with library lookup paths for such things, especially when compiling C as part of the build process, which insject intrinsically does due its dependence on my binding wrapper for frida-gum’s gum_interceptor_attach
API, which relies on GLib-ified C. So I updated that crate to provide a build.rs
macro to place a .cargo/config
file with the necessary rustflags. Meanwhile, Rust is handing out CVEs for whitespace tricks, *tsk* *tsk*.
Usage
The normal method of using insject is to just execute it, pass in the target PID, a --
separator, and then whatever command you want to run against the target PID’s namespaces.
$ ifconfig br-1a3c6b64c540: flags=4099mtu 1500 ether 02:42:b2:46:70:9a txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 docker0: flags=4163 mtu 1500 ether 02:42:bb:ee:b4:64 txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 26 bytes 5660 (5.6 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ... $ sudo insject -S -- sh [insject] -> mnt: 0, net: 0, time: 0, ipc: 0, uts: 0, pid: 0, cgroup: 0, userns: -1, apparmor: docker-default, user: 0/0/0 # ifconfig eth0 Link encap:Ethernet HWaddr 02:42:AC:11:00:02 inet addr:172.17.0.2 Bcast:172.17.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:40 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:23261 (22.7 KiB) TX bytes:0 (0.0 B) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
$ sudo ./target/release/insject -S -s setns-- python3 Python 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> open("/etc/hostname","r").read() 'ubuntun' >>> import sys, os, ctypes, subprocess; libc = ctypes.CDLL('libc.so.6'); libc.setns(-1, -1) [insject] -> mnt: 0, net: 0, time: 0, ipc: 0, uts: 0, pid: 0, cgroup: 0, userns: -1, apparmor: docker-default, user: 0/0/0 -1 >>> open("/etc/hostname","r").read() 'af40a9036a27n'
Wait, setns(2) After execve(2) as root, Isn’t That Dangerous?
Tremendously so. There is a reason why insject sets the PID namespace towards the end. If it did so earlier, the target namespace could potentially ptrace(2)
you before you finished leaving your namespace, and while you were still holding significant privileges. In fact, this is actually a fun way to enable cross-parent ptrace(2)
support in contravention of yama.ptrace_scope
; using user namespaces; you have the target join your user namespace that you have CAP_SYS_PTRACE
in, and then you ptrace(2)
it.
However, in practice, the real risk here is the file descriptors and the willingness of the binary you want to inject to load arbitrary code from the new mount namespace’s filesystem. This can be a big issue with Bash, which really likes to load shell scripts or with interpreters such as Python, which will prioritize local files in the directory (it thinks is) of the script being run over standard library modules, due to how sys.path
works.
A Short List of Things Not To Do
- With a program that loads code from disk at runtime, attach to any normal container’s mount namespace without attaching to the PID namespace and forking (the default for PID ns attach)
- If you don’t do both, you run the risk of loading code into a process that still has access to the host PID namespace and likely still has privileges there.
- With a program that loads code from disk at runtime and holds writable (or directory) file descriptors to host resources from before being insjected, attach to any normal container’s mount namespace
- Doing so runs the risk of loading code that then uses the existing file descriptors to host namespace resources to manipulate them (so don’t do something like
open("/", 0)
before insjecting)
- Doing so runs the risk of loading code that then uses the existing file descriptors to host namespace resources to manipulate them (so don’t do something like
- Attach to the PID namespace of a container with CAP_SYS_PTRACE, but not other namespaces, especially the mount namespace
- If you do this, even without pre-forking, subprocesses can be
ptrace(2)
-d by container processes via their PID and used to access host namespace resources
- If you do this, even without pre-forking, subprocesses can be
- With a program that holds writable (or directory) file descriptors to host resources from before being insjected, attach to a container with CAP_SYS_PTRACE
- Similarly to the previous bullet, subprocesses or the fork child can be
ptrace(2)
-d by container processes via their PID and used to access the host namespace resources via their file descriptors.
- Similarly to the previous bullet, subprocesses or the fork child can be
- insject a program after it has created additional threads
- As
setns(2)
only assigns a namespace to the calling thread, insjecting after a process has created threads will mean that the process will exist in a state where one of its threads will enter separate namespaces, while the others will remain in their original namespaces. This results in a similar risk profile as the previous situations as code execution in one thread can be trivially pivoted into another thread that can operate in the original namespaces. Additionally,setns(2)
simply does not work with multithreaded programs for user namespaces and time namespaces, though the latter behavior is undocumented.setns(2)
will also not work if any threads have been created with theclone(2)
CLONE_FS
flag, which is used to ensure thatchdir(2)
andumask(2)
syscalls are applied across threads; this flag is used for all Golang runtime threads, resulting in post-runtime init Go processes not being able to join mount namespaces, time namespaces, and user namespaces.
- As
There are likely a large number of other dangerous combinations that arise from attaching to semi-privileged containers in specific ways, but they likely generally involve privileges that would enable container breakout anyway. However, your best bet is to use the -S,--strict
flag, which will exit(2)
the process if the insject operation occurs while the process has multiple threads, or if any part of the operation fails.
So be very careful. Like use /bin/sh
(dash
, ash
, etc.), not bash
or zsh
, careful, especially if you have custom shell configurations/integrations like rvm. You have been warned. insject is available at https://github.com/nccgroup/insject, have fun.
Conclusion
This was a fun little toy to put together. insject is a fairly useful utility and solves most of the problems of setns, so I’m fairly happy with it. I will continue to try to find a good solution to the debugger problems, but it might have to involve a custom ptrace(2)
-based code injector, which would potentially help with the other main limitation of LD_PRELOAD and dlopen(3)
, they don’t really work from statically linked binaries.
Separately, while this isn’t my first time using Rust to build intricate LD_PRELOAD payloads, for all the annoyances involved, I would like to make it clear that Rust is still a great language to work with for these kinds of tasks since you have full access to its standard library and crate ecosystem.