Enhancing Data Understanding
Rocco Gagliardi
How to run Untrusted Code in Containers
The term container runtime can be very confusing. In general, the term refers to the software responsible for running the container, but some runtimes (called high-level runtimes) are a set of tools for running and managing containers. Ian Lewis explains the term Container Runtimes in detail.
Many articles on the internet tell the story of container runtimes and why Kubernetes ditched Docker as the default runtime. Docker started as a set of different simple tools (according to the Unix philosophy) each with a very specific task, generally geared towards communicating with peripherals, like dealing with networking or storage. Over time, given the various problems caused by this approach in large-scale deployments, Docker, CoreOS, Google, and others developed the Open Container Initiative (OCI) that defines open industry standards around container formats and runtimes. The OCI currently contains two specifications: The Runtime Specification (runtime-spec), defining how to run an image, and the Image Specification (image-spec), defining how to pack information in an image ready to run (manifest, filesystem, configuration).
Docker developed and provided the container format and runtime, runC
, to the OCI to serve as a reference implementation. To communicate with the different runtimes another standard has been developed: Container Runtime Interface (CRI), this makes possible for orchestrators to choose different runtimes as long as they are OCI-compatible.
After the standards for the different levels were defined, different tools were developed for each level, allowing us to choose the best combination for each need.
Containers have long been confused with some sort of sandbox, but containers are not sandboxes, they do not protect the host system by default, and configuration and tuning are required to keep their potential under control.
The typical configuration process involves:
For example, providing access to all syscalls (348) that the kernel offers can be dangerous if you do not have total control over the code that runs inside the container. And even then, it would be better to prevent the use of all unnecessary resources.
Docker in its infancy was very generous with permission management, coupled with the confusion of roles (Docker can do all: run images, manage images, volumes, network, and so on). These are some of the reasons that led to the development of other runtimes designed to minimize functionality and with well-defined roles.
The OCI family of runtimes can be divided into two:
runC
or crun
(RedHat implementation of OCI specs).Using virtualized runtimes could be the solution to have the security of the virtualization (VMs) and the speed of the container, but there are some back draws: Unlike native runtimes, sandboxed runtimes affect performance throughout the life of a containerized process. In container sandboxes, there is an additional layer of abstraction: The process runs on the unikernel which passes instructions to the host kernel. Access to filesystem, implemented trough a proxy, or to devices is also controlled and affected by performance.
In the virtualized family, there is an additional differentiation: VMs and syscall filters.
Kata container implements the VM virtualization. The key features can be summarized as follow:
As long the performance of the security-oriented runtimes does not reach the native runtimes, we will see a mix of all of them.
In 2018, Google open-sourced gVisor, a user-space kernel for containers. gVisor presents itself to applications as a normal kernel but exposes a smaller number of syscalls, uses a different language (Go) reducing the possibility of having the same bugs in both kernels, filters the file-system and network access through a dedicated and separated mechanism.
The key features can be summarized as follow:
As said, there are drawbacks concerning the usage of gVisor, mainly regarding performance. For further analysis, read The True Cost of Containing: A gVisor Case Study.
The installation is pretty simple. Once installed, a new runtime can be added to Docker configuration:
{ “runtimes”: { “runsc”: { “path”: “/usr/local/bin/runsc”, “runtimeArgs”: [ “—overlay”, “—network=sandbox”, “—debug-log=/tmp/runsc/”, “—debug”, “—strace” ] } } }
Specifying additional protection:
Use runsc flags
to see the documentation of the runtime options.
Running the same container with runc and runsc shows the difference between native (passing calls to the host kernel) and sandboxed (presenting itself as different kernel):
[root@localhost cnt]# docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE alpine latest 14119a10abf4 2 months ago 5.6MB[root@localhost cnt]# docker run —rm —runtime=runc alpine uname -r 4.18.0-348.el8.x86_64
[root@localhost cnt]# docker run —rm —runtime=runsc alpine uname -r 4.4.0
runsc presents itself as kernel 4.4.0 to the application.
At the same time, simple attacks like root mount do not work:
[roga@localhost cnt]$ docker run —rm -it —runtime=runc -v /:/pesc pesc /bin/bash root@1770c2eb6e12:/pesc# exit exit [roga@localhost cnt]$ docker run —rm -it —runtime=runsc -v /:/pesc pesc /bin/bash docker: Error response from daemon: OCI runtime create failed: reading spec: Mount.Destination must be an absolute path: &{] bind /var/lib/docker/volumes/f2a1cf46b1780a3becea9e75278f0f373948975784c3679f5f73d749147904f0/_data [rbind]}: unknown.[roga@localhost runsc]$ cat runsc.log.20211028-080751.680195.create I1028 08:07:51.680381 4892 main.go:218] *************************** I1028 08:07:51.680410 4892 main.go:219] Args: [/usr/local/bin/runsc —overlay —network=sandbox —debug-log=/tmp/runsc/ —debug —strace —root /var/run/docker/runtime-runc/moby —log /run/containerd/io.containerd.runtime.v2.task/moby/a4b3f7175ca8ca397b08a676bf2df320579cda792babe3d50593f86c80149174/log.json —log-format json create —bundle /run/containerd/io.containerd.runtime.v2.task/moby/a4b3f7175ca8ca397b08a676bf2df320579cda792babe3d50593f86c80149174 —pid-file /run/containerd/io.containerd.runtime.v2.task/moby/a4b3f7175ca8ca397b08a676bf2df320579cda792babe3d50593f86c80149174/init.pid —console-socket /tmp/pty948542374/pty.sock a4b3f7175ca8ca397b08a676bf2df320579cda792babe3d50593f86c80149174] I1028 08:07:51.680438 4892 main.go:220] Version release-20211019.0 I1028 08:07:51.680444 4892 main.go:221] GOOS: linux I1028 08:07:51.680450 4892 main.go:222] GOARCH: amd64 I1028 08:07:51.680456 4892 main.go:223] PID: 4892 I1028 08:07:51.680463 4892 main.go:224] UID: 0, GID: 0 I1028 08:07:51.680469 4892 main.go:225] Configuration: I1028 08:07:51.680475 4892 main.go:226] RootDir: /var/run/docker/runtime-runc/moby I1028 08:07:51.680485 4892 main.go:227] Platform: ptrace I1028 08:07:51.680492 4892 main.go:228] FileAccess: exclusive, overlay: true I1028 08:07:51.680500 4892 main.go:229] Network: sandbox, logging: false I1028 08:07:51.680507 4892 main.go:230] Strace: true, max size: 1024, syscalls: I1028 08:07:51.680513 4892 main.go:231] VFS2 enabled: false, LISAFS: false I1028 08:07:51.680518 4892 main.go:232] *************************** W1028 08:07:51.681354 4892 specutils.go:112] noNewPrivileges ignored. PR_SET_NO_NEW_PRIVS is assumed to always be set. W1028 08:07:51.681388 4892 error.go:48] FATAL ERROR: reading spec: Mount.Destination must be an absolute path: &{[ bind /var/lib/docker/volumes/f51e3dcbea53b32e7682bc8a58e63a493decde9115488f6fc98a772972cd4b22/_data [rbind]} W1028 08:07:51.681447 4892 main.go:257] Failure to execute command, err: 1
In general, the perfect use case for gVisor is when a container needs to run untrusted code. In this case, gVisor can add additional layers of security with a simple switch in the Docker or Kubernetes configuration.
For a typical customer environment, without the need to host untrusted code, an idea could be to adopt a mix between gVisor for exposed applications and standard runtimes for applications with revised or trusted code. In the case of a Kubernetes cluster, it is possible to create a subset of nodes to execute sensitive code and install gVisor on those nodes. The rest of the nodes can run containers using standard, better-performing runtimes.
Containers are very popular since they have power and flexibility, but the complex technology governing the environment poses many security challenges. If you need to run untrusted code on your containers, it would be useful to implement a second layer of protection; In this case, gVisor is a good choice to start with.
It requires an extensive test to the compatibility whit the application and to carefully consider the performance impact, but with the usage of a different language (go) and the strict control over the syscalls, gVisor provides a solid second line of defense for an environment that does not trust code.
Considering the Secure containerized environments with updated threat matrix for Kubernetes, gVisor can help mitigate some techniques used in execution, persistence, and privilege escalation tactics. We will see such implementation more often in the future.
Our experts will get in contact with you!
Rocco Gagliardi
Rocco Gagliardi
Rocco Gagliardi
Rocco Gagliardi
Our experts will get in contact with you!