Sandboxing Containers

Run Untrusted Code in a Container

by Rocco Gagliardi

on November 18, 2021

time to read: 12 minutes

Keypoints

How to run Untrusted Code in Containers

Containers are not virtual machines
Choose the right container runtime for the right task
Use sandboxed containers to run untrusted code
Carefully consider the performance impact of sandboxed runtimes

Talking about containers can be very complicated. There is a lot of software, interfaces, standards, and specifications involved and each of them can be very complex. Kubernetes (and derivates like OpenShift) became the standard for running containerized applications, but the goal of Kubernetes is to orchestrate the containers, not to run them. Even Kubernetes or OpenShift need a complex underlying technology layer. Pieces in this layer can be modified or fine-tuned one by one. In this post, we will discuss how to add additional protection layers to the host running untrusted code in a container.

The term container runtime can be very confusing. In general, the term refers to the software responsible for running the container, but some runtimes (called high-level runtimes) are a set of tools for running and managing containers. Ian Lewis explains the term Container Runtimes in detail.

Many articles on the internet tell the story of container runtimes and why Kubernetes ditched Docker as the default runtime. Docker started as a set of different simple tools (according to the Unix philosophy) each with a very specific task, generally geared towards communicating with peripherals, like dealing with networking or storage. Over time, given the various problems caused by this approach in large-scale deployments, Docker, CoreOS, Google, and others developed the Open Container Initiative (OCI) that defines open industry standards around container formats and runtimes. The OCI currently contains two specifications: The Runtime Specification (runtime-spec), defining how to run an image, and the Image Specification (image-spec), defining how to pack information in an image ready to run (manifest, filesystem, configuration).

Docker developed and provided the container format and runtime, runC, to the OCI to serve as a reference implementation. To communicate with the different runtimes another standard has been developed: Container Runtime Interface (CRI), this makes possible for orchestrators to choose different runtimes as long as they are OCI-compatible.

After the standards for the different levels were defined, different tools were developed for each level, allowing us to choose the best combination for each need.

Container Security

Containers have long been confused with some sort of sandbox, but containers are not sandboxes, they do not protect the host system by default, and configuration and tuning are required to keep their potential under control.

The typical configuration process involves:

User’s permission reduction: Limit access to devices, net, fs, signals, syscalls
Drop capabilities: As soon as possible drop unneeded capabilities
Cgroups tuning: Limit resource, assign cores, control device access
Namespace setup: Isolation of containers

For example, providing access to all syscalls (348) that the kernel offers can be dangerous if you do not have total control over the code that runs inside the container. And even then, it would be better to prevent the use of all unnecessary resources.

Docker in its infancy was very generous with permission management, coupled with the confusion of roles (Docker can do all: run images, manage images, volumes, network, and so on). These are some of the reasons that led to the development of other runtimes designed to minimize functionality and with well-defined roles.

Native or virtualized runtime

The OCI family of runtimes can be divided into two:

native: Runs the containerized process on the same host kernel; low-level runtimes like runC or crun (RedHat implementation of OCI specs).
virtualized:
- Sandboxed runtimes with a dedicated kernel which provide further isolation of the host from the containerized process; examples are gVisor or nabla.
- Lightweight Virtual Machines (VMs) that feel and perform like containers, but provide the workload isolation and security advantages of VMs, for example Kata.

Using virtualized runtimes could be the solution to have the security of the virtualization (VMs) and the speed of the container, but there are some back draws: Unlike native runtimes, sandboxed runtimes affect performance throughout the life of a containerized process. In container sandboxes, there is an additional layer of abstraction: The process runs on the unikernel which passes instructions to the host kernel. Access to filesystem, implemented trough a proxy, or to devices is also controlled and affected by performance.

In the virtualized family, there is an additional differentiation: VMs and syscall filters.

Kata container implements the VM virtualization. The key features can be summarized as follow:

Full Kernel on top of a lightweight VM
Since syscalls go through, there is full syscall support
Performance penalty due to the VM layer
Can run any application
Can run in nested virtualized environments if the hypervisor and hardware support it

As long the performance of the security-oriented runtimes does not reach the native runtimes, we will see a mix of all of them.

gVisor

In 2018, Google open-sourced gVisor, a user-space kernel for containers. gVisor presents itself to applications as a normal kernel but exposes a smaller number of syscalls, uses a different language (Go) reducing the possibility of having the same bugs in both kernels, filters the file-system and network access through a dedicated and separated mechanism.

The key features can be summarized as follow:

Kernel (partial) in userspace. Of 348 syscalls, 260 syscalls have a full or partial implementation.
Intercepts and filters syscalls trough a Sentry module (interpretation and filtering).
Performance penalty at runtime due to syscall filtering for syscall intensive applications.
Performance penalty at runtime for filesystem intensive application, due the proxy throug Gofer of the filesystem accesses.
Can run only applications that use implemented syscalls.

As said, there are drawbacks concerning the usage of gVisor, mainly regarding performance. For further analysis, read The True Cost of Containing: A gVisor Case Study.

How to use

The installation is pretty simple. Once installed, a new runtime can be added to Docker configuration:

{
   “runtimes”: {
       “runsc”: {
            “path”: “/usr/local/bin/runsc”,
            “runtimeArgs”: [
                “—overlay”,
                “—network=sandbox”,
                “—debug-log=/tmp/runsc/”,
                “—debug”,
                “—strace”
           ]
       }
   }
}

Specifying additional protection:

overlay: Wrap filesystem mounts with writable overlay. All modifications are stored in memory inside the sandbox.
network: Specifies which network to use: Sandbox (default), host, none. Using a network inside the sandbox is more secure because it is isolated from the host network.
debug options

Use runsc flags to see the documentation of the runtime options.

Running the same container with runc and runsc shows the difference between native (passing calls to the host kernel) and sandboxed (presenting itself as different kernel):

[root@localhost cnt]# docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE alpine latest 14119a10abf4 2 months ago 5.6MB

[root@localhost cnt]# docker run —rm —runtime=runc alpine uname -r 4.18.0-348.el8.x86_64

[root@localhost cnt]# docker run —rm —runtime=runsc alpine uname -r 4.4.0

runsc presents itself as kernel 4.4.0 to the application.

At the same time, simple attacks like root mount do not work:

[roga@localhost cnt]$ docker run —rm -it —runtime=runc -v /:/pesc pesc /bin/bash root@1770c2eb6e12:/pesc# exit exit [roga@localhost cnt]$ docker run —rm -it —runtime=runsc -v /:/pesc pesc /bin/bash docker: Error response from daemon: OCI runtime create failed: reading spec: Mount.Destination must be an absolute path: &{] bind /var/lib/docker/volumes/f2a1cf46b1780a3becea9e75278f0f373948975784c3679f5f73d749147904f0/_data [rbind]}: unknown.

[roga@localhost runsc]$ cat runsc.log.20211028-080751.680195.create I1028 08:07:51.680381 4892 main.go:218] *************************** I1028 08:07:51.680410 4892 main.go:219] Args: [/usr/local/bin/runsc —overlay —network=sandbox —debug-log=/tmp/runsc/ —debug —strace —root /var/run/docker/runtime-runc/moby —log /run/containerd/io.containerd.runtime.v2.task/moby/a4b3f7175ca8ca397b08a676bf2df320579cda792babe3d50593f86c80149174/log.json —log-format json create —bundle /run/containerd/io.containerd.runtime.v2.task/moby/a4b3f7175ca8ca397b08a676bf2df320579cda792babe3d50593f86c80149174 —pid-file /run/containerd/io.containerd.runtime.v2.task/moby/a4b3f7175ca8ca397b08a676bf2df320579cda792babe3d50593f86c80149174/init.pid —console-socket /tmp/pty948542374/pty.sock a4b3f7175ca8ca397b08a676bf2df320579cda792babe3d50593f86c80149174] I1028 08:07:51.680438 4892 main.go:220] Version release-20211019.0 I1028 08:07:51.680444 4892 main.go:221] GOOS: linux I1028 08:07:51.680450 4892 main.go:222] GOARCH: amd64 I1028 08:07:51.680456 4892 main.go:223] PID: 4892 I1028 08:07:51.680463 4892 main.go:224] UID: 0, GID: 0 I1028 08:07:51.680469 4892 main.go:225] Configuration: I1028 08:07:51.680475 4892 main.go:226] RootDir: /var/run/docker/runtime-runc/moby I1028 08:07:51.680485 4892 main.go:227] Platform: ptrace I1028 08:07:51.680492 4892 main.go:228] FileAccess: exclusive, overlay: true I1028 08:07:51.680500 4892 main.go:229] Network: sandbox, logging: false I1028 08:07:51.680507 4892 main.go:230] Strace: true, max size: 1024, syscalls: I1028 08:07:51.680513 4892 main.go:231] VFS2 enabled: false, LISAFS: false I1028 08:07:51.680518 4892 main.go:232] *************************** W1028 08:07:51.681354 4892 specutils.go:112] noNewPrivileges ignored. PR_SET_NO_NEW_PRIVS is assumed to always be set. W1028 08:07:51.681388 4892 error.go:48] FATAL ERROR: reading spec: Mount.Destination must be an absolute path: &{[ bind /var/lib/docker/volumes/f51e3dcbea53b32e7682bc8a58e63a493decde9115488f6fc98a772972cd4b22/_data [rbind]} W1028 08:07:51.681447 4892 main.go:257] Failure to execute command, err: 1

Use Cases

In general, the perfect use case for gVisor is when a container needs to run untrusted code. In this case, gVisor can add additional layers of security with a simple switch in the Docker or Kubernetes configuration.

For a typical customer environment, without the need to host untrusted code, an idea could be to adopt a mix between gVisor for exposed applications and standard runtimes for applications with revised or trusted code. In the case of a Kubernetes cluster, it is possible to create a subset of nodes to execute sensitive code and install gVisor on those nodes. The rest of the nodes can run containers using standard, better-performing runtimes.

Summary

Containers are very popular since they have power and flexibility, but the complex technology governing the environment poses many security challenges. If you need to run untrusted code on your containers, it would be useful to implement a second layer of protection; In this case, gVisor is a good choice to start with.

It requires an extensive test to the compatibility whit the application and to carefully consider the performance impact, but with the usage of a different language (go) and the strict control over the syscalls, gVisor provides a solid second line of defense for an environment that does not trust code.

Considering the Secure containerized environments with updated threat matrix for Kubernetes, gVisor can help mitigate some techniques used in execution, persistence, and privilege escalation tactics. We will see such implementation more often in the future.

About the Author

Rocco Gagliardi has been working in IT since the 1980s and specialized in IT security in the 1990s. His main focus lies in security frameworks, network routing, firewalling and log management.

You want to test the security of your firewall?

Our experts will get in contact with you!

Transition to OpenSearch

Rocco Gagliardi

Graylog v5

Rocco Gagliardi

auditd

Rocco Gagliardi

Security Frameworks

Rocco Gagliardi

You want more?

Further articles available here

You need support in such a project?

Our experts will get in contact with you!

You want more?

Further articles available here

Sandboxing Containers

Run Untrusted Code in a Container

Keypoints

Container Security

Native or virtualized runtime

gVisor

How to use

Use Cases

Summary

About the Author

Links

Tags

You want to test the security of your firewall?

Transition to OpenSearch

Graylog v5

auditd

Security Frameworks

You want more?

You need support in such a project?

You want more?