Enhancing Data Understanding
Rocco Gagliardi
Use Syscalls Filters to Contain Containers
During our projects, we are confronted with different aspects of security. Broadly speaking, the model we use can be summarized as follows:
In each area, we analyze the state of security and suggest improvements. Specifically, regarding containerized applications, we looked at how to secure the delivery of the container and how to mitigate a specific case where the code to run is untrusted, using a technique between containerization and virtualization.
In this article we will look specifically at the security policies that can be imposed at the kernel level to an application that runs in a cloud environment using seccomp-bpf
.
If you have experience as a programmer, during the last few years you have obsessively repeated a single golden rule: “Don’t Repeat Yourself!”, or DRY. In the security world, DRY is banned in favour of !DRY. Deploying multiple layers of redundant security measures is a common pattern, part of a larger strategy named Defense in Depth. It can be complicated, sometime, but makes sense: Apply redundant controls at different parts of a system.
Containers, we want to re-emphasize, are not designed for containing! All containers share the same (host) kernel: If the kernel has a vulnerability, all containers could leverage that vulnerability and possibly access unauthorized data on the host and virtually in another container.
Basically, a kernel is responsible for the execution of low level operations that are invoked by high level programs, so manages privileged access for unprivileged users. Those operations are accessible via system calls (syscalls). In a modern system, ca. 400 different syscalls exists: Functionality for a programmer, attack-vectors for an attacker.
The idea is simple: If we could block unneeded syscalls, even if the code of the syscalls is vulnerable, the container/application cannot leverage the vulnerability.
This is the seccomp (SECure COMPuting) mode introduced in 2005 with kernel 2.6.12. With the introduction in 2012 with kernel 3.5 of seccomp-bpf
, a human mechanism to create syscalls filters, it was possible to manage other syscalls than the original magic-four: exit()
, sigreturn()
, read()
, write()
.
In the last article we saw a method to limit syscalls in a simple manner: Sandboxing the container substituting the runtime kernel. This approach addressed a specific threat (running untrusted code) swapping the host kernel with a dedicated one: The reduced number of syscalls is a side effect of the implementation, only 260 syscalls are implemented in the new kernel, therefore the exposure is reduced by about a third.
For a general use case, without swapping or virtualizing the kernel, we want to check which privileged features offered by the host an unprivileged container can access. The privileged features made available by the kernel are obtained through syscalls and we want to make only the strictly necessary syscalls available to a container.
The challenge is to find out which syscalls are necessary for a container to run. And this is not a trivial task.
May sound odd, but the syscalls necessary for a software to run depends on many factors, among the main ones: CPU architecture, programming language, compiler. Take the example of Hello World; we have the same program in C, Go, and Python:
bash$ cat hello.c #include <stdio.h> int main() { printf("Hello World\n"); return 0; } bash$ cat hello.go package main import "fmt" func main() { fmt.Println("Hello World") } bash$ cat hello.py #!/usr/bin/env python3 print("Hello World")
Based on the task, we could expect similar results because the same functionality is requested to the kernel (print string “Hello World”), but if we look at the syscalls used by each program we notice some differences:
bash$ strace -c ./run.hello.c Hello World % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 41.37 0.000163 40 4 mprotect 17.51 0.000069 69 1 write 17.26 0.000068 8 8 mmap 8.38 0.000033 33 1 munmap 4.82 0.000019 6 3 newfstatat 2.79 0.000011 3 3 brk 2.28 0.000009 9 1 getrandom 1.78 0.000007 7 1 prlimit64 1.27 0.000005 2 2 close 1.02 0.000004 4 1 set_tid_address 0.76 0.000003 1 2 1 arch_prctl 0.76 0.000003 3 1 set_robust_list 0.00 0.000000 0 1 read 0.00 0.000000 0 4 pread64 0.00 0.000000 0 1 1 access 0.00 0.000000 0 1 execve 0.00 0.000000 0 2 openat ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000394 10 37 2 total bash$ strace -c ./run.hello.go Hello World % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ------------------ 81.19 0.003276 28 113 rt_sigaction 11.15 0.000450 150 3 clone 2.16 0.000087 10 8 rt_sigprocmask 2.16 0.000087 29 3 fcntl 2.16 0.000087 29 3 1 futex 1.09 0.000044 44 1 write 0.10 0.000004 4 1 rt_sigreturn 0.00 0.000000 0 1 read 0.00 0.000000 0 1 close 0.00 0.000000 0 18 mmap 0.00 0.000000 0 1 execve 0.00 0.000000 0 2 sigaltstack 0.00 0.000000 0 1 arch_prctl 0.00 0.000000 0 1 gettid 0.00 0.000000 0 1 sched_getaffinity 0.00 0.000000 0 1 openat ------ ----------- ----------- --------- --------- ------------------ 100.00 0.004035 25 159 1 total bash$ strace -c python3 hello.py Hello World % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 29.96 0.009716 41 232 31 newfstatat 9.33 0.003025 45 66 rt_sigaction 8.22 0.002665 37 71 read 8.21 0.002661 38 69 3 lseek 7.90 0.002562 111 23 mmap 7.23 0.002345 45 52 3 openat 6.86 0.002224 42 52 close 5.60 0.001815 42 43 36 ioctl 3.66 0.001188 198 6 mprotect 2.98 0.000967 48 20 getdents64 2.51 0.000814 90 9 brk 1.60 0.000520 260 2 1 arch_prctl 0.90 0.000292 146 2 munmap 0.89 0.000290 72 4 3 readlink 0.82 0.000267 133 2 getrandom 0.81 0.000264 264 1 set_tid_address 0.59 0.000190 95 2 getcwd 0.56 0.000182 182 1 prlimit64 0.49 0.000160 160 1 set_robust_list 0.47 0.000154 51 3 dup 0.14 0.000047 47 1 fcntl 0.12 0.000039 39 1 write 0.08 0.000026 26 1 futex 0.04 0.000013 13 1 sysinfo 0.00 0.000000 0 4 pread64 0.00 0.000000 0 1 1 access 0.00 0.000000 0 1 execve 0.00 0.000000 0 1 getuid 0.00 0.000000 0 1 getgid 0.00 0.000000 0 1 geteuid 0.00 0.000000 0 1 getegid ------ ----------- ----------- --------- --------- ---------------- 100.00 0.032426 48 675 78 total
If, as expected, an interpreted language needs more syscalls than a compiled one, less understandable is the difference between the two compiled languages C and Go.
In addition, tracing the same code running in a container adds a lot of “noise”:
bash$ strace -c podman run --rm --name labs_20220203 labs20220203 /tmp/run.hello.go Hello World % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ------------------ 66.48 0.120977 822 147 4 futex 5.29 0.009625 53 180 39 newfstatat 3.95 0.007187 52 136 read 3.18 0.005796 36 160 mmap 3.06 0.005569 51 109 27 openat 2.58 0.004692 43 107 close 2.40 0.004370 58 75 rt_sigprocmask 2.24 0.004070 47 86 34 epoll_ctl 0.79 0.001431 44 32 epoll_wait 0.76 0.001384 12 115 rt_sigaction 0.73 0.001327 45 29 fcntl 0.66 0.001194 56 21 socket 0.64 0.001156 25 46 mprotect 0.59 0.001080 90 12 munmap 0.56 0.001020 56 18 6 connect 0.56 0.001012 1012 1 execve 0.44 0.000802 61 13 geteuid 0.42 0.000772 55 14 fstat 0.41 0.000751 35 21 4 rt_sigreturn 0.34 0.000626 52 12 getdents64 0.32 0.000585 83 7 sched_yield 0.26 0.000477 47 10 sendto 0.25 0.000464 51 9 statfs 0.25 0.000458 26 17 pread64 0.23 0.000427 42 10 fstatfs 0.19 0.000346 43 8 recvfrom 0.18 0.000322 80 4 clone3 0.18 0.000321 80 4 brk 0.18 0.000321 45 7 getsockname 0.15 0.000275 34 8 7 setrlimit 0.15 0.000268 44 6 4 prctl 0.13 0.000245 35 7 getpid 0.12 0.000215 35 6 recvmsg 0.10 0.000186 62 3 timerfd_create 0.10 0.000174 87 2 setns 0.08 0.000154 154 1 chdir 0.08 0.000150 30 5 getuid 0.08 0.000141 28 5 lseek 0.08 0.000140 46 3 bind 0.08 0.000138 46 3 3 mkdirat 0.07 0.000132 44 3 epoll_pwait 0.07 0.000120 40 3 epoll_create1 0.06 0.000118 29 4 3 access 0.05 0.000096 32 3 fchmodat 0.05 0.000094 94 1 prlimit64 0.05 0.000082 27 3 gettid 0.04 0.000076 76 1 setresgid 0.04 0.000072 36 2 getrandom 0.04 0.000071 35 2 getrlimit 0.04 0.000069 69 1 setresuid 0.04 0.000066 66 1 pipe2 0.03 0.000063 63 1 getegid 0.03 0.000058 29 2 timerfd_settime 0.03 0.000057 57 1 capget 0.03 0.000051 51 1 getcwd 0.02 0.000044 44 1 sysinfo 0.02 0.000036 18 2 1 arch_prctl 0.01 0.000027 27 1 readlinkat 0.00 0.000008 8 1 sched_getaffinity 0.00 0.000000 0 1 uname 0.00 0.000000 0 1 umask 0.00 0.000000 0 2 sigaltstack 0.00 0.000000 0 1 set_tid_address 0.00 0.000000 0 1 set_robust_list ------ ----------- ----------- --------- --------- ------------------ 100.00 0.181988 121 1499 132 total
Pretty complex to understand which syscall was made by the code and which by the container runtime. But, since we run a single container in a test environment, we are still dealing with the easy part of the problem: We need to trace the syscalls of a specific container on a cluster running a lot of containers, so we need to identify which syscall belongs to which container.
To address this problem, some tools have been developed. We will see the basic procedure to trace a container, create and apply policies. In the second part, we will apply the same procedure on a Kubernetes Cluster in an automated mode (CI/CD).
We will try to stay compliant with OCI, so will use podman instead of Docker to manage and run containers: In our case, podman runs inside a Fedora35 VMWare Workstation on Windows 10, but the same applies to a generic container environment: The goal is to identify when a specific container starts making syscalls and trace them until the container exits.
Here is where seccomp-bpf
helps: As super-user, we can create a filter to inject into the kernel, hook them to a specific syscall (sys_enter
), mark the process (--annotation
), and collect the syscalls executed. We just need to add some OCI tools to our system (with Fedora, just install oci-seccomp-bpf-hook
package or compile your own) and configure a hook:
bash$ cat /usr/share/containers/oci/hooks.d/oci-seccomp-bpf-hook.json { "version": "1.0.0", "hook": { "path": "/usr/libexec/oci/hooks.d/oci-seccomp-bpf-hook", "args": [ "oci-seccomp-bpf-hook", "-s" ] }, "when": { "annotations": { "^io\\.containers\\.trace-syscall$": ".*" } }, "stages": [ "prestart" ] }
With the tools installed and configured, we can proceed to create the policy.
Based on some research, the majority of Containers only use around 80-120 syscalls. Applying filters will reduce the exposure of the kernel from ~400 to ~100 syscalls. Note that Docker has a default seccomp-bpf
policy that drops ~50 syscalls.
We can use podman to create a seccomp-bpf
policy. In the following example, we create a very simple dedicated Container based on fedora
, copy our software in /tmp
directory and run the container with required io.containers.trace-syscall=
annotation. The seccomp-bpf
hook triggers the syscall logs when matching the annotation and creates the /tmp/run.hello.c.json
ready to use policy file.
# podman build -t labs20220203 . STEP 1/2: FROM fedora STEP 2/2: COPY run.hello* /tmp COMMIT labs20220203 --> b3717541ab7 Successfully tagged localhost/labs20220203:latest b3717541ab7ad6fe59b34dde61afdbb0f1cefcece5b9d28d426443805cb0ab94 # podman run --rm --name labs_20220203 --annotation io.containers.trace-syscall=of:/tmp/run.hello.c.json labs20220203 /tmp/run.hello.c Hello World # cat /tmp/run.hello.c.json | jq { "architectures" : [ "SCMP_ARCH_X86_64" ], "defaultAction" : "SCMP_ACT_ERRNO", "syscalls" : [ { "action" : "SCMP_ACT_ALLOW", "args" : [], "comment" : "", "excludes" : {}, "includes" : {}, "names" : [ "access", "arch_prctl", "brk", "capset", "chdir", "close", "close_range", "dup2", "execve", "exit_group", "fchdir", "fchown", "fstatfs", "getegid", "geteuid", "getgid", "getrandom", "getuid", "ioctl", "lseek", "mmap", "mount", "mprotect", "munmap", "newfstatat", "openat", "openat2", "pivot_root", "prctl", "pread64", "prlimit64", "pselect6", "read", "rt_sigaction", "rt_sigprocmask", "seccomp", "set_robust_list", "set_tid_address", "sethostname", "setresgid", "setresuid", "setsid", "statx", "umask", "umount2", "write" ] } ] }
Not too bad: With pretty simple commands we created a usable seccomp-bpf
policy that we can ship with our containerized application and use in the Kubernetes POD definition (spec/securityContext/seccompProfile
).
Too easy, right? True.
Take as example a very common utility ls
and try to create a profile for them:
# podman run --rm --name labs_20220203 --annotation io.containers.trace-syscall=of:/tmp/fedora_ls_no-opt.json fedora ls / ... # cat /tmp/fedora_ls_no-opt.json | jq { "defaultAction": "SCMP_ACT_ERRNO", "architectures": [ "SCMP_ARCH_X86_64" ], "syscalls": [ { "names": [ "access", "arch_prctl", "brk", "capset", "chdir", "close", "close_range", "dup2", "execve", "exit_group", "fchdir", "fchown", "fstatfs", "getdents64", "getegid", "geteuid", "getgid", "getrandom", "getuid", "ioctl", "lseek", "mmap", "mount", "mprotect", "munmap", "newfstatat", "openat", "openat2", "pivot_root", "prctl", "pread64", "prlimit64", "pselect6", "read", "rt_sigaction", "rt_sigprocmask", "seccomp", "set_robust_list", "set_tid_address", "sethostname", "setresgid", "setresuid", "setsid", "statfs", "statx", "umask", "umount2", "write" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": {}, "excludes": {} } ] } # cat /tmp/fedora_ls_no-opt.json | jq '.syscalls[].names | length' 48
Now try ls
with some options, as example -lisa
:
# podman run --rm --name labs_20220203 --annotation io.containers.trace-syscall=of:/tmp/fedora_ls_no-opt.json fedora ls -lisa / ... # cat /tmp/fedora_ls_lisa.json | jq { "defaultAction": "SCMP_ACT_ERRNO", "architectures": [ "SCMP_ARCH_X86_64" ], "syscalls": [ { "names": [ "access", "arch_prctl", "brk", "capset", "chdir", "close", "close_range", "connect", "dup2", "execve", "exit_group", "fchdir", "fchown", "fstatfs", "futex", "getdents64", "getegid", "geteuid", "getgid", "getrandom", "getuid", "getxattr", "ioctl", "lgetxattr", "lseek", "mmap", "mount", "mprotect", "munmap", "newfstatat", "openat", "openat2", "pivot_root", "prctl", "pread64", "prlimit64", "pselect6", "read", "readlink", "rt_sigaction", "rt_sigprocmask", "seccomp", "set_robust_list", "set_tid_address", "sethostname", "setresgid", "setresuid", "setsid", "socket", "statfs", "statx", "umask", "umount2", "write" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": {}, "excludes": {} } ] } # cat /tmp/fedora_ls_lisa.json | jq '.syscalls[].names | length' 54
So, same command but we have 6 syscalls more if we use the -lisa
options!
To have a complete and stable seccomp-bpf
policy, we must execute ls
with all possible options combinations, otherwise a non-tested combination could crash our application. This is already complex to do it manually for a relatively simple application like ls
; making the same for a complex distributed application accessing network and volumes in a rapid deployment cycle, is a non trivial challenge.
Once the policy is ready, it is easy to apply to the Container. Note, that is not possible to apply a seccomp-bpf
profile to a container running with privileged: True set in the container’s securityContext
. Privileged containers always run as Unconfined
.
Running the ls
utility with the created profile means that syscalls not explicitly allowed are blocked. If we try to run ls -lisa with the policy of ls
we receive errors:
bash$ sudo podman run --security-opt seccomp=./fedora_ls_no-opt.json fedora ls / bin boot ... bash$ sudo podman run --security-opt seccomp=./fedora_ls_no-opt.json fedora ls -lisa / ls: /: Operation not permitted ls: /bin: Operation not permitted ls: Cannot read symbolic link '/bin': Operation not permitted ... total 16 441735 0 dr-xr-xr-x 1 root root 6 Jan 20 15:15 . 441735 0 dr-xr-xr-x 1 root root 6 Jan 20 15:15 .. 303907 4 lrwxrwxrwx 1 root root 7 Jul 21 2021 bin ... (ERROR)-(Exit Code 1)-(General error)
We can fine-tune the policy to allow a baseline and log other syscalls in the audit.log:
{ "defaultAction": "SCMP_ACT_ERRNO", "architectures": [ "SCMP_ARCH_X86_64" ], "syscalls": [ { "names": [ "access", ... "write" ], "action": "SCMP_ACT_ALLOW", "args": [], "comment": "", "includes": {}, "excludes": {} }, { "names": [ "connect", "futex", "getxattr", "lgetxattr", "readlink", "socket" ], "action": "SCMP_ACT_LOG", "args": [], "comment": "", "includes": {}, "excludes": {} } ] } bash$ sudo podman run --security-opt seccomp=./fedora_ls_no-opt_log-delta.json fedora ls -lisa / total 16 441781 0 dr-xr-xr-x. 1 root root 6 Jan 20 15:29 . 441781 0 dr-xr-xr-x. 1 root root 6 Jan 20 15:29 .. 303907 4 lrwxrwxrwx. 1 root root 7 Jul 21 2021 bin -> usr/bin 303903 0 dr-xr-xr-x. 1 root root 0 Jul 21 2021 boot 1 0 drwxr-xr-x. 5 root root 340 Jan 20 15:29 dev 296107 0 drwxr-xr-x. 1 root root 1682 Nov 25 06:49 etc bash$ cat /var/log/audit.log ... type=SECCOMP msg=audit(1642692596.061:447): Auid=1000 uid=0 gid=0 ses=3 subj=system_u:systeer:container_t:s0:c390,c427 pid=17505 comm="ls" exe="/usr/bin/ls" sig=0 arch=c000003e syscall=202 compat=0 ip=0x7f323c139d40 code=0x7ffc0000AUID="rcc" UID="root" GID="root" ARCH=x86_64 SYSCALL=futex type=SECCOMP msg=audit(1642692596.061:454): Auid=1000 uid=0 gid=0 ses=3 subj=system_u:systeer:container_t:s0:c390,c427 pid=17505 comm="ls" exe="/usr/bin/ls" sig=0 arch=c000003e syscall=41 compat=0 ip=0x7f323c1ba55b code=0x7ffc0000AUID="rcc" UID="root" GID="root" ARCH=x86_64 SYSCALL=socket type=SECCOMP msg=audit(1642692596.061:455): Auid=1000 uid=0 gid=0 ses=3 subj=system_u:systeer:container_t:s0:c390,c427 pid=17505 comm="ls" exe="/usr/bin/ls" sig=0 arch=c000003e syscall=42 compat=0 ip=0x7f323c1b9f27 code=0x7ffc0000AUID="rcc" UID="root" GID="root" ARCH=x86_64 SYSCALL=connect type=SECCOMP msg=audit(1642692596.062:499): Auid=1000 uid=0 gid=0 ses=3 subj=system_u:systeer:container_t:s0:c390,c427 pid=17505 comm="ls" exe="/usr/bin/ls" sig=0 arch=c000003e syscall=191 compat=0 ip=0x7f323c1b50be code=0x7ffc0000AUID="rcc" UID="root" GID="root" ARCH=x86_64 SYSCALL=getxattr type=SECCOMP msg=audit(1642692596.062:500): Auid=1000 uid=0 gid=0 ses=3 subj=system_u:systeer:container_t:s0:c390,c427 pid=17505 comm="ls" exe="/usr/bin/ls" sig=0 arch=c000003e syscall=192 compat=0 ip=0x7f323c1b511e code=0x7ffc0000AUID="rcc" UID="root" GID="root" ARCH=x86_64 SYSCALL=lgetxattr type=SECCOMP msg=audit(1642692596.062:501): Auid=1000 uid=0 gid=0 ses=3 subj=system_u:systeer:container_t:s0:c390,c427 pid=17505 comm="ls" exe="/usr/bin/ls" sig=0 arch=c000003e syscall=191 compat=0 ip=0x7f323c1b50be code=0x7ffc0000AUID="rcc" UID="root" GID="root" ARCH=x86_64 SYSCALL=getxattr type=SECCOMP msg=audit(1642692596.062:504): Auid=1000 uid=0 gid=0 ses=3 subj=system_u:systeer:container_t:s0:c390,c427 pid=17505 comm="ls" exe="/usr/bin/ls" sig=0 arch=c000003e syscall=89 compat=0 ip=0x7f323c1aa05b code=0x7ffc0000AUID="rcc" UID="root" GID="root" ARCH=x86_64 SYSCALL=readlink ...
With this method, we could also build a SOC Usecase alerting in case of “unexpected” syscalls events.
We have seen how to control which privileged functionality a kernel should offer to a specific unprivileged process. So why don’t we see a specific policy for each application? Because the real challenge is creating a policy that makes the application safe but does not “disturb” operation.
As we have seen, the filters reside in the kernel and prevent unpleased syscall calls. But the list of syscalls used by an application is not usually available. It depends on a considerable number of factors, including architecture and compiler.
The importance of microservices environment will increase in the future and so the application number, and the security controls will be more and more complex. In the end, we must go deeper into the system and install controls also in the “trusted” part of them: The kernel.
seccomp-bpf
give us the ability to enforce policies at the kernel level (once enforce, there is no way to return, other than a reboot), but the tuning of the policies, especially in a dynamic environment is not trivial. Nevertheless, we have to deal with them, now and more in the future.
How a policy is automatically created will be a topic of the next Lab, in which we will see examples of the two principal strategies used: Static and dynamic analysis, automated in a CD/CI Chain.
Our experts will get in contact with you!
Rocco Gagliardi
Rocco Gagliardi
Rocco Gagliardi
Rocco Gagliardi
Our experts will get in contact with you!