eBPF - A First Look

eBPF

A First Look

Ahmet Hrnjadovic
by Ahmet Hrnjadovic
time to read: 11 minutes

Keypoints

This is the Hot New Kernel Feature

  • BPF is executed in kernel space
  • BPF can be loaded into the kernel at runtime
  • BPF can be used to dynamically trace kernel function calls
  • BPF runs in a sandboxed environment, minimizing the risk of harm to the system

BPF (Berkeley Packet Filter) is nothing new; it has been used in tools such as tcpdump for efficient packet tracing for years. In recent kernel versions, especially since version 4.1, a range of new additions have been made to BPF, allowing it to be used for much more than packet filtering. This article intends to give a peek into using BPF and showcase some of BPFs capabilities.

Because of the many additions to BPF it is also referred to as eBPF, the ‘e’ standing for extended. For tasks which are handled by the Linux kernel, BPF is more efficient than traditional event tracing methods because it allows for direct instrumentation of any kernel function. Thus BPF can be attached to all kinds of event sources. BPF runs in a sandboxed kernel environment which guarantees the program does not cause any harm to the system through its application logic. This makes it well suited as a means to monitor events or debug on production systems. A risk which still remains is an impact on performance if a large number of events is monitored.

An existing collection of tools using BPF can be found in the bpfcc-tools package for Debian systems. One useful tool is opensnoop, which allows for monitoring the files opened either system wide or for a single program by tracing the do_sys_open kernel function.

Another useful BPF program is tcpretrans, which traces tcp retransmits. Without BPF, a common way to trace tcp retransmits would be using a packetcapture and analyzing the packets by using a tool such as tcpdump. With BPF, we can attach a kprobe so the tcp_retransmit_skb kernel function, which is a more elegant solution. From tcpretrans:

# initialize BPF
b = BPF(text=bpf_text)
b.attach_kprobe(event="tcp_retransmit_skb", fn_name="trace_retransmit")

The function trace_retransmit is a C function defined earlier in the source file. It gathers information on the retransmit and is executed every time a retransmit happens.

Using BPF

Writing programs directly using eBPF instructions is time consuming and is comparable to coding in assembly. There are however approachable front ends available with varying levels of abstraction and capabilities. In this article, bcc (BPF compiler collection) is used. Bcc offers frontends in both Python and Lua. While this first example using the Python interface is very minimalistic, it demonstrates with how little code the ptrace_attach system call can be monitored to detect a form of process injection.

/usr/bin/python

from bcc import BPF

BPF(text='int kprobe__ptrace_attach(void *ctx) { bpf_trace_printk("ptrace_attach called\\n"); return 0; }').trace_print()

This program generates the following output when ptrace_attach is called:

# ./ptrace.py
         derusbi-6572  [003] .... 55527.716367: 0x00000001: ptrace_attach called

The quoted string assigned to the text variable is restricted C code. kprobe__ is a special function prefix that creates a kprobe (dynamic tracing of a kernel function call) for the specified kernel function name. Our function is executed every time the ptrace_attach system call is used.

bpf_trace_printk can be used as a convenient hack to write to the common /sys/kernel/debug/tracing/trace_pipe for debugging purposes, but shouldn’t be used otherwise because trace_pipe is globally shared. The exact output of bpf_trace_printk depends on the options set in /sys/kernel/debug/tracing/trace_options (see /sys/kernel/debug/tracing/README for more information). In this case, derusbi is the name and 6572 the PID of the current task. [003] is the cpu number the current task is running on, followed by IRQ-Options, a timestamp in nanoseconds, a fake value used by BPF for the instruction pointer register and our formatted message.

We can attach multiple kprobes to a function using attach_kprobe in python. Also, using trace_fields() instead of trace_print() gives us more control over the values printed.

from bcc import BPF

# define BPF program
bpf_program = """
int p_event(void *ctx) {
    bpf_trace_printk("traced very meaningful event!\\n");
    return 0;
}
"""

# load BPF program
b = BPF(text=bpf_program)

#attach kprobes for sys_clone and execve
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="p_event")
b.attach_kprobe(event=b.get_syscall_fnname("execve"), fn_name="p_event")

while True:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
    except ValueError:
        continue
    print("%f\t%d\t%s\t%s" % (ts, pid, task, msg))

This next example uses the BPF_PERF_OUTPUT() interface. This is the preferred method of pushing per-event data to user space.

#!/usr/bin/python

#Adapted example from https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md
#as well as https://github.com/iovisor/bcc/blob/master/docs/tutorial_bcc_python_developer.md

from bcc import BPF

# define BPF program
bpf_program = """
#include <linux/sched.h>

struct omg_data {
    u64 gid_pid;
    u32 pid;
    u32 gid;
    u64 ts;  //timestamp with nanosecond precision
    char procname[TASK_COMM_LEN]; //holds the name of the current process
};

BPF_PERF_OUTPUT(custom_event);

int get_thi_dete(struct pt_regs *ctx) {
    struct omg_data mdata;

    mdata.gid_pid = bpf_get_current_pid_tgid();
    mdata.pid = (u32) mdata.gid_pid;
    mdata.gid = mdata.gid_pid >> 32;

    mdata.ts = bpf_ktime_get_ns(); //gets timestamp in nanoseconds

    bpf_get_current_comm(&mdata.procname, sizeof(mdata.procname));

    custom_event.perf_submit(ctx, &mdata, sizeof(mdata));

    return 0;
}
"""

# load BPF program
b = BPF(text=bpf_program)
b.attach_kprobe(event=b.get_syscall_fnname("clone"), fn_name="get_thi_dete")


# header
print("%-18s %-6s %-6s %-16s %s" % ("TIME(s)", "PID", "TID", "TASK", "MESSAGE"))

# process event
start = 0
def print_event(cpu, omg_data, size):
    global start
    event = b["custom_event"].event(omg_data)
    if start == 0:
            start = event.ts
    time_s = (float(event.ts - start)) / 1000000000
    print("%-18.9f %-6d %-6d %-16s %s" % (time_s, event.gid, event.pid, event.procname, "Traced sys_clone!"))

# loop with callback to print_event
b["custom_event"].open_perf_buffer(print_event)
while 1:
    b.perf_buffer_poll()

Because we don’t use bpf_trace_printk() here, we also don’t get the prepackaged information it provides. This means we need to gather event data ourselves. We use the omg_data struct to pass data from kernel to user space. BPF_PERF_OUTPUT('custom_event') creates a BPF table custom_event for pushing out custom event data to user space via the perf ring buffer. From bpf.h:

* u64 bpf_get_current_pid_tgid(void)
*  Return
*      A 64-bit integer containing the current tgid and pid, and
*      created as such:
*      *current_task*\ **->tgid << 32 \|**
*      *current_task*\ **->pid**.

Thus, to get the thread group ID we and the PID we just shift the returned value appropriately.

To get the name of the current task we call bpf_get_current_comm(). From bpf.h:

* int bpf_get_current_comm(char *buf, u32 size_of_buf)
*  Description
*      Copy the **comm** attribute of the current task into *buf* of
*      *size_of_buf*. The **comm** attribute contains the name of
*      the executable (excluding the path) for the current task. The
*      *size_of_buf* must be strictly positive. On success, the
*      helper makes sure that the *buf* is NUL-terminated. On failure,
*      it is filled with zeroes.

perf_submit submits the event for user space via a perf ring buffer. print_event is the Python function which will handle reading events from the custom_event stream. b.perf_buffer_poll() waits for events. This call is blocking.

Conclusion

The new additions to BPF allow for compact, powerful and performant tracing programs. Now it’s just a matter of finding suitable use-cases to harness its power. Despite touching on sensitive areas of the system, the risk associated with impacting other areas of the system is reduced, thanks to being run in a sandboxed environment and static code analysis by the kernel.

About the Author

Ahmet Hrnjadovic

Ahmet Hrnjadovic is working in cybersecurity since 2017. There he is focused in topics like Linux, secure development and web application security testing. (ORCID 0000-0003-1320-8655)

Links

You need support in such a project?

Our experts will get in contact with you!

×
Security Testing

Security Testing

Tomaso Vasella

Active Directory certificate services

Active Directory certificate services

Eric Maurer

Foreign Entra Workload Identities

Foreign Entra Workload Identities

Marius Elmiger

Active Directory certificate services

Active Directory certificate services

Eric Maurer

You want more?

Further articles available here

You need support in such a project?

Our experts will get in contact with you!

You want more?

Further articles available here