In this lab, no special hardware is needed and you are free to use any machine you would like, including a personal machine or a machine that you provisioned on CloudLab. However, you must run a Linux distribution that supports seccomp filter (any recent Linux Kernel will work).
The Berkeley Packet Filter (BPF) provides a safe way of loading and executing a code extension inside the kernel. It can be used to modify the kernel’s behavior or to provide new functionality. In the case of this lab, you will use BPF to modify how the kernel handles system calls. Specifically, you will use an interface called seccomp that is designed to filter system calls.
The basic operation of BPF is to provide a filter program that is loaded into the kernel. Each filter program consists of an array of instructions, using a special, custom instruction set that the kernel can verify is secure. However, it is more restrictive than instruction sets you may have used in the past (e.g., RISC-V). The BPF virtual machine provides 32-bit words and fixed-sized instructions. It supports just a single accumulator and a single index register.
More recently (kernel version 3.18 and later), an extension to BPF has been made available, called eBPF. eBPF is less restrictive, providing 64-bit words and ten registers. The hints in this lab will focus on the original BPF, but feel free to use eBPF instead.
Here are several references on BPF and seccomp that may find helpful in completing the lab:
There are two main goals in this lab assignment: a) intercepting system calls and redirecting them to userspace, and b) handling the redirected system calls in a way that provides a custom filesystem.
For intercepting system calls, you will use seccomp and a custom BPF filter that you will implement. The BPF filter must allow only the minimal set of system calls through, such as those needed to support redirection and to allow printing to STDOUT. However, all other system calls must be delivered to a signal handler for emulation. Then, in the handler, extract the system call number and arguments, and implement the appropriate behavior.
Using these emulated system calls, the filesystem should provide the following nonstandard behavior. Each file that is opened will contain (as its file contents) exactly one 32-bit (4 byte) word that stores the length of the file’s name (i.e., the number of characters). For example, a file named “foo” would contain the integer value “3”, while a file named “firecracker” would contain the integer value “11”.
We provide you with skeleton code that includes test cases. If successful, the test code will print SUCCESS, indicating you have a working solution. Otherwise, if a system call returns an unexpected value, or the character count is wrong, the test code will print FAIL. The test code uses only the system calls open(), read(), and close(), but it relies on other system calls to exit and to handle signals for emulation. Your job is to take the test program, modify it to redirect system calls to userspace, and then provide a user-level implementation of the above system calls, delivering the specified file system behavior.
Warning: Any operation that you perform in your signal handler must technically be reentrant. We recommend that you familiarize yourself with this problem, and aim to write code that follows these restrictions. If you don’t, your solution may may still work, but we can’t guarantee there won’t be undefined behavior.
Hint: First, define some constants for the system call number and registers or any other parameters you may need to access inside your filter.
Hint: Then implement a filter program that redirects system calls to userspace. You will need to allow the kernel to directly handle rt_sigreturn(), exit_group(), and write() to STDOUT but not to other file descriptors. For all other system calls, you should trap (i.e., deliver a signal) so it can be emulated. Because the goal is to minimize the attack surface, your filter must provide this behavior (allowing the minimum set of system calls through) to receive full credit.
Hint: Next you must call the prctl() system call with the argument PR_SET_NO_NEW_PRIVS to drop privileges.
Hint: Then you must call the prctl() system call to load the filter program using the argument PR_SET_SECCOMP.
Hint: Now the filter is done, but you must still configure and handle signals to perform the emulation. Use sigaction() to set this up for SIGSYS and sigprocmask() to unblock SIGSYS.
Hint: Finally, build a handler that parses the system call number and its arguments and implements the specified behavior above. You may find it helpful to use a switch statement that calls a function to handle each system call. Be sure to set the return value of the system call before exiting the handler.
Optional Challenge: Measure the performance of your seccomp filtered and emulated system calls versus normal Linux system calls. We suggest using getpid() to make this characterization, as it returns a simple number and performs no other work.