Some thoughts on sandboxing programs
Few years back I had written an online judge to host contests in college, with my admittedly beginner knowledge of a Linux system. I did a couple of iterations to improve the judge itself, but the core problem of sandboxing any arbitrary code (correctly) remains unsolved and an interesting one. Also, it turns out that even compiling a program is also not safe outside of sandbox either, there are actual C programs whose compilation can bring a the system to a halt - so the entire process from compiling the user submitted code to running the program on the test cases needs to happen in a sandbox. Recently, I have been thinking a lot about how to sandbox programs correctly (not just from the context of an online judge) given any capabilities that you would want to restrict on the program.
The following things are in the context of a Linux system and may not work on a OS X machine. Majority of these methods have a corresponding shell command, which can be triggered to set capabilities for the current session, but they also have a corresponding system call which can be directly called from a C program giving much more flexibility in setting capabilities - after which you can do a
execve(2) to convert it into the required process with the restricted capabilities.
A chroot jail
chroot jail changes the apparent root directory of a process and it sub-processes. How this helps is that, the program cannot access files outside the chroot jailed directory and operations that access the parent directory etc. fail. This is helpful for blocking processes from accessing outside files that you have kept out of the jailed environment.
Even then, doing a chroot jail correctly is hard, because you can break out of the jail when running the process as
root user, by chroot-ing into a directory and changing the current directory to something outside the jail (eg:
The pitfalls are pretty well documented in the man-page of the
chroot(2) system call. A correct implementation of chroot jail changes the user ID of the process to something non-root:
This gives us some idea of how to prevent an arbitrary program from getting file access outside of its jail.
Limit resource usage
The idea is a program should be allotted only limited amount of CPU time and RAM - anything that exceeds the set limits will be killed with a specific signal from the kernel itself. This is where the
setrlimit(2) call comes in. This also has a corresponding command line utlity called
ulimit(1) that can be used to set different resource usage caps for a process. A look at the man-page of
setrlimit(2), you will see how
RLIMIT_RSS can be used to set the hard and soft limits for a process.
setrlimit(2) system call allows for controlling of a number of parameters even the maximum number of processes, max file descriptor number, etc. on a child process.
Restrict network access
If you have the network namespaces enabled in your kernel then you can use the
unshare command to create a new network namespace which will have a different network interfaces from the host system and by default it would have no network - which means we effectively restricted entire network access of a program. It is as simple as running:
If we replace
./myprocess by one of our above programs what we get is effectively a sandboxed program that has no network access and at the same time is bound to get a max CPU time of 1 second.
seccomp - Secure Computing Mode
This is a simple sandboxing tool provided by the Linux kernel since version 2.6.12, when enabled on a process in the strict mode only allows a handful of system calls like
exit(). A whole bunch of calls like
malloc(), etc. becomes unavailable to the process when this is enabled.
A more recent extension called the Seccomp-BPF from kernel version 3.10 allows filtering of system calls via a filtering program called the Berkeley Packet Filter (BPF). It can be used to a allow or deny a family of arbitrary system calls on a target process. Brendan Gregg has an awesome talk about BPF programs and how they can be used to filter different system calls or for tracing purposes.
seccomp(2) man page describes a lot of the low level details. This is also used in the Google Chrome and Gecko browser engines for sandboxing. Mozilla’s wiki page help for a much easier understanding of creating a basic filter or how it is used in Gecko.
Virtualization has been a well known method to run an application process insided a VM with its own OS running on the host. While it is the best way to sandbox something, its probably the most heavyweight of all other approaches. You end up running an OS inside an OS with all hardware virtualized inside it which will be used by the sandboxed process.
Another recent development is that of containers, a lightweight way to isolate an application. This is all based on new kernel features like namespaces and cgroups. Namespace make it appear to a process that they have their own copy of the resource like the network or filesystem. Cgroups can set the bounds on the resources on a group of process like memory or CPU. Remember the
unshare command showed above? There is a corresponding system call to create a new namespace called
unshare as well. The system calls
unshare(2) forms the namespace API which allows a process to create a new namespace with their own view of the network, the mount points, etc. all mentioned in details in the
namespace(7) man page. What we did in the previous example is equivalent to calling:
Turns out sandboxing a userspace process is a non-trivial thing - so the sandbox that I had made for the judge was not really foolproof. Real world applications like Mozilla Firefox and Google Chrome implement multi-level sophisticated sandboxing based on seccomp BPF and user namespaces. I have experimented a bit with these system calls from basic C programs or shell programs for my own understanding, but a full end-to-end implementation needs to take into account a lot of specific details. This is something that I would like to work on in future and integrate it into the fully working judge (and, I just added another item in an already long to-do list).