Seccomp Filtering
Canister uses seccomp BPF to restrict which Linux syscalls the sandboxed process can invoke. This document explains how the default baseline works, how recipes customize it, and the enforcement modes available.
Table of Contents
- How Seccomp Works in Canister
- Default Baseline
- Customizing via Recipes
- Always-Denied Syscalls
- Deny Action: Errno vs Kill
- Monitor Mode and SECCOMP_RET_LOG
- Architecture Validation
- USER_NOTIF Supervisor
- Inspecting the Baseline
How Seccomp Works in Canister
Canister generates a classic BPF (Berkeley Packet Filter) program at runtime
and loads it via prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER) right before
execve().
Two enforcement modes
| Mode | Default action | Listed syscalls | Config value |
|---|---|---|---|
| Allow-list (default) | DENY | Only listed syscalls permitted | seccomp_mode = "allow-list" |
| Deny-list | ALLOW | Only listed syscalls blocked | seccomp_mode = "deny-list" |
Allow-list mode (recommended, default) inverts the security model:
every syscall not explicitly in the baseline (plus allow_extra) is denied.
This provides a much smaller kernel attack surface.
Deny-list mode is the permissive fallback: everything is allowed except
the syscalls in the deny list (plus deny_extra). Use this when you need
maximum compatibility with unknown workloads, at the cost of a larger
attack surface.
The filter cannot be removed or modified after loading. The PR_SET_NO_NEW_PRIVS
flag is set first, which is required for unprivileged seccomp and also prevents
the sandboxed process from gaining new privileges via execve of setuid
binaries.
Default Baseline
Canister ships a single default seccomp baseline defined in
recipes/default.toml. The baseline is embedded in the binary at compile
time via include_str!(), so it always works standalone. At runtime, the
search path is checked for an external override:
./.canister/default.toml(project-local)$XDG_CONFIG_HOME/canister/recipes/default.toml(per-user)/etc/canister/recipes/default.toml(system-wide)- Embedded fallback (compiled into the binary)
This lets teams pin, audit, or version-control the baseline independently of the binary.
The baseline provides:
- ~187 allowed syscalls — the common syscalls needed by most programs (read, write, open, mmap, clone, futex, getpgrp, etc.)
- ~18 always-denied syscalls — dangerous operations that no sandboxed process should ever need (reboot, kexec_load, mount, etc.)
The default.toml uses absolute [syscalls] allow = [...] and deny = [...]
fields. Regular recipes use the relative allow_extra / deny_extra fields
to layer overrides on top. These two modes are mutually exclusive — a
recipe either IS the baseline (uses allow/deny) or EXTENDS it (uses
allow_extra/deny_extra).
The baseline was derived by analyzing the syscall needs of Python, Node.js, Elixir/BEAM, and general-purpose binaries. The old 4-profile system (generic, python, node, elixir) was collapsed into this single baseline because:
- Python and Node were literally identical — same allow list, same deny list.
- The total delta across all 4 profiles was only 6 syscalls:
ptrace,personality,seccomp,io_uring_setup,io_uring_enter,io_uring_register. - The 4-profile taxonomy gave a false sense of specificity.
Recipes that need syscalls beyond the baseline use [syscalls] allow_extra.
Recipes that want tighter restrictions use [syscalls] deny_extra.
Customizing via Recipes
The [syscalls] section in a recipe TOML customizes the baseline:
[syscalls]
allow_extra = ["ptrace"] # add to the allow list
deny_extra = ["personality"] # add to deny list AND remove from allow list
seccomp_mode = "allow-list" # default; or "deny-list"
How overrides work:
- Start with the default baseline (ALLOW_BASE + DENY_ALWAYS).
- Add
allow_extrasyscalls to the allow list (deduplicated). - Add
deny_extrasyscalls to the deny list. - Remove
deny_extrasyscalls from the allow list (deny takes precedence). - Generate the BPF filter from the final lists.
Common recipes:
| Workload | allow_extra | deny_extra | Why |
|---|---|---|---|
| Python scripts | (none) | — | Default baseline is sufficient |
| Node.js builds | (none) | — | Default baseline is sufficient |
| Elixir/BEAM | ["ptrace"] | — | BEAM tools (:observer, :dbg, recon) need ptrace |
| Generic (permissive) | ["ptrace", "personality", "seccomp", "io_uring_setup", "io_uring_enter", "io_uring_register"] | — | Maximum compatibility |
| Hardened | — | ["personality"] | Block multilib/personality switching |
Always-Denied Syscalls
The default baseline includes ~16 syscalls that are always denied. These are dangerous kernel-level operations that a sandboxed process should never need:
| Syscall | Why it’s blocked |
|---|---|
reboot | Reboots the system |
kexec_load | Loads a new kernel |
init_module | Loads a kernel module |
finit_module | Loads a kernel module (from fd) |
delete_module | Unloads a kernel module |
swapon | Enables swap space |
swapoff | Disables swap space |
acct | Enables/disables process accounting |
mount | Mounts a filesystem |
umount2 | Unmounts a filesystem |
pivot_root | Changes the root filesystem |
chroot | Changes the root directory |
syslog | Reads/controls kernel message buffer |
settimeofday | Changes the system clock |
unshare | Creates new namespaces (escape vector) |
setns | Joins existing namespaces (escape vector) |
These are blocked because they represent operations that only system administrators should perform, and a sandboxed process has no legitimate reason to invoke them.
Deny Action: Errno, Kill, and Strict Mode
Canister supports three deny actions depending on the mode:
| Mode | Deny action | Behavior |
|---|---|---|
| Normal | SECCOMP_RET_ERRNO | EPERM | Denied syscall returns -1 with errno = EPERM. Process survives. |
Strict (--strict) | SECCOMP_RET_KILL_PROCESS | Process is immediately terminated with SIGSYS. |
Monitor (--monitor) | SECCOMP_RET_LOG | Syscall is allowed but logged to kernel audit. |
Normal mode (default) uses Errno because:
- Most programs check return values and can handle
EPERMgracefully. - Kill mode makes debugging harder (process just dies with no error message).
- The denied syscalls are operations that programs generally don’t invoke
accidentally – if a program calls
reboot(), it’s intentional and getting EPERM back is the right response.
Strict mode (--strict) uses Kill because:
- In CI/production, a denied syscall indicates a policy violation or attack.
- Immediate termination prevents any further execution after a violation.
- The process cannot observe or react to the denial (no information leak).
The architecture validation check (wrong CPU architecture) always uses
SECCOMP_RET_KILL_PROCESS regardless of mode, since an architecture
mismatch indicates an actual attack (e.g., x32 ABI bypass attempt).
Monitor Mode and SECCOMP_RET_LOG
When running with --monitor, the seccomp filter uses SECCOMP_RET_LOG
(0x7ffc0000) instead of SECCOMP_RET_ERRNO. This is a third deny action
mode:
| Mode | Return value | Behavior |
|---|---|---|
| Errno | SECCOMP_RET_ERRNO | EPERM | Denied syscall returns EPERM |
| Kill | SECCOMP_RET_KILL_PROCESS | Process killed immediately |
| Log | SECCOMP_RET_LOG | Syscall is allowed but logged to kernel audit |
In Log mode, the BPF filter structure is identical to Errno mode — same architecture check, same deny list, same jump offsets. Only the return value for matched syscalls changes. This means the filter accurately reflects what would be blocked in enforcement mode.
Viewing logged syscalls:
# After running with --monitor
journalctl -k | grep seccomp
# or
dmesg | grep seccomp
Each log line shows the syscall number, PID, and other context. Map syscall
numbers back to names with ausyscall (from the auditd package):
ausyscall --dump | grep <number>
SECCOMP_RET_LOG is available since Linux 4.14 (well within the 5.6+
minimum kernel requirement).
Architecture Validation
The BPF filter’s first check validates that the syscall comes from the expected CPU architecture:
- x86_64:
AUDIT_ARCH_X86_64(0xC000003E) - aarch64:
AUDIT_ARCH_AARCH64(0xC00000B7)
If the architecture doesn’t match, the process is killed immediately
(SECCOMP_RET_KILL_PROCESS).
Why this matters: On x86_64, the kernel also supports the x32 ABI (a 32-bit ABI with 64-bit pointers). x32 syscalls use different numbers than native x86_64. Without this check, an attacker could invoke x32 syscalls to bypass the filter (since the BPF checks are against x86_64 numbers).
USER_NOTIF Supervisor
Classic BPF can only inspect the syscall number and architecture (seccomp_data.nr
and seccomp_data.arch). It cannot inspect syscall arguments — for pointer-based
arguments like connect()’s sockaddr or execve()’s pathname, the BPF filter
only sees the raw pointer value, not the data it points to.
Canister uses SECCOMP_RET_USER_NOTIF (Linux 5.9+) to bridge this gap. When the
sandboxed process invokes a syscall that requires argument inspection, the kernel
suspends the calling thread and delivers a notification to a supervisor process.
The supervisor reads the actual argument data (via /proc/<pid>/mem), makes a
policy decision, and sends an ALLOW or DENY verdict back to the kernel.
How it works
The supervisor runs as PID 1 inside the sandbox’s PID namespace, not as a thread in the parent. This architecture is required because of three cascading kernel restrictions:
- After
unshare(CLONE_NEWPID),clone(CLONE_THREAD)returnsEINVAL(pid_ns_for_children != task_active_pid_ns), so a supervisor thread cannot be spawned. - The host’s procfs (
s_user_ns = init_user_ns) denies/proc/<pid>/memopens from a child user namespace, so the supervisor must mount its own procfs. - PID 1 is an ancestor of all sandboxed processes, satisfying Yama
ptrace_scope=1withoutPR_SET_PTRACER.
PID 1 (supervisor, same user ns + PID ns) PID 2+ (worker / sandboxed)
───────────────────────────────────────── ────────────────────────────
1. unshare(CLONE_NEWNS) 1. Sandbox setup
2. mount /proc (owned by user ns) (overlay, pivot_root, etc.)
3. recv_fd() via SCM_RIGHTS 2. seccomp() → notifier fd
→ notifier_fd 3. send_fd() via SCM_RIGHTS
4. Loop: 4. Install main BPF filter
a. poll(notifier_fd, 200ms) 5. execve()
b. ioctl(NOTIF_RECV) → read notification
c. open+read /proc/<pid>/mem
d. Evaluate against policy
e. ioctl(NOTIF_ID_VALID) → TOCTOU check
f. ioctl(NOTIF_SEND) → verdict
g. waitpid(WNOHANG) → check child status
The supervisor runs inline (single-threaded) using poll() with a 200ms timeout,
interleaved with non-blocking waitpid to detect when the worker exits. After the
worker exits, remaining in-flight notifications are drained before the supervisor
terminates.
Two-filter architecture
The worker installs two seccomp filters:
-
Notifier filter (installed first via
seccomp()syscall withSECCOMP_FILTER_FLAG_NEW_LISTENER): ReturnsSECCOMP_RET_USER_NOTIFfor the eight intercepted syscalls (connect,sendto,sendmsg,clone,clone3,socket,execve,execveat). All other syscalls returnSECCOMP_RET_ALLOW. -
Main filter (installed second via
prctl(PR_SET_SECCOMP)): The existing allow-list or deny-list BPF filter. ReturnsSECCOMP_RET_ERRNO,SECCOMP_RET_KILL_PROCESS, orSECCOMP_RET_LOGdepending on mode.
The kernel evaluates filters in reverse install order, but SECCOMP_RET_USER_NOTIF
takes special precedence — when any filter returns USER_NOTIF, the kernel always
delivers the notification to the supervisor, regardless of what other filters return.
Intercepted syscalls
| Syscall | Argument inspected | Policy |
|---|---|---|
connect() | sockaddr (destination address) | Allow only IPs pre-resolved from each [[host]] block’s domain and explicit allow_ips. Loopback and Unix domain sockets always allowed. |
sendto() | dest_addr + msg_controllen | DNS queries on port 53 trigger supervisor-side resolution and dynamic allowlist population. Connected sockets (NULL dest_addr) allowed. |
sendmsg() | msghdr struct (msg_controllen) | Blocks any sendmsg() with ancillary data (msg_controllen > 0), preventing SCM_RIGHTS fd passing regardless of outbound restriction settings. |
clone() | flags (register value) | Deny namespace-creating flags: CLONE_NEWNS, CLONE_NEWCGROUP, CLONE_NEWUTS, CLONE_NEWIPC, CLONE_NEWUSER, CLONE_NEWPID, CLONE_NEWNET |
clone3() | clone_args.flags (read from userspace struct) | Same flag check as clone(), read from the clone_args struct via /proc/<pid>/mem |
socket() | domain, type, protocol (register values) | SOCK_RAW denied. AF_NETLINK restricted to NETLINK_ROUTE (protocol 0) only — all other netlink protocols denied. Normal TCP/UDP/Unix sockets allowed. |
execve() | pathname (read from userspace string) | Validate against allow_execve paths. If allow_execve is empty, allow all. |
execveat() | pathname (read from userspace string) | Same as execve(). Resolves the path relative to the dirfd argument. |
TOCTOU protection
A time-of-check-time-of-use race exists: a multi-threaded sandboxed process
could modify the memory that the supervisor reads between the read and the
verdict. Canister mitigates this with SECCOMP_IOCTL_NOTIF_ID_VALID:
- Read notification (gets syscall args and a unique notification ID).
- Read memory via
/proc/<pid>/memfor pointer-based arguments. - Evaluate policy.
- Call
ioctl(SECCOMP_IOCTL_NOTIF_ID_VALID, &id)— if the kernel returns an error (ENOENT), the syscall was interrupted (the thread exited or the memory was unmapped) and the notification is stale. The supervisor skips sending a verdict. - Send verdict.
This is the standard mitigation recommended by the seccomp_unotify(2) man page.
It is not airtight against a determined attacker with precise timing, but it
eliminates the most common race windows.
CIDR matching
For connect() filtering, the supervisor supports both exact IP matches and CIDR
range matches (e.g., 10.0.0.0/8, 2606:2800:220:1::/64). The resolved IPs from
each [[host]] block’s domain are combined with any allow_ips CIDR ranges
from the config to build the allowlist. Loopback addresses (127.0.0.0/8,
::1) and AF_UNIX sockets are always permitted.
DNS proxy integration
When the notifier is active, a DNS proxy runs in the parent process on
an ephemeral port. The sandbox’s /etc/resolv.conf points to pasta’s
DNS address (169.254.0.1:53), which is configured via --dns-forward
to forward queries to the parent’s DNS proxy. The proxy
only resolves domains that have a matching [[host]] block — all other
queries receive an NXDOMAIN response. This prevents DNS-based information
exfiltration and ensures the sandbox can only resolve allowed domains.
Configuration
The notifier is controlled by the notifier field in [syscalls]:
[syscalls]
notifier = true # force on
notifier = false # force off
# omit → auto-detect (default)
Auto-detection logic:
- If
notifieris explicitly set in the config, that value is used. - If running in monitor mode, the notifier is disabled (monitor mode uses
SECCOMP_RET_LOG, which is incompatible withSECCOMP_RET_USER_NOTIF). - Otherwise, the notifier is enabled if the kernel version is 5.9 or later
(the minimum version that supports all required
seccomp_unotifyioctls).
Kernel version detection reads /proc/sys/kernel/osrelease and parses the
major.minor version.
Requirements
- Linux 5.9+ — for
SECCOMP_IOCTL_NOTIF_RECV,SECCOMP_IOCTL_NOTIF_SEND, andSECCOMP_IOCTL_NOTIF_ID_VALID. PR_SET_NO_NEW_PRIVSmust be set on the worker before installing the filter (already done by both the notifier and main filter installation paths). The supervisor (PID 1) must NOT havePR_SET_NO_NEW_PRIVSset, as it would break/proc/<pid>/memaccess.- AppArmor — the
canister_sandboxedprofile must allowptrace (readby tracedby)from thecanisterpeer profile. This is configured in the shippedcanister.apparmorprofile.
Inspecting the Baseline
List discovered recipes and the default baseline:
$ can recipe list
Discovered recipes:
elixir Elixir/Erlang (BEAM VM) — mix, iex, Phoenix
+ptrace recipes/elixir.toml
...
Default baseline: ~187 allowed, ~18 denied syscalls
Customize per-recipe with [syscalls] allow_extra / deny_extra
To see exactly which syscalls the baseline allows/blocks, open
recipes/default.toml. The [syscalls] allow array is the allow set,
[syscalls] deny is the deny set. The file is the single source of truth —
it is embedded into the binary at compile time via include_str!() and can
be overridden by placing a default.toml in the recipe search path
(./.canister/, $XDG_CONFIG_HOME/canister/recipes/, /etc/canister/recipes/).
SeccompProfile::apply_overrides() merges per-recipe allow_extra /
deny_extra customizations on top of this baseline.
To see the fully resolved policy (after all recipe merging and env var
expansion), use can recipe show:
$ can recipe show -r elixir
strict = false
[filesystem]
allow = ["/bin", "/sbin", ...]
[syscalls]
seccomp_mode = "allow-list"
allow_extra = ["ptrace"]
...
The output is valid TOML that can be saved as a standalone recipe file.