Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Seccomp Filtering

Canister uses seccomp BPF to restrict which Linux syscalls the sandboxed process can invoke. This document explains how the default baseline works, how recipes customize it, and the enforcement modes available.

Table of Contents


How Seccomp Works in Canister

Canister generates a classic BPF (Berkeley Packet Filter) program at runtime and loads it via prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER) right before execve().

Two enforcement modes

ModeDefault actionListed syscallsConfig value
Allow-list (default)DENYOnly listed syscalls permittedseccomp_mode = "allow-list"
Deny-listALLOWOnly listed syscalls blockedseccomp_mode = "deny-list"

Allow-list mode (recommended, default) inverts the security model: every syscall not explicitly in the baseline (plus allow_extra) is denied. This provides a much smaller kernel attack surface.

Deny-list mode is the permissive fallback: everything is allowed except the syscalls in the deny list (plus deny_extra). Use this when you need maximum compatibility with unknown workloads, at the cost of a larger attack surface.

The filter cannot be removed or modified after loading. The PR_SET_NO_NEW_PRIVS flag is set first, which is required for unprivileged seccomp and also prevents the sandboxed process from gaining new privileges via execve of setuid binaries.


Default Baseline

Canister ships a single default seccomp baseline defined in recipes/default.toml. The baseline is embedded in the binary at compile time via include_str!(), so it always works standalone. At runtime, the search path is checked for an external override:

  1. ./.canister/default.toml (project-local)
  2. $XDG_CONFIG_HOME/canister/recipes/default.toml (per-user)
  3. /etc/canister/recipes/default.toml (system-wide)
  4. Embedded fallback (compiled into the binary)

This lets teams pin, audit, or version-control the baseline independently of the binary.

The baseline provides:

  • ~187 allowed syscalls — the common syscalls needed by most programs (read, write, open, mmap, clone, futex, getpgrp, etc.)
  • ~18 always-denied syscalls — dangerous operations that no sandboxed process should ever need (reboot, kexec_load, mount, etc.)

The default.toml uses absolute [syscalls] allow = [...] and deny = [...] fields. Regular recipes use the relative allow_extra / deny_extra fields to layer overrides on top. These two modes are mutually exclusive — a recipe either IS the baseline (uses allow/deny) or EXTENDS it (uses allow_extra/deny_extra).

The baseline was derived by analyzing the syscall needs of Python, Node.js, Elixir/BEAM, and general-purpose binaries. The old 4-profile system (generic, python, node, elixir) was collapsed into this single baseline because:

  1. Python and Node were literally identical — same allow list, same deny list.
  2. The total delta across all 4 profiles was only 6 syscalls: ptrace, personality, seccomp, io_uring_setup, io_uring_enter, io_uring_register.
  3. The 4-profile taxonomy gave a false sense of specificity.

Recipes that need syscalls beyond the baseline use [syscalls] allow_extra. Recipes that want tighter restrictions use [syscalls] deny_extra.


Customizing via Recipes

The [syscalls] section in a recipe TOML customizes the baseline:

[syscalls]
allow_extra = ["ptrace"]           # add to the allow list
deny_extra  = ["personality"]      # add to deny list AND remove from allow list
seccomp_mode = "allow-list"        # default; or "deny-list"

How overrides work:

  1. Start with the default baseline (ALLOW_BASE + DENY_ALWAYS).
  2. Add allow_extra syscalls to the allow list (deduplicated).
  3. Add deny_extra syscalls to the deny list.
  4. Remove deny_extra syscalls from the allow list (deny takes precedence).
  5. Generate the BPF filter from the final lists.

Common recipes:

Workloadallow_extradeny_extraWhy
Python scripts(none)Default baseline is sufficient
Node.js builds(none)Default baseline is sufficient
Elixir/BEAM["ptrace"]BEAM tools (:observer, :dbg, recon) need ptrace
Generic (permissive)["ptrace", "personality", "seccomp", "io_uring_setup", "io_uring_enter", "io_uring_register"]Maximum compatibility
Hardened["personality"]Block multilib/personality switching

Always-Denied Syscalls

The default baseline includes ~16 syscalls that are always denied. These are dangerous kernel-level operations that a sandboxed process should never need:

SyscallWhy it’s blocked
rebootReboots the system
kexec_loadLoads a new kernel
init_moduleLoads a kernel module
finit_moduleLoads a kernel module (from fd)
delete_moduleUnloads a kernel module
swaponEnables swap space
swapoffDisables swap space
acctEnables/disables process accounting
mountMounts a filesystem
umount2Unmounts a filesystem
pivot_rootChanges the root filesystem
chrootChanges the root directory
syslogReads/controls kernel message buffer
settimeofdayChanges the system clock
unshareCreates new namespaces (escape vector)
setnsJoins existing namespaces (escape vector)

These are blocked because they represent operations that only system administrators should perform, and a sandboxed process has no legitimate reason to invoke them.


Deny Action: Errno, Kill, and Strict Mode

Canister supports three deny actions depending on the mode:

ModeDeny actionBehavior
NormalSECCOMP_RET_ERRNO | EPERMDenied syscall returns -1 with errno = EPERM. Process survives.
Strict (--strict)SECCOMP_RET_KILL_PROCESSProcess is immediately terminated with SIGSYS.
Monitor (--monitor)SECCOMP_RET_LOGSyscall is allowed but logged to kernel audit.

Normal mode (default) uses Errno because:

  1. Most programs check return values and can handle EPERM gracefully.
  2. Kill mode makes debugging harder (process just dies with no error message).
  3. The denied syscalls are operations that programs generally don’t invoke accidentally – if a program calls reboot(), it’s intentional and getting EPERM back is the right response.

Strict mode (--strict) uses Kill because:

  1. In CI/production, a denied syscall indicates a policy violation or attack.
  2. Immediate termination prevents any further execution after a violation.
  3. The process cannot observe or react to the denial (no information leak).

The architecture validation check (wrong CPU architecture) always uses SECCOMP_RET_KILL_PROCESS regardless of mode, since an architecture mismatch indicates an actual attack (e.g., x32 ABI bypass attempt).


Monitor Mode and SECCOMP_RET_LOG

When running with --monitor, the seccomp filter uses SECCOMP_RET_LOG (0x7ffc0000) instead of SECCOMP_RET_ERRNO. This is a third deny action mode:

ModeReturn valueBehavior
ErrnoSECCOMP_RET_ERRNO | EPERMDenied syscall returns EPERM
KillSECCOMP_RET_KILL_PROCESSProcess killed immediately
LogSECCOMP_RET_LOGSyscall is allowed but logged to kernel audit

In Log mode, the BPF filter structure is identical to Errno mode — same architecture check, same deny list, same jump offsets. Only the return value for matched syscalls changes. This means the filter accurately reflects what would be blocked in enforcement mode.

Viewing logged syscalls:

# After running with --monitor
journalctl -k | grep seccomp
# or
dmesg | grep seccomp

Each log line shows the syscall number, PID, and other context. Map syscall numbers back to names with ausyscall (from the auditd package):

ausyscall --dump | grep <number>

SECCOMP_RET_LOG is available since Linux 4.14 (well within the 5.6+ minimum kernel requirement).


Architecture Validation

The BPF filter’s first check validates that the syscall comes from the expected CPU architecture:

  • x86_64: AUDIT_ARCH_X86_64 (0xC000003E)
  • aarch64: AUDIT_ARCH_AARCH64 (0xC00000B7)

If the architecture doesn’t match, the process is killed immediately (SECCOMP_RET_KILL_PROCESS).

Why this matters: On x86_64, the kernel also supports the x32 ABI (a 32-bit ABI with 64-bit pointers). x32 syscalls use different numbers than native x86_64. Without this check, an attacker could invoke x32 syscalls to bypass the filter (since the BPF checks are against x86_64 numbers).


USER_NOTIF Supervisor

Classic BPF can only inspect the syscall number and architecture (seccomp_data.nr and seccomp_data.arch). It cannot inspect syscall arguments — for pointer-based arguments like connect()’s sockaddr or execve()’s pathname, the BPF filter only sees the raw pointer value, not the data it points to.

Canister uses SECCOMP_RET_USER_NOTIF (Linux 5.9+) to bridge this gap. When the sandboxed process invokes a syscall that requires argument inspection, the kernel suspends the calling thread and delivers a notification to a supervisor process. The supervisor reads the actual argument data (via /proc/<pid>/mem), makes a policy decision, and sends an ALLOW or DENY verdict back to the kernel.

How it works

The supervisor runs as PID 1 inside the sandbox’s PID namespace, not as a thread in the parent. This architecture is required because of three cascading kernel restrictions:

  1. After unshare(CLONE_NEWPID), clone(CLONE_THREAD) returns EINVAL (pid_ns_for_children != task_active_pid_ns), so a supervisor thread cannot be spawned.
  2. The host’s procfs (s_user_ns = init_user_ns) denies /proc/<pid>/mem opens from a child user namespace, so the supervisor must mount its own procfs.
  3. PID 1 is an ancestor of all sandboxed processes, satisfying Yama ptrace_scope=1 without PR_SET_PTRACER.
  PID 1 (supervisor, same user ns + PID ns)    PID 2+ (worker / sandboxed)
  ─────────────────────────────────────────    ────────────────────────────
  1. unshare(CLONE_NEWNS)                      1. Sandbox setup
  2. mount /proc (owned by user ns)               (overlay, pivot_root, etc.)
  3. recv_fd() via SCM_RIGHTS                  2. seccomp() → notifier fd
     → notifier_fd                             3. send_fd() via SCM_RIGHTS
  4. Loop:                                     4. Install main BPF filter
     a. poll(notifier_fd, 200ms)               5. execve()
     b. ioctl(NOTIF_RECV) → read notification
     c. open+read /proc/<pid>/mem
     d. Evaluate against policy
     e. ioctl(NOTIF_ID_VALID) → TOCTOU check
     f. ioctl(NOTIF_SEND) → verdict
     g. waitpid(WNOHANG) → check child status

The supervisor runs inline (single-threaded) using poll() with a 200ms timeout, interleaved with non-blocking waitpid to detect when the worker exits. After the worker exits, remaining in-flight notifications are drained before the supervisor terminates.

Two-filter architecture

The worker installs two seccomp filters:

  1. Notifier filter (installed first via seccomp() syscall with SECCOMP_FILTER_FLAG_NEW_LISTENER): Returns SECCOMP_RET_USER_NOTIF for the eight intercepted syscalls (connect, sendto, sendmsg, clone, clone3, socket, execve, execveat). All other syscalls return SECCOMP_RET_ALLOW.

  2. Main filter (installed second via prctl(PR_SET_SECCOMP)): The existing allow-list or deny-list BPF filter. Returns SECCOMP_RET_ERRNO, SECCOMP_RET_KILL_PROCESS, or SECCOMP_RET_LOG depending on mode.

The kernel evaluates filters in reverse install order, but SECCOMP_RET_USER_NOTIF takes special precedence — when any filter returns USER_NOTIF, the kernel always delivers the notification to the supervisor, regardless of what other filters return.

Intercepted syscalls

SyscallArgument inspectedPolicy
connect()sockaddr (destination address)Allow only IPs pre-resolved from each [[host]] block’s domain and explicit allow_ips. Loopback and Unix domain sockets always allowed.
sendto()dest_addr + msg_controllenDNS queries on port 53 trigger supervisor-side resolution and dynamic allowlist population. Connected sockets (NULL dest_addr) allowed.
sendmsg()msghdr struct (msg_controllen)Blocks any sendmsg() with ancillary data (msg_controllen > 0), preventing SCM_RIGHTS fd passing regardless of outbound restriction settings.
clone()flags (register value)Deny namespace-creating flags: CLONE_NEWNS, CLONE_NEWCGROUP, CLONE_NEWUTS, CLONE_NEWIPC, CLONE_NEWUSER, CLONE_NEWPID, CLONE_NEWNET
clone3()clone_args.flags (read from userspace struct)Same flag check as clone(), read from the clone_args struct via /proc/<pid>/mem
socket()domain, type, protocol (register values)SOCK_RAW denied. AF_NETLINK restricted to NETLINK_ROUTE (protocol 0) only — all other netlink protocols denied. Normal TCP/UDP/Unix sockets allowed.
execve()pathname (read from userspace string)Validate against allow_execve paths. If allow_execve is empty, allow all.
execveat()pathname (read from userspace string)Same as execve(). Resolves the path relative to the dirfd argument.

TOCTOU protection

A time-of-check-time-of-use race exists: a multi-threaded sandboxed process could modify the memory that the supervisor reads between the read and the verdict. Canister mitigates this with SECCOMP_IOCTL_NOTIF_ID_VALID:

  1. Read notification (gets syscall args and a unique notification ID).
  2. Read memory via /proc/<pid>/mem for pointer-based arguments.
  3. Evaluate policy.
  4. Call ioctl(SECCOMP_IOCTL_NOTIF_ID_VALID, &id) — if the kernel returns an error (ENOENT), the syscall was interrupted (the thread exited or the memory was unmapped) and the notification is stale. The supervisor skips sending a verdict.
  5. Send verdict.

This is the standard mitigation recommended by the seccomp_unotify(2) man page. It is not airtight against a determined attacker with precise timing, but it eliminates the most common race windows.

CIDR matching

For connect() filtering, the supervisor supports both exact IP matches and CIDR range matches (e.g., 10.0.0.0/8, 2606:2800:220:1::/64). The resolved IPs from each [[host]] block’s domain are combined with any allow_ips CIDR ranges from the config to build the allowlist. Loopback addresses (127.0.0.0/8, ::1) and AF_UNIX sockets are always permitted.

DNS proxy integration

When the notifier is active, a DNS proxy runs in the parent process on an ephemeral port. The sandbox’s /etc/resolv.conf points to pasta’s DNS address (169.254.0.1:53), which is configured via --dns-forward to forward queries to the parent’s DNS proxy. The proxy only resolves domains that have a matching [[host]] block — all other queries receive an NXDOMAIN response. This prevents DNS-based information exfiltration and ensures the sandbox can only resolve allowed domains.

Configuration

The notifier is controlled by the notifier field in [syscalls]:

[syscalls]
notifier = true     # force on
notifier = false    # force off
# omit             → auto-detect (default)

Auto-detection logic:

  1. If notifier is explicitly set in the config, that value is used.
  2. If running in monitor mode, the notifier is disabled (monitor mode uses SECCOMP_RET_LOG, which is incompatible with SECCOMP_RET_USER_NOTIF).
  3. Otherwise, the notifier is enabled if the kernel version is 5.9 or later (the minimum version that supports all required seccomp_unotify ioctls).

Kernel version detection reads /proc/sys/kernel/osrelease and parses the major.minor version.

Requirements

  • Linux 5.9+ — for SECCOMP_IOCTL_NOTIF_RECV, SECCOMP_IOCTL_NOTIF_SEND, and SECCOMP_IOCTL_NOTIF_ID_VALID.
  • PR_SET_NO_NEW_PRIVS must be set on the worker before installing the filter (already done by both the notifier and main filter installation paths). The supervisor (PID 1) must NOT have PR_SET_NO_NEW_PRIVS set, as it would break /proc/<pid>/mem access.
  • AppArmor — the canister_sandboxed profile must allow ptrace (readby tracedby) from the canister peer profile. This is configured in the shipped canister.apparmor profile.

Inspecting the Baseline

List discovered recipes and the default baseline:

$ can recipe list
Discovered recipes:

  elixir               Elixir/Erlang (BEAM VM) — mix, iex, Phoenix
                       +ptrace                        recipes/elixir.toml
  ...

Default baseline: ~187 allowed, ~18 denied syscalls
  Customize per-recipe with [syscalls] allow_extra / deny_extra

To see exactly which syscalls the baseline allows/blocks, open recipes/default.toml. The [syscalls] allow array is the allow set, [syscalls] deny is the deny set. The file is the single source of truth — it is embedded into the binary at compile time via include_str!() and can be overridden by placing a default.toml in the recipe search path (./.canister/, $XDG_CONFIG_HOME/canister/recipes/, /etc/canister/recipes/).

SeccompProfile::apply_overrides() merges per-recipe allow_extra / deny_extra customizations on top of this baseline.

To see the fully resolved policy (after all recipe merging and env var expansion), use can recipe show:

$ can recipe show -r elixir
strict = false

[filesystem]
allow = ["/bin", "/sbin", ...]

[syscalls]
seccomp_mode = "allow-list"
allow_extra = ["ptrace"]
...

The output is valid TOML that can be saved as a standalone recipe file.