Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Canister

A lightweight sandbox for running untrusted code safely on Linux.

Canister (can) runs any command inside an isolated sandbox with restricted filesystem, network, and syscall access. No root required. Single binary, zero runtime dependencies.

$ can run --recipe recipes/example.toml -- python3 untrusted_script.py

The script sees an empty filesystem (except explicitly allowed paths), can only reach allowed domains, and is blocked from dangerous syscalls.

Key Features

  • Namespace isolation — mount, PID, network, user, and UTS namespaces
  • Filesystem control — read-only bind mounts with explicit write paths
  • Network filtering — DNS-level domain filtering + IP/CIDR rules
  • Seccomp BPF — syscall allow/deny lists with optional SECCOMP_RET_USER_NOTIF supervisor
  • L7 egress proxy + DLP — TLS-terminating proxy with credential pattern detection and per-service domain scoping; canary tokens and session entropy budgeting catch exfiltration attempts
  • Recipe composition — layered TOML configs that merge predictably
  • Zero dependencies — static Rust binary, no daemon, no root

Documentation Structure

This documentation is organized into three sections:

  • User Guide — conceptual explanations, getting started, configuration patterns
  • Reference — auto-generated from source code: CLI flags, config schema, recipes, merge semantics
  • Architecture — system design overview and Architecture Decision Records (ADRs)

Getting Started

Installation

Download the latest binary from GitHub Releases:

# Download and extract
curl -fsSL https://github.com/dergraf/canister/releases/latest/download/canister-x86_64-linux.tar.gz \
  | tar xz -C ~/.local/bin

# Verify
can --version

Or build from source:

git clone https://github.com/dergraf/canister.git
cd canister
cargo build --release
cp target/release/can ~/.local/bin/

First-time Setup

Run the setup command to configure your system for unprivileged user namespaces:

can setup

Quick Start

Run a command inside a sandbox:

can run -- ls /

This runs ls / inside an isolated environment with the default recipe applied. The sandbox restricts filesystem access, blocks network traffic, and filters syscalls.

Using Recipes

Recipes are TOML files that define sandbox policies. Use built-in recipes or write your own:

# List available built-in recipes
can recipe list

# Run with a specific recipe
can run --recipe python -- python3 script.py

# Auto-detect recipe from command
can run -- python3 script.py

See Configuration for the full configuration guide and Built-in Recipes for all available recipes.

Project Manifests

For projects that need reproducible sandbox configurations, create a canister.toml manifest:

[sandbox.dev]
recipes = ["python", "network-curl"]

[sandbox.dev.config.network]
[[host]]
domain = "pypi.org"

[[host]]
domain = "files.pythonhosted.org"

Then use can up to launch the sandbox:

can up dev

See the Manifest Reference for full schema documentation.

Configuration Reference

Canister uses TOML configuration files with strict schema validation. Unknown fields are rejected at parse time.

When no config file is provided (can run -- command), default policy uses proxy-only egress with strict filesystem defaults and the default seccomp baseline.

Table of Contents


Project Manifest (canister.toml)

A project manifest declares named sandboxes for a project. Instead of remembering which -r flags to pass, you define sandboxes once in canister.toml and run them with can up.

Place canister.toml in your project root (next to .git/). Canister discovers it by walking up from the current directory, similar to .gitignore.

Manifest Format

[sandbox.dev]
description = "Neovim + Elixir development"
recipes = ["neovim", "elixir", "nix"]
command = "nvim"

[sandbox.dev.filesystem]
allow_write = ["$HOME/.local/share/nvim"]

[[sandbox.dev.host]]
domain = "api.myproject.dev"

[sandbox.test]
description = "Mix test runner"
recipes = ["elixir", "nix"]
command = "mix test"

[sandbox.ci]
description = "CI — strict, no network"
recipes = ["elixir", "nix", "generic-strict"]
command = "mix test --cover"
strict = true

[sandbox.ci.resources]
memory_mb = 2048
cpu_percent = 100

[sandbox.<name>] fields:

FieldTypeRequiredDescription
descriptionstringNoHuman-readable description
recipesstring[]YesRecipe names to compose (resolved via recipe search path)
commandstringYesCommand to run (may include arguments)
strictboolNoOverride strict mode for this sandbox

Override sections:

Each sandbox can include optional override sections that merge on top of the composed recipes. These use the same schema as recipe files:

  • [sandbox.<name>.filesystem]allow, allow_write, deny
  • [sandbox.<name>.network]egress, allow_ips, ports, contract_mode
  • [[sandbox.<name>.host]] — one or more per-destination contracts (see [[host]] below)
  • [sandbox.<name>.process]max_pids, allow_execve, env_passthrough
  • [sandbox.<name>.resources]memory_mb, cpu_percent
  • [sandbox.<name>.syscalls]allow_extra, deny_extra, seccomp_mode, notifier

Overrides follow the same merge semantics as recipe composition: Vec fields are unioned, scalar fields use last-Some-wins, strict uses OR.

Validation rules:

  • At least one [sandbox.<name>] must be defined.
  • Each sandbox must have recipes = [...] with at least one entry.
  • Each sandbox must have a non-empty command.
  • Unknown fields are rejected (deny_unknown_fields).
  • Mixing absolute (allow/deny) and relative (allow_extra/deny_extra) syscall fields in a sandbox’s [syscalls] section is an error.

can up

Run a named sandbox from the manifest:

# Run the default sandbox (alphabetically first).
can up

# Run a specific sandbox by name.
can up dev
can up test
can up ci

# With CLI overrides.
can up dev --strict
can up dev --monitor
can up test -p 4000:4000

Default sandbox: When no name is given, can up uses the alphabetically first sandbox. Use descriptive names so the default is predictable (e.g., dev sorts before test).

Error handling: If canister.toml is not found, can up prints an error suggesting can run for ad-hoc use. If the named sandbox doesn’t exist, it lists available sandbox names.

Dry-Run Preview

Use --dry-run to see the fully resolved policy without running anything:

can up dev --dry-run
can up ci --dry-run

The output shows the merged result of base.toml + auto-detected recipes + manifest recipes + manifest overrides, including filesystem paths, network domains, syscall overrides, and resource limits.

Composition Order (can up)

base.toml
  → auto-detected recipes (match_prefix against command binary)
  → recipes listed in manifest (left to right)
  → manifest overrides ([sandbox.<name>.filesystem], etc.)
  = final SandboxConfig

This is the same merge chain as can run, except the explicit -r flags are replaced by the manifest’s recipes = [...] list, and manifest overrides act as the final layer.

Design note: Package manager recipes (nix, homebrew, etc.) should be listed explicitly in recipes = [...]. While auto-detection via match_prefix still works for the command binary, explicit declaration preserves the principle of least privilege — auditors can see exactly which recipes are composed by reading canister.toml.


Recipe Composition

Canister supports composing multiple recipes via repeated -r / --recipe flags. Recipes are merged left-to-right into a single resolved config.

Composition order: base.toml → auto-detected recipes → explicit --recipe args.

base.toml provides essential OS bind mounts and is always loaded first (embedded in the binary, overridable on disk). Auto-detected recipes are matched by match_prefix before explicit recipes are applied. The default.toml seccomp baseline is resolved separately by the seccomp module and is NOT part of this composition chain.

# base.toml (always) → nix.toml (auto-detected) → elixir.toml (explicit)
can run -r elixir -- mix test    # mix resolves to /nix/store/..., nix.toml auto-detected

# Explicit composition
can run -r nix -r elixir -- mix test
can run -r cargo -r generic-strict -- cargo build

Merge Semantics

When multiple recipes are merged, each field type follows a specific strategy:

Field typeStrategyExample
Vec fields (paths, domains, syscalls, env vars)Union — deduplicated, order preservedTwo recipes allowing /a and /b["/a", "/b"]
strict (Option<bool>)OR — any Some(true) wins, can never be loosenedRecipe A: strict = true, Recipe B: omitted → true
egress (Option<EgressMode>)Last-Some-winsNone preserves earlier valueRecipe A: egress = "proxy-only", Recipe B: egress = "direct"direct
seccomp_mode (Option<SeccompMode>)Last-Some-winsSame as egress
Numeric (max_pids, memory_mb, cpu_percent)Last-Some-winsRecipe A: max_pids = 64, Recipe B: max_pids = 128128
RecipeMetaOverlay — later recipe’s metadata wins if present

The “last-Some-wins” strategy means None (field not specified) preserves the value from an earlier recipe, while Some(value) overwrites it.

Name-Based Lookup

The -r argument is resolved as follows:

  1. If the argument contains / or ends with .toml, treat as a file path.
  2. Otherwise, search for <name>.toml in the recipe search path:
    • ./.canister/
    • $XDG_CONFIG_HOME/canister/recipes/
    • /etc/canister/recipes/
  3. First match wins (project-local takes precedence over user-global).
can run -r elixir -- mix test              # name lookup → elixir.toml
can run -r recipes/custom.toml -- mix test # file path
can run -r ./my-policy.toml -- echo hi     # file path (contains /)

Auto-Detection via match_prefix

Recipes can declare match_prefix patterns in their [recipe] metadata. During CLI setup (before forking), the command binary path is resolved and canonicalized. Each discovered recipe’s match_prefix is checked against the resolved path. Matching recipes are automatically merged into the chain between base.toml and explicit -r args.

This replaces the previous hardcoded detect_command_prefix() logic. Adding support for a new package manager is “write a .toml file” rather than “modify Rust code”.

Environment Variable Expansion

Recipe paths support environment variable expansion:

SyntaxExpansion
$HOMEValue of $HOME
$USERValue of $USER
${XDG_CONFIG_HOME}Value of $XDG_CONFIG_HOME
$$Literal $

Expansion applies to [filesystem].allow, [filesystem].deny, [process].allow_execve, and [recipe].match_prefix. It is performed during config resolution (after merge, before the sandbox uses the paths).

[filesystem]
allow = ["$HOME/.cargo", "$HOME/.rustup", "$HOME/project"]

[recipe]
match_prefix = ["$HOME/.cargo"]

[recipe] (metadata)

Optional metadata section for recipe files. Not used for policy enforcement but controls recipe discovery and composition behavior.

FieldTypeDefaultDescription
namestring (optional)Human-readable recipe name
descriptionstring (optional)Short description shown by can recipe list
match_prefixstring[][]Path prefixes for auto-detection (env vars expanded)
[recipe]
name = "nix"
description = "Nix package manager (/nix/store)"
match_prefix = ["/nix/store"]

[filesystem]

Controls what the sandboxed process can see and access on the filesystem.

When filesystem isolation is active (requires a MAC policy on Ubuntu 24.04+ and Fedora 41+), the sandbox starts with an empty tmpfs root. Only explicitly allowed paths and essential system paths are bind-mounted read-only.

FieldTypeDefaultDescription
allowstring[][]Paths to bind-mount read-only into the sandbox
denystring[][]Paths explicitly denied (checked before allow)

Behavior:

  • Deny rules take precedence over allow rules.
  • Paths are matched by prefix: allowing /usr/lib also allows /usr/lib/python3.
  • Essential paths are defined in recipes/base.toml (embedded in the binary, overridable on disk) and always bind-mounted: /bin, /sbin, /usr/bin, /usr/sbin, /lib, /lib64, /usr/lib, /etc.
  • Auto-detection: When the command binary lives outside standard FHS paths (e.g., installed via Nix, Homebrew, Cargo, etc.), Canister auto-detects the appropriate package manager recipe via match_prefix and merges it into the recipe chain, bringing the necessary mount paths automatically. See Auto-Detection via match_prefix.
  • When filesystem isolation is blocked (MAC system blocks mounts), the sandbox aborts. Run sudo can setup to install the security policy (use --force to reinstall if the policy is outdated).
[filesystem]
allow = ["/usr/lib", "/usr/bin", "/tmp/workspace", "/home/user/data"]
deny  = ["/etc/shadow", "/etc/passwd", "/root", "/home/user/.ssh"]

Package Manager Support

When the command binary is installed outside standard system paths, Canister uses recipe-based auto-detection to ensure the binary is visible inside the sandbox. Each package manager has a recipe with match_prefix patterns:

RecipeAuto-detects when binary is underMounts
nix.toml/nix/store/nix/store (read-only)
homebrew.toml/opt/homebrew, /home/linuxbrew/.linuxbrewThe matching prefix
cargo.toml$HOME/.cargo, $HOME/.rustup$HOME/.cargo, $HOME/.rustup
snap.toml/snap/snap
flatpak.toml/var/lib/flatpak, $HOME/.local/share/flatpakThe matching prefix
gnu-store.toml/gnu/store/gnu/store

How it works:

  1. The command path is canonicalized (all symlinks resolved) at startup.
  2. Each discovered recipe’s match_prefix is checked against the resolved path.
  3. Matching recipes are merged into the composition chain, bringing their [filesystem].allow paths, [process].allow_execve entries, and any other policy fields.
  4. For content-addressed stores like /nix/store, the entire tree is mounted. Binaries reference sibling store entries freely, making individual-entry mounting impractical.

Security note: Auto-detection makes the prefix visible inside the sandbox but does not grant execution permission. The [process] allow_execve list independently controls what binaries can be executed. Package manager recipes include allow_execve prefix rules (e.g., /nix/store/*) to authorize execution within the mounted tree.

Adding a new package manager: Create a new .toml recipe with appropriate match_prefix, [filesystem].allow, and [process].allow_execve entries. No Rust code changes needed.


[network]

Controls network access. Secure by default: all network access is denied unless explicitly allowed.

FieldTypeDefaultDescription
egress"proxy-only" | "none" | "direct""proxy-only"Outbound networking mode
allow_ipsstring[][]Allowed IPs or CIDR ranges (IPv4 and IPv6)
portsstring[][]Port forwarding specs ([ip:]hostPort:containerPort[/protocol])
contract_mode"strict" | "relaxed""strict"Default for hosts without a [[host]] block. strict refuses; relaxed allows + logs.

FQDN egress goes through the top-level [[host]] table, not this section. Each [[host]] block names a domain and the request shapes accepted on it; see that section for the full schema.

Network mode determination:

The effective egress mode determines isolation behavior:

egressModeDescription
noneNoneNo outbound network. Empty network namespace, loopback only.
proxy-onlyFilteredOutbound traffic must go through local proxy (kernel-enforced).
directFull/FilteredDirect outbound. If allowlists/ports are set, filtered mode is used for policy checks; otherwise full host network namespace.

Specifying ports automatically upgrades None mode to Filtered mode (port forwarding requires a functional network namespace with pasta).

Domain matching:

Domains are matched including subdomains. Allowing pypi.org also allows files.pythonhosted.org if listed, but does not automatically allow subdomains of pypi.org. Each domain must be listed explicitly.

IP/CIDR matching:

IPs support both exact match and CIDR notation:

[network]
allow_ips = [
    "93.184.216.34",        # exact IPv4
    "10.0.0.0/8",           # IPv4 CIDR
    "2606:2800:220:1::/64", # IPv6 CIDR
]

Filtered mode requirements:

Filtered mode requires pasta (from the passt project) installed on the host:

sudo apt install passt       # Debian/Ubuntu
sudo dnf install passt       # Fedora

In filtered mode, pasta mirrors the host’s network configuration into the sandbox namespace. pasta copies the host’s real IP addresses and routes into the namespace. DNS is handled via a link-local address:

AddressRole
Host’s default gatewayGateway
169.254.0.1DNS server (link-local, pasta --dns)
Host’s real IPSandbox IP (mirrored from host)
[network]
egress = "proxy-only"
[[host]]
domain = "pypi.org"

[[host]]
domain = "files.pythonhosted.org"

[[host]]
domain = "registry.npmjs.org"

Port forwarding (ports):

Port forwarding uses Docker/Podman-compatible syntax and is specified via the -p / --port CLI flag or the ports config field:

# CLI usage
can run -p 8080:80 -- my-server
can run -p 127.0.0.1:3000:3000 -p 5432:5432/tcp -- my-app
# Config usage
[network]
egress = "proxy-only"
ports = ["8080:80", "127.0.0.1:3000:3000", "5353:53/udp"]

Syntax: [ip:]hostPort:containerPort[/protocol]

ComponentRequiredDefaultDescription
ipNo0.0.0.0Host IP to bind (e.g., 127.0.0.1)
hostPortYesPort on the host
containerPortYesPort inside the sandbox
protocolNotcptcp or udp

[[host]]

Per-destination egress contract. One [[host]] block per FQDN you allow the sandbox to reach. The block answers every question about that upstream in one place: connect permission, the request shapes that are legitimate, and which DLP detectors may carry verdicts on this host as Warn instead of Block.

There is no separate connect-permission list. Having a [[host]] block at all is the permission to dial; the block’s other fields tighten what’s allowed from there. The minimum block is one line (domain = "x") — equivalent to “allow this host, any shape.”

# Minimum-viable allow.
[[host]]
domain = "static.example.com"

# Full picture for a service we care about.
[[host]]
domain             = "api.github.com"
methods            = ["GET", "POST", "PATCH", "PUT", "DELETE"]
content_types      = ["application/json", "application/vnd.github+json"]
paths              = ["/repos/", "/user/", "/orgs/"]
max_request_bytes  = 1_048_576                       # 1 MiB
allow_credentials  = ["github_pat"]                  # downgrade github_pat hits to Warn here

# Per-host escape hatch.
[[host]]
domain        = "weird-tool.corp.internal"
contract_mode = "relaxed"

Fields

FieldTypeDefaultDescription
domainstringrequiredFQDN this block applies to. Wildcards (*.github.com) match one or more subdomain levels; bare domains match exact + any subdomain. Most-specific match wins.
methodsstring[][]Allowed HTTP methods (case-insensitive). Empty = any.
content_typesstring[][]Allowed request Content-Type values (matched on mime/subtype portion; ; charset=... parameters are ignored). Empty = any.
pathsstring[][]Path prefixes the request URI must start with. Empty = any.
max_request_bytesu64unsetPer-host request body cap. Applies after the global max_streamed_body_bytes.
allow_credentialsstring[][]DLP detector ids whose verdicts on this host downgrade from Block to Warn (e.g. ["github_pat"] means the worker may legitimately carry a github PAT in Authorization to this host).
contract_mode"strict" | "relaxed"inherit [network] contract_modePer-host override of the global default. Only affects the unknown-host decision once you’re inside this block; field-level checks still run.

Multiple [[host]] entries with the same domain merge by: union on vec fields, max for max_request_bytes, last-Some-wins for contract_mode. A project recipe can extend (never silently restrict) a canister-shipped contract by writing another [[host]] with the same domain.

Refusal behaviour

If a request reaches the proxy with a destination that has no matching [[host]] block, the gate decides based on [network] contract_mode:

  • strict (default) — refuse with 415 (or 413 for body-size), x-canister-error: contract-refused, and a response body that carries the exact [[host]] patch to paste into canister.toml to allow this exact shape.
  • relaxed — allow the request but emit an unknown_host_contract tracing event. Intended for prototyping where the upstream set isn’t known up front.

See docs/refusals.md for the operator-facing walkthrough (415 vs 451, how to read the patch, escape hatches).

Shipped service contracts

Canister ships contracts for the upstreams workers most commonly hit under recipes/services/: github.toml, openai.toml, anthropic.toml, npm.toml, pypi.toml, huggingface.toml, docker.toml, aws.toml, stripe.toml, slack.toml. Compose them with -r service:github (etc.) or via a project manifest.


[network.dlp]

Data Loss Prevention layer running inside the L7 egress proxy. Scans outbound HTTP traffic for credential patterns (GitHub PATs, npm tokens, AWS keys, Slack tokens, SSH private keys, generic bearer tokens) and enforces per-host credential scoping via the allow_credentials field on each [[host]] block — a GitHub PAT bound for registry.npmjs.org will be blocked even though both hosts are reachable. See DLP for the full threat model and detector list.

DLP only runs when traffic is inspectable, i.e. when network.egress = "proxy-only".

[network.dlp]
enabled = true
canary_tokens = true              # default when DLP is enabled
max_decode_depth = 32             # base64/hex/percent recursion cap
decompress = true                 # gzip/deflate/brotli before scan
dns_entropy_threshold = 4.5       # Shannon entropy per DNS label
session_entropy_budget = 8192     # cumulative high-entropy bytes/session

# Extend credential scope by adding allow_credentials on the host:
[[host]]
domain            = "github.corp.example.com"
methods           = ["GET", "POST", "PATCH", "PUT", "DELETE"]
content_types     = ["application/json"]
allow_credentials = ["github_pat"]

Fields

FieldTypeDefaultDescription
enabledboolfalseEnable DLP scanning. Implicitly true under --strict when egress = "proxy-only".
canary_tokensbooltrue when DLP enabledInject fake credentials into the sandbox env and treat any outbound appearance as exfiltration.
max_decode_depthusize32Encoding chain recursion depth (base64 / hex / percent).
decompressbooltrueInflate gzip / deflate / brotli bodies before scanning.
dns_entropy_thresholdf644.5Shannon entropy per DNS label above which the hostname is blocked.
session_entropy_budgetu648192Cumulative high-entropy bytes allowed across one sandbox session before further requests are blocked.

Credential-flow scope is configured per host via the allow_credentials field on [[host]].

Built-in scopes

Each detector has a baseline list of home domains hardcoded in the detector registry — destinations where it’s universally legitimate for that credential type to flow:

DetectorBuilt-in home domains
github_patgithub.com, *.github.com
npm_tokenregistry.npmjs.org
aws_access_key*.amazonaws.com
slack_token*.slack.com
bearer_token(none — requires explicit allow_credentials = ["bearer_token"] on the host)
ssh_private_key, canary_token(none — always block)
generic_high_entropy(warn only, block in --strict)

Add to this set per-host via allow_credentials on the relevant [[host]] block. Built-in lists are never narrowed.

Merge semantics

FieldMerge rule
enabledOR — any Some(true) wins (security escalation, never reversed)
canary_tokensOR
max_decode_depth, decompress, dns_entropy_threshold, session_entropy_budgetlast-Some-wins

A downstream recipe can never disable DLP that an upstream recipe enabled.

Interaction with --strict / --monitor

  • --strict implicitly enables DLP (when egress = "proxy-only") and promotes generic_high_entropy from warn to block.
  • --monitor logs DLP findings at warn! level but forwards the request, adding an x-canister-dlp-warning header so the sandboxed process can observe what would have been blocked.

On block, the proxy returns 451 Unavailable For Legal Reasons with headers x-canister-error: dlp-blocked and x-canister-dlp-detector: <name>.


[process]

Controls process creation, environment filtering, and executable restrictions.

FieldTypeDefaultDescription
max_pidsint (optional)noneMaximum number of processes (via RLIMIT_NPROC)
allow_execvestring[][]Executables the sandbox may exec (empty = allow all)
env_passthroughstring[][]Environment variables to pass from host (all others stripped)

PID namespace isolation:

Every sandboxed process runs in its own PID namespace. The sandboxed command becomes PID 1 and cannot see or signal any host processes.

Environment filtering:

When env_passthrough is empty, the sandbox starts with a completely clean environment — zero host environment variables are inherited. This is the most secure default.

When env_passthrough contains variable names, only those variables are kept. A minimal PATH=/usr/local/bin:/usr/bin:/bin is injected if PATH is not in the passthrough list.

max_pids enforcement:

Uses RLIMIT_NPROC to cap the number of processes. When exceeded, fork() returns EAGAIN. This is a per-UID limit, which is effective inside the sandbox’s user namespace (where the process runs as UID 0 mapped to the host user).

allow_execve validation:

When non-empty, the resolved command path must match one of the listed paths. If the command is not in the allow list, execution is rejected before forking.

Prefix rules: Entries ending in /* match any binary under that directory tree. For example, /nix/store/* allows any binary whose resolved path starts with /nix/store/. The match requires a / boundary — /nix/store-extra/foo does NOT match /nix/store/*. This is essential for content-addressed stores like Nix where binary paths contain unpredictable hashes.

Ongoing enforcement: When the USER_NOTIF supervisor is active (kernel 5.9+, default), every execve() and execveat() call inside the sandbox is intercepted and validated against allow_execve. This means child processes cannot exec arbitrary binaries. When the notifier is disabled (kernel < 5.9 or notifier = false), only the initial command is validated, and child processes can exec any binary visible in the mount namespace.

[process]
max_pids = 64
allow_execve = ["/usr/bin/python3", "/usr/bin/pip", "/nix/store/*"]
env_passthrough = ["PATH", "HOME", "LANG", "TERM", "VIRTUAL_ENV"]

[resources]

Resource limits enforced via cgroups v2. Requires systemd with per-user cgroup delegation (default on most modern distributions).

Opt-in: Resource limits are not included in any of the shipped base recipes. They are entirely opt-in — add memory_mb and/or cpu_percent to your own recipe when needed.

FieldTypeDefaultDescription
memory_mbint (optional)noneMemory limit in megabytes
cpu_percentint (optional)noneCPU limit as percentage of one core (e.g., 50 = 50%)

How it works:

Canister detects the current cgroup from /proc/self/cgroup, creates a child cgroup (canister-<pid>), writes memory.max and cpu.max, and moves the sandboxed process into it. No root required.

  • memory_mb = 512memory.max = 536870912 (512 MiB). Exceeding the limit triggers the kernel OOM killer.
  • cpu_percent = 50cpu.max = "50000 100000" (50ms quota per 100ms period), capping the process to 50% of one CPU core.

Failure behavior: If cgroup setup fails (e.g., no cgroup v2, no delegation), the sandbox aborts. In strict mode (--strict), seccomp uses KILL_PROCESS for immediate termination on any denied syscall.

[resources]
memory_mb = 512
cpu_percent = 100

[syscalls]

Customizes the seccomp BPF baseline and enforcement mode.

Canister ships a single default seccomp baseline defined in recipes/default.toml (~187 allowed syscalls, ~18 always-denied). The baseline is embedded in the binary at compile time and can be overridden by placing a default.toml in the recipe search path (./.canister/, $XDG_CONFIG_HOME/canister/recipes/, /etc/canister/recipes/).

Regular recipes customize the baseline by adding or removing syscalls with allow_extra / deny_extra. The baseline itself uses allow / deny (absolute lists). These two pairs are mutually exclusive — a recipe either IS the baseline (uses allow/deny) or EXTENDS it (uses allow_extra/deny_extra).

Override fields (for regular recipes)

FieldTypeDefaultDescription
seccomp_modestring"allow-list"Seccomp mode: "allow-list" (default deny) or "deny-list" (default allow)
allow_extrastring[][]Syscalls to add to the baseline allow list
deny_extrastring[][]Syscalls to add to the deny list (also removed from allow list)
notifierbool (optional)auto-detectEnable/disable the USER_NOTIF supervisor for argument-level syscall filtering

Absolute fields (for default.toml only)

FieldTypeDefaultDescription
allowstring[][]Complete allow list (replaces the baseline, not additive)
denystring[][]Complete deny list (replaces the baseline, not additive)

Mutual exclusion: Using allow or deny together with allow_extra or deny_extra in the same [syscalls] section is a validation error.

Seccomp modes:

ModeDefault actionListed syscallsUse case
allow-listDENY allOnly baseline + allow_extra syscalls permittedProduction, CI (recommended)
deny-listALLOW allOnly baseline deny + deny_extra syscalls blockedCompatibility, unknown workloads

Examples:

# Elixir/BEAM: needs ptrace for observer/dbg/recon
[syscalls]
allow_extra = ["ptrace"]

# Strict: also block personality for extra hardening
[syscalls]
deny_extra = ["personality"]

# Full override: add io_uring support
[syscalls]
allow_extra = ["ptrace", "personality", "seccomp", "io_uring_setup", "io_uring_enter", "io_uring_register"]

# Deny-list mode for maximum compatibility
[syscalls]
seccomp_mode = "deny-list"

See SECCOMP.md for details on the baseline syscall set and how the embed+override resolution works.

USER_NOTIF supervisor (notifier)

The notifier field controls the SECCOMP_RET_USER_NOTIF supervisor, which provides argument-level filtering for connect(), clone()/clone3(), socket(), execve(), and execveat().

ValueBehavior
trueForce the notifier on (fails if kernel < 5.9)
falseForce the notifier off
omittedAuto-detect: enabled if kernel >= 5.9 and not in monitor mode

When the notifier is active, connect() calls are filtered against the resolved IPs from each [[host]].domain and allow_ips, clone()/clone3() are blocked from creating new namespaces, socket() is blocked from creating AF_NETLINK or SOCK_RAW sockets, and execve()/execveat() are validated against allow_execve paths for every execution (not just the initial command).

The notifier is merged using the last-Some-wins strategy during recipe composition, consistent with other Option<bool> scalar fields.

# Disable the notifier for compatibility with older kernels
[syscalls]
notifier = false

# Force it on (fail loudly if not supported)
[syscalls]
notifier = true

See SECCOMP.md for the full technical description.


[proxy]

L7 proxy settings used by proxy-only egress mode.

[proxy]
max_buffered_body_bytes = 8388608   # 8 MiB (default)
upstream_request_timeout_ms = 30000  # 30 s (default)

Fields

FieldTypeDefaultDescription
max_buffered_body_bytesusize8388608Max bytes buffered for DLP body scanning
upstream_request_timeout_msu6430000Upstream request timeout in milliseconds

Enforcement semantics

When network.egress = "proxy-only":

  • sandboxed processes may only open outbound INET/INET6 connections to:
    • loopback proxy endpoint (127.0.0.1:<proxy_port> / ::1:<proxy_port>)
    • configured DNS server on port 53
  • direct outbound internet access is denied by seccomp USER_NOTIF policy even if HTTP_PROXY/HTTPS_PROXY env vars are unset.

This makes seccomp-notify the first-line defense and proxy the forwarding path for legitimate traffic.


Strict Mode

Strict mode (--strict or strict = true in config) tightens all enforcement for CI and production use.

Config:

strict = true

CLI:

can run --strict --recipe policy.toml -- python3 script.py

The CLI --strict flag can only tighten — if the config sets strict = true, the CLI cannot override it to false.

Changes in strict mode:

Enforcement pointNormal modeStrict mode
Filesystem isolationAborts on failureAborts on failure
Network setupAborts on failureAborts on failure
Loopback bring-upAborts on failureAborts on failure
Seccomp deny actionEPERM (process survives)KILL_PROCESS (immediate termination)
Cgroup setupAborts on failureAborts on failure

The key difference is the seccomp deny action: normal mode returns EPERM so the process can handle denials gracefully; strict mode kills the process immediately on any denied syscall.

Mutual exclusion: --strict and --monitor cannot be used together. Strict mode ensures full enforcement; monitor mode relaxes it. These are contradictory intents.


Monitor Mode

Monitor mode (--monitor) is a CLI flag, not a config field. It relaxes enforcement across all policy sections so you can observe what would be blocked without actually blocking it.

can run --monitor --recipe my_policy.toml -- python3 script.py

What changes in monitor mode:

SectionNormalMonitor
[process].allow_execveBlocks unlisted commandsLogs warning, allows
[process].env_passthroughStrips unlisted varsLogs stripped count, passes all
[process].max_pidsEnforces RLIMIT_NPROCLogs limit, skips enforcement
[syscalls] seccompReturns EPERM on denied syscallsLogs to kernel audit, allows
[filesystem]Overlay + pivot_rootUnchanged (isolation active)
[network]Namespace + pastaUnchanged (isolation active)

Reading monitor output:

  • Look for MONITOR: prefixed log lines in stderr.
  • Seccomp events appear in kernel logs: journalctl -k | grep seccomp.
  • A pre-run policy preview and post-run summary are printed automatically.

Monitor mode is a development tool. It provides no security guarantees. Cannot be combined with --strict.


Inspecting the Resolved Policy

Use can recipe show to see the fully resolved policy after all recipe merging and environment variable expansion:

# Show the base policy (no recipes)
can recipe show

# Show the resolved policy with a recipe
can recipe show -r elixir

# Show with auto-detection (pass the command to trigger match_prefix)
can recipe show -r elixir -- mix test

# Compose multiple recipes and see the result
can recipe show -r nix -r elixir

# Save to a standalone recipe file
can recipe show -r nix -r elixir > my-custom.toml
can run -r my-custom.toml -- mix test

The output is valid TOML and includes all resolved fields:

strict = false

[filesystem]
allow = ["/bin", "/sbin", "/usr/bin", ...]
deny = ["/etc/shadow", "/etc/gshadow"]

[network]
[[host]]
domain = "hex.pm"

[[host]]
domain = "repo.hex.pm"

[[host]]
domain = "builds.hex.pm"
egress = "proxy-only"

[process]
allow_execve = ["/nix/store/*"]
env_passthrough = ["PATH", "HOME", ...]

[resources]

[syscalls]
seccomp_mode = "allow-list"
allow_extra = ["ptrace"]

This serves two purposes:

  1. Auditing — see exactly what policy will be enforced before running.
  2. Standalone recipes — capture the output and use it as a custom recipe that doesn’t depend on any other recipe files.

Examples

Minimal: deny everything

No config file needed. This is the default.

can run -- echo "hello"

Equivalent to:

[filesystem]
[network]
egress = "none"
[syscalls]

Python data science

Allow pip installs from PyPI and access to a workspace directory.

[filesystem]
allow = [
    "/usr/lib",
    "/usr/bin",
    "/usr/local/lib",
    "/tmp/workspace",
]
deny = ["/etc/shadow", "/root"]

[network]
egress = "proxy-only"
[[host]]
domain = "pypi.org"

[[host]]
domain = "files.pythonhosted.org"
[process]
env_passthrough = ["PATH", "HOME", "LANG", "VIRTUAL_ENV"]

Node.js build

Allow npm registry access and a project directory.

[filesystem]
allow = [
    "/usr/lib",
    "/usr/bin",
    "/usr/local",
    "/home/user/project",
]

[network]
egress = "proxy-only"
[[host]]
domain = "registry.npmjs.org"

[[host]]
domain = "nodejs.org"
[process]
env_passthrough = ["PATH", "HOME", "NODE_ENV"]

Full network trust

For trusted code that needs unrestricted network access but should still be filesystem- and syscall-restricted.

[filesystem]
allow = ["/tmp/workspace"]

[network]
egress = "direct"

Air-gapped

No network, no filesystem beyond essentials, strict seccomp.

[filesystem]
allow = ["/tmp/workspace"]
deny  = ["/etc", "/root", "/home"]

[network]
egress = "none"

Strict CI

All-or-nothing enforcement. If any isolation layer can’t be set up, the sandbox refuses to start. Denied syscalls kill the process immediately.

strict = true

[filesystem]
allow = ["/tmp/workspace"]

[network]
egress = "none"

[process]
max_pids = 64
allow_execve = ["/usr/bin/python3"]

[resources]
memory_mb = 512
cpu_percent = 100

[syscalls]
seccomp_mode = "allow-list"

Elixir/Erlang (mix tasks, iex, Phoenix)

Run mix tasks, iex shells, or Phoenix servers with hex.pm access. Use with -r nix or -r homebrew if Elixir is installed via a package manager.

[recipe]
name = "elixir"
description = "Elixir/Erlang (BEAM VM) — mix, iex, Phoenix"

[filesystem]
allow = [
    "/usr/lib",
    "/usr/bin",
    "/usr/local/lib",
    "/usr/local/bin",
    "/lib",
    "/tmp/workspace",
]
deny = ["/etc/shadow", "/root"]

[network]
[[host]]
domain = "hex.pm"

[[host]]
domain = "repo.hex.pm"

[[host]]
domain = "builds.hex.pm"
egress = "proxy-only"

[process]
max_pids = 256
env_passthrough = [
    "PATH", "HOME", "LANG", "TERM",
    "MIX_ENV", "MIX_HOME", "HEX_HOME",
    "ERL_AFLAGS", "ELIXIR_ERL_OPTIONS",
    "SECRET_KEY_BASE", "DATABASE_URL", "PORT",
]

[syscalls]
allow_extra = ["ptrace"]   # BEAM tracing tools (:observer, :dbg, recon)

Usage with composition:

# Nix-installed Elixir: nix.toml auto-detected, elixir.toml explicit
can run -r elixir -- mix test

# Explicit composition
can run -r nix -r elixir -- mix test

# Strict CI
can run --strict -r elixir -- mix test

Seccomp Filtering

Canister uses seccomp BPF to restrict which Linux syscalls the sandboxed process can invoke. This document explains how the default baseline works, how recipes customize it, and the enforcement modes available.

Table of Contents


How Seccomp Works in Canister

Canister generates a classic BPF (Berkeley Packet Filter) program at runtime and loads it via prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER) right before execve().

Two enforcement modes

ModeDefault actionListed syscallsConfig value
Allow-list (default)DENYOnly listed syscalls permittedseccomp_mode = "allow-list"
Deny-listALLOWOnly listed syscalls blockedseccomp_mode = "deny-list"

Allow-list mode (recommended, default) inverts the security model: every syscall not explicitly in the baseline (plus allow_extra) is denied. This provides a much smaller kernel attack surface.

Deny-list mode is the permissive fallback: everything is allowed except the syscalls in the deny list (plus deny_extra). Use this when you need maximum compatibility with unknown workloads, at the cost of a larger attack surface.

The filter cannot be removed or modified after loading. The PR_SET_NO_NEW_PRIVS flag is set first, which is required for unprivileged seccomp and also prevents the sandboxed process from gaining new privileges via execve of setuid binaries.


Default Baseline

Canister ships a single default seccomp baseline defined in recipes/default.toml. The baseline is embedded in the binary at compile time via include_str!(), so it always works standalone. At runtime, the search path is checked for an external override:

  1. ./.canister/default.toml (project-local)
  2. $XDG_CONFIG_HOME/canister/recipes/default.toml (per-user)
  3. /etc/canister/recipes/default.toml (system-wide)
  4. Embedded fallback (compiled into the binary)

This lets teams pin, audit, or version-control the baseline independently of the binary.

The baseline provides:

  • ~187 allowed syscalls — the common syscalls needed by most programs (read, write, open, mmap, clone, futex, getpgrp, etc.)
  • ~18 always-denied syscalls — dangerous operations that no sandboxed process should ever need (reboot, kexec_load, mount, etc.)

The default.toml uses absolute [syscalls] allow = [...] and deny = [...] fields. Regular recipes use the relative allow_extra / deny_extra fields to layer overrides on top. These two modes are mutually exclusive — a recipe either IS the baseline (uses allow/deny) or EXTENDS it (uses allow_extra/deny_extra).

The baseline was derived by analyzing the syscall needs of Python, Node.js, Elixir/BEAM, and general-purpose binaries. The old 4-profile system (generic, python, node, elixir) was collapsed into this single baseline because:

  1. Python and Node were literally identical — same allow list, same deny list.
  2. The total delta across all 4 profiles was only 6 syscalls: ptrace, personality, seccomp, io_uring_setup, io_uring_enter, io_uring_register.
  3. The 4-profile taxonomy gave a false sense of specificity.

Recipes that need syscalls beyond the baseline use [syscalls] allow_extra. Recipes that want tighter restrictions use [syscalls] deny_extra.


Customizing via Recipes

The [syscalls] section in a recipe TOML customizes the baseline:

[syscalls]
allow_extra = ["ptrace"]           # add to the allow list
deny_extra  = ["personality"]      # add to deny list AND remove from allow list
seccomp_mode = "allow-list"        # default; or "deny-list"

How overrides work:

  1. Start with the default baseline (ALLOW_BASE + DENY_ALWAYS).
  2. Add allow_extra syscalls to the allow list (deduplicated).
  3. Add deny_extra syscalls to the deny list.
  4. Remove deny_extra syscalls from the allow list (deny takes precedence).
  5. Generate the BPF filter from the final lists.

Common recipes:

Workloadallow_extradeny_extraWhy
Python scripts(none)Default baseline is sufficient
Node.js builds(none)Default baseline is sufficient
Elixir/BEAM["ptrace"]BEAM tools (:observer, :dbg, recon) need ptrace
Generic (permissive)["ptrace", "personality", "seccomp", "io_uring_setup", "io_uring_enter", "io_uring_register"]Maximum compatibility
Hardened["personality"]Block multilib/personality switching

Always-Denied Syscalls

The default baseline includes ~16 syscalls that are always denied. These are dangerous kernel-level operations that a sandboxed process should never need:

SyscallWhy it’s blocked
rebootReboots the system
kexec_loadLoads a new kernel
init_moduleLoads a kernel module
finit_moduleLoads a kernel module (from fd)
delete_moduleUnloads a kernel module
swaponEnables swap space
swapoffDisables swap space
acctEnables/disables process accounting
mountMounts a filesystem
umount2Unmounts a filesystem
pivot_rootChanges the root filesystem
chrootChanges the root directory
syslogReads/controls kernel message buffer
settimeofdayChanges the system clock
unshareCreates new namespaces (escape vector)
setnsJoins existing namespaces (escape vector)

These are blocked because they represent operations that only system administrators should perform, and a sandboxed process has no legitimate reason to invoke them.


Deny Action: Errno, Kill, and Strict Mode

Canister supports three deny actions depending on the mode:

ModeDeny actionBehavior
NormalSECCOMP_RET_ERRNO | EPERMDenied syscall returns -1 with errno = EPERM. Process survives.
Strict (--strict)SECCOMP_RET_KILL_PROCESSProcess is immediately terminated with SIGSYS.
Monitor (--monitor)SECCOMP_RET_LOGSyscall is allowed but logged to kernel audit.

Normal mode (default) uses Errno because:

  1. Most programs check return values and can handle EPERM gracefully.
  2. Kill mode makes debugging harder (process just dies with no error message).
  3. The denied syscalls are operations that programs generally don’t invoke accidentally – if a program calls reboot(), it’s intentional and getting EPERM back is the right response.

Strict mode (--strict) uses Kill because:

  1. In CI/production, a denied syscall indicates a policy violation or attack.
  2. Immediate termination prevents any further execution after a violation.
  3. The process cannot observe or react to the denial (no information leak).

The architecture validation check (wrong CPU architecture) always uses SECCOMP_RET_KILL_PROCESS regardless of mode, since an architecture mismatch indicates an actual attack (e.g., x32 ABI bypass attempt).


Monitor Mode and SECCOMP_RET_LOG

When running with --monitor, the seccomp filter uses SECCOMP_RET_LOG (0x7ffc0000) instead of SECCOMP_RET_ERRNO. This is a third deny action mode:

ModeReturn valueBehavior
ErrnoSECCOMP_RET_ERRNO | EPERMDenied syscall returns EPERM
KillSECCOMP_RET_KILL_PROCESSProcess killed immediately
LogSECCOMP_RET_LOGSyscall is allowed but logged to kernel audit

In Log mode, the BPF filter structure is identical to Errno mode — same architecture check, same deny list, same jump offsets. Only the return value for matched syscalls changes. This means the filter accurately reflects what would be blocked in enforcement mode.

Viewing logged syscalls:

# After running with --monitor
journalctl -k | grep seccomp
# or
dmesg | grep seccomp

Each log line shows the syscall number, PID, and other context. Map syscall numbers back to names with ausyscall (from the auditd package):

ausyscall --dump | grep <number>

SECCOMP_RET_LOG is available since Linux 4.14 (well within the 5.6+ minimum kernel requirement).


Architecture Validation

The BPF filter’s first check validates that the syscall comes from the expected CPU architecture:

  • x86_64: AUDIT_ARCH_X86_64 (0xC000003E)
  • aarch64: AUDIT_ARCH_AARCH64 (0xC00000B7)

If the architecture doesn’t match, the process is killed immediately (SECCOMP_RET_KILL_PROCESS).

Why this matters: On x86_64, the kernel also supports the x32 ABI (a 32-bit ABI with 64-bit pointers). x32 syscalls use different numbers than native x86_64. Without this check, an attacker could invoke x32 syscalls to bypass the filter (since the BPF checks are against x86_64 numbers).


USER_NOTIF Supervisor

Classic BPF can only inspect the syscall number and architecture (seccomp_data.nr and seccomp_data.arch). It cannot inspect syscall arguments — for pointer-based arguments like connect()’s sockaddr or execve()’s pathname, the BPF filter only sees the raw pointer value, not the data it points to.

Canister uses SECCOMP_RET_USER_NOTIF (Linux 5.9+) to bridge this gap. When the sandboxed process invokes a syscall that requires argument inspection, the kernel suspends the calling thread and delivers a notification to a supervisor process. The supervisor reads the actual argument data (via /proc/<pid>/mem), makes a policy decision, and sends an ALLOW or DENY verdict back to the kernel.

How it works

The supervisor runs as PID 1 inside the sandbox’s PID namespace, not as a thread in the parent. This architecture is required because of three cascading kernel restrictions:

  1. After unshare(CLONE_NEWPID), clone(CLONE_THREAD) returns EINVAL (pid_ns_for_children != task_active_pid_ns), so a supervisor thread cannot be spawned.
  2. The host’s procfs (s_user_ns = init_user_ns) denies /proc/<pid>/mem opens from a child user namespace, so the supervisor must mount its own procfs.
  3. PID 1 is an ancestor of all sandboxed processes, satisfying Yama ptrace_scope=1 without PR_SET_PTRACER.
  PID 1 (supervisor, same user ns + PID ns)    PID 2+ (worker / sandboxed)
  ─────────────────────────────────────────    ────────────────────────────
  1. unshare(CLONE_NEWNS)                      1. Sandbox setup
  2. mount /proc (owned by user ns)               (overlay, pivot_root, etc.)
  3. recv_fd() via SCM_RIGHTS                  2. seccomp() → notifier fd
     → notifier_fd                             3. send_fd() via SCM_RIGHTS
  4. Loop:                                     4. Install main BPF filter
     a. poll(notifier_fd, 200ms)               5. execve()
     b. ioctl(NOTIF_RECV) → read notification
     c. open+read /proc/<pid>/mem
     d. Evaluate against policy
     e. ioctl(NOTIF_ID_VALID) → TOCTOU check
     f. ioctl(NOTIF_SEND) → verdict
     g. waitpid(WNOHANG) → check child status

The supervisor runs inline (single-threaded) using poll() with a 200ms timeout, interleaved with non-blocking waitpid to detect when the worker exits. After the worker exits, remaining in-flight notifications are drained before the supervisor terminates.

Two-filter architecture

The worker installs two seccomp filters:

  1. Notifier filter (installed first via seccomp() syscall with SECCOMP_FILTER_FLAG_NEW_LISTENER): Returns SECCOMP_RET_USER_NOTIF for the eight intercepted syscalls (connect, sendto, sendmsg, clone, clone3, socket, execve, execveat). All other syscalls return SECCOMP_RET_ALLOW.

  2. Main filter (installed second via prctl(PR_SET_SECCOMP)): The existing allow-list or deny-list BPF filter. Returns SECCOMP_RET_ERRNO, SECCOMP_RET_KILL_PROCESS, or SECCOMP_RET_LOG depending on mode.

The kernel evaluates filters in reverse install order, but SECCOMP_RET_USER_NOTIF takes special precedence — when any filter returns USER_NOTIF, the kernel always delivers the notification to the supervisor, regardless of what other filters return.

Intercepted syscalls

SyscallArgument inspectedPolicy
connect()sockaddr (destination address)Allow only IPs pre-resolved from each [[host]] block’s domain and explicit allow_ips. Loopback and Unix domain sockets always allowed.
sendto()dest_addr + msg_controllenDNS queries on port 53 trigger supervisor-side resolution and dynamic allowlist population. Connected sockets (NULL dest_addr) allowed.
sendmsg()msghdr struct (msg_controllen)Blocks any sendmsg() with ancillary data (msg_controllen > 0), preventing SCM_RIGHTS fd passing regardless of outbound restriction settings.
clone()flags (register value)Deny namespace-creating flags: CLONE_NEWNS, CLONE_NEWCGROUP, CLONE_NEWUTS, CLONE_NEWIPC, CLONE_NEWUSER, CLONE_NEWPID, CLONE_NEWNET
clone3()clone_args.flags (read from userspace struct)Same flag check as clone(), read from the clone_args struct via /proc/<pid>/mem
socket()domain, type, protocol (register values)SOCK_RAW denied. AF_NETLINK restricted to NETLINK_ROUTE (protocol 0) only — all other netlink protocols denied. Normal TCP/UDP/Unix sockets allowed.
execve()pathname (read from userspace string)Validate against allow_execve paths. If allow_execve is empty, allow all.
execveat()pathname (read from userspace string)Same as execve(). Resolves the path relative to the dirfd argument.

TOCTOU protection

A time-of-check-time-of-use race exists: a multi-threaded sandboxed process could modify the memory that the supervisor reads between the read and the verdict. Canister mitigates this with SECCOMP_IOCTL_NOTIF_ID_VALID:

  1. Read notification (gets syscall args and a unique notification ID).
  2. Read memory via /proc/<pid>/mem for pointer-based arguments.
  3. Evaluate policy.
  4. Call ioctl(SECCOMP_IOCTL_NOTIF_ID_VALID, &id) — if the kernel returns an error (ENOENT), the syscall was interrupted (the thread exited or the memory was unmapped) and the notification is stale. The supervisor skips sending a verdict.
  5. Send verdict.

This is the standard mitigation recommended by the seccomp_unotify(2) man page. It is not airtight against a determined attacker with precise timing, but it eliminates the most common race windows.

CIDR matching

For connect() filtering, the supervisor supports both exact IP matches and CIDR range matches (e.g., 10.0.0.0/8, 2606:2800:220:1::/64). The resolved IPs from each [[host]] block’s domain are combined with any allow_ips CIDR ranges from the config to build the allowlist. Loopback addresses (127.0.0.0/8, ::1) and AF_UNIX sockets are always permitted.

DNS proxy integration

When the notifier is active, a DNS proxy runs in the parent process on an ephemeral port. The sandbox’s /etc/resolv.conf points to pasta’s DNS address (169.254.0.1:53), which is configured via --dns-forward to forward queries to the parent’s DNS proxy. The proxy only resolves domains that have a matching [[host]] block — all other queries receive an NXDOMAIN response. This prevents DNS-based information exfiltration and ensures the sandbox can only resolve allowed domains.

Configuration

The notifier is controlled by the notifier field in [syscalls]:

[syscalls]
notifier = true     # force on
notifier = false    # force off
# omit             → auto-detect (default)

Auto-detection logic:

  1. If notifier is explicitly set in the config, that value is used.
  2. If running in monitor mode, the notifier is disabled (monitor mode uses SECCOMP_RET_LOG, which is incompatible with SECCOMP_RET_USER_NOTIF).
  3. Otherwise, the notifier is enabled if the kernel version is 5.9 or later (the minimum version that supports all required seccomp_unotify ioctls).

Kernel version detection reads /proc/sys/kernel/osrelease and parses the major.minor version.

Requirements

  • Linux 5.9+ — for SECCOMP_IOCTL_NOTIF_RECV, SECCOMP_IOCTL_NOTIF_SEND, and SECCOMP_IOCTL_NOTIF_ID_VALID.
  • PR_SET_NO_NEW_PRIVS must be set on the worker before installing the filter (already done by both the notifier and main filter installation paths). The supervisor (PID 1) must NOT have PR_SET_NO_NEW_PRIVS set, as it would break /proc/<pid>/mem access.
  • AppArmor — the canister_sandboxed profile must allow ptrace (readby tracedby) from the canister peer profile. This is configured in the shipped canister.apparmor profile.

Inspecting the Baseline

List discovered recipes and the default baseline:

$ can recipe list
Discovered recipes:

  elixir               Elixir/Erlang (BEAM VM) — mix, iex, Phoenix
                       +ptrace                        recipes/elixir.toml
  ...

Default baseline: ~187 allowed, ~18 denied syscalls
  Customize per-recipe with [syscalls] allow_extra / deny_extra

To see exactly which syscalls the baseline allows/blocks, open recipes/default.toml. The [syscalls] allow array is the allow set, [syscalls] deny is the deny set. The file is the single source of truth — it is embedded into the binary at compile time via include_str!() and can be overridden by placing a default.toml in the recipe search path (./.canister/, $XDG_CONFIG_HOME/canister/recipes/, /etc/canister/recipes/).

SeccompProfile::apply_overrides() merges per-recipe allow_extra / deny_extra customizations on top of this baseline.

To see the fully resolved policy (after all recipe merging and env var expansion), use can recipe show:

$ can recipe show -r elixir
strict = false

[filesystem]
allow = ["/bin", "/sbin", ...]

[syscalls]
seccomp_mode = "allow-list"
allow_extra = ["ptrace"]
...

The output is valid TOML that can be saved as a standalone recipe file.

Data Loss Prevention (DLP)

Canister’s L7 egress proxy includes a built-in DLP layer that scans outbound HTTP traffic for credential patterns and enforces per-detector domain scoping. Even when a sandboxed process has filesystem access to credential files (because the user wants npm, gh, or aws to keep working), DLP makes it structurally impossible for those credentials to leak to unauthorised destinations.

Table of Contents


Threat Model

A sandboxed process typically has filesystem access to credential-bearing files — intentionally, because the user wants their package managers and CLI tools to keep working against private registries. That process is potentially:

  • Untrusted — a build script, post-install hook, or LLM-generated command running with read access to ~/.npmrc, ~/.aws/credentials, the GitHub keyring, etc.
  • Trusted-but-buggy — telemetry code that accidentally serialises environment variables containing tokens.
  • Trusted-but-compromised — a supply-chain attack inside an otherwise reputable dependency.

DLP’s goal: even when a credential is readable, it cannot leave the sandbox via HTTP(S) unless flowing to an explicitly authorised destination for that credential’s service.

In scope:

  • HTTP/1.1 and HTTP/2 request headers, bodies, trailers
  • URI query parameters and path segments
  • Bodies wrapped in gzip / deflate / brotli
  • Multi-layer encoded payloads (base64 / hex / percent), up to 32 levels
  • DNS-label exfiltration via high-entropy hostname labels
  • Slow byte-at-a-time exfiltration via cumulative entropy budgeting

Out of scope:

  • Covert timing channels
  • In-memory key extraction
  • Filesystem-write exfiltration to shared/CWD mounts
  • Pixel-level steganography in image payloads
  • Plain CONNECT (L4) tunnels — DLP forces interception when enabled, so any traffic that bypasses interception (e.g. non-HTTP protocols) is denied rather than inspected.

Architecture

DLP lives in the standalone can-dlp crate so it can be reused by both the proxy and the sandbox (for canary generation) without pulling proxy dependencies into the sandbox crate.

crates/can-dlp/
  src/
    detectors.rs      — DetectorId enum, compiled RegexSet, Finding
    scopes.rs         — per-detector domain matching (built-in + extras)
    decode.rs         — base64/hex/percent recursion, up to N layers
    decompress.rs     — gzip/deflate/brotli body decompression
    normalize.rs      — whitespace/unicode normalisation before scanning
    entropy.rs        — Shannon entropy + SessionEntropyBudget
    canary.rs         — fake credential generation
    scanner.rs        — DlpScanner: orchestrates the full pipeline
    error.rs          — DlpError (thiserror)

The DlpConfig serde struct lives in can-policy (next to NetworkConfig) to avoid a can-dlp → can-policy circular dependency.

Activation chain:

recipe / manifest [network.dlp]
        │
        ▼
NetworkConfig::dlp (Option<DlpConfig>)
        │
        ▼
ProxyServer constructed with DlpScanner + SessionEntropyBudget
        │
        ▼
Per-request: scan headers + URI + (decompressed, decoded) body

When DLP is enabled, the proxy forces interception of all traffic. The passthrough path (which is opaque to the proxy) is disabled because it would bypass scanning.


Detectors and Scope Model

Each detector has hardcoded home domains baked into the binary. Tokens can only flow to their home service — even if a [[host]] block permits the destination, a GitHub PAT bound for registry.npmjs.org is blocked.

DetectorPatternBuilt-in home domainsDefault action
github_patgh[pousr]_[A-Za-z0-9]{36} and github_pat_[A-Za-z0-9]{22}_[A-Za-z0-9]{59}github.com, *.github.comblock
npm_tokennpm_[A-Za-z0-9]{36}registry.npmjs.orgblock
aws_access_keyAKIA[A-Z0-9]{16}*.amazonaws.comblock
slack_tokenxox[baprs]-[0-9]{10,13}-[0-9]{10,13}-[A-Za-z0-9]{24}*.slack.comblock
ssh_private_key-----BEGIN (RSA|EC|OPENSSH|DSA )?PRIVATE KEY-----none — always blockblock
bearer_tokenBearer\s+[A-Za-z0-9\-._~+/]{20,}=*(requires explicit allow_credentials = ["bearer_token"] on a host)block
generic_high_entropySliding window, Shannon entropy > 4.5, 20+ chars(warn only)warn (promoted to block in --strict)
canary_tokenExact match against injected fake credentialsnone — always blockblock (error log)

Enforcement rules:

  1. Known-service tokens (github_pat, npm_token, aws_access_key, slack_token) — destination must be in the detector’s home domains or in a [[host]] block whose allow_credentials includes the detector id. Mismatched service → 451 block.
  2. bearer_token — generic; requires explicit per-host opt-in via allow_credentials = ["bearer_token"]. No implicit scope.
  3. ssh_private_key and canary_token — no legitimate HTTP destination; always blocked.
  4. generic_high_entropy — too noisy to scope; always warn, blocks only in --strict.

The shipped service contracts under recipes/services/*.toml (github.toml, npm.toml, …) already include the right allow_credentials for their detector — composing tools = ["npm", "gh"] produces the right behaviour: npm tokens can only reach npmjs.org, GitHub PATs can only reach GitHub.

Extending scopes for self-hosted services

Self-hosted services (GitHub Enterprise, private npm registries) extend the built-in scopes via extra_scopes:

[network.dlp]
enabled = true

[network.dlp.extra_scopes]
github_pat = ["github.corp.example.com"]
npm_token = ["npm.internal.example.com"]

Extras are unioned with the built-in domains. They never replace or narrow them, so a self-hosted override cannot accidentally weaken the default scope for the public service.


Scan Pipeline

Per request, the proxy runs:

1. Headers (Authorization, Cookie, Proxy-Authorization, X-*)
   → scan_text → token detected? scope check
2. URI (full reconstructed authority + path + query)
   → scan_text → token detected? scope check
3. Body
   a. Read Content-Encoding header
   b. Decompress (gzip / deflate / brotli) if configured
   c. Run encoding chain recursion (base64 / hex / percent)
   d. Pattern match each layer against PatternSet
4. For every finding:
   - canary    → BLOCK + error! log (zero false positives)
   - ssh key   → BLOCK
   - scoped    → BLOCK if destination not in home/extras
   - bearer    → BLOCK unless the destination's `[[host]]` block lists `"bearer_token"` in `allow_credentials`
   - generic   → WARN (BLOCK in --strict)
5. Session entropy budget update; BLOCK if exceeded.
6. Build response:
   - On allow: forward upstream with `update_content_length()` if body
     was buffered.
   - On block: 451 + `x-canister-error: dlp-blocked` +
     `x-canister-dlp-detector: <name>`.
   - On monitor-mode warn: forward upstream + add
     `x-canister-dlp-warning` so the sandboxed process can observe what
     would have been blocked.

DLP forces request body buffering within the existing max_buffered_body_bytes cap. A streaming scan would miss tokens that straddle chunk boundaries; the cap (default 8 MiB) prevents memory abuse.


Encoding Chain Recursion

decode.rs walks every layer of base64 / base64url / hex / percent-encoding up to max_decode_depth (default 32). At each layer the scanner attempts all decoders; any that produces output different from its input is recursed into. All decoded layers are matched against PatternSet, so:

  • Authorization: Bearer dGVzdA== (Bearer test) is matched at the original layer.
  • body={"x":"Z2hwX0FBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQQ=="} (a base64-wrapped GitHub PAT) is caught at depth 1.
  • base64(base64(token)) is caught at depth 2.
  • Garbage / malformed encoding at any layer is fail-closed: the original bytes are scanned as-is and the recursion stops on that branch — never silently skipped.

The depth cap is a fuse against adversarially deep nesting designed to exhaust CPU.


Content Decompression

decompress.rs inspects the Content-Encoding header and inflates gzip / deflate / brotli bodies before scanning. This is gated by network.dlp.decompress (default true).

Malformed or truncated compressed bodies fail the request rather than being forwarded unscanned — fail-closed.


DNS Entropy Check

Independently of HTTP scanning, the proxy applies a Shannon-entropy check to the destination hostname before resolving it. Each DNS label (the parts between the dots) is scored; if any label exceeds dns_entropy_threshold (default 4.5) the request is blocked with dlp-blocked + dns-entropy reason. This catches the classic DNS exfiltration pattern: <base64-of-secret>.attacker.example where the high-entropy subdomain is the payload.

The check runs even on CONNECT tunnels (before resolution), so it applies regardless of L7 protocol.


Session Entropy Budget

A sandbox session keeps a SessionEntropyBudget (default 8192 bytes). After each request scan, the count of high-entropy bytes (Shannon entropy > 4.0 in any 32-byte sliding window) is recorded against the budget. When the budget is exhausted, further requests are blocked.

This catches slow exfiltration: a credential split across many small requests, each individually below the per-request entropy threshold but collectively well above plausible legitimate traffic patterns.

The budget is per ProxyServer instance, which is one per sandbox session — it resets when the sandbox exits.


Canary Tokens

When canary_tokens = true (default when DLP is enabled), the sandbox generates three fake credentials at startup and exports them as environment variables into the sandboxed process:

Env varShape
CANISTER_CANARY_GITHUB_PATghp_ + 36 random alphanumeric
CANISTER_CANARY_NPM_TOKENnpm_ + 36 random alphanumeric
CANISTER_CANARY_AWS_ACCESS_KEYAKIA + 16 random uppercase

The same exact strings are registered as the highest-priority pattern in the DlpScanner. Any outbound appearance of one of these strings is proof of exfiltration — no legitimate process produces these values, because they don’t authenticate to anything. The detection is always block and is logged at error! level with full request context.

Canary tokens are an active honeypot: they create traffic that a malicious process is incentivised to scoop up (env vars are the first thing credential thieves enumerate) and which the proxy treats as a high-confidence trip wire.


Enforcement Modes

DLP integrates with the existing sandbox enforcement modes rather than introducing a separate kill switch.

ModeDLP enabled?generic_high_entropyBlock action
DefaultPer recipe enabled = truewarn451
--monitorAs configuredwarn (logged)Not blocked — request forwarded with x-canister-dlp-warning header
--strictImplicitly enabled when egress = "proxy-only"promoted to block451
  • Default: DLP runs if the recipe enables it; violations are 451.
  • --monitor: DLP findings are logged at warn! level with full detector / host / fingerprint detail but requests still go through. Mirrors how monitor mode handles seccomp and filesystem checks. Use this to dry-run a new policy before flipping it on.
  • --strict: DLP is implicitly enabled even without dlp.enabled = true, provided the recipe uses egress = "proxy-only" (strict mode requires DLP-grade enforcement). generic_high_entropy is promoted from warn to block.

No new flags or kill switches were added — --strict plus recipe config cover the same activation surface as a dedicated enable knob.


Response Headers and Status Codes

OutcomeStatusHeaders
Token detected, blocked451 Unavailable For Legal Reasonsx-canister-error: dlp-blocked, x-canister-dlp-detector: <name>
Token detected, monitor mode(upstream status)x-canister-dlp-warning: <name>
DNS-label entropy block451x-canister-error: dlp-blocked, x-canister-dlp-reason: dns-entropy
Session budget exhausted451x-canister-error: dlp-blocked, x-canister-dlp-reason: session-budget

451 is used so DLP blocks are distinguishable from upstream 403s. The detector name is exposed in the header so the sandboxed process / calling tool can produce a sensible error message.


Configuration

Full schema (all fields optional; defaults shown):

[network.dlp]
enabled = false                   # implicit true under --strict + proxy-only
canary_tokens = true              # default when DLP is enabled
max_decode_depth = 32             # encoding chain recursion cap
decompress = true                 # gzip/deflate/brotli before scan
dns_entropy_threshold = 4.5       # Shannon entropy per DNS label
session_entropy_budget = 8192     # cumulative high-entropy bytes/session

[network.dlp.extra_scopes]
github_pat = ["github.corp.example.com"]
npm_token = ["npm.internal.example.com"]

Merge semantics

When recipes / manifests are merged left-to-right (base.toml → auto-detected → explicit -r → manifest overrides), each field uses:

FieldMerge ruleRationale
enabledOR (any Some(true) wins)Security escalation, never reversed
canary_tokensORSame
extra_scopesper-detector domain unionNever narrows
max_decode_depthlast-Some-winsNumeric tuning
decompresslast-Some-wins
dns_entropy_thresholdlast-Some-wins
session_entropy_budgetlast-Some-wins

This guarantees a downstream recipe can never disable DLP that an upstream recipe enabled, and can never shrink the scope set.

Where to put it

  • Project-level: [network.dlp] in canister.toml enables DLP for every sandbox in the project.
  • Per-sandbox: same key under [sandbox.<name>.network.dlp].
  • Recipe-level: drop a [network.dlp] block into a custom recipe. Tool recipes (tool:gh, tool:npm, etc.) deliberately do not ship [network.dlp] — they declare the right [[host]] blocks with allow_credentials, and the scope check does the rest.

Limitations

  • Pattern coverage is finite. A novel credential shape (a vendor introducing a new prefix) won’t be caught until a detector is added. generic_high_entropy is the catch-all, but its warn-by-default posture means it’s only fatal in --strict.
  • Body buffering ceiling. Requests above max_buffered_body_bytes (default 8 MiB) are rejected with 413 Payload Too Large rather than forwarded unscanned. This is fail-closed by design, but it limits the protocol shapes DLP can cover (large file uploads need a higher cap or a different egress path).
  • TLS interception is required. DLP relies on the proxy’s MITM CA; it does not inspect end-to-end-pinned TLS (e.g. when the sandboxed process pins its own cert). Such traffic fails to handshake under the proxy, which is the same fail-closed posture.
  • No regex on raw binary. Detectors operate on UTF-8 text after decompression and decoding. Binary protocols carrying credentials outside text fields (e.g. proprietary RPC over HTTP) need a custom detector or a different egress strategy.

CLI Reference

This file is auto-generated by can-docgen. Do not edit manually.

can

Canister: a lightweight sandbox for running untrusted code safely

Usage: can [OPTIONS] <COMMAND>

Commands:
  up      Run a named sandbox from canister.toml
  run     Run a command inside the sandbox
  check   Check available kernel capabilities for sandboxing
  setup   Install or manage the security policy (AppArmor/SELinux) for filesystem isolation
  recipe  Manage and inspect recipes
  init    Download community recipes to the local config directory
  update  Update community recipes from the remote repository
  help    Print this message or the help of the given subcommand(s)

Options:
  -v, --verbose  Enable verbose (debug) logging
  -h, --help     Print help
  -V, --version  Print version

can run

Run a command inside the sandbox

Usage: can run [OPTIONS] <COMMAND>...

Arguments:
  <COMMAND>...
          The command to execute

Options:
  -r, --recipe <RECIPE>
          Recipe name or path. Can be repeated for composition.
          
          If the argument contains `/` or ends with `.toml`, it is treated as a file path. Otherwise it is looked up by name across the recipe search path (e.g., `-r nix` resolves to `nix.toml`).
          
          Multiple recipes are merged left-to-right.

  -v, --verbose
          Enable verbose (debug) logging

  -m, --monitor
          Run in monitor mode: log access attempts without enforcing

  -s, --strict
          Strict mode: fail hard on all setup failures. Seccomp uses KILL_PROCESS, filesystem isolation failures are fatal. Intended for CI / production use

  -p, --port <PORTS>
          Publish a container port to the host.
          
          Syntax: [ip:]hostPort:containerPort[/protocol] Examples: -p 8080:80, -p 127.0.0.1:8443:443/tcp, -p 5000:5000/udp Can be repeated. Implies filtered network mode.

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

can up

Run a named sandbox from canister.toml.

Discovers canister.toml by walking up from the current directory, resolves the named sandbox (or the first-defined one), composes its recipes, and runs the command.

Usage: can up [OPTIONS] [NAME]

Arguments:
  [NAME]
          Sandbox name to run (defaults to the first defined in canister.toml)

Options:
      --dry-run
          Preview the resolved policy without running the sandbox

  -v, --verbose
          Enable verbose (debug) logging

  -m, --monitor
          Run in monitor mode: log access attempts without enforcing

  -s, --strict
          Override strict mode from the CLI

  -p, --port <PORTS>
          Publish a container port to the host.
          
          Syntax: [ip:]hostPort:containerPort[/protocol] Can be repeated. Implies filtered network mode.

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

can check

Check available kernel capabilities for sandboxing

Usage: can check [OPTIONS]

Options:
  -v, --verbose  Enable verbose (debug) logging
  -h, --help     Print help
  -V, --version  Print version

can setup

Install or manage the security policy (AppArmor/SELinux) for filesystem isolation

Usage: can setup [OPTIONS]

Options:
      --remove
          Remove the security policy instead of installing it

  -v, --verbose
          Enable verbose (debug) logging

  -f, --force
          Force reinstall even if the policy is already installed. Useful after upgrading canister to pick up policy changes

      --pasta-path <PASTA_PATH>
          Explicit path to the pasta binary for non-standard installations.
          
          When pasta is installed via Nix, Homebrew, or custom builds, sudo may not find it in PATH. Use this to generate correct AppArmor rules: sudo can setup --pasta-path $(which pasta)

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

can recipe

Manage and inspect recipes

Usage: can recipe [OPTIONS] <COMMAND>

Commands:
  list     List available recipes and the default baseline syscall counts
  show     Show the fully resolved recipe as TOML
  explain  Explain what a recipe does in human-readable form
  suggest  Suggest recipes for a command
  help     Print this message or the help of the given subcommand(s)

Options:
  -v, --verbose  Enable verbose (debug) logging
  -h, --help     Print help
  -V, --version  Print version

can init

Download community recipes to the local config directory.

Clones the canister GitHub repository (shallow) and copies recipe .toml files into $XDG_CONFIG_HOME/canister/recipes/. Requires git. Prints manual instructions if git is unavailable.

Usage: can init [OPTIONS]

Options:
      --repo <REPO>
          GitHub repository (owner/repo) to fetch from

  -v, --verbose
          Enable verbose (debug) logging

      --branch <BRANCH>
          Branch to fetch

      --no-verify
          Skip SHA-256 checksum verification of recipe files. Required when using custom/forked repositories

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

can update

Update community recipes from the remote repository.

Re-downloads and overwrites all recipes. Equivalent to `can init`.

Usage: can update [OPTIONS]

Options:
      --repo <REPO>
          GitHub repository (owner/repo) to fetch from

  -v, --verbose
          Enable verbose (debug) logging

      --branch <BRANCH>
          Branch to fetch

      --no-verify
          Skip SHA-256 checksum verification of recipe files. Required when using custom/forked repositories

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

can recipe list

List available recipes and the default baseline syscall counts

Usage: can recipe list [OPTIONS]

Options:
  -v, --verbose  Enable verbose (debug) logging
  -h, --help     Print help
  -V, --version  Print version

can recipe show

Show the fully resolved recipe as TOML.

Merges base.toml, auto-detected recipes, and explicit --recipe arguments, expands environment variables, then prints the final effective policy. The output is valid TOML that can be saved as a standalone recipe file.

Usage: can recipe show [OPTIONS] [COMMAND]...

Arguments:
  [COMMAND]...
          Optional command to resolve (enables auto-detection of recipes).
          
          The command is NOT executed — it is only used to determine which recipes would be auto-detected based on `match_prefix`.

Options:
  -r, --recipe <RECIPE>
          Recipe name or path. Can be repeated for composition

  -v, --verbose
          Enable verbose (debug) logging

  -h, --help
          Print help (see a summary with '-h')

  -V, --version
          Print version

Configuration Reference

This file is auto-generated by can-docgen. Do not edit manually.

Canister uses TOML recipe files with strict schema validation. Unknown fields are rejected at parse time.

Top-level fields

A recipe file — the only entry point for parsing policy TOML files.

FieldTypeDefaultDescription
hostobject[][]Per-destination egress contracts. See host::HostBlock and docs/adr/0007-per-destination-egress-contracts.md. Multiple blocks targeting the same domain are merged in RecipeFile::merge (vec union, max for max_request_bytes, last-Some-wins for contract_mode).
strictbool (optional)Strict mode: fail hard instead of degrading gracefully.

[filesystem]

FieldTypeDefaultDescription
allowstring[][]Paths the sandboxed process is allowed to access (read-only).
allow_writestring[][]Paths bind-mounted writable into the sandbox.
denystring[][]Paths explicitly denied (checked before allow and allow_write).
maskstring[]Paths to mask inside the sandbox (bind /dev/null over them).

[network]

FieldTypeDefaultDescription
allow_host_loopbackboolfalseAllow the sandbox to reach host loopback services through the egress proxy via the magic alias host.canister.local.
allow_ipsstring[][]Allowed IP addresses or CIDR ranges. IP-literal egress is a separate concept from FQDN egress (no service identity, no per-route shape gates apply), so it stays here rather than folding into the [[host]] table.
contract_modestrict | relaxed (optional)Mode for hosts that have no matching [[host]] entry.
egress"none", "proxy-only", "direct" (optional)
portsobject[][]Port forwarding rules: map host ports to sandbox ports.

[network.dlp]

Data Loss Prevention configuration for the egress proxy.

FieldTypeDefaultDescription
canary_tokensbool (optional)Inject canary tokens (fake credentials) into the sandbox environment to detect exfiltration attempts. Default: true when DLP is enabled.
decompressbool (optional)Decompress request bodies (gzip/deflate/brotli) before scanning. Default: true.
dns_entropy_thresholdnumber (optional)Normalised per-label entropy ratio for DNS exfiltration detection. A label’s Shannon entropy is divided by log2(len) to get a value in [0.0, 1.0]; the FQDN trips when two or more labels exceed this ratio. Default: 0.92. (Pre-2026-05 configs used absolute bits — those values are now clamped to 1.0 and effectively disable the check.)
enabledbool (optional)Enable DLP scanning. Implicitly enabled in --strict mode when egress = "proxy-only".
max_decode_depthinteger (optional)Maximum encoding chain recursion depth (base64, hex, percent-encoding). Default: 32.
session_entropy_budgetinteger (optional)Cumulative high-entropy bytes allowed per sandbox session before requests are blocked. Default: 8192.

[process]

FieldTypeDefaultDescription
allow_execvestring[][]Paths to executables the sandboxed process may exec.
envobject{}Environment variables to set in the sandbox. These are evaluated after passthrough.
env_passthroughstring[][]Environment variables to pass through from the host. All others are stripped.
max_pidsinteger (optional)Maximum number of child PIDs allowed.

[proxy]

FieldTypeDefaultDescription
max_buffered_body_bytesinteger (optional)Maximum bytes buffered for DLP body scanning via the full whole-buffer pipeline (decode chains, decompression, unescape). Requests at or under this size get the strongest analysis. Default 8 MiB. Requests above this size up to [Self::max_streamed_body_bytes] are still scanned but via the chunked streaming path (regex only, no decode chain).
max_streamed_body_bytesinteger (optional)Hard upper bound on request body size. Beyond this, the proxy returns 413. Defaults to 64 MiB. Requests between [Self::max_buffered_body_bytes] and this cap are scanned by the streaming detector: regex passes with a 256-byte overlap window, no decompression / decode chain.
upstream_request_timeout_msinteger (optional)Upstream request total timeout in milliseconds. Defaults to 30 000 ms.
upstream_schemestring (optional)Force the upstream request scheme. Accepts "http" or "h2c". When unset (the default), the scheme is inferred from the inbound request URI. Prior versions consulted a client-controlled x-canister-upstream-scheme header for this — that was a footgun (the sandboxed process picked the proxy’s egress protocol) and is no longer honoured. h2c also requires the experimental-h2c build feature; without it, setting this to "h2c" returns an upstream error.

[recipe]

Metadata section for recipe files.

FieldTypeDefaultDescription
descriptionstring (optional)One-line description of what this recipe is for.
match_prefixstring[][]Path prefixes that trigger auto-detection of this recipe.
namestring (optional)Human-readable recipe name. Defaults to the filename stem when omitted.
versionstring (optional)Opaque version string (for humans, not parsed).

[resources]

FieldTypeDefaultDescription
cpu_percentinteger (optional)CPU limit as a percentage (e.g., 50 = 50% of one core).
memory_mbinteger (optional)Memory limit in megabytes.

[syscalls]

Syscall customization.

FieldTypeDefaultDescription
allowstring[][]Absolute allow list — the complete set of permitted syscalls. Only valid in default.toml. Mutually exclusive with allow_extra.
allow_extrastring[][]Syscalls to add to the allow list (on top of the default baseline).
denystring[][]Absolute deny list — syscalls always blocked. Only valid in default.toml. Mutually exclusive with deny_extra.
deny_extrastring[][]Syscalls to add to the deny list (also removed from allow list).
notifierbool (optional)Enable the SECCOMP_RET_USER_NOTIF supervisor for argument-level syscall filtering (connect, clone, socket, execve).
seccomp_modeallow-list | deny-list (optional)Seccomp enforcement mode.

Manifest Reference (canister.toml)

This file is auto-generated by can-docgen. Do not edit manually.

A project manifest declares named sandboxes. Place canister.toml in your project root.

Top-level fields

Top-level project manifest parsed from canister.toml.

FieldTypeDefaultDescription
sandboxobjectNamed sandbox definitions.

Merge Semantics

This file is auto-generated by can-docgen. Do not edit manually.

When multiple recipes are composed, each field follows a specific merge strategy.

Composition Order

base.toml (always loaded first)
  → auto-detected recipes (match_prefix against command binary)
  → explicit --recipe args (left to right)
  → manifest overrides (for `can up`)
  = final SandboxConfig

Field Merge Strategies

FieldTypeStrategyDescription
recipeRecipeMetaOverlayLater recipe’s metadata wins if present
strictOption<bool>ORAny Some(true) wins — can never be loosened
filesystem.allowVec<PathBuf>UnionDeduplicated, preserving first-occurrence order
filesystem.allow_writeVec<PathBuf>UnionDeduplicated, preserving first-occurrence order
filesystem.denyVec<PathBuf>UnionDeduplicated, preserving first-occurrence order
filesystem.maskVec<PathBuf>UnionDeduplicated, preserving first-occurrence order
host (top-level [[host]] blocks)Vec<HostBlock>Union by domainSame domain → field-merged via HostBlock::merge; distinct domains preserved
network.allow_ipsVec<String>UnionDeduplicated, preserving first-occurrence order
network.egressOption<EgressMode>Last-Some-winsNone preserves earlier value; Some(x) overwrites
network.portsVec<PortMapping>UnionDeduplicated, preserving first-occurrence order
process.max_pidsOption<u32>Last-Some-winsNone preserves earlier value; Some(x) overwrites
process.allow_execveVec<PathBuf>UnionDeduplicated, preserving first-occurrence order
process.env_passthroughVec<String>UnionDeduplicated, preserving first-occurrence order
resources.memory_mbOption<u64>Last-Some-winsNone preserves earlier value; Some(x) overwrites
resources.cpu_percentOption<u32>Last-Some-winsNone preserves earlier value; Some(x) overwrites
syscalls.seccomp_modeOption<SeccompMode>Last-Some-winsNone preserves earlier value; Some(x) overwrites
syscalls.notifierOption<bool>Last-Some-winsNone preserves earlier value; Some(x) overwrites
syscalls.allowVec<String>UnionAbsolute allow list (baseline only)
syscalls.denyVec<String>UnionAbsolute deny list (baseline only)
syscalls.allow_extraVec<String>UnionDeduplicated, preserving first-occurrence order
syscalls.deny_extraVec<String>UnionDeduplicated, preserving first-occurrence order

Strategy Definitions

  • Union: Both base and overlay values are combined into a single list, deduplicated by value, preserving first-occurrence order.
  • OR: Any Some(true) wins permanently. Once strict mode is enabled by any recipe in the chain, it cannot be disabled.
  • Last-Some-wins: The last recipe that specifies a value (Some(x)) wins. None (field omitted) preserves the earlier value.
  • Overlay: The later recipe’s value replaces the earlier one entirely if present.

Built-in Recipes

This file is auto-generated by can-docgen. Do not edit manually.

Canister ships with the following recipe files in the recipes/ directory. The base.toml and default.toml recipes are embedded in the binary.

Overview

RecipeDescriptionAuto-detected
base.tomlEssential OS paths for any Linux binaryNo
cargo.tomlRust/Cargo toolchain ($HOME/.cargo, $HOME/.rustup)Yes ($HOME/.cargo, $HOME/.rustup)
default.tomlDefault baseline — common syscalls for any Linux process. Blocks dangerous kernel operations and namespace escapes.No
elixir.tomlElixir/Erlang development: mix tasks, iex, Phoenix serverNo
example.tomlExample recipe showing all available optionsNo
flatpak.tomlFlatpak applicationsYes (/var/lib/flatpak, $HOME/.local/share/flatpak)
generic-strict.tomlStrict no-network policy for untrusted binaries (CI/production)No
gnu-store.tomlGNU Guix package manager (/gnu/store)Yes (/gnu/store)
homebrew.tomlHomebrew/Linuxbrew package managerYes (/opt/homebrew, /home/linuxbrew/.linuxbrew)
neovim.tomlNeovim editor with LSP, tree-sitter, and plugin supportNo
nix.tomlNix package manager (/nix/store)Yes (/nix/store)
node-build.tomlNode.js build tasks: npm install, build, testNo
opencode.tomlOpenCode AI coding agent with scoped filesystem and restricted networkNo
python-pip.tomlInstall Python packages with pip (network access to PyPI)No
snap.tomlSnap package manager (/snap)Yes (/snap)

Recipe Contents

base.toml

# Canister base recipe — essential OS bind mounts.
#
# This recipe provides the minimal set of system paths required for any
# Linux binary to execute inside the sandbox. It replaces the hardcoded
# ESSENTIAL_BIND_MOUNTS list for auditability and customization.
#
# Composition order: base.toml is always loaded first, before default.toml,
# auto-detected recipes, and explicit --recipe arguments.
#
# Override: Place a base.toml in $XDG_CONFIG_HOME/canister/recipes/ or
# ./.canister/ to customize. The embedded copy is used as fallback.

[recipe]
name = "base"
description = "Essential OS paths for any Linux binary"
version = "2"

[filesystem]
# Paths bind-mounted read-only for basic execution. These provide:
# - Shell utilities and system binaries (/bin, /sbin, /usr/bin, /usr/sbin)
# - Shared libraries and the dynamic linker (/lib, /lib64, /usr/lib, /usr/lib64)
# - Locally installed binaries and libraries (/usr/local/bin, /usr/local/lib)
# - Shared data files (/usr/share)
# - Dynamic linker cache and configuration (/etc/ld.so.*)
# - DNS and network resolution (/etc/resolv.conf, /etc/nsswitch.conf, /etc/hosts)
# - TLS certificates (/etc/ssl, /etc/ca-certificates)
# - Timezone data (/etc/localtime)
# - Alternatives system (/etc/alternatives)
# - User/group databases (/etc/passwd, /etc/group) — needed by Go, Python, etc.
# - Temporary file directory (/tmp) — isolated per sandbox via overlay
allow = [
    "/bin",
    "/sbin",
    "/lib",
    "/lib64",
    "/usr/bin",
    "/usr/sbin",
    "/usr/lib",
    "/usr/lib64",
    "/usr/local/bin",
    "/usr/local/lib",
    "/usr/share",
    "/tmp",
    "/etc/ld.so.cache",
    "/etc/ld.so.conf",
    "/etc/ld.so.conf.d",
    "/etc/resolv.conf",
    "/etc/nsswitch.conf",
    "/etc/hosts",
    "/etc/ssl",
    "/etc/ca-certificates",
    "/etc/localtime",
    "/etc/alternatives",
    "/etc/passwd",
    "/etc/group",
]

deny = [
    "/etc/shadow",
    "/etc/gshadow",
]

cargo.toml

# Canister recipe for Cargo (Rust) toolchain.
#
# Cargo installs binaries to $HOME/.cargo/bin and stores the toolchain
# (rustc, rustup) under $HOME/.rustup. Both are needed for Rust
# compilation and tool execution.
#
# Auto-detection: triggered by match_prefix (uses env var expansion)

[recipe]
name = "cargo"
description = "Rust/Cargo toolchain ($HOME/.cargo, $HOME/.rustup)"
version = "1"
match_prefix = ["$HOME/.cargo", "$HOME/.rustup"]

[filesystem]
allow = ["$HOME/.cargo", "$HOME/.rustup"]
deny = ["$HOME/.cargo/credentials.toml", "$HOME/.cargo/credentials"]

default.toml

# Default baseline — the canonical syscall policy for Canister.
#
# This file defines the base set of allowed and denied syscalls used by
# every sandbox invocation. It is embedded into the binary via
# include_str!() as a fallback, but can be overridden by placing a
# default.toml in the recipe search path:
#
#   1. ./.canister/default.toml              (project-local)
#   2. $XDG_CONFIG_HOME/canister/recipes/  (per-user)
#   3. /etc/canister/recipes/              (system-wide)
#
# Regular recipes extend this baseline with allow_extra / deny_extra.
# Only the baseline itself uses the absolute allow / deny fields.

[recipe]
name = "default"
description = "Default baseline — common syscalls for any Linux process. Blocks dangerous kernel operations and namespace escapes."
version = "1"

[syscalls]
# Absolute allow list — syscalls needed by virtually any Linux process:
# libc init, memory allocation, signal handling, file I/O, threading.
#
# Recipes MUST NOT use these fields — they use allow_extra / deny_extra
# to layer on top of this baseline.
allow = [
    # Process lifecycle
    "fork",
    "vfork",
    "clone",
    "clone3",
    "execve",
    "kill",
    "tkill",
    "tgkill",
    "exit",
    "exit_group",
    "wait4",
    "waitid",

    # Process control (prctl only — ptrace, personality, seccomp per-recipe)
    "prctl",

    # File I/O
    "open",
    "openat",
    "openat2",
    "creat",
    "close",
    "close_range",
    "read",
    "write",
    "readv",
    "writev",
    "pread64",
    "pwrite64",
    "lseek",
    "dup",
    "dup2",
    "dup3",
    "fcntl",
    "flock",
    "fsync",
    "fdatasync",
    "truncate",
    "ftruncate",
    "fallocate",

    # File metadata
    "stat",
    "fstat",
    "lstat",
    "newfstatat",
    "statx",
    "access",
    "faccessat",
    "faccessat2",
    "chmod",
    "fchmod",
    "fchmodat",
    "chown",
    "fchown",
    "lchown",
    "fchownat",

    # Directory operations
    "mkdir",
    "mkdirat",
    "rmdir",
    "rename",
    "renameat",
    "renameat2",
    "link",
    "linkat",
    "unlink",
    "unlinkat",
    "symlink",
    "symlinkat",
    "readlink",
    "readlinkat",
    "getdents",
    "getdents64",

    # Memory
    "mmap",
    "mprotect",
    "munmap",
    "mremap",
    "madvise",
    "msync",
    "brk",
    "mlock",
    "mlock2",
    "munlock",
    "mlockall",
    "munlockall",

    # Network
    "socket",
    "connect",
    "accept",
    "accept4",
    "bind",
    "listen",
    "sendto",
    "recvfrom",
    "sendmsg",
    "sendmmsg",
    "recvmsg",
    "shutdown",
    "getsockopt",
    "setsockopt",
    "getsockname",
    "getpeername",
    "socketpair",

    # Signals
    "rt_sigaction",
    "rt_sigprocmask",
    "rt_sigreturn",
    "rt_sigsuspend",
    "sigaltstack",

    # Time
    "nanosleep",
    "clock_nanosleep",
    "clock_gettime",
    "clock_getres",
    "gettimeofday",

    # Polling / async I/O
    "poll",
    "ppoll",
    "select",
    "pselect6",
    "epoll_create",
    "epoll_create1",
    "epoll_ctl",
    "epoll_wait",
    "epoll_pwait",
    "epoll_pwait2",
    "eventfd",
    "eventfd2",
    "timerfd_create",
    "timerfd_settime",
    "timerfd_gettime",

    # File monitoring
    "inotify_init",
    "inotify_init1",
    "inotify_add_watch",
    "inotify_rm_watch",

    # IPC
    "pipe",
    "pipe2",
    "shmget",
    "shmat",
    "shmctl",
    "shmdt",
    "semget",
    "semop",
    "semctl",
    "msgget",
    "msgsnd",
    "msgrcv",
    "msgctl",

    # Process info
    "getpid",
    "getppid",
    "getuid",
    "getgid",
    "geteuid",
    "getegid",
    "gettid",
    "getpgid",
    "getpgrp",
    "setpgid",
    "setsid",
    "getgroups",
    "setgroups",
    "setuid",
    "setgid",
    "setreuid",
    "setregid",
    "setresuid",
    "setresgid",

    # I/O control + legacy AIO
    "ioctl",
    "io_setup",
    "io_submit",
    "io_getevents",
    "io_destroy",

    # Misc / threading
    "futex",
    "set_tid_address",
    "set_robust_list",
    "get_robust_list",
    "sched_yield",
    "sched_getaffinity",
    "sched_setaffinity",
    "sched_setscheduler",
    "sched_getscheduler",
    "rseq",
    "getcwd",
    "chdir",
    "fchdir",
    "umask",
    "uname",
    "sysinfo",
    "getrusage",
    "getrandom",
    "prlimit64",
    "pidfd_open",
    "copy_file_range",
    "sendfile",
    "splice",
    "tee",

    # Arch-specific
    "arch_prctl",
]

# Absolute deny list — dangerous kernel operations that a sandboxed
# process should never need. Always denied regardless of mode.
deny = [
    "reboot",
    "kexec_load",
    "init_module",
    "finit_module",
    "delete_module",
    "swapon",
    "swapoff",
    "acct",
    "mount",
    "umount2",
    "pivot_root",
    "chroot",
    "syslog",
    "settimeofday",
    # Namespace escapes — a sandboxed process must never create new
    # namespaces or join existing ones.
    "unshare",
    "setns",
    # In-memory code execution — memfd_create + execveat enables fileless
    # execution, a common technique for running malicious payloads without
    # touching disk. Recipes that legitimately need these (e.g., BEAM VM)
    # can opt in via allow_extra.
    "memfd_create",
    "execveat",
]

elixir.toml

# Canister recipe for Elixir/Erlang workloads.
#
# The BEAM VM needs ptrace for :observer, :dbg, and erlang:trace/3.
# This recipe adds BEAM-specific syscalls, Hex.pm network access, and
# BEAM environment variables. System paths are provided by base.toml.
#
# Usage:
#   can run -r elixir -- mix test
#   can run -r elixir -r nix -- iex -S mix

[recipe]
name = "elixir-dev"
description = "Elixir/Erlang development: mix tasks, iex, Phoenix server"
version = "2"

# Strict mode: abort if any isolation layer cannot be set up.
# Recommended for CI. Uncomment to enable.
# strict = true

[syscalls]
# BEAM needs ptrace for tracing/debugging tools (:observer, :dbg).
# BEAM uses memfd_create for JIT code loading (moved to deny list in default baseline).
allow_extra = ["ptrace", "memfd_create"]

[filesystem]
deny = ["/etc/shadow", "/root"]

[network]
egress = "proxy-only"
# Allow hex.pm for dependency fetching and common Elixir registries.
[[host]]
domain = "hex.pm"

[[host]]
domain = "repo.hex.pm"

[[host]]
domain = "builds.hex.pm"

[[host]]
domain = "github.com"
[process]
# BEAM spawns many lightweight processes via OS threads; the default
# scheduler count equals the CPU core count. 256 is generous for most
# mix tasks and development servers.
max_pids = 256
# Environment variables the BEAM commonly needs.
env_passthrough = [
    "PATH",
    "HOME",
    "LANG",
    "TERM",
    "MIX_ENV",
    "MIX_HOME",
    "HEX_HOME",
    "ERL_AFLAGS",
    "ELIXIR_ERL_OPTIONS",
    "RELEASE_COOKIE",
    "RELEASE_NODE",
    "RELEASE_DISTRIBUTION",
    "SECRET_KEY_BASE",
    "DATABASE_URL",
    "PHX_HOST",
    "PHX_SERVER",
    "PORT",
]

example.toml

# Example Canister recipe — a complete sandbox policy for Python scripts.
#
# System paths (/usr/lib, /usr/bin, /lib, /tmp, etc.) are provided by
# base.toml — recipes only need to add application-specific paths.
#
# Usage: can run -r example -- python3 script.py

[recipe]
name = "example"
description = "Example recipe showing all available options"
version = "2"

# Strict mode: abort if any isolation layer cannot be set up.
# Recommended for CI. Uncomment to enable.
# strict = true

[filesystem]
# Paths the sandboxed process can read (mounted read-only).
# System paths are already provided by base.toml — only add app-specific paths here.
allow = []
# Paths the sandboxed process can write to (changes persist on host).
# The working directory ($PWD) is always writable. Uncomment for additional paths:
# allow_write = ["/var/data/myapp"]
# Paths explicitly denied (checked before allow and allow_write).
deny = ["/etc/shadow", "/root"]

[network]
egress = "proxy-only"
# Allowed IPs or CIDRs for direct connections. Empty = no IP-literal
# egress; everything must go through the DNS-resolved `[[host]]` list
# below.
allow_ips = []

# Allowed upstream destinations. The proxy refuses any other host.
[[host]]
domain = "pypi.org"

[[host]]
domain = "files.pythonhosted.org"

[process]
# Max child PIDs.
max_pids = 64
# Restrict which executables the sandbox may run (optional, opt-in).
# By default, all executables are allowed — the real security boundaries
# are network, filesystem, and seccomp. Uncomment to lock down:
# allow_execve = ["/usr/bin/python3"]
# Environment variables passed through from host.
env_passthrough = ["PATH", "HOME", "LANG", "TERM"]

# [resources] section — cgroup v2 resource limits (optional).
# Requires cgroups v2 on the host. Uncomment to enable:
# [resources]
# memory_mb = 512
# cpu_percent = 50

# [syscalls] section — customize the default seccomp baseline.
# Uncomment to add or remove specific syscalls:
# [syscalls]
# seccomp_mode = "allow-list"
# allow_extra = ["ptrace"]        # add ptrace to the allow list
# deny_extra = ["personality"]    # remove personality from allow, add to deny

flatpak.toml

# Canister recipe for Flatpak applications.
#
# Flatpak installs applications under /var/lib/flatpak (system-wide)
# and $HOME/.local/share/flatpak (per-user).
#
# Auto-detection: triggered by match_prefix

[recipe]
name = "flatpak"
description = "Flatpak applications"
version = "1"
match_prefix = ["/var/lib/flatpak", "$HOME/.local/share/flatpak"]

[filesystem]
allow = ["/var/lib/flatpak", "$HOME/.local/share/flatpak"]

generic-strict.toml

# Canister recipe for strict-mode execution of arbitrary binaries.
#
# Enables strict mode: any setup failure is fatal, seccomp uses
# KILL_PROCESS. No network access. Minimal filesystem (base.toml only).
# Intended for CI pipelines and production jobs where security is paramount.
#
# Usage:
#   can run -r generic-strict -- ./my-binary --flag
#   can run -r generic-strict -- cargo test

strict = true

[recipe]
name = "generic-strict"
description = "Strict no-network policy for untrusted binaries (CI/production)"
version = "2"

[syscalls]
# For compiled binaries that may use ptrace (debuggers), io_uring
# (modern async I/O), personality (multilib), or seccomp (self-sandboxing).
allow_extra = [
    "ptrace",
    "personality",
    "seccomp",
    "io_uring_setup",
    "io_uring_enter",
    "io_uring_register",
]

[filesystem]
deny = ["/etc/shadow", "/root", "/home"]

[network]
egress = "none"

[process]
max_pids = 64
env_passthrough = ["PATH", "LANG", "TERM"]

gnu-store.toml

# Canister recipe for GNU Guix package manager.
#
# Guix uses a content-addressed store at /gnu/store, similar to Nix.
# Binaries reference sibling store entries, so the entire store must
# be mounted.
#
# Auto-detection: triggered by match_prefix

[recipe]
name = "gnu-store"
description = "GNU Guix package manager (/gnu/store)"
version = "1"
match_prefix = ["/gnu/store"]

[filesystem]
allow = ["/gnu/store"]

homebrew.toml

# Canister recipe for Homebrew (Linuxbrew) package manager.
#
# On Linux, Homebrew installs to /home/linuxbrew/.linuxbrew or
# /opt/homebrew (rare on Linux, common path on macOS). Binaries
# reference the Cellar and shared libraries within the prefix.
#
# Auto-detection: triggered by match_prefix

[recipe]
name = "homebrew"
description = "Homebrew/Linuxbrew package manager"
version = "1"
match_prefix = ["/opt/homebrew", "/home/linuxbrew/.linuxbrew"]

[filesystem]
allow = ["/opt/homebrew", "/home/linuxbrew/.linuxbrew"]

neovim.toml

# Canister recipe for Neovim with LSP support.
#
# Mounts the user's full Neovim configuration (config, data, state, cache)
# so plugin managers (lazy.nvim), LSP servers (via Mason), tree-sitter
# parsers, and other tooling work out of the box.
#
# System paths (/usr/lib, /usr/bin, /lib, /tmp, etc.) are provided by
# base.toml — this recipe only adds Neovim-specific paths.
#
# Usage:
#   can run -r neovim -- nvim
#   can run -r neovim -r elixir -r nix -- nvim

[recipe]
name = "neovim"
description = "Neovim editor with LSP, tree-sitter, and plugin support"
version = "2"

[filesystem]
allow = [
    # Neovim XDG directories
    "$HOME/.config/nvim",
    "$HOME/.local/share/nvim",
    "$HOME/.local/state/nvim",
    "$HOME/.cache/nvim",
]
deny = ["/etc/shadow", "/root"]

[network]
egress = "proxy-only"
# LSP servers, Mason, and plugin managers may need network access
# for installation and updates. Lock down to known hosts.
[[host]]
domain = "github.com"

[[host]]
domain = "objects.githubusercontent.com"

[[host]]
domain = "raw.githubusercontent.com"

[[host]]
domain = "api.github.com"

[[host]]
domain = "registry.npmjs.org"

[[host]]
domain = "pypi.org"

[[host]]
domain = "files.pythonhosted.org"

[[host]]
domain = "luarocks.org"

[[host]]
domain = "github.com"
[process]
max_pids = 256
allow_execve = []
env_passthrough = [
    "PATH",
    "HOME",
    "LANG",
    "TERM",
    "COLORTERM",
    "TERMINFO",
    "USER",
    "SHELL",
    "EDITOR",
    "VISUAL",

    # XDG directories (so nvim finds its config)
    "XDG_CONFIG_HOME",
    "XDG_DATA_HOME",
    "XDG_STATE_HOME",
    "XDG_CACHE_HOME",
    "XDG_RUNTIME_DIR",

    # Nix
    "NIX_PATH",
    "NIX_PROFILES",

    # LSP / language tooling
    "CARGO_HOME",
    "RUSTUP_HOME",
    "GOPATH",
    "GOROOT",
    "NODE_PATH",
    "npm_config_prefix",
]

[syscalls]
# Neovim + LSP servers need ptrace (for debugging) and
# memfd_create (used by various runtimes)
allow_extra = ["ptrace", "memfd_create"]

nix.toml

# Canister recipe for Nix package manager.
#
# Nix stores all packages in /nix/store with content-addressed paths.
# Binaries freely reference sibling store entries, so the entire store
# must be mounted. This recipe is auto-detected when the resolved
# command binary lives under /nix/store.
#
# Auto-detection: triggered by match_prefix

[recipe]
name = "nix"
description = "Nix package manager (/nix/store)"
version = "1"
match_prefix = ["/nix/store"]

[filesystem]
# Mount the entire Nix store — binaries reference sibling entries
# via rpaths and wrapper scripts.
allow = ["/nix/store"]

node-build.toml

# Canister recipe for Node.js build tasks (npm/yarn/pnpm).
#
# Allows network access to the npm registry and common CDNs.
# System paths are provided by base.toml — this recipe only adds
# Node.js-specific network and environment configuration.
#
# Usage:
#   can run -r node-build -- npm install
#   can run -r node-build -- npm run build

[recipe]
name = "node-build"
description = "Node.js build tasks: npm install, build, test"
version = "2"

[filesystem]
deny = ["/etc/shadow", "/root"]

[network]
egress = "proxy-only"
[[host]]
domain = "registry.npmjs.org"

[[host]]
domain = "registry.yarnpkg.com"

[[host]]
domain = "registry.npmmirror.com"
[process]
max_pids = 128
env_passthrough = [
    "PATH",
    "HOME",
    "LANG",
    "TERM",
    "NODE_ENV",
    "NPM_CONFIG_REGISTRY",
    "NPM_TOKEN",
]

opencode.toml

# Canister recipe for OpenCode — AI coding agent.
#
# OpenCode is a Go binary that talks to LLM providers (GitHub Copilot,
# Anthropic, OpenAI, etc.) and runs developer tools (git, cargo, npm,
# ripgrep, etc.) to assist with coding tasks.
#
# This recipe scopes filesystem access to the project working directory,
# OpenCode's own state/config dirs, and common tool locations. Network
# is restricted to known LLM API domains.
#
# Usage:
#   can run --recipe recipes/opencode.toml -- opencode
#
# Note: Run from the project directory you want OpenCode to work on.
# The $PWD will be bind-mounted into the sandbox.

[recipe]
name = "opencode"
description = "OpenCode AI coding agent with scoped filesystem and restricted network"
version = "2"

[filesystem]
# OpenCode needs access to:
# - The project directory (implicitly mounted writable as the working dir)
# - Its own config and state dirs
# System paths (/usr/lib, /usr/bin, /lib, /tmp, etc.) are provided by base.toml.
# Language toolchains ($HOME/.cargo, $HOME/.rustup, /nix/store, etc.) should
# be added via recipe composition in canister.toml, NOT baked in here.
#
# SECURITY: We intentionally do NOT mount:
# - $HOME/.ssh        — SSH private keys. Use SSH_AUTH_SOCK (agent) instead.
# - $HOME/.gitconfig  — may contain credential helpers or tokens. Git identity
#                       is passed via GIT_AUTHOR_NAME/EMAIL env vars.
# - $HOME/.cargo      — contains credentials.toml (crates.io tokens). Mount
#                       via the cargo recipe when needed.
# - $HOME/.gnupg      — GPG private keys.
# - $HOME/.aws        — AWS credentials.
# - $HOME/.kube       — Kubernetes config with tokens.
allow = [
    "$HOME/.opencode",
    "$HOME/.config/opencode",
]
# OpenCode writes to its state dir (SQLite DB, logs, tool output).
allow_write = [
    "$HOME/.local/share/opencode",
]
deny = [
    "/etc/shadow",
    "/root",
    "$HOME/.ssh",
    "$HOME/.gnupg",
    "$HOME/.aws",
    "$HOME/.kube",
    "$HOME/.docker",
    "$HOME/.npmrc",
    "$HOME/.pypirc",
    "$HOME/.netrc",
    "$HOME/.config/gh",
]

[network]
egress = "proxy-only"
# GitHub Copilot auth and API
# - github.com: OAuth device flow
# - api.github.com: token exchange and user info
# - api.githubcopilot.com: chat completions endpoint
# - copilot-proxy.githubusercontent.com: Copilot proxy
#
# Anthropic API (Claude models)
# - api.anthropic.com
#
# OpenAI API
# - api.openai.com
#
# OpenCode's own services (Zen, auth, sharing)
# - opencode.ai
#
# WebFetch tool — OpenCode fetches arbitrary URLs for research.
# If you need unrestricted web access, set egress = "direct".
[[host]]
domain = "github.com"

[[host]]
domain = "api.github.com"

[[host]]
domain = "api.githubcopilot.com"

[[host]]
domain = "copilot-proxy.githubusercontent.com"

[[host]]
domain = "api.anthropic.com"

[[host]]
domain = "api.openai.com"

[[host]]
domain = "opencode.ai"
[process]
# OpenCode spawns subprocesses for tools (git, cargo, rg, etc.)
max_pids = 256
# Environment variables OpenCode and its tools need.
env_passthrough = [
    "PATH",
    "HOME",
    "LANG",
    "TERM",
    "COLORTERM",
    "EDITOR",
    "SHELL",
    "USER",
    "XDG_CONFIG_HOME",
    "XDG_DATA_HOME",
    "XDG_STATE_HOME",
    "XDG_CACHE_HOME",
    "GIT_AUTHOR_NAME",
    "GIT_AUTHOR_EMAIL",
    "GIT_COMMITTER_NAME",
    "GIT_COMMITTER_EMAIL",
    "HTTPS_PROXY",
    "HTTP_PROXY",
    "NO_PROXY",
    "SSH_AUTH_SOCK",
    "CARGO_HOME",
    "RUSTUP_HOME",
    "NODE_PATH",
]

python-pip.toml

# Canister recipe for installing Python packages with pip.
#
# Allows network access to PyPI. System paths are provided by base.toml.
#
# Usage:
#   can run -r python-pip -- pip install requests
#   can run -r python-pip -- python3 -m pip install -r requirements.txt

[recipe]
name = "python-pip"
description = "Install Python packages with pip (network access to PyPI)"
version = "2"

[filesystem]
deny = ["/etc/shadow", "/root"]

[network]
egress = "proxy-only"
[[host]]
domain = "pypi.org"

[[host]]
domain = "files.pythonhosted.org"
[process]
max_pids = 32
env_passthrough = [
    "PATH",
    "HOME",
    "LANG",
    "TERM",
    "PIP_INDEX_URL",
    "PIP_TRUSTED_HOST",
    "VIRTUAL_ENV",
]

snap.toml

# Canister recipe for Snap packages.
#
# Snap installs applications under /snap with the runtime under
# /snap/core*. Binaries are launched via /snap/bin wrappers.
#
# Auto-detection: triggered by match_prefix

[recipe]
name = "snap"
description = "Snap package manager (/snap)"
version = "1"
match_prefix = ["/snap"]

[filesystem]
allow = ["/snap"]

Architecture

This document describes the internal design of Canister, the execution flow of a sandboxed process, and the security properties of each isolation layer.

Table of Contents


Design Principles

  1. Unprivileged by default. No root, no suid, no capabilities. Everything runs as the calling user using unprivileged user namespaces.

  2. Defense in depth. Multiple independent isolation mechanisms. Bypassing one layer does not compromise the others.

  3. Fail closed. When a feature cannot be set up (e.g., a MAC system blocks mounts), Canister aborts. All setup failures are fatal in both normal and strict mode — the sandbox runs at full strength or not at all.

  4. Single binary. No runtime dependencies beyond the Linux kernel (and optionally pasta for filtered networking). No dynamic linking to external libraries.

  5. One-shot execution. Fork, isolate, exec, wait, exit. No daemon, no long-running supervisor process. The sandbox lifetime equals the command lifetime.


Crate Structure

canister/
├── can-cli        CLI binary. Argument parsing (clap), recipe
│                  resolution (name-based lookup, auto-detection
│                  via match_prefix), composition chain assembly,
│                  `can up` (manifest-driven sandboxes from
│                  canister.toml), `can recipe show` (emit resolved
│                  policy as TOML), can init / can update lifecycle
│                  commands.
│
├── can-sandbox    Core runtime. Orchestrates the fork/unshare/exec
│                  sequence. Contains the namespace, overlay, and
│                  seccomp modules.
│
├── can-policy     Policy engine. TOML config parsing, RecipeFile
│                  merge logic, environment variable expansion
│                  ($HOME, $USER, etc.), access control enforcement
│                  (path, domain, IP/CIDR), seccomp profile
│                  definitions. Also contains the project manifest
│                  module (manifest.rs) for canister.toml parsing,
│                  validation, and upward directory discovery.
│                  No Linux-specific code.
│
├── can-net        Network isolation. Network namespace setup,
│                  loopback interface, pasta integration,
│                  DNS proxy with domain filtering.
│
└── can-log        Logging setup. TTY detection, human vs JSON
                   output selection, monitor-mode event types
                   and summary output.

Dependencies flow downward: can-cli -> can-sandbox -> can-policy, can-net. can-policy and can-log have no internal dependencies.

Outbound Defense Model (Filtered + Proxy + DLP)

When proxy is enabled with enforcement, outbound networking follows a three-layer model:

  1. Kernel first-line (seccomp USER_NOTIF): sandboxed processes may only connect to local proxy loopback endpoint and DNS server; all other direct outbound INET/INET6 traffic is denied.
  2. Proxy second-line (user space): proxy validates destination against allow policy and forwards via L7 HTTP interception path or L4 CONNECT passthrough path.
  3. DLP third-line (content scanning): when [network.dlp] is enabled (implicit under --strict + proxy-only), the L7 path scans request headers, URI, and body for credential patterns (GitHub PATs, npm tokens, AWS keys, SSH keys, etc.) and enforces per-detector domain scoping. A GitHub PAT bound for registry.npmjs.org is blocked even though registry.npmjs.org has a [[host]] block. Bodies are decompressed (gzip/deflate/brotli) and decoded (base64/hex/percent, up to 32 layers) before pattern matching. See DLP.md for the threat model, detector list, and canary-token / session-entropy-budget mechanisms.

This prevents bypass by unsetting proxy environment variables, and prevents exfiltration of credentials that the sandbox legitimately needs read access to.


Execution Flow

Canister supports two entry points:

  • can run -r ... -- command — ad-hoc sandboxing with explicit recipe flags
  • can up [name] — manifest-driven sandboxing from canister.toml

Both converge on the same fork/unshare/exec pipeline. The only difference is how the recipe chain is assembled in step 1.

Manifest Discovery (can up)

When can up is invoked, the CLI discovers canister.toml by walking up from the current directory (like .gitignore). It parses the manifest, resolves the named sandbox (or the first defined sandbox alphabetically), and assembles the recipe chain from the manifest’s recipes = [...] list plus any [sandbox.<name>.filesystem] / [sandbox.<name>.network] / etc. overrides.

Composition order for can up:

base.toml
  → auto-detected recipes (match_prefix against command binary)
  → recipes listed in manifest (left to right)
  → manifest overrides ([sandbox.<name>.filesystem], etc.)
  = final SandboxConfig

This replaces the explicit --recipe flags from can run with the manifest’s declarative recipe list. The resolved SandboxConfig is identical in structure and is passed to the same sandbox runtime.

can run Flow

The complete lifecycle of can run -r nix -r elixir -- mix test:

┌─────────────────────────────────────────────────────────────────────┐
│ 1. CLI SETUP                                                        │
│    a. Parse args, resolve + canonicalize command path               │
│    b. Load base.toml (embedded, overridable)                        │
│    c. Auto-detect recipes: match resolved binary path against       │
│       match_prefix in all discovered recipe files                   │
│    d. Load explicit --recipe args (name-based lookup or file path)  │
│    e. Merge recipe chain: base → auto-detected → explicit (L-to-R) │
│    f. Expand env vars ($HOME, $USER, etc.) in merged config         │
│    g. Validate allow_execve, determine network mode                 │
└──────────────────────────┬──────────────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────────────┐
│ 1b. NOTIFIER SETUP (before fork)                                     │
│    a. Resolve notifier_enabled (config override / monitor mode /    │
│       kernel version auto-detect)                                   │
│    b. If enabled: create anonymous Unix socket pair for fd passing  │
│       (parent_sock, child_sock)                                     │
│    c. Pre-resolve allowed domains to IPs (already done in network   │
│       setup — IPs stored for building notifier policy)              │
└──────────────────────────┬──────────────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────────────┐
│ 2. FORK                                                             │
│    Create three pipes (child_ready, maps_done, network_done).   │
│    Capture UID/GID. Call fork().                                │
└──────────┬──────────────────────────────────────┬───────────────────┘
           │                                      │
    ┌──────▼──────┐                        ┌──────▼──────┐
    │   PARENT    │                        │    CHILD    │
    │             │                        │             │
    │             │                        │ 3. UNSHARE  │
    │             │                        │    Phase 1: │
    │             │                        │    USER+PID │
    │             │                        │    [+NET]   │
    │             │    "ready" ◄────────── │             │
    │             │                        │             │
    │ 4. UID/GID  │                        │   (blocks   │
    │    MAPPING   │                        │   maps_done)│
    │    Write     │                        │             │
    │    /proc/    │                        │             │
    │    <pid>/    │                        │             │
    │    uid_map   │                        │             │
    │    gid_map   │                        │             │
    │             │ ──► "maps_done"        │             │
    │             │                        │   (blocks   │
    │ 5. NETWORK  │                        │   net_done) │
    │    Start    │                        │             │
    │    pasta    │                        │             │
    │    --userns │                        │             │
    │    --netns  │                        │             │
    │             │ ──► "net_done"         │             │
    │             │                        │             │
    │             │                        │ 5b.UNSHARE  │
    │             │                        │    Phase 2: │
    │             │                        │    NEWNS    │
    │             │                        │             │
     │             │                        │ 6. PID NS   │
     │             │                        │    First    │
     │             │                        │    fork()   │
     │             │                        │    (creates │
     │             │                        │    new PID  │
     │             │                        │    ns)      │
     │             │                        │             │
     │             │                        │    Interme- │
     │             │                        │    diate:   │
     │             │                        │    waitpid  │
     │             │                        │    + exit   │
     │             │                        │             │
     │             │                        │ 6a. SECOND  │
     │             │                        │    FORK     │
     │             │                        │    (when    │
     │             │                        │    notifier │
     │             │                        │    enabled) │
     │             │                        │             │
     │             │                        │   ┌─ PID 1: │
     │             │                        │   │  SUPER- │
     │             │                        │   │  VISOR  │
     │             │                        │   │  unshare│
     │             │                        │   │  NEWNS  │
     │             │                        │   │  mount  │
     │             │                        │   │  /proc  │
     │             │                        │   │  recv   │
     │             │                        │   │  notif  │
     │             │                        │   │  fd via │
     │             │                        │   │  SCM_   │
     │             │                        │   │  RIGHTS │
     │             │                        │   │  poll + │
     │             │                        │   │  waitpid│
     │             │                        │   │  loop   │
     │             │                        │   │         │
     │             │                        │   └─ PID 2: │
     │             │                        │      WORKER │
     │             │                        │      setsid │
    │             │                        │             │
    │             │                        │ 6b. CGROUP  │
    │             │                        │    Create   │
    │             │                        │    child    │
    │             │                        │    cgroup,  │
    │             │                        │    write    │
    │             │                        │    memory   │
    │             │                        │    .max +   │
    │             │                        │    cpu.max  │
    │             │                        │    (before  │
    │             │                        │    pivot_   │
    │             │                        │    root)    │
    │             │                        │             │
    │             │                        │ 7. OVERLAY  │
    │             │                        │    tmpfs    │
    │             │                        │    root,    │
    │             │                        │    bind     │
    │             │                        │    mounts   │
    │             │                        │    (from    │
    │             │                        │    merged   │
    │             │                        │    config), │
    │             │                        │    CWD bind │
    │             │                        │    mount    │
    │             │                        │    (RW),    │
    │             │                        │    pivot_   │
    │             │                        │    root,    │
    │             │                        │    chdir()  │
    │             │                        │             │
    │             │                        │ 7b. PROC    │
    │             │                        │    HARDEN   │
    │             │                        │    Mask     │
    │             │                        │    /proc/*  │
    │             │                        │             │
    │             │                        │ 8. NET      │
    │             │                        │    SETUP    │
    │             │                        │    loopback │
    │             │                        │    resolv.  │
    │             │                        │    conf     │
    │             │                        │             │
    │             │                        │ 9. PROCESS  │
    │             │                        │    RLIMIT   │
    │             │                        │    NPROC    │
    │             │                        │             │
     │             │                        │ 10. NOTIF   │
     │             │                        │    FILTER   │
     │             │                        │    Worker   │
     │             │                        │    installs │
     │             │                        │    USER_    │
     │             │                        │    NOTIF    │
     │             │                        │    BPF,     │
     │             │                        │    sends fd │
     │             │                        │    to PID 1 │
     │             │                        │    (super-  │
     │             │                        │    visor)   │
     │             │                        │    via SCM_ │
     │             │                        │    RIGHTS   │
     │             │                        │             │
     │             │                        │ 11. SECCOMP │
    │             │                        │    Load     │
    │             │                        │    main BPF │
    │             │                        │    filter   │
    │             │                        │             │
    │             │                        │ 12. ENV     │
    │             │                        │    Filter   │
    │             │                        │    env vars │
    │             │                        │             │
    │             │                        │ 13. EXEC    │
    │             │                        │    execve() │
    │             │                        │             │
    │ 14. WAIT   │                        │  (running)  │
    │     waitpid │                        │             │
    │             │                        │ (exits)     │
    │             │                        └─────────────┘
     │ 15. CLEANUP│
     │    Kill     │
     │    pasta    │
     │    Stop DNS │
     │    proxy    │
     │    Return   │
     │    exit code│
     └─────────────┘

Critical ordering constraints:

  • Recipe composition (load, merge, env expansion) happens entirely in the CLI layer before forking. The child receives an already-resolved SandboxConfig.
  • unshare() is split into two phases. Phase 1: unshare(CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNET) — creates user, PID, and network namespaces. Phase 2: unshare(CLONE_NEWNS) — creates the mount namespace. The split is necessary so pasta can access /proc/<child_pid>/ns/net before the child’s mount namespace changes.
  • UID/GID maps must be written from the parent process. The child cannot write its own maps after unshare(CLONE_NEWUSER).
  • pasta must be started after the child creates CLONE_NEWNET and after UID/GID maps are written, but before the child calls unshare(CLONE_NEWNS). pasta is invoked with --userns /proc/<child_pid>/ns/user --netns /proc/<child_pid>/ns/net --runas <uid>.
  • The inner fork for PID namespace must happen before filesystem setup so /proc mount reflects the new PID namespace.
  • setsid() must be called after the inner PID namespace fork. PID 1 inherits an invisible session/process-group from the parent namespace; without setsid(), bash’s job control initialization fails (getpgrp returns the parent-namespace group).
  • /proc hardening must happen after overlay + /proc mount but before seccomp.
  • RLIMIT_NPROC must be set before seccomp (which blocks prctl).
  • Cgroups v2 setup must happen before pivot_root, because the cgroup filesystem (/sys/fs/cgroup) is on the host and becomes inaccessible after the root is swapped. This is step 6b in the execution flow.
  • The CWD bind-mount must happen during overlay setup (step 7), before pivot_root. The host’s current working directory is captured before unshare() and bind-mounted writable into the new root. After pivot_root, the child calls chdir() to the mounted CWD path.
  • The notifier filter must be installed before the main seccomp filter. The seccomp() syscall with SECCOMP_FILTER_FLAG_NEW_LISTENER returns the notification fd. The worker (PID 2) sends this fd to PID 1 (supervisor) via SCM_RIGHTS, then installs the main filter via prctl(PR_SET_SECCOMP).
  • PID 1 (the supervisor) must receive the notifier fd and begin its poll loop before the worker calls execve(), so the supervisor is ready to handle notifications from the target program.
  • Seccomp must be loaded after all setup is complete, right before exec.
  • Environment filtering happens at exec time — execve() receives the filtered environment directly.

Isolation Layers

1. User Namespaces

Syscall: unshare(CLONE_NEWUSER)

The child process gets a new user namespace where it is mapped as UID 0 / GID 0. This gives it “root inside the namespace” which is required for mount operations, but grants zero real privileges on the host.

The parent writes the mapping:

/proc/<pid>/setgroups → "deny"
/proc/<pid>/uid_map   → "0 <host_uid> 1"
/proc/<pid>/gid_map   → "0 <host_gid> 1"

Security property: The child appears to be root but cannot affect any resources outside its namespace. All privilege checks are scoped to the namespace.

2. Mount Namespace + pivot_root

Syscall: unshare(CLONE_NEWNS) + pivot_root()

The child gets its own mount table. The setup sequence:

1.  mount("", "/", MS_SLAVE | MS_REC)     # break propagation to host
2.  mount("tmpfs", new_root)               # empty tmpfs as new root
3.  mkdir skeleton dirs                     # /bin, /lib, /usr, /proc, /dev, /tmp, ...
4.  bind-mount essentials (read-only)       # from base.toml: /bin, /sbin, /usr/bin, ...
5.  bind-mount allowed paths (RO)           # from merged [filesystem].allow (all recipes)
5b. bind-mount CWD (read-write)            # host working directory, always mounted
6.  mount /tmp (read-write)                 # ephemeral writable space
7.  mount /proc                             # needed by many programs
8.  set up /dev                             # null, zero, urandom, tty, fd symlinks
9.  pivot_root(new_root, old_root)          # swap filesystem root
10. umount(old_root, MNT_DETACH)           # detach host filesystem entirely
11. chdir(cwd_path)                         # restore working directory inside new root

Recipe-based mount resolution:

The paths visible inside the sandbox come from the merged recipe chain (base.toml → auto-detected → explicit). There is no hardcoded prefix detection at runtime. Instead:

  1. base.toml defines essential OS bind mounts (/bin, /sbin, /usr/bin, /usr/sbin, /lib, /lib64, /usr/lib, /etc). It is embedded in the binary via include_str!() and overridable on disk, following the same pattern as default.toml.

  2. Auto-detected recipes provide package-manager mounts. Each recipe declares match_prefix patterns in its [recipe] metadata. During CLI setup (before fork), the resolved binary path is matched against all discovered recipes. Matching recipes are merged into the chain, bringing their [filesystem].allow paths with them:

    Recipematch_prefixAdds to allow
    nix.toml/nix/store/nix/store
    homebrew.toml/opt/homebrew, /home/linuxbrew/.linuxbrew/opt/homebrew (or linuxbrew)
    cargo.toml$HOME/.cargo, $HOME/.rustup$HOME/.cargo, $HOME/.rustup
    snap.toml/snap/snap
    flatpak.toml/var/lib/flatpak, $HOME/.local/share/flatpakprefix paths
    gnu-store.toml/gnu/store/gnu/store
  3. Explicit recipes (--recipe / -r flags) add whatever [filesystem].allow paths they declare.

  4. Environment variable expansion ($HOME, $USER, ${XDG_CONFIG_HOME}) is performed during into_sandbox_config(), after merge but before the paths are used by the overlay module.

This design means adding support for a new package manager is “write a .toml file” rather than “modify Rust code”. The detect_command_prefix() function was removed entirely.

Security model: Filesystem visibility does not equal execution permission. Mounted paths are visible inside the sandbox, but allow_execve and the USER_NOTIF supervisor’s execve()/execveat() filtering control what can actually be executed.

Security property: The process cannot see or access any host path that was not explicitly included in the merged recipe chain. The host’s current working directory is always bind-mounted writable so the sandboxed process can read/write files in its working directory. All other writes go to tmpfs and are discarded when the process exits.

MAC systems: When a Mandatory Access Control system (AppArmor on Ubuntu, SELinux on Fedora/RHEL) blocks mount operations, filesystem isolation cannot be established and the sandbox aborts. Run sudo can setup to install the appropriate security policy (see MAC section).

3. Network Namespace

Syscall: unshare(CLONE_NEWNET) + pasta

Three modes, determined from config:

None mode: The sandbox has an empty network namespace with only loopback. No external connectivity.

Filtered mode: The parent starts pasta which mirrors the host’s network configuration into the child’s network namespace. pasta copies the host’s real IP addresses, routes, and gateway into the namespace:

┌──────────────────────────────────┐
│         Host network             │
│                                  │
│   pasta ◄──── namespace fd       │
│       │                          │
│       │  mirrors host config     │
│       │                          │
└───────┼──────────────────────────┘
        │
┌───────┼──────────────────────────┐
│       ▼      Sandbox network     │
│   Host's real IP (mirrored)      │
│   gateway: host's default gw     │
│   DNS: 169.254.0.1 (link-local)  │
│                                  │
│   ┌─────────────────────────┐    │
│   │   sandboxed process     │    │
│   └─────────────────────────┘    │
└──────────────────────────────────┘

Allowed domains are pre-resolved to IP addresses at startup (from the parent, which still has host DNS access). These resolved IPs are passed to the USER_NOTIF supervisor, which intercepts connect() syscalls and validates the destination IP against the allow list. A DNS proxy runs in the parent process on an ephemeral port, filtering DNS queries to only resolve allowed domains. The sandbox’s /etc/resolv.conf is configured to use pasta’s DNS address (169.254.0.1:53, set via --dns), which routes queries to the parent’s DNS proxy via --dns-forward. This prevents DNS-based information exfiltration.

Port forwarding: When -p / --port flags are specified, pasta is configured with explicit port forwarding rules via -t (TCP) and -u (UDP) options. Auto-forwarding is disabled (-t none -u none) and only the specified ports are forwarded.

Full mode: No CLONE_NEWNET. The sandbox shares the host network.

Security property: In None mode, the process has zero network access. In Filtered mode, connectivity is routed through pasta, and the USER_NOTIF supervisor enforces IP-level connect() filtering against the allowed domain/IP list. DNS queries are restricted to allowed domains. In Full mode, there is no network isolation.

4. Seccomp BPF

Syscall: prctl(PR_SET_NO_NEW_PRIVS) + prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER)

A classic BPF program is loaded right before execve(). The filter is generated at runtime from the default baseline defined in recipes/default.toml (~187 allowed, ~18 always-denied) plus any [syscalls] overrides (allow_extra / deny_extra).

When the USER_NOTIF supervisor is enabled, two BPF filters are installed:

  1. Notifier filter (installed first, via seccomp() with SECCOMP_FILTER_FLAG_NEW_LISTENER): Returns SECCOMP_RET_USER_NOTIF for eight intercepted syscalls (connect, sendto, sendmsg, clone, clone3, socket, execve, execveat). All others return SECCOMP_RET_ALLOW.

  2. Main filter (installed second, via prctl(PR_SET_SECCOMP)): The standard allow-list or deny-list filter described below.

The kernel evaluates filters in reverse install order, but SECCOMP_RET_USER_NOTIF takes special precedence — when present, the kernel always delivers the notification to the supervisor. See Seccomp USER_NOTIF Supervisor for details.

When the notifier is disabled (kernel < 5.9, monitor mode, or notifier = false), only the main filter is installed.

The baseline is embedded in the binary via include_str!() so it works standalone. At runtime, Canister searches for an external default.toml in ./.canister/, $XDG_CONFIG_HOME/canister/recipes/, and /etc/canister/recipes/. If found, the external file takes precedence over the embedded copy. This lets users pin, audit, or version-control the baseline without recompiling.

Two modes — allow-list (default) and deny-list:

ModeDefault actionListed syscallsRecommended for
Allow-list (default)DENYPermittedProduction, CI
Deny-listALLOWBlockedCompatibility, unknown workloads

Allow-list mode is the default and recommended mode. It inverts the security model: only syscalls explicitly listed in the profile are permitted; everything else is denied. This provides a much smaller attack surface than a deny-list.

BPF program structure (allow-list mode):

Instruction  What it does
─────────────────────────────────────────────────
[0]          Load seccomp_data.arch
[1]          If arch == x86_64: skip to [3]
[2]          Return KILL_PROCESS (wrong architecture)
[3]          Load seccomp_data.nr (syscall number)
[4]          If nr == allowed_0: jump to [ALLOW]
[5]          If nr == allowed_1: jump to [ALLOW]
...
[N]          If nr == allowed_K: jump to [ALLOW]
[N+1]        Return ERRNO(EPERM) (no match → denied)
[N+2]        Return ALLOW (match → permitted)

BPF program structure (deny-list mode):

Instruction  What it does
─────────────────────────────────────────────────
[0]          Load seccomp_data.arch
[1]          If arch == x86_64: skip to [3]
[2]          Return KILL_PROCESS (wrong architecture)
[3]          Load seccomp_data.nr (syscall number)
[4]          If nr == denied_0: jump to [DENY]
[5]          If nr == denied_1: jump to [DENY]
...
[N]          If nr == denied_K: jump to [DENY]
[N+1]        Return ALLOW (no match → permitted)
[N+2]        Return ERRNO(EPERM) (match → denied)

The mode is selected via [syscalls] seccomp_mode in the config file (default: "allow-list").

Architecture validation: The first check rejects any syscall from a non-native architecture. On x86_64, this prevents bypass via the x32 ABI (which shares the kernel but uses different syscall numbers).

Deny action: In normal mode, Canister uses SECCOMP_RET_ERRNO | EPERM which allows the sandboxed process to handle denied syscalls gracefully. In strict mode (--strict), it uses SECCOMP_RET_KILL_PROCESS — the process is killed immediately on any denied syscall.

Security property: Even if a process escapes all namespace isolation, it cannot invoke unlisted syscalls. The filter is enforced by the kernel and cannot be removed or modified by the filtered process (loading new seccomp filters is blocked by the default baseline’s deny list).

4b. Seccomp USER_NOTIF Supervisor

Syscall: seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER) Module: notifier.rs

The USER_NOTIF supervisor extends seccomp BPF with argument-level inspection. Classic BPF can only check the syscall number and architecture — it cannot dereference pointers or read memory. The supervisor intercepts specific syscalls via SECCOMP_RET_USER_NOTIF, reads the actual argument data from /proc/<pid>/mem, and makes a policy decision.

Architecture:

The supervisor runs as PID 1 inside the sandbox’s PID namespace. This is necessary because:

  1. After unshare(CLONE_NEWPID), clone(CLONE_THREAD) fails with EINVAL, so a supervisor thread cannot be spawned.
  2. The host’s procfs denies /proc/<pid>/mem opens from a child user namespace. PID 1 mounts its own procfs (owned by the sandbox’s user namespace).
  3. As PID 1, the supervisor is an ancestor of all sandboxed processes, satisfying Yama ptrace_scope=1 without PR_SET_PTRACER.
  PID 1 (supervisor)                     PID 2+ (worker / sandboxed)
  ──────────────────                     ──────────────────────────
  unshare(CLONE_NEWNS)                   Sandbox setup (overlay, pivot_root)
  mount /proc                            seccomp() → notifier fd
  recv_fd() via SCM_RIGHTS               send_fd() via SCM_RIGHTS
       │                                 install main BPF filter
       │    connect(AF_INET, ...)        execve()
       │ ──── SUSPENDED ────────────►         │
       │                               ┌──────┴──────────────┐
       │                               │  Supervisor (PID 1) │
       │                               │  1. NOTIF_RECV      │
       │                               │  2. open+read       │
       │                               │     /proc/<pid>/mem │
       │                               │  3. Check policy    │
       │                               │  4. NOTIF_ID_VALID  │
       │    ALLOW / ERRNO(EPERM)       │  5. NOTIF_SEND      │
       │ ◄──────────────────────────── └──────────────────────┘
       │
       ▼  (continues or gets EPERM)

The supervisor runs inline (single-threaded) using poll() with a 200ms timeout, interleaved with non-blocking waitpid to detect when the worker exits. After the worker exits, remaining in-flight notifications are drained before the supervisor terminates.

Filtered syscalls:

SyscallWhat is inspectedPolicy enforcement
connect()sockaddr struct (IP + port)Must match IPs resolved from each [[host]]’s domain, allow_ips CIDRs, or loopback
sendto()dest_addr + msg_controllenDNS queries on port 53 trigger supervisor-side resolution; connected sockets (NULL addr) allowed
sendmsg()msghdr struct (msg_controllen)Blocks any sendmsg() with ancillary data (msg_controllen > 0), preventing SCM_RIGHTS fd passing
clone()flags registerNamespace flags (CLONE_NEWNS, CLONE_NEWCGROUP, CLONE_NEWUTS, CLONE_NEWIPC, CLONE_NEWUSER, CLONE_NEWPID, CLONE_NEWNET) denied
clone3()clone_args.flags in userspace memorySame namespace flag check, struct read via /proc/<pid>/mem
socket()domain + type + protocol registersSOCK_RAW denied; AF_NETLINK restricted to NETLINK_ROUTE (protocol 0) only
execve()Pathname string in userspace memoryMust match allow_execve paths (empty = allow all)
execveat()Pathname + dirfdSame as execve(), with dirfd resolution

TOCTOU mitigation: Between reading the worker’s memory and sending the verdict, a multi-threaded sandbox process could modify the inspected memory. The supervisor calls ioctl(SECCOMP_IOCTL_NOTIF_ID_VALID) after the policy check — if the notification ID is no longer valid (thread exited or memory was remapped), the verdict is skipped.

Memory access: The supervisor (PID 1) runs in the same user namespace and PID namespace as all sandboxed processes. It mounts its own procfs (the user namespace owns the PID namespace, so the mount succeeds). notif.pid in the seccomp notification matches PIDs visible in this procfs. As PID 1, the supervisor is an ancestor of all sandboxed processes, so Yama ptrace_scope=1 is satisfied without PR_SET_PTRACER. The supervisor does NOT have PR_SET_NO_NEW_PRIVS set, which would otherwise block /proc/<pid>/mem access.

Fd passing protocol: Before fork(), the parent creates an anonymous Unix socket pair (socketpair(AF_UNIX, SOCK_STREAM)). One end is inherited by the worker (PID 2+), the other by the supervisor (PID 1). After the notifier filter is installed, the worker sends the notifier fd to PID 1 as SCM_RIGHTS ancillary data.

Requirements: Linux 5.9+ (auto-detected from /proc/sys/kernel/osrelease). Disabled in monitor mode (incompatible with SECCOMP_RET_LOG). Configurable via [syscalls] notifier in recipe config.

Security property: Even syscalls that pass the main BPF filter are subject to argument-level inspection. A sandboxed process cannot connect to unauthorized IPs, pass file descriptors via SCM_RIGHTS, create new namespaces via clone flags, open raw sockets, open AF_NETLINK sockets beyond NETLINK_ROUTE, or exec binaries outside the allow_execve list.

5. Process Control

Modules: process.rs (environment filtering, PID namespace, RLIMIT_NPROC, allow_execve validation)

Process control enforces the [process] config section:

PID Namespace (CLONE_NEWPID + two forks):

The child calls unshare(CLONE_NEWPID) atomically with the other namespace flags. Since CLONE_NEWPID affects children of the calling process (not the caller itself), the child forks to enter the new PID namespace.

When the USER_NOTIF supervisor is enabled, a second fork inside the new PID namespace creates the supervisor/worker split:

  • PID 1 (supervisor): Mounts its own /proc via unshare(CLONE_NEWNS), receives the notifier fd from the worker via SCM_RIGHTS, and runs the supervisor loop inline (single-threaded poll + waitpid).
  • PID 2+ (worker): Performs sandbox setup (overlay, pivot_root, seccomp), installs the USER_NOTIF filter, sends the notifier fd to PID 1, then execs the target command.

When the notifier is disabled, there is only one fork. The child becomes PID 1 and proceeds directly with sandbox setup and exec.

After the inner fork, setsid() is called to create a new session and process group. This is necessary because PID 1 inherits an invisible session/process-group from the parent namespace. Without setsid(), bash’s job control initialization fails because getpgrp() returns the parent-namespace process group ID, which doesn’t exist in the new PID namespace — causing “initialize_job_control: getpgrp failed”.

The intermediate parent (in the old PID namespace) waits and propagates the exit code.

  Outer child (after unshare)
       │
       ├── fork()  (enters new PID namespace)
       │     │
       │     ├── [notifier enabled] fork() again:
       │     │     │
       │     │     ├── PID 1: Supervisor
       │     │     │     ├── unshare(CLONE_NEWNS)
       │     │     │     ├── mount /proc
       │     │     │     ├── recv notifier fd
       │     │     │     └── poll/waitpid supervisor loop
       │     │     │
       │     │     └── PID 2: Worker
       │     │           ├── setsid()
       │     │           ├── setup overlay, network, seccomp
       │     │           ├── install notifier filter, send fd to PID 1
       │     │           └── execve()
       │     │
       │     ├── [notifier disabled] PID 1: direct setup + exec
       │     │     ├── setsid()
       │     │     ├── setup overlay, network, seccomp
       │     │     └── execve()
       │     │
       │     └── (intermediate parent waits, exits with child's code)

Security property: The sandboxed process tree is completely isolated. It cannot see or signal host processes via /proc or kill().

Environment Filtering (env_passthrough):

Before execve(), the environment is reconstructed from scratch. Only variables listed in env_passthrough are kept. If the list is empty, the process starts with a completely clean environment (zero host leakage).

A minimal PATH is injected if not explicitly passed through, to prevent the sandbox from being unable to find executables.

Uses execve() instead of execvp() to pass the filtered environment explicitly.

Security property: Sensitive environment variables (API keys, tokens, credentials in AWS_SECRET_ACCESS_KEY, GITHUB_TOKEN, etc.) are never leaked to the sandbox unless explicitly listed in env_passthrough.

max_pids (RLIMIT_NPROC):

Sets RLIMIT_NPROC via setrlimit() to cap the number of processes the sandbox can create. This is a per-UID limit — effective because the sandbox runs as UID 0 in its own user namespace, mapped to the host user.

Security property: Prevents fork bombs. A process that exceeds the limit gets EAGAIN from fork().

allow_execve (pre-exec validation):

The resolved command path is checked against the allow_execve list before forking. If the command is not in the list (and the list is non-empty), execution is rejected immediately.

Prefix rules: Entries ending in /* match any binary under that directory tree. For example, /nix/store/* allows any binary whose resolved path starts with /nix/store/. The match requires a / boundary to prevent false positives (e.g., /nix/store-extra/foo does NOT match /nix/store/*). This is essential for content-addressed stores like Nix where binary paths contain unpredictable hashes.

Limitation: allow_execve validates the initial command at the CLI level. Ongoing enforcement of every execve() call inside the sandbox is provided by the USER_NOTIF supervisor (see Seccomp USER_NOTIF Supervisor), which intercepts execve() and execveat() syscalls and validates the pathname against the allow_execve list. When the notifier is disabled (kernel < 5.9 or notifier = false), only the initial command is validated.

6. Cgroups v2

Files: cgroups.rs

Cgroups v2 enforces resource limits (memory and CPU) without requiring root. It leverages systemd’s per-user cgroup delegation, which is available on any modern system running systemd (Ubuntu 22.04+, Fedora 36+, etc.).

Resource limits are opt-in — none of the shipped base recipes include [resources]. Users add memory_mb and/or cpu_percent in their own recipes when needed.

Setup sequence (happens before pivot_root, while /sys/fs/cgroup is still accessible):

  1. Detect the current cgroup by reading /proc/self/cgroup.
  2. Create a child cgroup at <parent>/canister-<pid>.
  3. Write memory.max (bytes) and cpu.max (quota/period) to the child cgroup’s control files.
  4. Move the sandboxed process into the child cgroup by writing its PID to cgroup.procs.

CPU limiting: cpu_percent = 50 translates to cpu.max = "50000 100000" (50ms quota per 100ms period), effectively capping the process to 50% of one CPU core.

Memory limiting: memory_mb = 512 translates to memory.max = 536870912 (512 * 1024 * 1024 bytes). When exceeded, the kernel OOM-kills the process.

Cleanup: Child cgroups are removed when the sandboxed process exits (the kernel removes empty cgroups automatically).

Failure handling: Cgroup setup failure aborts the sandbox. All setup failures are fatal regardless of mode.

Security property: The sandboxed process cannot consume unbounded memory or CPU. The limits are enforced by the kernel’s cgroup controller and cannot be modified by the sandboxed process (which has no write access to the cgroup filesystem after seccomp is loaded).

7. /proc Hardening

Files: overlay.rs (mount_proc function)

After mounting /proc inside the sandbox, Canister masks sensitive paths following Docker’s default behavior, plus additional hardening:

Masked files (bind-mount /dev/null over them):

  • /proc/kcore — physical memory access
  • /proc/keys — kernel keyring contents
  • /proc/key-users — keyring user counts (information leak)
  • /proc/sysrq-trigger — kernel SysRq commands
  • /proc/timer_list — timer details (information leak)
  • /proc/latency_stats — latency statistics
  • /proc/kallsyms — kernel symbol addresses (KASLR bypass)
  • /proc/schedstat — scheduler statistics (information leak)

Masked per-process files (bind-mount /dev/null over them):

  • /proc/self/mountinfo — mount topology (reveals sandbox structure)
  • /proc/1/mountinfo — same, for PID 1

Masked directories (mount empty read-only tmpfs over them):

  • /proc/acpi — ACPI interface
  • /proc/scsi — SCSI device interface

Read-only remount:

  • /proc/sys — prevents writing to sysctl tunables

Failure handling: Individual mask failures are logged at debug level and are non-fatal. The sandbox continues with whatever masking succeeded.

Security property: The sandboxed process cannot read sensitive kernel information from /proc, trigger SysRq commands, modify sysctl values, or inspect the sandbox’s mount topology via mountinfo.

8. Capability Dropping

Module: namespace.rs (drop_capabilities())

After all namespace setup is complete and before execve(), Canister drops all Linux capabilities from the bounding set and clears the inheritable and ambient sets.

Setup sequence:

  1. Read CAP_LAST_CAP from /proc/sys/kernel/cap_last_cap to discover the number of capabilities on the running kernel (currently 41).
  2. Drop each capability from the bounding set using prctl(PR_CAPBSET_DROP, cap).
  3. Clear the inheritable capability set using capset().
  4. Clear the ambient capability set using prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_CLEAR_ALL).

Result after exec with NO_NEW_PRIVS:

CapEff: 0000000000000000
CapPrm: 0000000000000000
CapBnd: 0000000000000000
CapAmb: 0000000000000000
CapInh: 0000000000000000

Why this matters: Inside a user namespace, the sandboxed process has CAP_SYS_ADMIN and other capabilities that allow namespace operations. While seccomp blocks the dangerous syscalls (mount, unshare, etc.), dropping capabilities provides defense-in-depth. Even if a seccomp bypass were found, the empty capability set prevents privilege escalation.

AppArmor interaction: The canister AppArmor profile requires allow capability setpcap, to permit PR_CAPBSET_DROP calls. This is included in the shipped profile and installed via sudo can setup.

Security property: The sandboxed process executes with no capabilities in any set. It cannot gain capabilities through any mechanism (exec of setuid binaries is also blocked by NO_NEW_PRIVS).

9. Default Resource Limits

Module: process.rs (apply_default_resource_limits())

Before execve(), Canister applies conservative resource limits that provide baseline protection even when no [resources] section is present in the recipe:

LimitValuePurpose
RLIMIT_NPROC4096Limits total processes (fork bomb defense)
RLIMIT_AS8 GBLimits virtual address space
RLIMIT_NOFILE4096Limits open file descriptors
RLIMIT_FSIZE4 GBLimits maximum file size
RLIMIT_CORE0Disables core dumps (prevents data leakage)

These defaults are applied first, then any explicit limits from the recipe’s [resources] section override them. The RLIMIT_NPROC from [process].max_pids takes precedence over the default if specified.

Security property: Fork bombs are bounded, memory-hungry processes are capped, and core dumps cannot leak sandbox state to disk.

10. Monitor Mode

Flag: --monitor

Monitor mode runs the sandbox with all namespace isolation active (for accurate observation) but relaxes policy enforcement. Each enforcement point logs what would have been blocked without actually blocking it.

Enforcement points and their monitor-mode behavior:

Enforcement pointNormal modeMonitor mode
allow_execveRejects unlisted commandsLogs warning, allows through
env_passthroughStrips unlisted env varsLogs what would be stripped, passes full env
max_pidsSets RLIMIT_NPROCLogs the limit, skips setrlimit()
Seccomp BPFSECCOMP_RET_ERRNO (EPERM)SECCOMP_RET_LOG (allowed but kernel-logged)
USER_NOTIF supervisorActive (intercepts syscalls)Disabled (incompatible with SECCOMP_RET_LOG)
Filesystem isolationFull overlay + pivot_rootFull overlay + pivot_root (unchanged)
Network isolationNamespace + pastaNamespace + pasta (unchanged)

Key design decisions:

  1. Namespaces stay active. Monitor mode does NOT skip namespace creation. This ensures the process runs in the same environment it would in enforced mode, so observations are accurate. If namespaces were disabled, the process might behave differently (different PIDs, different filesystem view, etc.).

  2. SECCOMP_RET_LOG for syscalls. Instead of returning EPERM, denied syscalls are allowed through but logged to the kernel audit subsystem. View these with journalctl -k | grep seccomp. This uses a real BPF filter (same structure as enforcement mode) so the observation is exact.

  3. USER_NOTIF is disabled. The notifier supervisor is incompatible with monitor mode because SECCOMP_RET_USER_NOTIF suspends the syscall (it does not log-and-allow like SECCOMP_RET_LOG). In monitor mode, all syscalls pass through to the kernel with logging only.

  4. Pre-run policy preview. Before forking, the CLI prints a summary of the active policy so the user knows what enforcement points will be observed.

  5. Post-run summary. After the sandboxed process exits, the CLI prints a summary with the exit code and hints for reviewing the monitor output.

Intended workflow:

# 1. Run with monitor to see what the policy would block
can run --monitor --recipe my_policy.toml -- ./my_program

# 2. Review MONITOR: lines in output and seccomp audit logs
journalctl -k | grep seccomp

# 3. Adjust policy based on observations

# 4. Run with enforcement
can run --recipe my_policy.toml -- ./my_program

Security property: Monitor mode provides NO security guarantees. It is a development/debugging tool for iterating on sandbox policies.

Warning: A malicious process can detect monitor mode (e.g., by attempting a denied syscall and observing it succeeds) and behave differently. Always validate policies with enforcement enabled.

11. Strict Mode

Flag: --strict (or strict = true in config)

Strict mode is the inverse of monitor mode: instead of relaxing enforcement, it tightens it. Both normal and strict mode treat all setup failures as fatal. The key difference is the seccomp deny action.

Changes in strict mode:

Enforcement pointNormal modeStrict mode
Filesystem isolationAborts on failureAborts on failure
Network setupAborts on failureAborts on failure
Loopback bring-upAborts on failureAborts on failure
Seccomp deny actionSECCOMP_RET_ERRNO (EPERM)SECCOMP_RET_KILL_PROCESS
Cgroup setupAborts on failureAborts on failure

Mutual exclusion: --strict and --monitor cannot be used together. This is enforced at the CLI level.

Recommended for: CI pipelines, production deployments, and any environment where reduced isolation is worse than no execution.


Parent-Child Protocol

The parent and child synchronize via three anonymous pipes:

Pipe 1: child_ready  (child → parent)   "namespaces created"
Pipe 2: maps_done    (parent → child)   "UID/GID maps written"
Pipe 3: network_done (parent → child)   "pasta started, network ready"

Timeline:
  Child: unshare(USER+PID+NET)
  Child: write(child_ready, 0x00)       ← "namespaces created"
  Child: read(maps_done)                ← blocks

  Parent: read(child_ready)             ← unblocks
  Parent: write uid_map, gid_map
  Parent: write(maps_done, 0x00)        ← "maps written"

  Child: read(maps_done)                ← unblocks
  Child: read(network_done)             ← blocks

  Parent: start pasta --userns /proc/<child>/ns/user --netns /proc/<child>/ns/net --runas <uid>
  Parent: write(network_done, 0x00)     ← "network ready"

  Child: read(network_done)             ← unblocks
  Child: unshare(NEWNS)
  Child: install notifier filter, send fd to PID 1 supervisor (if enabled)
  Child: setup overlay, network, seccomp
  Child: execve()

  PID 1: receive notifier fd, run inline supervisor loop (if enabled)

This three-pipe protocol is necessary because:

  1. UID/GID maps must be written from outside the namespace. The kernel requires an external process to write /proc/<pid>/uid_map.

  2. pasta needs the child’s user and network namespaces. pasta is invoked with --userns /proc/<child_pid>/ns/user --netns /proc/<child_pid>/ns/net --runas <uid>. The setns(CLONE_NEWNET) syscall requires CAP_SYS_ADMIN in the user namespace that owns the target network namespace — not the caller’s user namespace. Since the child created both namespaces atomically via unshare(CLONE_NEWUSER | CLONE_NEWNET), the network namespace is owned by the child’s user namespace. pasta must therefore first join the child’s user namespace (setns(CLONE_NEWUSER)) to acquire CAP_SYS_ADMIN there, then join the network namespace (setns(CLONE_NEWNET)). The child calls prctl(PR_SET_PTRACER, PR_SET_PTRACER_ANY) before signaling the parent, so that pasta (a sibling process) can open /proc/<child>/ns/* despite Yama ptrace_scope=1. --runas <uid> prevents pasta from dropping to “nobody”, which would fail the kernel’s UID ownership check on namespace files. This must happen after the child creates CLONE_NEWNET but before the child tries to use the network.

  3. Mount namespace is split from the initial unshare. The child first calls unshare(USER+PID+NET), then waits for pasta, then calls unshare(NEWNS) separately. This split ensures pasta can access /proc/<child_pid>/ns/net before the child’s mount namespace changes.

  4. Mount operations need mapped UIDs. The child cannot mount anything until its UID is mapped (otherwise the kernel rejects it).

  5. The notifier fd must be passed from worker to supervisor. The seccomp() syscall returns the notifier fd in the worker’s process. The fd is sent to PID 1 (supervisor) via SCM_RIGHTS over an anonymous Unix socket pair created before the supervisor/worker fork.


Mandatory Access Control (MAC)

Linux distributions use Mandatory Access Control systems to restrict unprivileged processes. Canister detects the active MAC system at runtime and manages the appropriate security policy via can setup. See ADR-0004 for the design rationale.

Supported MAC Systems

DistributionMAC SystemRestriction Mechanism
Ubuntu 24.04+AppArmorkernel.apparmor_restrict_unprivileged_userns=1
Fedora 41+ / RHEL 10+SELinuxuser_namespace { create } permission
Arch, Void, Gentoo, etc.NoneNo restriction — works natively

Detection

Canister detects the active MAC system at startup:

  1. AppArmor: /sys/module/apparmor/parameters/enabled == "Y"
  2. SELinux: /sys/fs/selinux/enforce exists
  3. Neither: no policy needed, sandbox works natively

can check reports the active MAC system, its restriction status, and the canister policy status.

AppArmor (Ubuntu)

Two-profile architecture:

Canister uses two AppArmor profiles, managed by can setup:

  1. canister — attached to the can binary. Grants mount, pivot_root, capabilities (sys_admin, net_admin, sys_chroot, sys_ptrace, dac_override, dac_read_search), userns creation, and full file/network access. Has a catch-all px /** -> canister//&canister_sandboxed rule that transitions all child exec’s to the restricted sub-profile. Also has specific ux (unconfined exec) rules for:

    • pasta (/usr/bin/pasta, /usr/bin/pasta.avx2, /bin/pasta, /bin/pasta.avx2): pasta needs CAP_SYS_ADMIN to call setns(CLONE_NEWUSER), which is denied by canister_sandboxed. The ux rules take precedence over the px /** glob, so pasta runs unconfined.
    • apparmor_parser (/usr/sbin/apparmor_parser, /sbin/apparmor_parser): needs CAP_MAC_ADMIN to load/unload profiles during can setup.
  2. canister_sandboxed — maximally strict sub-profile for sandboxed commands. Denies all capabilities (audit deny capability), mount/umount/ pivot_root, user namespace creation, ptrace (except allowing the USER_NOTIF supervisor to read process memory), and DBus.

Profile transition chain:

canister (binary starts, never execs itself)
    ├─ fork (child inherits "canister") → all namespace setup happens here
    │   └─ execve(command) → "canister//&canister_sandboxed"
    ├─ spawn(pasta) → ux rule fires → runs unconfined
    └─ spawn(apparmor_parser) → ux rule fires → runs unconfined

AppArmor specific path rules (ux /usr/bin/pasta) take precedence over glob rules (px /**), so the ux rules for pasta and apparmor_parser work without conflicting with the catch-all px rule.

One-time upgrade note: When upgrading from an older profile (without ux rules for pasta/apparmor_parser) to the new profile, apparmor_parser may be confined by the old profile and fail with “Access denied”. In this case, manually reload: sudo apparmor_parser -r /etc/apparmor.d/canister.

SELinux (Fedora/RHEL)

Policy module architecture:

Canister’s SELinux policy defines three types:

  1. canister_t — domain for the can binary. Grants user_namespace { create }, cap_userns { sys_admin sys_ptrace net_admin sys_chroot }, mount/pivot_root permissions, full file access, and ptrace over sandboxed children.

  2. canister_sandboxed_t — restricted domain for sandboxed child processes. Basic file read/execute and network socket access only. No namespace creation, no capabilities, no mount operations.

  3. canister_exec_t — file type for the can binary, triggers automatic domain transition from unconfined_t to canister_t on exec.

Installation: SELinux policy installation requires checkmodule, semodule_package, and semodule (from policycoreutils and checkpolicy packages). can setup generates .te (type enforcement) and .fc (file context) files, compiles them, and installs the module.

Impact on Canister

FeatureWith MAC restrictionWith canister policy
User namespaceWorksWorks
Mount namespaceMounts fail → abortsFull isolation
Filesystem isolationAborts (cannot establish)Full
Network namespaceWorksWorks
Loopback bring-upFails → abortsWorks
pastaN/A (no connectivity)Works
SeccompWorksWorks
USER_NOTIF supervisorWorksWorks

Policy management (can setup)

# Install the security policy (auto-detects MAC system and binary path)
sudo can setup

# Force reinstall (even if policy exists and appears current)
sudo can setup --force

# Remove the policy
sudo can setup --remove

can setup is interactive when stdout is a terminal: it shows the generated policy content (or a diff when updating), and asks for confirmation before writing. In non-interactive mode (piped/CI), it writes without prompting.

The command auto-detects the active MAC system and generates the appropriate policy. On systems with no MAC, it reports that no policy is needed.

Stale policy detection: when the installed policy content doesn’t match the current template (e.g., after a Canister upgrade), can check reports the policy as “OUTDATED” and can setup will update it.


Known Limitations

Fundamental limitations

  • Kernel exploits. No userspace sandbox can protect against kernel vulnerabilities. Seccomp reduces the attack surface but cannot eliminate it.

  • Side channels. Timing attacks, cache attacks, and speculative execution attacks are out of scope.

  • RLIMIT_NPROC is per-UID. The max_pids limit applies to the user’s total process count, not just the sandbox. Inside a user namespace this is usually fine (the sandbox runs as a mapped UID), but if multiple sandboxes share a UID they share the limit.

  • DNS resolution timing. Domain pre-resolution happens at sandbox startup. If DNS records change during execution, the resolved IP set becomes stale. TTL-aware re-resolution is not implemented.

  • USER_NOTIF TOCTOU window. The SECCOMP_IOCTL_NOTIF_ID_VALID check mitigates but does not fully eliminate the time-of-check-time-of-use race in the USER_NOTIF supervisor. A highly concurrent, adversarial workload with precise timing could theoretically modify memory between the supervisor’s read and verdict. This is an inherent limitation of the seccomp_unotify mechanism.