Canister
A lightweight sandbox for running untrusted code safely on Linux.
Canister (can) runs any command inside an isolated sandbox with restricted filesystem, network, and syscall access. No root required. Single binary, zero runtime dependencies.
$ can run --recipe recipes/example.toml -- python3 untrusted_script.py
The script sees an empty filesystem (except explicitly allowed paths), can only reach allowed domains, and is blocked from dangerous syscalls.
Key Features
- Namespace isolation — mount, PID, network, user, and UTS namespaces
- Filesystem control — read-only bind mounts with explicit write paths
- Network filtering — DNS-level domain filtering + IP/CIDR rules
- Seccomp BPF — syscall allow/deny lists with optional
SECCOMP_RET_USER_NOTIFsupervisor - L7 egress proxy + DLP — TLS-terminating proxy with credential pattern detection and per-service domain scoping; canary tokens and session entropy budgeting catch exfiltration attempts
- Recipe composition — layered TOML configs that merge predictably
- Zero dependencies — static Rust binary, no daemon, no root
Documentation Structure
This documentation is organized into three sections:
- User Guide — conceptual explanations, getting started, configuration patterns
- Reference — auto-generated from source code: CLI flags, config schema, recipes, merge semantics
- Architecture — system design overview and Architecture Decision Records (ADRs)
Getting Started
Installation
Download the latest binary from GitHub Releases:
# Download and extract
curl -fsSL https://github.com/dergraf/canister/releases/latest/download/canister-x86_64-linux.tar.gz \
| tar xz -C ~/.local/bin
# Verify
can --version
Or build from source:
git clone https://github.com/dergraf/canister.git
cd canister
cargo build --release
cp target/release/can ~/.local/bin/
First-time Setup
Run the setup command to configure your system for unprivileged user namespaces:
can setup
Quick Start
Run a command inside a sandbox:
can run -- ls /
This runs ls / inside an isolated environment with the default recipe applied. The sandbox restricts filesystem access, blocks network traffic, and filters syscalls.
Using Recipes
Recipes are TOML files that define sandbox policies. Use built-in recipes or write your own:
# List available built-in recipes
can recipe list
# Run with a specific recipe
can run --recipe python -- python3 script.py
# Auto-detect recipe from command
can run -- python3 script.py
See Configuration for the full configuration guide and Built-in Recipes for all available recipes.
Project Manifests
For projects that need reproducible sandbox configurations, create a canister.toml manifest:
[sandbox.dev]
recipes = ["python", "network-curl"]
[sandbox.dev.config.network]
[[host]]
domain = "pypi.org"
[[host]]
domain = "files.pythonhosted.org"
Then use can up to launch the sandbox:
can up dev
See the Manifest Reference for full schema documentation.
Configuration Reference
Canister uses TOML configuration files with strict schema validation. Unknown fields are rejected at parse time.
When no config file is provided (can run -- command), default policy uses
proxy-only egress with strict filesystem defaults and the default seccomp baseline.
Table of Contents
- Project Manifest (canister.toml)
- Recipe Composition
- recipe (metadata)
- filesystem
- network
- [host]
- network.dlp
- process
- resources
- syscalls
- proxy
- Strict Mode
- Monitor Mode
- Inspecting the Resolved Policy
- Examples
Project Manifest (canister.toml)
A project manifest declares named sandboxes for a project. Instead of
remembering which -r flags to pass, you define sandboxes once in
canister.toml and run them with can up.
Place canister.toml in your project root (next to .git/). Canister
discovers it by walking up from the current directory, similar to .gitignore.
Manifest Format
[sandbox.dev]
description = "Neovim + Elixir development"
recipes = ["neovim", "elixir", "nix"]
command = "nvim"
[sandbox.dev.filesystem]
allow_write = ["$HOME/.local/share/nvim"]
[[sandbox.dev.host]]
domain = "api.myproject.dev"
[sandbox.test]
description = "Mix test runner"
recipes = ["elixir", "nix"]
command = "mix test"
[sandbox.ci]
description = "CI — strict, no network"
recipes = ["elixir", "nix", "generic-strict"]
command = "mix test --cover"
strict = true
[sandbox.ci.resources]
memory_mb = 2048
cpu_percent = 100
[sandbox.<name>] fields:
| Field | Type | Required | Description |
|---|---|---|---|
description | string | No | Human-readable description |
recipes | string[] | Yes | Recipe names to compose (resolved via recipe search path) |
command | string | Yes | Command to run (may include arguments) |
strict | bool | No | Override strict mode for this sandbox |
Override sections:
Each sandbox can include optional override sections that merge on top of the composed recipes. These use the same schema as recipe files:
[sandbox.<name>.filesystem]—allow,allow_write,deny[sandbox.<name>.network]—egress,allow_ips,ports,contract_mode[[sandbox.<name>.host]]— one or more per-destination contracts (see[[host]]below)[sandbox.<name>.process]—max_pids,allow_execve,env_passthrough[sandbox.<name>.resources]—memory_mb,cpu_percent[sandbox.<name>.syscalls]—allow_extra,deny_extra,seccomp_mode,notifier
Overrides follow the same merge semantics as recipe composition: Vec fields are unioned, scalar fields use last-Some-wins, strict uses OR.
Validation rules:
- At least one
[sandbox.<name>]must be defined. - Each sandbox must have
recipes = [...]with at least one entry. - Each sandbox must have a non-empty
command. - Unknown fields are rejected (
deny_unknown_fields). - Mixing absolute (
allow/deny) and relative (allow_extra/deny_extra) syscall fields in a sandbox’s[syscalls]section is an error.
can up
Run a named sandbox from the manifest:
# Run the default sandbox (alphabetically first).
can up
# Run a specific sandbox by name.
can up dev
can up test
can up ci
# With CLI overrides.
can up dev --strict
can up dev --monitor
can up test -p 4000:4000
Default sandbox: When no name is given, can up uses the alphabetically
first sandbox. Use descriptive names so the default is predictable (e.g.,
dev sorts before test).
Error handling: If canister.toml is not found, can up prints an error
suggesting can run for ad-hoc use. If the named sandbox doesn’t exist, it
lists available sandbox names.
Dry-Run Preview
Use --dry-run to see the fully resolved policy without running anything:
can up dev --dry-run
can up ci --dry-run
The output shows the merged result of base.toml + auto-detected recipes +
manifest recipes + manifest overrides, including filesystem paths, network
domains, syscall overrides, and resource limits.
Composition Order (can up)
base.toml
→ auto-detected recipes (match_prefix against command binary)
→ recipes listed in manifest (left to right)
→ manifest overrides ([sandbox.<name>.filesystem], etc.)
= final SandboxConfig
This is the same merge chain as can run, except the explicit -r flags
are replaced by the manifest’s recipes = [...] list, and manifest
overrides act as the final layer.
Design note: Package manager recipes (nix, homebrew, etc.) should be
listed explicitly in recipes = [...]. While auto-detection via
match_prefix still works for the command binary, explicit declaration
preserves the principle of least privilege — auditors can see exactly
which recipes are composed by reading canister.toml.
Recipe Composition
Canister supports composing multiple recipes via repeated -r / --recipe
flags. Recipes are merged left-to-right into a single resolved config.
Composition order: base.toml → auto-detected recipes → explicit --recipe args.
base.toml provides essential OS bind mounts and is always loaded first
(embedded in the binary, overridable on disk). Auto-detected recipes are
matched by match_prefix before explicit recipes are applied. The
default.toml seccomp baseline is resolved separately by the seccomp module
and is NOT part of this composition chain.
# base.toml (always) → nix.toml (auto-detected) → elixir.toml (explicit)
can run -r elixir -- mix test # mix resolves to /nix/store/..., nix.toml auto-detected
# Explicit composition
can run -r nix -r elixir -- mix test
can run -r cargo -r generic-strict -- cargo build
Merge Semantics
When multiple recipes are merged, each field type follows a specific strategy:
| Field type | Strategy | Example |
|---|---|---|
Vec fields (paths, domains, syscalls, env vars) | Union — deduplicated, order preserved | Two recipes allowing /a and /b → ["/a", "/b"] |
strict (Option<bool>) | OR — any Some(true) wins, can never be loosened | Recipe A: strict = true, Recipe B: omitted → true |
egress (Option<EgressMode>) | Last-Some-wins — None preserves earlier value | Recipe A: egress = "proxy-only", Recipe B: egress = "direct" → direct |
seccomp_mode (Option<SeccompMode>) | Last-Some-wins | Same as egress |
Numeric (max_pids, memory_mb, cpu_percent) | Last-Some-wins | Recipe A: max_pids = 64, Recipe B: max_pids = 128 → 128 |
RecipeMeta | Overlay — later recipe’s metadata wins if present | — |
The “last-Some-wins” strategy means None (field not specified) preserves
the value from an earlier recipe, while Some(value) overwrites it.
Name-Based Lookup
The -r argument is resolved as follows:
- If the argument contains
/or ends with.toml, treat as a file path. - Otherwise, search for
<name>.tomlin the recipe search path:./.canister/$XDG_CONFIG_HOME/canister/recipes//etc/canister/recipes/
- First match wins (project-local takes precedence over user-global).
can run -r elixir -- mix test # name lookup → elixir.toml
can run -r recipes/custom.toml -- mix test # file path
can run -r ./my-policy.toml -- echo hi # file path (contains /)
Auto-Detection via match_prefix
Recipes can declare match_prefix patterns in their [recipe] metadata.
During CLI setup (before forking), the command binary path is resolved and
canonicalized. Each discovered recipe’s match_prefix is checked against
the resolved path. Matching recipes are automatically merged into the chain
between base.toml and explicit -r args.
This replaces the previous hardcoded detect_command_prefix() logic.
Adding support for a new package manager is “write a .toml file” rather
than “modify Rust code”.
Environment Variable Expansion
Recipe paths support environment variable expansion:
| Syntax | Expansion |
|---|---|
$HOME | Value of $HOME |
$USER | Value of $USER |
${XDG_CONFIG_HOME} | Value of $XDG_CONFIG_HOME |
$$ | Literal $ |
Expansion applies to [filesystem].allow, [filesystem].deny,
[process].allow_execve, and [recipe].match_prefix. It is performed
during config resolution (after merge, before the sandbox uses the paths).
[filesystem]
allow = ["$HOME/.cargo", "$HOME/.rustup", "$HOME/project"]
[recipe]
match_prefix = ["$HOME/.cargo"]
[recipe] (metadata)
Optional metadata section for recipe files. Not used for policy enforcement but controls recipe discovery and composition behavior.
| Field | Type | Default | Description |
|---|---|---|---|
name | string (optional) | — | Human-readable recipe name |
description | string (optional) | — | Short description shown by can recipe list |
match_prefix | string[] | [] | Path prefixes for auto-detection (env vars expanded) |
[recipe]
name = "nix"
description = "Nix package manager (/nix/store)"
match_prefix = ["/nix/store"]
[filesystem]
Controls what the sandboxed process can see and access on the filesystem.
When filesystem isolation is active (requires a MAC policy on Ubuntu 24.04+ and Fedora 41+), the sandbox starts with an empty tmpfs root. Only explicitly allowed paths and essential system paths are bind-mounted read-only.
| Field | Type | Default | Description |
|---|---|---|---|
allow | string[] | [] | Paths to bind-mount read-only into the sandbox |
deny | string[] | [] | Paths explicitly denied (checked before allow) |
Behavior:
- Deny rules take precedence over allow rules.
- Paths are matched by prefix: allowing
/usr/libalso allows/usr/lib/python3. - Essential paths are defined in
recipes/base.toml(embedded in the binary, overridable on disk) and always bind-mounted:/bin,/sbin,/usr/bin,/usr/sbin,/lib,/lib64,/usr/lib,/etc. - Auto-detection: When the command binary lives outside standard FHS paths
(e.g., installed via Nix, Homebrew, Cargo, etc.), Canister auto-detects the
appropriate package manager recipe via
match_prefixand merges it into the recipe chain, bringing the necessary mount paths automatically. See Auto-Detection via match_prefix. - When filesystem isolation is blocked (MAC system blocks mounts), the
sandbox aborts. Run
sudo can setupto install the security policy (use--forceto reinstall if the policy is outdated).
[filesystem]
allow = ["/usr/lib", "/usr/bin", "/tmp/workspace", "/home/user/data"]
deny = ["/etc/shadow", "/etc/passwd", "/root", "/home/user/.ssh"]
Package Manager Support
When the command binary is installed outside standard system paths, Canister
uses recipe-based auto-detection to ensure the binary is visible inside the
sandbox. Each package manager has a recipe with match_prefix patterns:
| Recipe | Auto-detects when binary is under | Mounts |
|---|---|---|
nix.toml | /nix/store | /nix/store (read-only) |
homebrew.toml | /opt/homebrew, /home/linuxbrew/.linuxbrew | The matching prefix |
cargo.toml | $HOME/.cargo, $HOME/.rustup | $HOME/.cargo, $HOME/.rustup |
snap.toml | /snap | /snap |
flatpak.toml | /var/lib/flatpak, $HOME/.local/share/flatpak | The matching prefix |
gnu-store.toml | /gnu/store | /gnu/store |
How it works:
- The command path is canonicalized (all symlinks resolved) at startup.
- Each discovered recipe’s
match_prefixis checked against the resolved path. - Matching recipes are merged into the composition chain, bringing their
[filesystem].allowpaths,[process].allow_execveentries, and any other policy fields. - For content-addressed stores like
/nix/store, the entire tree is mounted. Binaries reference sibling store entries freely, making individual-entry mounting impractical.
Security note: Auto-detection makes the prefix visible inside the
sandbox but does not grant execution permission. The [process] allow_execve
list independently controls what binaries can be executed. Package
manager recipes include allow_execve prefix rules (e.g., /nix/store/*)
to authorize execution within the mounted tree.
Adding a new package manager: Create a new .toml recipe with
appropriate match_prefix, [filesystem].allow, and
[process].allow_execve entries. No Rust code changes needed.
[network]
Controls network access. Secure by default: all network access is denied unless explicitly allowed.
| Field | Type | Default | Description |
|---|---|---|---|
egress | "proxy-only" | "none" | "direct" | "proxy-only" | Outbound networking mode |
allow_ips | string[] | [] | Allowed IPs or CIDR ranges (IPv4 and IPv6) |
ports | string[] | [] | Port forwarding specs ([ip:]hostPort:containerPort[/protocol]) |
contract_mode | "strict" | "relaxed" | "strict" | Default for hosts without a [[host]] block. strict refuses; relaxed allows + logs. |
FQDN egress goes through the top-level [[host]] table, not
this section. Each [[host]] block names a domain and the request
shapes accepted on it; see that section for the full schema.
Network mode determination:
The effective egress mode determines isolation behavior:
egress | Mode | Description |
|---|---|---|
none | None | No outbound network. Empty network namespace, loopback only. |
proxy-only | Filtered | Outbound traffic must go through local proxy (kernel-enforced). |
direct | Full/Filtered | Direct outbound. If allowlists/ports are set, filtered mode is used for policy checks; otherwise full host network namespace. |
Specifying ports automatically upgrades None mode to Filtered
mode (port forwarding requires a functional network namespace with pasta).
Domain matching:
Domains are matched including subdomains. Allowing pypi.org also allows
files.pythonhosted.org if listed, but does not automatically allow
subdomains of pypi.org. Each domain must be listed explicitly.
IP/CIDR matching:
IPs support both exact match and CIDR notation:
[network]
allow_ips = [
"93.184.216.34", # exact IPv4
"10.0.0.0/8", # IPv4 CIDR
"2606:2800:220:1::/64", # IPv6 CIDR
]
Filtered mode requirements:
Filtered mode requires pasta (from the passt project) installed on the host:
sudo apt install passt # Debian/Ubuntu
sudo dnf install passt # Fedora
In filtered mode, pasta mirrors the host’s network configuration into the sandbox namespace. pasta copies the host’s real IP addresses and routes into the namespace. DNS is handled via a link-local address:
| Address | Role |
|---|---|
| Host’s default gateway | Gateway |
169.254.0.1 | DNS server (link-local, pasta --dns) |
| Host’s real IP | Sandbox IP (mirrored from host) |
[network]
egress = "proxy-only"
[[host]]
domain = "pypi.org"
[[host]]
domain = "files.pythonhosted.org"
[[host]]
domain = "registry.npmjs.org"
Port forwarding (ports):
Port forwarding uses Docker/Podman-compatible syntax and is specified via
the -p / --port CLI flag or the ports config field:
# CLI usage
can run -p 8080:80 -- my-server
can run -p 127.0.0.1:3000:3000 -p 5432:5432/tcp -- my-app
# Config usage
[network]
egress = "proxy-only"
ports = ["8080:80", "127.0.0.1:3000:3000", "5353:53/udp"]
Syntax: [ip:]hostPort:containerPort[/protocol]
| Component | Required | Default | Description |
|---|---|---|---|
ip | No | 0.0.0.0 | Host IP to bind (e.g., 127.0.0.1) |
hostPort | Yes | — | Port on the host |
containerPort | Yes | — | Port inside the sandbox |
protocol | No | tcp | tcp or udp |
[[host]]
Per-destination egress contract. One [[host]] block per FQDN you
allow the sandbox to reach. The block answers every question about
that upstream in one place: connect permission, the request shapes
that are legitimate, and which DLP detectors may carry verdicts on
this host as Warn instead of Block.
There is no separate connect-permission list. Having a [[host]]
block at all is the permission to dial; the block’s other fields
tighten what’s allowed from there. The minimum block is one line
(domain = "x") — equivalent to “allow this host, any shape.”
# Minimum-viable allow.
[[host]]
domain = "static.example.com"
# Full picture for a service we care about.
[[host]]
domain = "api.github.com"
methods = ["GET", "POST", "PATCH", "PUT", "DELETE"]
content_types = ["application/json", "application/vnd.github+json"]
paths = ["/repos/", "/user/", "/orgs/"]
max_request_bytes = 1_048_576 # 1 MiB
allow_credentials = ["github_pat"] # downgrade github_pat hits to Warn here
# Per-host escape hatch.
[[host]]
domain = "weird-tool.corp.internal"
contract_mode = "relaxed"
Fields
| Field | Type | Default | Description |
|---|---|---|---|
domain | string | required | FQDN this block applies to. Wildcards (*.github.com) match one or more subdomain levels; bare domains match exact + any subdomain. Most-specific match wins. |
methods | string[] | [] | Allowed HTTP methods (case-insensitive). Empty = any. |
content_types | string[] | [] | Allowed request Content-Type values (matched on mime/subtype portion; ; charset=... parameters are ignored). Empty = any. |
paths | string[] | [] | Path prefixes the request URI must start with. Empty = any. |
max_request_bytes | u64 | unset | Per-host request body cap. Applies after the global max_streamed_body_bytes. |
allow_credentials | string[] | [] | DLP detector ids whose verdicts on this host downgrade from Block to Warn (e.g. ["github_pat"] means the worker may legitimately carry a github PAT in Authorization to this host). |
contract_mode | "strict" | "relaxed" | inherit [network] contract_mode | Per-host override of the global default. Only affects the unknown-host decision once you’re inside this block; field-level checks still run. |
Multiple [[host]] entries with the same domain merge by:
union on vec fields, max for max_request_bytes, last-Some-wins
for contract_mode. A project recipe can extend (never silently
restrict) a canister-shipped contract by writing another [[host]]
with the same domain.
Refusal behaviour
If a request reaches the proxy with a destination that has no matching
[[host]] block, the gate decides based on [network] contract_mode:
strict(default) — refuse with 415 (or 413 for body-size),x-canister-error: contract-refused, and a response body that carries the exact[[host]]patch to paste intocanister.tomlto allow this exact shape.relaxed— allow the request but emit anunknown_host_contracttracing event. Intended for prototyping where the upstream set isn’t known up front.
See docs/refusals.md for the operator-facing
walkthrough (415 vs 451, how to read the patch, escape hatches).
Shipped service contracts
Canister ships contracts for the upstreams workers most commonly hit
under recipes/services/: github.toml, openai.toml,
anthropic.toml, npm.toml, pypi.toml, huggingface.toml,
docker.toml, aws.toml, stripe.toml, slack.toml. Compose
them with -r service:github (etc.) or via a project manifest.
[network.dlp]
Data Loss Prevention layer running inside the L7 egress proxy. Scans
outbound HTTP traffic for credential patterns (GitHub PATs, npm tokens,
AWS keys, Slack tokens, SSH private keys, generic bearer tokens) and
enforces per-host credential scoping via the allow_credentials field
on each [[host]] block — a GitHub PAT bound for
registry.npmjs.org will be blocked even though both hosts are
reachable. See DLP for the full threat model and detector
list.
DLP only runs when traffic is inspectable, i.e. when
network.egress = "proxy-only".
[network.dlp]
enabled = true
canary_tokens = true # default when DLP is enabled
max_decode_depth = 32 # base64/hex/percent recursion cap
decompress = true # gzip/deflate/brotli before scan
dns_entropy_threshold = 4.5 # Shannon entropy per DNS label
session_entropy_budget = 8192 # cumulative high-entropy bytes/session
# Extend credential scope by adding allow_credentials on the host:
[[host]]
domain = "github.corp.example.com"
methods = ["GET", "POST", "PATCH", "PUT", "DELETE"]
content_types = ["application/json"]
allow_credentials = ["github_pat"]
Fields
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable DLP scanning. Implicitly true under --strict when egress = "proxy-only". |
canary_tokens | bool | true when DLP enabled | Inject fake credentials into the sandbox env and treat any outbound appearance as exfiltration. |
max_decode_depth | usize | 32 | Encoding chain recursion depth (base64 / hex / percent). |
decompress | bool | true | Inflate gzip / deflate / brotli bodies before scanning. |
dns_entropy_threshold | f64 | 4.5 | Shannon entropy per DNS label above which the hostname is blocked. |
session_entropy_budget | u64 | 8192 | Cumulative high-entropy bytes allowed across one sandbox session before further requests are blocked. |
Credential-flow scope is configured per host via the
allow_credentials field on [[host]].
Built-in scopes
Each detector has a baseline list of home domains hardcoded in the detector registry — destinations where it’s universally legitimate for that credential type to flow:
| Detector | Built-in home domains |
|---|---|
github_pat | github.com, *.github.com |
npm_token | registry.npmjs.org |
aws_access_key | *.amazonaws.com |
slack_token | *.slack.com |
bearer_token | (none — requires explicit allow_credentials = ["bearer_token"] on the host) |
ssh_private_key, canary_token | (none — always block) |
generic_high_entropy | (warn only, block in --strict) |
Add to this set per-host via allow_credentials on the relevant
[[host]] block. Built-in lists are never narrowed.
Merge semantics
| Field | Merge rule |
|---|---|
enabled | OR — any Some(true) wins (security escalation, never reversed) |
canary_tokens | OR |
max_decode_depth, decompress, dns_entropy_threshold, session_entropy_budget | last-Some-wins |
A downstream recipe can never disable DLP that an upstream recipe enabled.
Interaction with --strict / --monitor
--strictimplicitly enables DLP (whenegress = "proxy-only") and promotesgeneric_high_entropyfromwarntoblock.--monitorlogs DLP findings atwarn!level but forwards the request, adding anx-canister-dlp-warningheader so the sandboxed process can observe what would have been blocked.
On block, the proxy returns 451 Unavailable For Legal Reasons with
headers x-canister-error: dlp-blocked and
x-canister-dlp-detector: <name>.
[process]
Controls process creation, environment filtering, and executable restrictions.
| Field | Type | Default | Description |
|---|---|---|---|
max_pids | int (optional) | none | Maximum number of processes (via RLIMIT_NPROC) |
allow_execve | string[] | [] | Executables the sandbox may exec (empty = allow all) |
env_passthrough | string[] | [] | Environment variables to pass from host (all others stripped) |
PID namespace isolation:
Every sandboxed process runs in its own PID namespace. The sandboxed command becomes PID 1 and cannot see or signal any host processes.
Environment filtering:
When env_passthrough is empty, the sandbox starts with a completely clean
environment — zero host environment variables are inherited. This is the most
secure default.
When env_passthrough contains variable names, only those variables are kept.
A minimal PATH=/usr/local/bin:/usr/bin:/bin is injected if PATH is not
in the passthrough list.
max_pids enforcement:
Uses RLIMIT_NPROC to cap the number of processes. When exceeded, fork()
returns EAGAIN. This is a per-UID limit, which is effective inside the
sandbox’s user namespace (where the process runs as UID 0 mapped to the host
user).
allow_execve validation:
When non-empty, the resolved command path must match one of the listed paths. If the command is not in the allow list, execution is rejected before forking.
Prefix rules: Entries ending in /* match any binary under that
directory tree. For example, /nix/store/* allows any binary whose resolved
path starts with /nix/store/. The match requires a / boundary —
/nix/store-extra/foo does NOT match /nix/store/*. This is essential for
content-addressed stores like Nix where binary paths contain unpredictable
hashes.
Ongoing enforcement: When the USER_NOTIF supervisor is active (kernel
5.9+, default), every execve() and execveat() call inside the sandbox is
intercepted and validated against allow_execve. This means child processes
cannot exec arbitrary binaries. When the notifier is disabled (kernel < 5.9
or notifier = false), only the initial command is validated, and child
processes can exec any binary visible in the mount namespace.
[process]
max_pids = 64
allow_execve = ["/usr/bin/python3", "/usr/bin/pip", "/nix/store/*"]
env_passthrough = ["PATH", "HOME", "LANG", "TERM", "VIRTUAL_ENV"]
[resources]
Resource limits enforced via cgroups v2. Requires systemd with per-user cgroup delegation (default on most modern distributions).
Opt-in: Resource limits are not included in any of the shipped base
recipes. They are entirely opt-in — add memory_mb and/or cpu_percent
to your own recipe when needed.
| Field | Type | Default | Description |
|---|---|---|---|
memory_mb | int (optional) | none | Memory limit in megabytes |
cpu_percent | int (optional) | none | CPU limit as percentage of one core (e.g., 50 = 50%) |
How it works:
Canister detects the current cgroup from /proc/self/cgroup, creates a
child cgroup (canister-<pid>), writes memory.max and cpu.max, and
moves the sandboxed process into it. No root required.
memory_mb = 512→memory.max = 536870912(512 MiB). Exceeding the limit triggers the kernel OOM killer.cpu_percent = 50→cpu.max = "50000 100000"(50ms quota per 100ms period), capping the process to 50% of one CPU core.
Failure behavior: If cgroup setup fails (e.g., no cgroup v2, no
delegation), the sandbox aborts. In strict mode (--strict), seccomp
uses KILL_PROCESS for immediate termination on any denied syscall.
[resources]
memory_mb = 512
cpu_percent = 100
[syscalls]
Customizes the seccomp BPF baseline and enforcement mode.
Canister ships a single default seccomp baseline defined in
recipes/default.toml (~187 allowed syscalls, ~18 always-denied). The
baseline is embedded in the binary at compile time and can be overridden by
placing a default.toml in the recipe search path (./.canister/,
$XDG_CONFIG_HOME/canister/recipes/, /etc/canister/recipes/).
Regular recipes customize the baseline by adding or removing syscalls with
allow_extra / deny_extra. The baseline itself uses allow / deny
(absolute lists). These two pairs are mutually exclusive — a recipe
either IS the baseline (uses allow/deny) or EXTENDS it (uses
allow_extra/deny_extra).
Override fields (for regular recipes)
| Field | Type | Default | Description |
|---|---|---|---|
seccomp_mode | string | "allow-list" | Seccomp mode: "allow-list" (default deny) or "deny-list" (default allow) |
allow_extra | string[] | [] | Syscalls to add to the baseline allow list |
deny_extra | string[] | [] | Syscalls to add to the deny list (also removed from allow list) |
notifier | bool (optional) | auto-detect | Enable/disable the USER_NOTIF supervisor for argument-level syscall filtering |
Absolute fields (for default.toml only)
| Field | Type | Default | Description |
|---|---|---|---|
allow | string[] | [] | Complete allow list (replaces the baseline, not additive) |
deny | string[] | [] | Complete deny list (replaces the baseline, not additive) |
Mutual exclusion: Using allow or deny together with allow_extra
or deny_extra in the same [syscalls] section is a validation error.
Seccomp modes:
| Mode | Default action | Listed syscalls | Use case |
|---|---|---|---|
allow-list | DENY all | Only baseline + allow_extra syscalls permitted | Production, CI (recommended) |
deny-list | ALLOW all | Only baseline deny + deny_extra syscalls blocked | Compatibility, unknown workloads |
Examples:
# Elixir/BEAM: needs ptrace for observer/dbg/recon
[syscalls]
allow_extra = ["ptrace"]
# Strict: also block personality for extra hardening
[syscalls]
deny_extra = ["personality"]
# Full override: add io_uring support
[syscalls]
allow_extra = ["ptrace", "personality", "seccomp", "io_uring_setup", "io_uring_enter", "io_uring_register"]
# Deny-list mode for maximum compatibility
[syscalls]
seccomp_mode = "deny-list"
See SECCOMP.md for details on the baseline syscall set and how the embed+override resolution works.
USER_NOTIF supervisor (notifier)
The notifier field controls the SECCOMP_RET_USER_NOTIF supervisor, which
provides argument-level filtering for connect(), clone()/clone3(),
socket(), execve(), and execveat().
| Value | Behavior |
|---|---|
true | Force the notifier on (fails if kernel < 5.9) |
false | Force the notifier off |
| omitted | Auto-detect: enabled if kernel >= 5.9 and not in monitor mode |
When the notifier is active, connect() calls are filtered against the
resolved IPs from each [[host]].domain and allow_ips, clone()/clone3() are
blocked from creating new namespaces, socket() is blocked from creating
AF_NETLINK or SOCK_RAW sockets, and execve()/execveat() are validated
against allow_execve paths for every execution (not just the initial command).
The notifier is merged using the last-Some-wins strategy during recipe
composition, consistent with other Option<bool> scalar fields.
# Disable the notifier for compatibility with older kernels
[syscalls]
notifier = false
# Force it on (fail loudly if not supported)
[syscalls]
notifier = true
See SECCOMP.md for the full technical description.
[proxy]
L7 proxy settings used by proxy-only egress mode.
[proxy]
max_buffered_body_bytes = 8388608 # 8 MiB (default)
upstream_request_timeout_ms = 30000 # 30 s (default)
Fields
| Field | Type | Default | Description |
|---|---|---|---|
max_buffered_body_bytes | usize | 8388608 | Max bytes buffered for DLP body scanning |
upstream_request_timeout_ms | u64 | 30000 | Upstream request timeout in milliseconds |
Enforcement semantics
When network.egress = "proxy-only":
- sandboxed processes may only open outbound INET/INET6 connections to:
- loopback proxy endpoint (
127.0.0.1:<proxy_port>/::1:<proxy_port>) - configured DNS server on port 53
- loopback proxy endpoint (
- direct outbound internet access is denied by seccomp USER_NOTIF policy
even if
HTTP_PROXY/HTTPS_PROXYenv vars are unset.
This makes seccomp-notify the first-line defense and proxy the forwarding path for legitimate traffic.
Strict Mode
Strict mode (--strict or strict = true in config) tightens all enforcement
for CI and production use.
Config:
strict = true
CLI:
can run --strict --recipe policy.toml -- python3 script.py
The CLI --strict flag can only tighten — if the config sets strict = true,
the CLI cannot override it to false.
Changes in strict mode:
| Enforcement point | Normal mode | Strict mode |
|---|---|---|
| Filesystem isolation | Aborts on failure | Aborts on failure |
| Network setup | Aborts on failure | Aborts on failure |
| Loopback bring-up | Aborts on failure | Aborts on failure |
| Seccomp deny action | EPERM (process survives) | KILL_PROCESS (immediate termination) |
| Cgroup setup | Aborts on failure | Aborts on failure |
The key difference is the seccomp deny action: normal mode returns EPERM so the process can handle denials gracefully; strict mode kills the process immediately on any denied syscall.
Mutual exclusion: --strict and --monitor cannot be used together.
Strict mode ensures full enforcement; monitor mode relaxes it. These are
contradictory intents.
Monitor Mode
Monitor mode (--monitor) is a CLI flag, not a config field. It relaxes
enforcement across all policy sections so you can observe what would be
blocked without actually blocking it.
can run --monitor --recipe my_policy.toml -- python3 script.py
What changes in monitor mode:
| Section | Normal | Monitor |
|---|---|---|
[process].allow_execve | Blocks unlisted commands | Logs warning, allows |
[process].env_passthrough | Strips unlisted vars | Logs stripped count, passes all |
[process].max_pids | Enforces RLIMIT_NPROC | Logs limit, skips enforcement |
[syscalls] seccomp | Returns EPERM on denied syscalls | Logs to kernel audit, allows |
[filesystem] | Overlay + pivot_root | Unchanged (isolation active) |
[network] | Namespace + pasta | Unchanged (isolation active) |
Reading monitor output:
- Look for
MONITOR:prefixed log lines in stderr. - Seccomp events appear in kernel logs:
journalctl -k | grep seccomp. - A pre-run policy preview and post-run summary are printed automatically.
Monitor mode is a development tool. It provides no security guarantees.
Cannot be combined with --strict.
Inspecting the Resolved Policy
Use can recipe show to see the fully resolved policy after all recipe
merging and environment variable expansion:
# Show the base policy (no recipes)
can recipe show
# Show the resolved policy with a recipe
can recipe show -r elixir
# Show with auto-detection (pass the command to trigger match_prefix)
can recipe show -r elixir -- mix test
# Compose multiple recipes and see the result
can recipe show -r nix -r elixir
# Save to a standalone recipe file
can recipe show -r nix -r elixir > my-custom.toml
can run -r my-custom.toml -- mix test
The output is valid TOML and includes all resolved fields:
strict = false
[filesystem]
allow = ["/bin", "/sbin", "/usr/bin", ...]
deny = ["/etc/shadow", "/etc/gshadow"]
[network]
[[host]]
domain = "hex.pm"
[[host]]
domain = "repo.hex.pm"
[[host]]
domain = "builds.hex.pm"
egress = "proxy-only"
[process]
allow_execve = ["/nix/store/*"]
env_passthrough = ["PATH", "HOME", ...]
[resources]
[syscalls]
seccomp_mode = "allow-list"
allow_extra = ["ptrace"]
This serves two purposes:
- Auditing — see exactly what policy will be enforced before running.
- Standalone recipes — capture the output and use it as a custom recipe that doesn’t depend on any other recipe files.
Examples
Minimal: deny everything
No config file needed. This is the default.
can run -- echo "hello"
Equivalent to:
[filesystem]
[network]
egress = "none"
[syscalls]
Python data science
Allow pip installs from PyPI and access to a workspace directory.
[filesystem]
allow = [
"/usr/lib",
"/usr/bin",
"/usr/local/lib",
"/tmp/workspace",
]
deny = ["/etc/shadow", "/root"]
[network]
egress = "proxy-only"
[[host]]
domain = "pypi.org"
[[host]]
domain = "files.pythonhosted.org"
[process]
env_passthrough = ["PATH", "HOME", "LANG", "VIRTUAL_ENV"]
Node.js build
Allow npm registry access and a project directory.
[filesystem]
allow = [
"/usr/lib",
"/usr/bin",
"/usr/local",
"/home/user/project",
]
[network]
egress = "proxy-only"
[[host]]
domain = "registry.npmjs.org"
[[host]]
domain = "nodejs.org"
[process]
env_passthrough = ["PATH", "HOME", "NODE_ENV"]
Full network trust
For trusted code that needs unrestricted network access but should still be filesystem- and syscall-restricted.
[filesystem]
allow = ["/tmp/workspace"]
[network]
egress = "direct"
Air-gapped
No network, no filesystem beyond essentials, strict seccomp.
[filesystem]
allow = ["/tmp/workspace"]
deny = ["/etc", "/root", "/home"]
[network]
egress = "none"
Strict CI
All-or-nothing enforcement. If any isolation layer can’t be set up, the sandbox refuses to start. Denied syscalls kill the process immediately.
strict = true
[filesystem]
allow = ["/tmp/workspace"]
[network]
egress = "none"
[process]
max_pids = 64
allow_execve = ["/usr/bin/python3"]
[resources]
memory_mb = 512
cpu_percent = 100
[syscalls]
seccomp_mode = "allow-list"
Elixir/Erlang (mix tasks, iex, Phoenix)
Run mix tasks, iex shells, or Phoenix servers with hex.pm access.
Use with -r nix or -r homebrew if Elixir is installed via a package manager.
[recipe]
name = "elixir"
description = "Elixir/Erlang (BEAM VM) — mix, iex, Phoenix"
[filesystem]
allow = [
"/usr/lib",
"/usr/bin",
"/usr/local/lib",
"/usr/local/bin",
"/lib",
"/tmp/workspace",
]
deny = ["/etc/shadow", "/root"]
[network]
[[host]]
domain = "hex.pm"
[[host]]
domain = "repo.hex.pm"
[[host]]
domain = "builds.hex.pm"
egress = "proxy-only"
[process]
max_pids = 256
env_passthrough = [
"PATH", "HOME", "LANG", "TERM",
"MIX_ENV", "MIX_HOME", "HEX_HOME",
"ERL_AFLAGS", "ELIXIR_ERL_OPTIONS",
"SECRET_KEY_BASE", "DATABASE_URL", "PORT",
]
[syscalls]
allow_extra = ["ptrace"] # BEAM tracing tools (:observer, :dbg, recon)
Usage with composition:
# Nix-installed Elixir: nix.toml auto-detected, elixir.toml explicit
can run -r elixir -- mix test
# Explicit composition
can run -r nix -r elixir -- mix test
# Strict CI
can run --strict -r elixir -- mix test
Seccomp Filtering
Canister uses seccomp BPF to restrict which Linux syscalls the sandboxed process can invoke. This document explains how the default baseline works, how recipes customize it, and the enforcement modes available.
Table of Contents
- How Seccomp Works in Canister
- Default Baseline
- Customizing via Recipes
- Always-Denied Syscalls
- Deny Action: Errno vs Kill
- Monitor Mode and SECCOMP_RET_LOG
- Architecture Validation
- USER_NOTIF Supervisor
- Inspecting the Baseline
How Seccomp Works in Canister
Canister generates a classic BPF (Berkeley Packet Filter) program at runtime
and loads it via prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER) right before
execve().
Two enforcement modes
| Mode | Default action | Listed syscalls | Config value |
|---|---|---|---|
| Allow-list (default) | DENY | Only listed syscalls permitted | seccomp_mode = "allow-list" |
| Deny-list | ALLOW | Only listed syscalls blocked | seccomp_mode = "deny-list" |
Allow-list mode (recommended, default) inverts the security model:
every syscall not explicitly in the baseline (plus allow_extra) is denied.
This provides a much smaller kernel attack surface.
Deny-list mode is the permissive fallback: everything is allowed except
the syscalls in the deny list (plus deny_extra). Use this when you need
maximum compatibility with unknown workloads, at the cost of a larger
attack surface.
The filter cannot be removed or modified after loading. The PR_SET_NO_NEW_PRIVS
flag is set first, which is required for unprivileged seccomp and also prevents
the sandboxed process from gaining new privileges via execve of setuid
binaries.
Default Baseline
Canister ships a single default seccomp baseline defined in
recipes/default.toml. The baseline is embedded in the binary at compile
time via include_str!(), so it always works standalone. At runtime, the
search path is checked for an external override:
./.canister/default.toml(project-local)$XDG_CONFIG_HOME/canister/recipes/default.toml(per-user)/etc/canister/recipes/default.toml(system-wide)- Embedded fallback (compiled into the binary)
This lets teams pin, audit, or version-control the baseline independently of the binary.
The baseline provides:
- ~187 allowed syscalls — the common syscalls needed by most programs (read, write, open, mmap, clone, futex, getpgrp, etc.)
- ~18 always-denied syscalls — dangerous operations that no sandboxed process should ever need (reboot, kexec_load, mount, etc.)
The default.toml uses absolute [syscalls] allow = [...] and deny = [...]
fields. Regular recipes use the relative allow_extra / deny_extra fields
to layer overrides on top. These two modes are mutually exclusive — a
recipe either IS the baseline (uses allow/deny) or EXTENDS it (uses
allow_extra/deny_extra).
The baseline was derived by analyzing the syscall needs of Python, Node.js, Elixir/BEAM, and general-purpose binaries. The old 4-profile system (generic, python, node, elixir) was collapsed into this single baseline because:
- Python and Node were literally identical — same allow list, same deny list.
- The total delta across all 4 profiles was only 6 syscalls:
ptrace,personality,seccomp,io_uring_setup,io_uring_enter,io_uring_register. - The 4-profile taxonomy gave a false sense of specificity.
Recipes that need syscalls beyond the baseline use [syscalls] allow_extra.
Recipes that want tighter restrictions use [syscalls] deny_extra.
Customizing via Recipes
The [syscalls] section in a recipe TOML customizes the baseline:
[syscalls]
allow_extra = ["ptrace"] # add to the allow list
deny_extra = ["personality"] # add to deny list AND remove from allow list
seccomp_mode = "allow-list" # default; or "deny-list"
How overrides work:
- Start with the default baseline (ALLOW_BASE + DENY_ALWAYS).
- Add
allow_extrasyscalls to the allow list (deduplicated). - Add
deny_extrasyscalls to the deny list. - Remove
deny_extrasyscalls from the allow list (deny takes precedence). - Generate the BPF filter from the final lists.
Common recipes:
| Workload | allow_extra | deny_extra | Why |
|---|---|---|---|
| Python scripts | (none) | — | Default baseline is sufficient |
| Node.js builds | (none) | — | Default baseline is sufficient |
| Elixir/BEAM | ["ptrace"] | — | BEAM tools (:observer, :dbg, recon) need ptrace |
| Generic (permissive) | ["ptrace", "personality", "seccomp", "io_uring_setup", "io_uring_enter", "io_uring_register"] | — | Maximum compatibility |
| Hardened | — | ["personality"] | Block multilib/personality switching |
Always-Denied Syscalls
The default baseline includes ~16 syscalls that are always denied. These are dangerous kernel-level operations that a sandboxed process should never need:
| Syscall | Why it’s blocked |
|---|---|
reboot | Reboots the system |
kexec_load | Loads a new kernel |
init_module | Loads a kernel module |
finit_module | Loads a kernel module (from fd) |
delete_module | Unloads a kernel module |
swapon | Enables swap space |
swapoff | Disables swap space |
acct | Enables/disables process accounting |
mount | Mounts a filesystem |
umount2 | Unmounts a filesystem |
pivot_root | Changes the root filesystem |
chroot | Changes the root directory |
syslog | Reads/controls kernel message buffer |
settimeofday | Changes the system clock |
unshare | Creates new namespaces (escape vector) |
setns | Joins existing namespaces (escape vector) |
These are blocked because they represent operations that only system administrators should perform, and a sandboxed process has no legitimate reason to invoke them.
Deny Action: Errno, Kill, and Strict Mode
Canister supports three deny actions depending on the mode:
| Mode | Deny action | Behavior |
|---|---|---|
| Normal | SECCOMP_RET_ERRNO | EPERM | Denied syscall returns -1 with errno = EPERM. Process survives. |
Strict (--strict) | SECCOMP_RET_KILL_PROCESS | Process is immediately terminated with SIGSYS. |
Monitor (--monitor) | SECCOMP_RET_LOG | Syscall is allowed but logged to kernel audit. |
Normal mode (default) uses Errno because:
- Most programs check return values and can handle
EPERMgracefully. - Kill mode makes debugging harder (process just dies with no error message).
- The denied syscalls are operations that programs generally don’t invoke
accidentally – if a program calls
reboot(), it’s intentional and getting EPERM back is the right response.
Strict mode (--strict) uses Kill because:
- In CI/production, a denied syscall indicates a policy violation or attack.
- Immediate termination prevents any further execution after a violation.
- The process cannot observe or react to the denial (no information leak).
The architecture validation check (wrong CPU architecture) always uses
SECCOMP_RET_KILL_PROCESS regardless of mode, since an architecture
mismatch indicates an actual attack (e.g., x32 ABI bypass attempt).
Monitor Mode and SECCOMP_RET_LOG
When running with --monitor, the seccomp filter uses SECCOMP_RET_LOG
(0x7ffc0000) instead of SECCOMP_RET_ERRNO. This is a third deny action
mode:
| Mode | Return value | Behavior |
|---|---|---|
| Errno | SECCOMP_RET_ERRNO | EPERM | Denied syscall returns EPERM |
| Kill | SECCOMP_RET_KILL_PROCESS | Process killed immediately |
| Log | SECCOMP_RET_LOG | Syscall is allowed but logged to kernel audit |
In Log mode, the BPF filter structure is identical to Errno mode — same architecture check, same deny list, same jump offsets. Only the return value for matched syscalls changes. This means the filter accurately reflects what would be blocked in enforcement mode.
Viewing logged syscalls:
# After running with --monitor
journalctl -k | grep seccomp
# or
dmesg | grep seccomp
Each log line shows the syscall number, PID, and other context. Map syscall
numbers back to names with ausyscall (from the auditd package):
ausyscall --dump | grep <number>
SECCOMP_RET_LOG is available since Linux 4.14 (well within the 5.6+
minimum kernel requirement).
Architecture Validation
The BPF filter’s first check validates that the syscall comes from the expected CPU architecture:
- x86_64:
AUDIT_ARCH_X86_64(0xC000003E) - aarch64:
AUDIT_ARCH_AARCH64(0xC00000B7)
If the architecture doesn’t match, the process is killed immediately
(SECCOMP_RET_KILL_PROCESS).
Why this matters: On x86_64, the kernel also supports the x32 ABI (a 32-bit ABI with 64-bit pointers). x32 syscalls use different numbers than native x86_64. Without this check, an attacker could invoke x32 syscalls to bypass the filter (since the BPF checks are against x86_64 numbers).
USER_NOTIF Supervisor
Classic BPF can only inspect the syscall number and architecture (seccomp_data.nr
and seccomp_data.arch). It cannot inspect syscall arguments — for pointer-based
arguments like connect()’s sockaddr or execve()’s pathname, the BPF filter
only sees the raw pointer value, not the data it points to.
Canister uses SECCOMP_RET_USER_NOTIF (Linux 5.9+) to bridge this gap. When the
sandboxed process invokes a syscall that requires argument inspection, the kernel
suspends the calling thread and delivers a notification to a supervisor process.
The supervisor reads the actual argument data (via /proc/<pid>/mem), makes a
policy decision, and sends an ALLOW or DENY verdict back to the kernel.
How it works
The supervisor runs as PID 1 inside the sandbox’s PID namespace, not as a thread in the parent. This architecture is required because of three cascading kernel restrictions:
- After
unshare(CLONE_NEWPID),clone(CLONE_THREAD)returnsEINVAL(pid_ns_for_children != task_active_pid_ns), so a supervisor thread cannot be spawned. - The host’s procfs (
s_user_ns = init_user_ns) denies/proc/<pid>/memopens from a child user namespace, so the supervisor must mount its own procfs. - PID 1 is an ancestor of all sandboxed processes, satisfying Yama
ptrace_scope=1withoutPR_SET_PTRACER.
PID 1 (supervisor, same user ns + PID ns) PID 2+ (worker / sandboxed)
───────────────────────────────────────── ────────────────────────────
1. unshare(CLONE_NEWNS) 1. Sandbox setup
2. mount /proc (owned by user ns) (overlay, pivot_root, etc.)
3. recv_fd() via SCM_RIGHTS 2. seccomp() → notifier fd
→ notifier_fd 3. send_fd() via SCM_RIGHTS
4. Loop: 4. Install main BPF filter
a. poll(notifier_fd, 200ms) 5. execve()
b. ioctl(NOTIF_RECV) → read notification
c. open+read /proc/<pid>/mem
d. Evaluate against policy
e. ioctl(NOTIF_ID_VALID) → TOCTOU check
f. ioctl(NOTIF_SEND) → verdict
g. waitpid(WNOHANG) → check child status
The supervisor runs inline (single-threaded) using poll() with a 200ms timeout,
interleaved with non-blocking waitpid to detect when the worker exits. After the
worker exits, remaining in-flight notifications are drained before the supervisor
terminates.
Two-filter architecture
The worker installs two seccomp filters:
-
Notifier filter (installed first via
seccomp()syscall withSECCOMP_FILTER_FLAG_NEW_LISTENER): ReturnsSECCOMP_RET_USER_NOTIFfor the eight intercepted syscalls (connect,sendto,sendmsg,clone,clone3,socket,execve,execveat). All other syscalls returnSECCOMP_RET_ALLOW. -
Main filter (installed second via
prctl(PR_SET_SECCOMP)): The existing allow-list or deny-list BPF filter. ReturnsSECCOMP_RET_ERRNO,SECCOMP_RET_KILL_PROCESS, orSECCOMP_RET_LOGdepending on mode.
The kernel evaluates filters in reverse install order, but SECCOMP_RET_USER_NOTIF
takes special precedence — when any filter returns USER_NOTIF, the kernel always
delivers the notification to the supervisor, regardless of what other filters return.
Intercepted syscalls
| Syscall | Argument inspected | Policy |
|---|---|---|
connect() | sockaddr (destination address) | Allow only IPs pre-resolved from each [[host]] block’s domain and explicit allow_ips. Loopback and Unix domain sockets always allowed. |
sendto() | dest_addr + msg_controllen | DNS queries on port 53 trigger supervisor-side resolution and dynamic allowlist population. Connected sockets (NULL dest_addr) allowed. |
sendmsg() | msghdr struct (msg_controllen) | Blocks any sendmsg() with ancillary data (msg_controllen > 0), preventing SCM_RIGHTS fd passing regardless of outbound restriction settings. |
clone() | flags (register value) | Deny namespace-creating flags: CLONE_NEWNS, CLONE_NEWCGROUP, CLONE_NEWUTS, CLONE_NEWIPC, CLONE_NEWUSER, CLONE_NEWPID, CLONE_NEWNET |
clone3() | clone_args.flags (read from userspace struct) | Same flag check as clone(), read from the clone_args struct via /proc/<pid>/mem |
socket() | domain, type, protocol (register values) | SOCK_RAW denied. AF_NETLINK restricted to NETLINK_ROUTE (protocol 0) only — all other netlink protocols denied. Normal TCP/UDP/Unix sockets allowed. |
execve() | pathname (read from userspace string) | Validate against allow_execve paths. If allow_execve is empty, allow all. |
execveat() | pathname (read from userspace string) | Same as execve(). Resolves the path relative to the dirfd argument. |
TOCTOU protection
A time-of-check-time-of-use race exists: a multi-threaded sandboxed process
could modify the memory that the supervisor reads between the read and the
verdict. Canister mitigates this with SECCOMP_IOCTL_NOTIF_ID_VALID:
- Read notification (gets syscall args and a unique notification ID).
- Read memory via
/proc/<pid>/memfor pointer-based arguments. - Evaluate policy.
- Call
ioctl(SECCOMP_IOCTL_NOTIF_ID_VALID, &id)— if the kernel returns an error (ENOENT), the syscall was interrupted (the thread exited or the memory was unmapped) and the notification is stale. The supervisor skips sending a verdict. - Send verdict.
This is the standard mitigation recommended by the seccomp_unotify(2) man page.
It is not airtight against a determined attacker with precise timing, but it
eliminates the most common race windows.
CIDR matching
For connect() filtering, the supervisor supports both exact IP matches and CIDR
range matches (e.g., 10.0.0.0/8, 2606:2800:220:1::/64). The resolved IPs from
each [[host]] block’s domain are combined with any allow_ips CIDR ranges
from the config to build the allowlist. Loopback addresses (127.0.0.0/8,
::1) and AF_UNIX sockets are always permitted.
DNS proxy integration
When the notifier is active, a DNS proxy runs in the parent process on
an ephemeral port. The sandbox’s /etc/resolv.conf points to pasta’s
DNS address (169.254.0.1:53), which is configured via --dns-forward
to forward queries to the parent’s DNS proxy. The proxy
only resolves domains that have a matching [[host]] block — all other
queries receive an NXDOMAIN response. This prevents DNS-based information
exfiltration and ensures the sandbox can only resolve allowed domains.
Configuration
The notifier is controlled by the notifier field in [syscalls]:
[syscalls]
notifier = true # force on
notifier = false # force off
# omit → auto-detect (default)
Auto-detection logic:
- If
notifieris explicitly set in the config, that value is used. - If running in monitor mode, the notifier is disabled (monitor mode uses
SECCOMP_RET_LOG, which is incompatible withSECCOMP_RET_USER_NOTIF). - Otherwise, the notifier is enabled if the kernel version is 5.9 or later
(the minimum version that supports all required
seccomp_unotifyioctls).
Kernel version detection reads /proc/sys/kernel/osrelease and parses the
major.minor version.
Requirements
- Linux 5.9+ — for
SECCOMP_IOCTL_NOTIF_RECV,SECCOMP_IOCTL_NOTIF_SEND, andSECCOMP_IOCTL_NOTIF_ID_VALID. PR_SET_NO_NEW_PRIVSmust be set on the worker before installing the filter (already done by both the notifier and main filter installation paths). The supervisor (PID 1) must NOT havePR_SET_NO_NEW_PRIVSset, as it would break/proc/<pid>/memaccess.- AppArmor — the
canister_sandboxedprofile must allowptrace (readby tracedby)from thecanisterpeer profile. This is configured in the shippedcanister.apparmorprofile.
Inspecting the Baseline
List discovered recipes and the default baseline:
$ can recipe list
Discovered recipes:
elixir Elixir/Erlang (BEAM VM) — mix, iex, Phoenix
+ptrace recipes/elixir.toml
...
Default baseline: ~187 allowed, ~18 denied syscalls
Customize per-recipe with [syscalls] allow_extra / deny_extra
To see exactly which syscalls the baseline allows/blocks, open
recipes/default.toml. The [syscalls] allow array is the allow set,
[syscalls] deny is the deny set. The file is the single source of truth —
it is embedded into the binary at compile time via include_str!() and can
be overridden by placing a default.toml in the recipe search path
(./.canister/, $XDG_CONFIG_HOME/canister/recipes/, /etc/canister/recipes/).
SeccompProfile::apply_overrides() merges per-recipe allow_extra /
deny_extra customizations on top of this baseline.
To see the fully resolved policy (after all recipe merging and env var
expansion), use can recipe show:
$ can recipe show -r elixir
strict = false
[filesystem]
allow = ["/bin", "/sbin", ...]
[syscalls]
seccomp_mode = "allow-list"
allow_extra = ["ptrace"]
...
The output is valid TOML that can be saved as a standalone recipe file.
Data Loss Prevention (DLP)
Canister’s L7 egress proxy includes a built-in DLP layer that scans
outbound HTTP traffic for credential patterns and enforces per-detector
domain scoping. Even when a sandboxed process has filesystem access to
credential files (because the user wants npm, gh, or aws to keep
working), DLP makes it structurally impossible for those credentials to
leak to unauthorised destinations.
Table of Contents
- Threat Model
- Architecture
- Detectors and Scope Model
- Scan Pipeline
- Encoding Chain Recursion
- Content Decompression
- DNS Entropy Check
- Session Entropy Budget
- Canary Tokens
- Enforcement Modes (
--strictand--monitor) - Response Headers and Status Codes
- Configuration
- Limitations
Threat Model
A sandboxed process typically has filesystem access to credential-bearing files — intentionally, because the user wants their package managers and CLI tools to keep working against private registries. That process is potentially:
- Untrusted — a build script, post-install hook, or LLM-generated
command running with read access to
~/.npmrc,~/.aws/credentials, the GitHub keyring, etc. - Trusted-but-buggy — telemetry code that accidentally serialises environment variables containing tokens.
- Trusted-but-compromised — a supply-chain attack inside an otherwise reputable dependency.
DLP’s goal: even when a credential is readable, it cannot leave the sandbox via HTTP(S) unless flowing to an explicitly authorised destination for that credential’s service.
In scope:
- HTTP/1.1 and HTTP/2 request headers, bodies, trailers
- URI query parameters and path segments
- Bodies wrapped in gzip / deflate / brotli
- Multi-layer encoded payloads (base64 / hex / percent), up to 32 levels
- DNS-label exfiltration via high-entropy hostname labels
- Slow byte-at-a-time exfiltration via cumulative entropy budgeting
Out of scope:
- Covert timing channels
- In-memory key extraction
- Filesystem-write exfiltration to shared/CWD mounts
- Pixel-level steganography in image payloads
- Plain
CONNECT(L4) tunnels — DLP forces interception when enabled, so any traffic that bypasses interception (e.g. non-HTTP protocols) is denied rather than inspected.
Architecture
DLP lives in the standalone can-dlp crate so it can be reused by both
the proxy and the sandbox (for canary generation) without pulling proxy
dependencies into the sandbox crate.
crates/can-dlp/
src/
detectors.rs — DetectorId enum, compiled RegexSet, Finding
scopes.rs — per-detector domain matching (built-in + extras)
decode.rs — base64/hex/percent recursion, up to N layers
decompress.rs — gzip/deflate/brotli body decompression
normalize.rs — whitespace/unicode normalisation before scanning
entropy.rs — Shannon entropy + SessionEntropyBudget
canary.rs — fake credential generation
scanner.rs — DlpScanner: orchestrates the full pipeline
error.rs — DlpError (thiserror)
The DlpConfig serde struct lives in can-policy (next to
NetworkConfig) to avoid a can-dlp → can-policy circular dependency.
Activation chain:
recipe / manifest [network.dlp]
│
▼
NetworkConfig::dlp (Option<DlpConfig>)
│
▼
ProxyServer constructed with DlpScanner + SessionEntropyBudget
│
▼
Per-request: scan headers + URI + (decompressed, decoded) body
When DLP is enabled, the proxy forces interception of all traffic. The passthrough path (which is opaque to the proxy) is disabled because it would bypass scanning.
Detectors and Scope Model
Each detector has hardcoded home domains baked into the binary.
Tokens can only flow to their home service — even if a [[host]]
block permits the destination, a GitHub PAT bound for
registry.npmjs.org is blocked.
| Detector | Pattern | Built-in home domains | Default action |
|---|---|---|---|
github_pat | gh[pousr]_[A-Za-z0-9]{36} and github_pat_[A-Za-z0-9]{22}_[A-Za-z0-9]{59} | github.com, *.github.com | block |
npm_token | npm_[A-Za-z0-9]{36} | registry.npmjs.org | block |
aws_access_key | AKIA[A-Z0-9]{16} | *.amazonaws.com | block |
slack_token | xox[baprs]-[0-9]{10,13}-[0-9]{10,13}-[A-Za-z0-9]{24} | *.slack.com | block |
ssh_private_key | -----BEGIN (RSA|EC|OPENSSH|DSA )?PRIVATE KEY----- | none — always block | block |
bearer_token | Bearer\s+[A-Za-z0-9\-._~+/]{20,}=* | (requires explicit allow_credentials = ["bearer_token"] on a host) | block |
generic_high_entropy | Sliding window, Shannon entropy > 4.5, 20+ chars | (warn only) | warn (promoted to block in --strict) |
canary_token | Exact match against injected fake credentials | none — always block | block (error log) |
Enforcement rules:
- Known-service tokens (
github_pat,npm_token,aws_access_key,slack_token) — destination must be in the detector’s home domains or in a[[host]]block whoseallow_credentialsincludes the detector id. Mismatched service → 451 block. bearer_token— generic; requires explicit per-host opt-in viaallow_credentials = ["bearer_token"]. No implicit scope.ssh_private_keyandcanary_token— no legitimate HTTP destination; always blocked.generic_high_entropy— too noisy to scope; always warn, blocks only in--strict.
The shipped service contracts under recipes/services/*.toml
(github.toml, npm.toml, …) already include the right
allow_credentials for their detector — composing
tools = ["npm", "gh"] produces the right behaviour: npm tokens
can only reach npmjs.org, GitHub PATs can only reach GitHub.
Extending scopes for self-hosted services
Self-hosted services (GitHub Enterprise, private npm registries) extend
the built-in scopes via extra_scopes:
[network.dlp]
enabled = true
[network.dlp.extra_scopes]
github_pat = ["github.corp.example.com"]
npm_token = ["npm.internal.example.com"]
Extras are unioned with the built-in domains. They never replace or narrow them, so a self-hosted override cannot accidentally weaken the default scope for the public service.
Scan Pipeline
Per request, the proxy runs:
1. Headers (Authorization, Cookie, Proxy-Authorization, X-*)
→ scan_text → token detected? scope check
2. URI (full reconstructed authority + path + query)
→ scan_text → token detected? scope check
3. Body
a. Read Content-Encoding header
b. Decompress (gzip / deflate / brotli) if configured
c. Run encoding chain recursion (base64 / hex / percent)
d. Pattern match each layer against PatternSet
4. For every finding:
- canary → BLOCK + error! log (zero false positives)
- ssh key → BLOCK
- scoped → BLOCK if destination not in home/extras
- bearer → BLOCK unless the destination's `[[host]]` block lists `"bearer_token"` in `allow_credentials`
- generic → WARN (BLOCK in --strict)
5. Session entropy budget update; BLOCK if exceeded.
6. Build response:
- On allow: forward upstream with `update_content_length()` if body
was buffered.
- On block: 451 + `x-canister-error: dlp-blocked` +
`x-canister-dlp-detector: <name>`.
- On monitor-mode warn: forward upstream + add
`x-canister-dlp-warning` so the sandboxed process can observe what
would have been blocked.
DLP forces request body buffering within the existing
max_buffered_body_bytes cap. A streaming scan would miss tokens that
straddle chunk boundaries; the cap (default 8 MiB) prevents memory
abuse.
Encoding Chain Recursion
decode.rs walks every layer of base64 / base64url / hex / percent-encoding up to max_decode_depth (default 32). At each layer
the scanner attempts all decoders; any that produces output different
from its input is recursed into. All decoded layers are matched
against PatternSet, so:
Authorization: Bearer dGVzdA==(Bearer test) is matched at the original layer.body={"x":"Z2hwX0FBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQQ=="}(a base64-wrapped GitHub PAT) is caught at depth 1.base64(base64(token))is caught at depth 2.- Garbage / malformed encoding at any layer is fail-closed: the original bytes are scanned as-is and the recursion stops on that branch — never silently skipped.
The depth cap is a fuse against adversarially deep nesting designed to exhaust CPU.
Content Decompression
decompress.rs inspects the Content-Encoding header and inflates
gzip / deflate / brotli bodies before scanning. This is gated by
network.dlp.decompress (default true).
Malformed or truncated compressed bodies fail the request rather than being forwarded unscanned — fail-closed.
DNS Entropy Check
Independently of HTTP scanning, the proxy applies a Shannon-entropy
check to the destination hostname before resolving it. Each DNS label
(the parts between the dots) is scored; if any label exceeds
dns_entropy_threshold (default 4.5) the request is blocked with
dlp-blocked + dns-entropy reason. This catches the classic DNS
exfiltration pattern: <base64-of-secret>.attacker.example where the
high-entropy subdomain is the payload.
The check runs even on CONNECT tunnels (before resolution), so it
applies regardless of L7 protocol.
Session Entropy Budget
A sandbox session keeps a SessionEntropyBudget (default 8192 bytes).
After each request scan, the count of high-entropy bytes (Shannon
entropy > 4.0 in any 32-byte sliding window) is recorded against the
budget. When the budget is exhausted, further requests are blocked.
This catches slow exfiltration: a credential split across many small requests, each individually below the per-request entropy threshold but collectively well above plausible legitimate traffic patterns.
The budget is per ProxyServer instance, which is one per sandbox
session — it resets when the sandbox exits.
Canary Tokens
When canary_tokens = true (default when DLP is enabled), the sandbox
generates three fake credentials at startup and exports them as
environment variables into the sandboxed process:
| Env var | Shape |
|---|---|
CANISTER_CANARY_GITHUB_PAT | ghp_ + 36 random alphanumeric |
CANISTER_CANARY_NPM_TOKEN | npm_ + 36 random alphanumeric |
CANISTER_CANARY_AWS_ACCESS_KEY | AKIA + 16 random uppercase |
The same exact strings are registered as the highest-priority pattern in
the DlpScanner. Any outbound appearance of one of these strings is
proof of exfiltration — no legitimate process produces these values,
because they don’t authenticate to anything. The detection is always
block and is logged at error! level with full request context.
Canary tokens are an active honeypot: they create traffic that a malicious process is incentivised to scoop up (env vars are the first thing credential thieves enumerate) and which the proxy treats as a high-confidence trip wire.
Enforcement Modes
DLP integrates with the existing sandbox enforcement modes rather than introducing a separate kill switch.
| Mode | DLP enabled? | generic_high_entropy | Block action |
|---|---|---|---|
| Default | Per recipe enabled = true | warn | 451 |
--monitor | As configured | warn (logged) | Not blocked — request forwarded with x-canister-dlp-warning header |
--strict | Implicitly enabled when egress = "proxy-only" | promoted to block | 451 |
- Default: DLP runs if the recipe enables it; violations are 451.
--monitor: DLP findings are logged atwarn!level with full detector / host / fingerprint detail but requests still go through. Mirrors how monitor mode handles seccomp and filesystem checks. Use this to dry-run a new policy before flipping it on.--strict: DLP is implicitly enabled even withoutdlp.enabled = true, provided the recipe usesegress = "proxy-only"(strict mode requires DLP-grade enforcement).generic_high_entropyis promoted from warn to block.
No new flags or kill switches were added — --strict plus recipe
config cover the same activation surface as a dedicated enable knob.
Response Headers and Status Codes
| Outcome | Status | Headers |
|---|---|---|
| Token detected, blocked | 451 Unavailable For Legal Reasons | x-canister-error: dlp-blocked, x-canister-dlp-detector: <name> |
| Token detected, monitor mode | (upstream status) | x-canister-dlp-warning: <name> |
| DNS-label entropy block | 451 | x-canister-error: dlp-blocked, x-canister-dlp-reason: dns-entropy |
| Session budget exhausted | 451 | x-canister-error: dlp-blocked, x-canister-dlp-reason: session-budget |
451 is used so DLP blocks are distinguishable from upstream 403s.
The detector name is exposed in the header so the sandboxed process /
calling tool can produce a sensible error message.
Configuration
Full schema (all fields optional; defaults shown):
[network.dlp]
enabled = false # implicit true under --strict + proxy-only
canary_tokens = true # default when DLP is enabled
max_decode_depth = 32 # encoding chain recursion cap
decompress = true # gzip/deflate/brotli before scan
dns_entropy_threshold = 4.5 # Shannon entropy per DNS label
session_entropy_budget = 8192 # cumulative high-entropy bytes/session
[network.dlp.extra_scopes]
github_pat = ["github.corp.example.com"]
npm_token = ["npm.internal.example.com"]
Merge semantics
When recipes / manifests are merged left-to-right (base.toml →
auto-detected → explicit -r → manifest overrides), each field uses:
| Field | Merge rule | Rationale |
|---|---|---|
enabled | OR (any Some(true) wins) | Security escalation, never reversed |
canary_tokens | OR | Same |
extra_scopes | per-detector domain union | Never narrows |
max_decode_depth | last-Some-wins | Numeric tuning |
decompress | last-Some-wins | |
dns_entropy_threshold | last-Some-wins | |
session_entropy_budget | last-Some-wins |
This guarantees a downstream recipe can never disable DLP that an upstream recipe enabled, and can never shrink the scope set.
Where to put it
- Project-level:
[network.dlp]incanister.tomlenables DLP for every sandbox in the project. - Per-sandbox: same key under
[sandbox.<name>.network.dlp]. - Recipe-level: drop a
[network.dlp]block into a custom recipe. Tool recipes (tool:gh,tool:npm, etc.) deliberately do not ship[network.dlp]— they declare the right[[host]]blocks withallow_credentials, and the scope check does the rest.
Limitations
- Pattern coverage is finite. A novel credential shape (a vendor
introducing a new prefix) won’t be caught until a detector is added.
generic_high_entropyis the catch-all, but itswarn-by-default posture means it’s only fatal in--strict. - Body buffering ceiling. Requests above
max_buffered_body_bytes(default 8 MiB) are rejected with413 Payload Too Largerather than forwarded unscanned. This is fail-closed by design, but it limits the protocol shapes DLP can cover (large file uploads need a higher cap or a different egress path). - TLS interception is required. DLP relies on the proxy’s MITM CA; it does not inspect end-to-end-pinned TLS (e.g. when the sandboxed process pins its own cert). Such traffic fails to handshake under the proxy, which is the same fail-closed posture.
- No regex on raw binary. Detectors operate on UTF-8 text after decompression and decoding. Binary protocols carrying credentials outside text fields (e.g. proprietary RPC over HTTP) need a custom detector or a different egress strategy.
CLI Reference
This file is auto-generated by
can-docgen. Do not edit manually.
can
Canister: a lightweight sandbox for running untrusted code safely
Usage: can [OPTIONS] <COMMAND>
Commands:
up Run a named sandbox from canister.toml
run Run a command inside the sandbox
check Check available kernel capabilities for sandboxing
setup Install or manage the security policy (AppArmor/SELinux) for filesystem isolation
recipe Manage and inspect recipes
init Download community recipes to the local config directory
update Update community recipes from the remote repository
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose Enable verbose (debug) logging
-h, --help Print help
-V, --version Print version
can run
Run a command inside the sandbox
Usage: can run [OPTIONS] <COMMAND>...
Arguments:
<COMMAND>...
The command to execute
Options:
-r, --recipe <RECIPE>
Recipe name or path. Can be repeated for composition.
If the argument contains `/` or ends with `.toml`, it is treated as a file path. Otherwise it is looked up by name across the recipe search path (e.g., `-r nix` resolves to `nix.toml`).
Multiple recipes are merged left-to-right.
-v, --verbose
Enable verbose (debug) logging
-m, --monitor
Run in monitor mode: log access attempts without enforcing
-s, --strict
Strict mode: fail hard on all setup failures. Seccomp uses KILL_PROCESS, filesystem isolation failures are fatal. Intended for CI / production use
-p, --port <PORTS>
Publish a container port to the host.
Syntax: [ip:]hostPort:containerPort[/protocol] Examples: -p 8080:80, -p 127.0.0.1:8443:443/tcp, -p 5000:5000/udp Can be repeated. Implies filtered network mode.
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
can up
Run a named sandbox from canister.toml.
Discovers canister.toml by walking up from the current directory, resolves the named sandbox (or the first-defined one), composes its recipes, and runs the command.
Usage: can up [OPTIONS] [NAME]
Arguments:
[NAME]
Sandbox name to run (defaults to the first defined in canister.toml)
Options:
--dry-run
Preview the resolved policy without running the sandbox
-v, --verbose
Enable verbose (debug) logging
-m, --monitor
Run in monitor mode: log access attempts without enforcing
-s, --strict
Override strict mode from the CLI
-p, --port <PORTS>
Publish a container port to the host.
Syntax: [ip:]hostPort:containerPort[/protocol] Can be repeated. Implies filtered network mode.
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
can check
Check available kernel capabilities for sandboxing
Usage: can check [OPTIONS]
Options:
-v, --verbose Enable verbose (debug) logging
-h, --help Print help
-V, --version Print version
can setup
Install or manage the security policy (AppArmor/SELinux) for filesystem isolation
Usage: can setup [OPTIONS]
Options:
--remove
Remove the security policy instead of installing it
-v, --verbose
Enable verbose (debug) logging
-f, --force
Force reinstall even if the policy is already installed. Useful after upgrading canister to pick up policy changes
--pasta-path <PASTA_PATH>
Explicit path to the pasta binary for non-standard installations.
When pasta is installed via Nix, Homebrew, or custom builds, sudo may not find it in PATH. Use this to generate correct AppArmor rules: sudo can setup --pasta-path $(which pasta)
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
can recipe
Manage and inspect recipes
Usage: can recipe [OPTIONS] <COMMAND>
Commands:
list List available recipes and the default baseline syscall counts
show Show the fully resolved recipe as TOML
explain Explain what a recipe does in human-readable form
suggest Suggest recipes for a command
help Print this message or the help of the given subcommand(s)
Options:
-v, --verbose Enable verbose (debug) logging
-h, --help Print help
-V, --version Print version
can init
Download community recipes to the local config directory.
Clones the canister GitHub repository (shallow) and copies recipe .toml files into $XDG_CONFIG_HOME/canister/recipes/. Requires git. Prints manual instructions if git is unavailable.
Usage: can init [OPTIONS]
Options:
--repo <REPO>
GitHub repository (owner/repo) to fetch from
-v, --verbose
Enable verbose (debug) logging
--branch <BRANCH>
Branch to fetch
--no-verify
Skip SHA-256 checksum verification of recipe files. Required when using custom/forked repositories
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
can update
Update community recipes from the remote repository.
Re-downloads and overwrites all recipes. Equivalent to `can init`.
Usage: can update [OPTIONS]
Options:
--repo <REPO>
GitHub repository (owner/repo) to fetch from
-v, --verbose
Enable verbose (debug) logging
--branch <BRANCH>
Branch to fetch
--no-verify
Skip SHA-256 checksum verification of recipe files. Required when using custom/forked repositories
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
can recipe list
List available recipes and the default baseline syscall counts
Usage: can recipe list [OPTIONS]
Options:
-v, --verbose Enable verbose (debug) logging
-h, --help Print help
-V, --version Print version
can recipe show
Show the fully resolved recipe as TOML.
Merges base.toml, auto-detected recipes, and explicit --recipe arguments, expands environment variables, then prints the final effective policy. The output is valid TOML that can be saved as a standalone recipe file.
Usage: can recipe show [OPTIONS] [COMMAND]...
Arguments:
[COMMAND]...
Optional command to resolve (enables auto-detection of recipes).
The command is NOT executed — it is only used to determine which recipes would be auto-detected based on `match_prefix`.
Options:
-r, --recipe <RECIPE>
Recipe name or path. Can be repeated for composition
-v, --verbose
Enable verbose (debug) logging
-h, --help
Print help (see a summary with '-h')
-V, --version
Print version
Configuration Reference
This file is auto-generated by
can-docgen. Do not edit manually.
Canister uses TOML recipe files with strict schema validation. Unknown fields are rejected at parse time.
Top-level fields
A recipe file — the only entry point for parsing policy TOML files.
| Field | Type | Default | Description |
|---|---|---|---|
host | object[] | [] | Per-destination egress contracts. See host::HostBlock and docs/adr/0007-per-destination-egress-contracts.md. Multiple blocks targeting the same domain are merged in RecipeFile::merge (vec union, max for max_request_bytes, last-Some-wins for contract_mode). |
strict | bool (optional) | — | Strict mode: fail hard instead of degrading gracefully. |
[filesystem]
| Field | Type | Default | Description |
|---|---|---|---|
allow | string[] | [] | Paths the sandboxed process is allowed to access (read-only). |
allow_write | string[] | [] | Paths bind-mounted writable into the sandbox. |
deny | string[] | [] | Paths explicitly denied (checked before allow and allow_write). |
mask | string[] | — | Paths to mask inside the sandbox (bind /dev/null over them). |
[network]
| Field | Type | Default | Description |
|---|---|---|---|
allow_host_loopback | bool | false | Allow the sandbox to reach host loopback services through the egress proxy via the magic alias host.canister.local. |
allow_ips | string[] | [] | Allowed IP addresses or CIDR ranges. IP-literal egress is a separate concept from FQDN egress (no service identity, no per-route shape gates apply), so it stays here rather than folding into the [[host]] table. |
contract_mode | strict | relaxed (optional) | — | Mode for hosts that have no matching [[host]] entry. |
egress | "none", "proxy-only", "direct" (optional) | — | |
ports | object[] | [] | Port forwarding rules: map host ports to sandbox ports. |
[network.dlp]
Data Loss Prevention configuration for the egress proxy.
| Field | Type | Default | Description |
|---|---|---|---|
canary_tokens | bool (optional) | — | Inject canary tokens (fake credentials) into the sandbox environment to detect exfiltration attempts. Default: true when DLP is enabled. |
decompress | bool (optional) | — | Decompress request bodies (gzip/deflate/brotli) before scanning. Default: true. |
dns_entropy_threshold | number (optional) | — | Normalised per-label entropy ratio for DNS exfiltration detection. A label’s Shannon entropy is divided by log2(len) to get a value in [0.0, 1.0]; the FQDN trips when two or more labels exceed this ratio. Default: 0.92. (Pre-2026-05 configs used absolute bits — those values are now clamped to 1.0 and effectively disable the check.) |
enabled | bool (optional) | — | Enable DLP scanning. Implicitly enabled in --strict mode when egress = "proxy-only". |
max_decode_depth | integer (optional) | — | Maximum encoding chain recursion depth (base64, hex, percent-encoding). Default: 32. |
session_entropy_budget | integer (optional) | — | Cumulative high-entropy bytes allowed per sandbox session before requests are blocked. Default: 8192. |
[process]
| Field | Type | Default | Description |
|---|---|---|---|
allow_execve | string[] | [] | Paths to executables the sandboxed process may exec. |
env | object | {} | Environment variables to set in the sandbox. These are evaluated after passthrough. |
env_passthrough | string[] | [] | Environment variables to pass through from the host. All others are stripped. |
max_pids | integer (optional) | — | Maximum number of child PIDs allowed. |
[proxy]
| Field | Type | Default | Description |
|---|---|---|---|
max_buffered_body_bytes | integer (optional) | — | Maximum bytes buffered for DLP body scanning via the full whole-buffer pipeline (decode chains, decompression, unescape). Requests at or under this size get the strongest analysis. Default 8 MiB. Requests above this size up to [Self::max_streamed_body_bytes] are still scanned but via the chunked streaming path (regex only, no decode chain). |
max_streamed_body_bytes | integer (optional) | — | Hard upper bound on request body size. Beyond this, the proxy returns 413. Defaults to 64 MiB. Requests between [Self::max_buffered_body_bytes] and this cap are scanned by the streaming detector: regex passes with a 256-byte overlap window, no decompression / decode chain. |
upstream_request_timeout_ms | integer (optional) | — | Upstream request total timeout in milliseconds. Defaults to 30 000 ms. |
upstream_scheme | string (optional) | — | Force the upstream request scheme. Accepts "http" or "h2c". When unset (the default), the scheme is inferred from the inbound request URI. Prior versions consulted a client-controlled x-canister-upstream-scheme header for this — that was a footgun (the sandboxed process picked the proxy’s egress protocol) and is no longer honoured. h2c also requires the experimental-h2c build feature; without it, setting this to "h2c" returns an upstream error. |
[recipe]
Metadata section for recipe files.
| Field | Type | Default | Description |
|---|---|---|---|
description | string (optional) | — | One-line description of what this recipe is for. |
match_prefix | string[] | [] | Path prefixes that trigger auto-detection of this recipe. |
name | string (optional) | — | Human-readable recipe name. Defaults to the filename stem when omitted. |
version | string (optional) | — | Opaque version string (for humans, not parsed). |
[resources]
| Field | Type | Default | Description |
|---|---|---|---|
cpu_percent | integer (optional) | — | CPU limit as a percentage (e.g., 50 = 50% of one core). |
memory_mb | integer (optional) | — | Memory limit in megabytes. |
[syscalls]
Syscall customization.
| Field | Type | Default | Description |
|---|---|---|---|
allow | string[] | [] | Absolute allow list — the complete set of permitted syscalls. Only valid in default.toml. Mutually exclusive with allow_extra. |
allow_extra | string[] | [] | Syscalls to add to the allow list (on top of the default baseline). |
deny | string[] | [] | Absolute deny list — syscalls always blocked. Only valid in default.toml. Mutually exclusive with deny_extra. |
deny_extra | string[] | [] | Syscalls to add to the deny list (also removed from allow list). |
notifier | bool (optional) | — | Enable the SECCOMP_RET_USER_NOTIF supervisor for argument-level syscall filtering (connect, clone, socket, execve). |
seccomp_mode | allow-list | deny-list (optional) | — | Seccomp enforcement mode. |
Manifest Reference (canister.toml)
This file is auto-generated by
can-docgen. Do not edit manually.
A project manifest declares named sandboxes. Place canister.toml in your project root.
Top-level fields
Top-level project manifest parsed from canister.toml.
| Field | Type | Default | Description |
|---|---|---|---|
sandbox | object | — | Named sandbox definitions. |
Merge Semantics
This file is auto-generated by
can-docgen. Do not edit manually.
When multiple recipes are composed, each field follows a specific merge strategy.
Composition Order
base.toml (always loaded first)
→ auto-detected recipes (match_prefix against command binary)
→ explicit --recipe args (left to right)
→ manifest overrides (for `can up`)
= final SandboxConfig
Field Merge Strategies
| Field | Type | Strategy | Description |
|---|---|---|---|
recipe | RecipeMeta | Overlay | Later recipe’s metadata wins if present |
strict | Option<bool> | OR | Any Some(true) wins — can never be loosened |
filesystem.allow | Vec<PathBuf> | Union | Deduplicated, preserving first-occurrence order |
filesystem.allow_write | Vec<PathBuf> | Union | Deduplicated, preserving first-occurrence order |
filesystem.deny | Vec<PathBuf> | Union | Deduplicated, preserving first-occurrence order |
filesystem.mask | Vec<PathBuf> | Union | Deduplicated, preserving first-occurrence order |
host (top-level [[host]] blocks) | Vec<HostBlock> | Union by domain | Same domain → field-merged via HostBlock::merge; distinct domains preserved |
network.allow_ips | Vec<String> | Union | Deduplicated, preserving first-occurrence order |
network.egress | Option<EgressMode> | Last-Some-wins | None preserves earlier value; Some(x) overwrites |
network.ports | Vec<PortMapping> | Union | Deduplicated, preserving first-occurrence order |
process.max_pids | Option<u32> | Last-Some-wins | None preserves earlier value; Some(x) overwrites |
process.allow_execve | Vec<PathBuf> | Union | Deduplicated, preserving first-occurrence order |
process.env_passthrough | Vec<String> | Union | Deduplicated, preserving first-occurrence order |
resources.memory_mb | Option<u64> | Last-Some-wins | None preserves earlier value; Some(x) overwrites |
resources.cpu_percent | Option<u32> | Last-Some-wins | None preserves earlier value; Some(x) overwrites |
syscalls.seccomp_mode | Option<SeccompMode> | Last-Some-wins | None preserves earlier value; Some(x) overwrites |
syscalls.notifier | Option<bool> | Last-Some-wins | None preserves earlier value; Some(x) overwrites |
syscalls.allow | Vec<String> | Union | Absolute allow list (baseline only) |
syscalls.deny | Vec<String> | Union | Absolute deny list (baseline only) |
syscalls.allow_extra | Vec<String> | Union | Deduplicated, preserving first-occurrence order |
syscalls.deny_extra | Vec<String> | Union | Deduplicated, preserving first-occurrence order |
Strategy Definitions
- Union: Both base and overlay values are combined into a single list, deduplicated by value, preserving first-occurrence order.
- OR: Any
Some(true)wins permanently. Once strict mode is enabled by any recipe in the chain, it cannot be disabled. - Last-Some-wins: The last recipe that specifies a value (
Some(x)) wins.None(field omitted) preserves the earlier value. - Overlay: The later recipe’s value replaces the earlier one entirely if present.
Built-in Recipes
This file is auto-generated by
can-docgen. Do not edit manually.
Canister ships with the following recipe files in the recipes/ directory.
The base.toml and default.toml recipes are embedded in the binary.
Overview
| Recipe | Description | Auto-detected |
|---|---|---|
base.toml | Essential OS paths for any Linux binary | No |
cargo.toml | Rust/Cargo toolchain ($HOME/.cargo, $HOME/.rustup) | Yes ($HOME/.cargo, $HOME/.rustup) |
default.toml | Default baseline — common syscalls for any Linux process. Blocks dangerous kernel operations and namespace escapes. | No |
elixir.toml | Elixir/Erlang development: mix tasks, iex, Phoenix server | No |
example.toml | Example recipe showing all available options | No |
flatpak.toml | Flatpak applications | Yes (/var/lib/flatpak, $HOME/.local/share/flatpak) |
generic-strict.toml | Strict no-network policy for untrusted binaries (CI/production) | No |
gnu-store.toml | GNU Guix package manager (/gnu/store) | Yes (/gnu/store) |
homebrew.toml | Homebrew/Linuxbrew package manager | Yes (/opt/homebrew, /home/linuxbrew/.linuxbrew) |
neovim.toml | Neovim editor with LSP, tree-sitter, and plugin support | No |
nix.toml | Nix package manager (/nix/store) | Yes (/nix/store) |
node-build.toml | Node.js build tasks: npm install, build, test | No |
opencode.toml | OpenCode AI coding agent with scoped filesystem and restricted network | No |
python-pip.toml | Install Python packages with pip (network access to PyPI) | No |
snap.toml | Snap package manager (/snap) | Yes (/snap) |
Recipe Contents
base.toml
# Canister base recipe — essential OS bind mounts.
#
# This recipe provides the minimal set of system paths required for any
# Linux binary to execute inside the sandbox. It replaces the hardcoded
# ESSENTIAL_BIND_MOUNTS list for auditability and customization.
#
# Composition order: base.toml is always loaded first, before default.toml,
# auto-detected recipes, and explicit --recipe arguments.
#
# Override: Place a base.toml in $XDG_CONFIG_HOME/canister/recipes/ or
# ./.canister/ to customize. The embedded copy is used as fallback.
[recipe]
name = "base"
description = "Essential OS paths for any Linux binary"
version = "2"
[filesystem]
# Paths bind-mounted read-only for basic execution. These provide:
# - Shell utilities and system binaries (/bin, /sbin, /usr/bin, /usr/sbin)
# - Shared libraries and the dynamic linker (/lib, /lib64, /usr/lib, /usr/lib64)
# - Locally installed binaries and libraries (/usr/local/bin, /usr/local/lib)
# - Shared data files (/usr/share)
# - Dynamic linker cache and configuration (/etc/ld.so.*)
# - DNS and network resolution (/etc/resolv.conf, /etc/nsswitch.conf, /etc/hosts)
# - TLS certificates (/etc/ssl, /etc/ca-certificates)
# - Timezone data (/etc/localtime)
# - Alternatives system (/etc/alternatives)
# - User/group databases (/etc/passwd, /etc/group) — needed by Go, Python, etc.
# - Temporary file directory (/tmp) — isolated per sandbox via overlay
allow = [
"/bin",
"/sbin",
"/lib",
"/lib64",
"/usr/bin",
"/usr/sbin",
"/usr/lib",
"/usr/lib64",
"/usr/local/bin",
"/usr/local/lib",
"/usr/share",
"/tmp",
"/etc/ld.so.cache",
"/etc/ld.so.conf",
"/etc/ld.so.conf.d",
"/etc/resolv.conf",
"/etc/nsswitch.conf",
"/etc/hosts",
"/etc/ssl",
"/etc/ca-certificates",
"/etc/localtime",
"/etc/alternatives",
"/etc/passwd",
"/etc/group",
]
deny = [
"/etc/shadow",
"/etc/gshadow",
]
cargo.toml
# Canister recipe for Cargo (Rust) toolchain.
#
# Cargo installs binaries to $HOME/.cargo/bin and stores the toolchain
# (rustc, rustup) under $HOME/.rustup. Both are needed for Rust
# compilation and tool execution.
#
# Auto-detection: triggered by match_prefix (uses env var expansion)
[recipe]
name = "cargo"
description = "Rust/Cargo toolchain ($HOME/.cargo, $HOME/.rustup)"
version = "1"
match_prefix = ["$HOME/.cargo", "$HOME/.rustup"]
[filesystem]
allow = ["$HOME/.cargo", "$HOME/.rustup"]
deny = ["$HOME/.cargo/credentials.toml", "$HOME/.cargo/credentials"]
default.toml
# Default baseline — the canonical syscall policy for Canister.
#
# This file defines the base set of allowed and denied syscalls used by
# every sandbox invocation. It is embedded into the binary via
# include_str!() as a fallback, but can be overridden by placing a
# default.toml in the recipe search path:
#
# 1. ./.canister/default.toml (project-local)
# 2. $XDG_CONFIG_HOME/canister/recipes/ (per-user)
# 3. /etc/canister/recipes/ (system-wide)
#
# Regular recipes extend this baseline with allow_extra / deny_extra.
# Only the baseline itself uses the absolute allow / deny fields.
[recipe]
name = "default"
description = "Default baseline — common syscalls for any Linux process. Blocks dangerous kernel operations and namespace escapes."
version = "1"
[syscalls]
# Absolute allow list — syscalls needed by virtually any Linux process:
# libc init, memory allocation, signal handling, file I/O, threading.
#
# Recipes MUST NOT use these fields — they use allow_extra / deny_extra
# to layer on top of this baseline.
allow = [
# Process lifecycle
"fork",
"vfork",
"clone",
"clone3",
"execve",
"kill",
"tkill",
"tgkill",
"exit",
"exit_group",
"wait4",
"waitid",
# Process control (prctl only — ptrace, personality, seccomp per-recipe)
"prctl",
# File I/O
"open",
"openat",
"openat2",
"creat",
"close",
"close_range",
"read",
"write",
"readv",
"writev",
"pread64",
"pwrite64",
"lseek",
"dup",
"dup2",
"dup3",
"fcntl",
"flock",
"fsync",
"fdatasync",
"truncate",
"ftruncate",
"fallocate",
# File metadata
"stat",
"fstat",
"lstat",
"newfstatat",
"statx",
"access",
"faccessat",
"faccessat2",
"chmod",
"fchmod",
"fchmodat",
"chown",
"fchown",
"lchown",
"fchownat",
# Directory operations
"mkdir",
"mkdirat",
"rmdir",
"rename",
"renameat",
"renameat2",
"link",
"linkat",
"unlink",
"unlinkat",
"symlink",
"symlinkat",
"readlink",
"readlinkat",
"getdents",
"getdents64",
# Memory
"mmap",
"mprotect",
"munmap",
"mremap",
"madvise",
"msync",
"brk",
"mlock",
"mlock2",
"munlock",
"mlockall",
"munlockall",
# Network
"socket",
"connect",
"accept",
"accept4",
"bind",
"listen",
"sendto",
"recvfrom",
"sendmsg",
"sendmmsg",
"recvmsg",
"shutdown",
"getsockopt",
"setsockopt",
"getsockname",
"getpeername",
"socketpair",
# Signals
"rt_sigaction",
"rt_sigprocmask",
"rt_sigreturn",
"rt_sigsuspend",
"sigaltstack",
# Time
"nanosleep",
"clock_nanosleep",
"clock_gettime",
"clock_getres",
"gettimeofday",
# Polling / async I/O
"poll",
"ppoll",
"select",
"pselect6",
"epoll_create",
"epoll_create1",
"epoll_ctl",
"epoll_wait",
"epoll_pwait",
"epoll_pwait2",
"eventfd",
"eventfd2",
"timerfd_create",
"timerfd_settime",
"timerfd_gettime",
# File monitoring
"inotify_init",
"inotify_init1",
"inotify_add_watch",
"inotify_rm_watch",
# IPC
"pipe",
"pipe2",
"shmget",
"shmat",
"shmctl",
"shmdt",
"semget",
"semop",
"semctl",
"msgget",
"msgsnd",
"msgrcv",
"msgctl",
# Process info
"getpid",
"getppid",
"getuid",
"getgid",
"geteuid",
"getegid",
"gettid",
"getpgid",
"getpgrp",
"setpgid",
"setsid",
"getgroups",
"setgroups",
"setuid",
"setgid",
"setreuid",
"setregid",
"setresuid",
"setresgid",
# I/O control + legacy AIO
"ioctl",
"io_setup",
"io_submit",
"io_getevents",
"io_destroy",
# Misc / threading
"futex",
"set_tid_address",
"set_robust_list",
"get_robust_list",
"sched_yield",
"sched_getaffinity",
"sched_setaffinity",
"sched_setscheduler",
"sched_getscheduler",
"rseq",
"getcwd",
"chdir",
"fchdir",
"umask",
"uname",
"sysinfo",
"getrusage",
"getrandom",
"prlimit64",
"pidfd_open",
"copy_file_range",
"sendfile",
"splice",
"tee",
# Arch-specific
"arch_prctl",
]
# Absolute deny list — dangerous kernel operations that a sandboxed
# process should never need. Always denied regardless of mode.
deny = [
"reboot",
"kexec_load",
"init_module",
"finit_module",
"delete_module",
"swapon",
"swapoff",
"acct",
"mount",
"umount2",
"pivot_root",
"chroot",
"syslog",
"settimeofday",
# Namespace escapes — a sandboxed process must never create new
# namespaces or join existing ones.
"unshare",
"setns",
# In-memory code execution — memfd_create + execveat enables fileless
# execution, a common technique for running malicious payloads without
# touching disk. Recipes that legitimately need these (e.g., BEAM VM)
# can opt in via allow_extra.
"memfd_create",
"execveat",
]
elixir.toml
# Canister recipe for Elixir/Erlang workloads.
#
# The BEAM VM needs ptrace for :observer, :dbg, and erlang:trace/3.
# This recipe adds BEAM-specific syscalls, Hex.pm network access, and
# BEAM environment variables. System paths are provided by base.toml.
#
# Usage:
# can run -r elixir -- mix test
# can run -r elixir -r nix -- iex -S mix
[recipe]
name = "elixir-dev"
description = "Elixir/Erlang development: mix tasks, iex, Phoenix server"
version = "2"
# Strict mode: abort if any isolation layer cannot be set up.
# Recommended for CI. Uncomment to enable.
# strict = true
[syscalls]
# BEAM needs ptrace for tracing/debugging tools (:observer, :dbg).
# BEAM uses memfd_create for JIT code loading (moved to deny list in default baseline).
allow_extra = ["ptrace", "memfd_create"]
[filesystem]
deny = ["/etc/shadow", "/root"]
[network]
egress = "proxy-only"
# Allow hex.pm for dependency fetching and common Elixir registries.
[[host]]
domain = "hex.pm"
[[host]]
domain = "repo.hex.pm"
[[host]]
domain = "builds.hex.pm"
[[host]]
domain = "github.com"
[process]
# BEAM spawns many lightweight processes via OS threads; the default
# scheduler count equals the CPU core count. 256 is generous for most
# mix tasks and development servers.
max_pids = 256
# Environment variables the BEAM commonly needs.
env_passthrough = [
"PATH",
"HOME",
"LANG",
"TERM",
"MIX_ENV",
"MIX_HOME",
"HEX_HOME",
"ERL_AFLAGS",
"ELIXIR_ERL_OPTIONS",
"RELEASE_COOKIE",
"RELEASE_NODE",
"RELEASE_DISTRIBUTION",
"SECRET_KEY_BASE",
"DATABASE_URL",
"PHX_HOST",
"PHX_SERVER",
"PORT",
]
example.toml
# Example Canister recipe — a complete sandbox policy for Python scripts.
#
# System paths (/usr/lib, /usr/bin, /lib, /tmp, etc.) are provided by
# base.toml — recipes only need to add application-specific paths.
#
# Usage: can run -r example -- python3 script.py
[recipe]
name = "example"
description = "Example recipe showing all available options"
version = "2"
# Strict mode: abort if any isolation layer cannot be set up.
# Recommended for CI. Uncomment to enable.
# strict = true
[filesystem]
# Paths the sandboxed process can read (mounted read-only).
# System paths are already provided by base.toml — only add app-specific paths here.
allow = []
# Paths the sandboxed process can write to (changes persist on host).
# The working directory ($PWD) is always writable. Uncomment for additional paths:
# allow_write = ["/var/data/myapp"]
# Paths explicitly denied (checked before allow and allow_write).
deny = ["/etc/shadow", "/root"]
[network]
egress = "proxy-only"
# Allowed IPs or CIDRs for direct connections. Empty = no IP-literal
# egress; everything must go through the DNS-resolved `[[host]]` list
# below.
allow_ips = []
# Allowed upstream destinations. The proxy refuses any other host.
[[host]]
domain = "pypi.org"
[[host]]
domain = "files.pythonhosted.org"
[process]
# Max child PIDs.
max_pids = 64
# Restrict which executables the sandbox may run (optional, opt-in).
# By default, all executables are allowed — the real security boundaries
# are network, filesystem, and seccomp. Uncomment to lock down:
# allow_execve = ["/usr/bin/python3"]
# Environment variables passed through from host.
env_passthrough = ["PATH", "HOME", "LANG", "TERM"]
# [resources] section — cgroup v2 resource limits (optional).
# Requires cgroups v2 on the host. Uncomment to enable:
# [resources]
# memory_mb = 512
# cpu_percent = 50
# [syscalls] section — customize the default seccomp baseline.
# Uncomment to add or remove specific syscalls:
# [syscalls]
# seccomp_mode = "allow-list"
# allow_extra = ["ptrace"] # add ptrace to the allow list
# deny_extra = ["personality"] # remove personality from allow, add to deny
flatpak.toml
# Canister recipe for Flatpak applications.
#
# Flatpak installs applications under /var/lib/flatpak (system-wide)
# and $HOME/.local/share/flatpak (per-user).
#
# Auto-detection: triggered by match_prefix
[recipe]
name = "flatpak"
description = "Flatpak applications"
version = "1"
match_prefix = ["/var/lib/flatpak", "$HOME/.local/share/flatpak"]
[filesystem]
allow = ["/var/lib/flatpak", "$HOME/.local/share/flatpak"]
generic-strict.toml
# Canister recipe for strict-mode execution of arbitrary binaries.
#
# Enables strict mode: any setup failure is fatal, seccomp uses
# KILL_PROCESS. No network access. Minimal filesystem (base.toml only).
# Intended for CI pipelines and production jobs where security is paramount.
#
# Usage:
# can run -r generic-strict -- ./my-binary --flag
# can run -r generic-strict -- cargo test
strict = true
[recipe]
name = "generic-strict"
description = "Strict no-network policy for untrusted binaries (CI/production)"
version = "2"
[syscalls]
# For compiled binaries that may use ptrace (debuggers), io_uring
# (modern async I/O), personality (multilib), or seccomp (self-sandboxing).
allow_extra = [
"ptrace",
"personality",
"seccomp",
"io_uring_setup",
"io_uring_enter",
"io_uring_register",
]
[filesystem]
deny = ["/etc/shadow", "/root", "/home"]
[network]
egress = "none"
[process]
max_pids = 64
env_passthrough = ["PATH", "LANG", "TERM"]
gnu-store.toml
# Canister recipe for GNU Guix package manager.
#
# Guix uses a content-addressed store at /gnu/store, similar to Nix.
# Binaries reference sibling store entries, so the entire store must
# be mounted.
#
# Auto-detection: triggered by match_prefix
[recipe]
name = "gnu-store"
description = "GNU Guix package manager (/gnu/store)"
version = "1"
match_prefix = ["/gnu/store"]
[filesystem]
allow = ["/gnu/store"]
homebrew.toml
# Canister recipe for Homebrew (Linuxbrew) package manager.
#
# On Linux, Homebrew installs to /home/linuxbrew/.linuxbrew or
# /opt/homebrew (rare on Linux, common path on macOS). Binaries
# reference the Cellar and shared libraries within the prefix.
#
# Auto-detection: triggered by match_prefix
[recipe]
name = "homebrew"
description = "Homebrew/Linuxbrew package manager"
version = "1"
match_prefix = ["/opt/homebrew", "/home/linuxbrew/.linuxbrew"]
[filesystem]
allow = ["/opt/homebrew", "/home/linuxbrew/.linuxbrew"]
neovim.toml
# Canister recipe for Neovim with LSP support.
#
# Mounts the user's full Neovim configuration (config, data, state, cache)
# so plugin managers (lazy.nvim), LSP servers (via Mason), tree-sitter
# parsers, and other tooling work out of the box.
#
# System paths (/usr/lib, /usr/bin, /lib, /tmp, etc.) are provided by
# base.toml — this recipe only adds Neovim-specific paths.
#
# Usage:
# can run -r neovim -- nvim
# can run -r neovim -r elixir -r nix -- nvim
[recipe]
name = "neovim"
description = "Neovim editor with LSP, tree-sitter, and plugin support"
version = "2"
[filesystem]
allow = [
# Neovim XDG directories
"$HOME/.config/nvim",
"$HOME/.local/share/nvim",
"$HOME/.local/state/nvim",
"$HOME/.cache/nvim",
]
deny = ["/etc/shadow", "/root"]
[network]
egress = "proxy-only"
# LSP servers, Mason, and plugin managers may need network access
# for installation and updates. Lock down to known hosts.
[[host]]
domain = "github.com"
[[host]]
domain = "objects.githubusercontent.com"
[[host]]
domain = "raw.githubusercontent.com"
[[host]]
domain = "api.github.com"
[[host]]
domain = "registry.npmjs.org"
[[host]]
domain = "pypi.org"
[[host]]
domain = "files.pythonhosted.org"
[[host]]
domain = "luarocks.org"
[[host]]
domain = "github.com"
[process]
max_pids = 256
allow_execve = []
env_passthrough = [
"PATH",
"HOME",
"LANG",
"TERM",
"COLORTERM",
"TERMINFO",
"USER",
"SHELL",
"EDITOR",
"VISUAL",
# XDG directories (so nvim finds its config)
"XDG_CONFIG_HOME",
"XDG_DATA_HOME",
"XDG_STATE_HOME",
"XDG_CACHE_HOME",
"XDG_RUNTIME_DIR",
# Nix
"NIX_PATH",
"NIX_PROFILES",
# LSP / language tooling
"CARGO_HOME",
"RUSTUP_HOME",
"GOPATH",
"GOROOT",
"NODE_PATH",
"npm_config_prefix",
]
[syscalls]
# Neovim + LSP servers need ptrace (for debugging) and
# memfd_create (used by various runtimes)
allow_extra = ["ptrace", "memfd_create"]
nix.toml
# Canister recipe for Nix package manager.
#
# Nix stores all packages in /nix/store with content-addressed paths.
# Binaries freely reference sibling store entries, so the entire store
# must be mounted. This recipe is auto-detected when the resolved
# command binary lives under /nix/store.
#
# Auto-detection: triggered by match_prefix
[recipe]
name = "nix"
description = "Nix package manager (/nix/store)"
version = "1"
match_prefix = ["/nix/store"]
[filesystem]
# Mount the entire Nix store — binaries reference sibling entries
# via rpaths and wrapper scripts.
allow = ["/nix/store"]
node-build.toml
# Canister recipe for Node.js build tasks (npm/yarn/pnpm).
#
# Allows network access to the npm registry and common CDNs.
# System paths are provided by base.toml — this recipe only adds
# Node.js-specific network and environment configuration.
#
# Usage:
# can run -r node-build -- npm install
# can run -r node-build -- npm run build
[recipe]
name = "node-build"
description = "Node.js build tasks: npm install, build, test"
version = "2"
[filesystem]
deny = ["/etc/shadow", "/root"]
[network]
egress = "proxy-only"
[[host]]
domain = "registry.npmjs.org"
[[host]]
domain = "registry.yarnpkg.com"
[[host]]
domain = "registry.npmmirror.com"
[process]
max_pids = 128
env_passthrough = [
"PATH",
"HOME",
"LANG",
"TERM",
"NODE_ENV",
"NPM_CONFIG_REGISTRY",
"NPM_TOKEN",
]
opencode.toml
# Canister recipe for OpenCode — AI coding agent.
#
# OpenCode is a Go binary that talks to LLM providers (GitHub Copilot,
# Anthropic, OpenAI, etc.) and runs developer tools (git, cargo, npm,
# ripgrep, etc.) to assist with coding tasks.
#
# This recipe scopes filesystem access to the project working directory,
# OpenCode's own state/config dirs, and common tool locations. Network
# is restricted to known LLM API domains.
#
# Usage:
# can run --recipe recipes/opencode.toml -- opencode
#
# Note: Run from the project directory you want OpenCode to work on.
# The $PWD will be bind-mounted into the sandbox.
[recipe]
name = "opencode"
description = "OpenCode AI coding agent with scoped filesystem and restricted network"
version = "2"
[filesystem]
# OpenCode needs access to:
# - The project directory (implicitly mounted writable as the working dir)
# - Its own config and state dirs
# System paths (/usr/lib, /usr/bin, /lib, /tmp, etc.) are provided by base.toml.
# Language toolchains ($HOME/.cargo, $HOME/.rustup, /nix/store, etc.) should
# be added via recipe composition in canister.toml, NOT baked in here.
#
# SECURITY: We intentionally do NOT mount:
# - $HOME/.ssh — SSH private keys. Use SSH_AUTH_SOCK (agent) instead.
# - $HOME/.gitconfig — may contain credential helpers or tokens. Git identity
# is passed via GIT_AUTHOR_NAME/EMAIL env vars.
# - $HOME/.cargo — contains credentials.toml (crates.io tokens). Mount
# via the cargo recipe when needed.
# - $HOME/.gnupg — GPG private keys.
# - $HOME/.aws — AWS credentials.
# - $HOME/.kube — Kubernetes config with tokens.
allow = [
"$HOME/.opencode",
"$HOME/.config/opencode",
]
# OpenCode writes to its state dir (SQLite DB, logs, tool output).
allow_write = [
"$HOME/.local/share/opencode",
]
deny = [
"/etc/shadow",
"/root",
"$HOME/.ssh",
"$HOME/.gnupg",
"$HOME/.aws",
"$HOME/.kube",
"$HOME/.docker",
"$HOME/.npmrc",
"$HOME/.pypirc",
"$HOME/.netrc",
"$HOME/.config/gh",
]
[network]
egress = "proxy-only"
# GitHub Copilot auth and API
# - github.com: OAuth device flow
# - api.github.com: token exchange and user info
# - api.githubcopilot.com: chat completions endpoint
# - copilot-proxy.githubusercontent.com: Copilot proxy
#
# Anthropic API (Claude models)
# - api.anthropic.com
#
# OpenAI API
# - api.openai.com
#
# OpenCode's own services (Zen, auth, sharing)
# - opencode.ai
#
# WebFetch tool — OpenCode fetches arbitrary URLs for research.
# If you need unrestricted web access, set egress = "direct".
[[host]]
domain = "github.com"
[[host]]
domain = "api.github.com"
[[host]]
domain = "api.githubcopilot.com"
[[host]]
domain = "copilot-proxy.githubusercontent.com"
[[host]]
domain = "api.anthropic.com"
[[host]]
domain = "api.openai.com"
[[host]]
domain = "opencode.ai"
[process]
# OpenCode spawns subprocesses for tools (git, cargo, rg, etc.)
max_pids = 256
# Environment variables OpenCode and its tools need.
env_passthrough = [
"PATH",
"HOME",
"LANG",
"TERM",
"COLORTERM",
"EDITOR",
"SHELL",
"USER",
"XDG_CONFIG_HOME",
"XDG_DATA_HOME",
"XDG_STATE_HOME",
"XDG_CACHE_HOME",
"GIT_AUTHOR_NAME",
"GIT_AUTHOR_EMAIL",
"GIT_COMMITTER_NAME",
"GIT_COMMITTER_EMAIL",
"HTTPS_PROXY",
"HTTP_PROXY",
"NO_PROXY",
"SSH_AUTH_SOCK",
"CARGO_HOME",
"RUSTUP_HOME",
"NODE_PATH",
]
python-pip.toml
# Canister recipe for installing Python packages with pip.
#
# Allows network access to PyPI. System paths are provided by base.toml.
#
# Usage:
# can run -r python-pip -- pip install requests
# can run -r python-pip -- python3 -m pip install -r requirements.txt
[recipe]
name = "python-pip"
description = "Install Python packages with pip (network access to PyPI)"
version = "2"
[filesystem]
deny = ["/etc/shadow", "/root"]
[network]
egress = "proxy-only"
[[host]]
domain = "pypi.org"
[[host]]
domain = "files.pythonhosted.org"
[process]
max_pids = 32
env_passthrough = [
"PATH",
"HOME",
"LANG",
"TERM",
"PIP_INDEX_URL",
"PIP_TRUSTED_HOST",
"VIRTUAL_ENV",
]
snap.toml
# Canister recipe for Snap packages.
#
# Snap installs applications under /snap with the runtime under
# /snap/core*. Binaries are launched via /snap/bin wrappers.
#
# Auto-detection: triggered by match_prefix
[recipe]
name = "snap"
description = "Snap package manager (/snap)"
version = "1"
match_prefix = ["/snap"]
[filesystem]
allow = ["/snap"]
Architecture
This document describes the internal design of Canister, the execution flow of a sandboxed process, and the security properties of each isolation layer.
Table of Contents
- Design Principles
- Crate Structure
- Execution Flow
- Isolation Layers
- Parent-Child Protocol
- Mandatory Access Control (MAC)
- Known Limitations
Design Principles
-
Unprivileged by default. No root, no suid, no capabilities. Everything runs as the calling user using unprivileged user namespaces.
-
Defense in depth. Multiple independent isolation mechanisms. Bypassing one layer does not compromise the others.
-
Fail closed. When a feature cannot be set up (e.g., a MAC system blocks mounts), Canister aborts. All setup failures are fatal in both normal and strict mode — the sandbox runs at full strength or not at all.
-
Single binary. No runtime dependencies beyond the Linux kernel (and optionally pasta for filtered networking). No dynamic linking to external libraries.
-
One-shot execution. Fork, isolate, exec, wait, exit. No daemon, no long-running supervisor process. The sandbox lifetime equals the command lifetime.
Crate Structure
canister/
├── can-cli CLI binary. Argument parsing (clap), recipe
│ resolution (name-based lookup, auto-detection
│ via match_prefix), composition chain assembly,
│ `can up` (manifest-driven sandboxes from
│ canister.toml), `can recipe show` (emit resolved
│ policy as TOML), can init / can update lifecycle
│ commands.
│
├── can-sandbox Core runtime. Orchestrates the fork/unshare/exec
│ sequence. Contains the namespace, overlay, and
│ seccomp modules.
│
├── can-policy Policy engine. TOML config parsing, RecipeFile
│ merge logic, environment variable expansion
│ ($HOME, $USER, etc.), access control enforcement
│ (path, domain, IP/CIDR), seccomp profile
│ definitions. Also contains the project manifest
│ module (manifest.rs) for canister.toml parsing,
│ validation, and upward directory discovery.
│ No Linux-specific code.
│
├── can-net Network isolation. Network namespace setup,
│ loopback interface, pasta integration,
│ DNS proxy with domain filtering.
│
└── can-log Logging setup. TTY detection, human vs JSON
output selection, monitor-mode event types
and summary output.
Dependencies flow downward: can-cli -> can-sandbox -> can-policy,
can-net. can-policy and can-log have no internal dependencies.
Outbound Defense Model (Filtered + Proxy + DLP)
When proxy is enabled with enforcement, outbound networking follows a three-layer model:
- Kernel first-line (seccomp USER_NOTIF): sandboxed processes may only connect to local proxy loopback endpoint and DNS server; all other direct outbound INET/INET6 traffic is denied.
- Proxy second-line (user space): proxy validates destination against allow policy and forwards via L7 HTTP interception path or L4 CONNECT passthrough path.
- DLP third-line (content scanning): when
[network.dlp]is enabled (implicit under--strict+proxy-only), the L7 path scans request headers, URI, and body for credential patterns (GitHub PATs, npm tokens, AWS keys, SSH keys, etc.) and enforces per-detector domain scoping. A GitHub PAT bound forregistry.npmjs.orgis blocked even thoughregistry.npmjs.orghas a[[host]]block. Bodies are decompressed (gzip/deflate/brotli) and decoded (base64/hex/percent, up to 32 layers) before pattern matching. See DLP.md for the threat model, detector list, and canary-token / session-entropy-budget mechanisms.
This prevents bypass by unsetting proxy environment variables, and prevents exfiltration of credentials that the sandbox legitimately needs read access to.
Execution Flow
Canister supports two entry points:
can run -r ... -- command— ad-hoc sandboxing with explicit recipe flagscan up [name]— manifest-driven sandboxing fromcanister.toml
Both converge on the same fork/unshare/exec pipeline. The only difference is how the recipe chain is assembled in step 1.
Manifest Discovery (can up)
When can up is invoked, the CLI discovers canister.toml by walking up
from the current directory (like .gitignore). It parses the manifest,
resolves the named sandbox (or the first defined sandbox alphabetically),
and assembles the recipe chain from the manifest’s recipes = [...] list
plus any [sandbox.<name>.filesystem] / [sandbox.<name>.network] / etc.
overrides.
Composition order for can up:
base.toml
→ auto-detected recipes (match_prefix against command binary)
→ recipes listed in manifest (left to right)
→ manifest overrides ([sandbox.<name>.filesystem], etc.)
= final SandboxConfig
This replaces the explicit --recipe flags from can run with the
manifest’s declarative recipe list. The resolved SandboxConfig is
identical in structure and is passed to the same sandbox runtime.
can run Flow
The complete lifecycle of can run -r nix -r elixir -- mix test:
┌─────────────────────────────────────────────────────────────────────┐
│ 1. CLI SETUP │
│ a. Parse args, resolve + canonicalize command path │
│ b. Load base.toml (embedded, overridable) │
│ c. Auto-detect recipes: match resolved binary path against │
│ match_prefix in all discovered recipe files │
│ d. Load explicit --recipe args (name-based lookup or file path) │
│ e. Merge recipe chain: base → auto-detected → explicit (L-to-R) │
│ f. Expand env vars ($HOME, $USER, etc.) in merged config │
│ g. Validate allow_execve, determine network mode │
└──────────────────────────┬──────────────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────────────┐
│ 1b. NOTIFIER SETUP (before fork) │
│ a. Resolve notifier_enabled (config override / monitor mode / │
│ kernel version auto-detect) │
│ b. If enabled: create anonymous Unix socket pair for fd passing │
│ (parent_sock, child_sock) │
│ c. Pre-resolve allowed domains to IPs (already done in network │
│ setup — IPs stored for building notifier policy) │
└──────────────────────────┬──────────────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────────────┐
│ 2. FORK │
│ Create three pipes (child_ready, maps_done, network_done). │
│ Capture UID/GID. Call fork(). │
└──────────┬──────────────────────────────────────┬───────────────────┘
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ PARENT │ │ CHILD │
│ │ │ │
│ │ │ 3. UNSHARE │
│ │ │ Phase 1: │
│ │ │ USER+PID │
│ │ │ [+NET] │
│ │ "ready" ◄────────── │ │
│ │ │ │
│ 4. UID/GID │ │ (blocks │
│ MAPPING │ │ maps_done)│
│ Write │ │ │
│ /proc/ │ │ │
│ <pid>/ │ │ │
│ uid_map │ │ │
│ gid_map │ │ │
│ │ ──► "maps_done" │ │
│ │ │ (blocks │
│ 5. NETWORK │ │ net_done) │
│ Start │ │ │
│ pasta │ │ │
│ --userns │ │ │
│ --netns │ │ │
│ │ ──► "net_done" │ │
│ │ │ │
│ │ │ 5b.UNSHARE │
│ │ │ Phase 2: │
│ │ │ NEWNS │
│ │ │ │
│ │ │ 6. PID NS │
│ │ │ First │
│ │ │ fork() │
│ │ │ (creates │
│ │ │ new PID │
│ │ │ ns) │
│ │ │ │
│ │ │ Interme- │
│ │ │ diate: │
│ │ │ waitpid │
│ │ │ + exit │
│ │ │ │
│ │ │ 6a. SECOND │
│ │ │ FORK │
│ │ │ (when │
│ │ │ notifier │
│ │ │ enabled) │
│ │ │ │
│ │ │ ┌─ PID 1: │
│ │ │ │ SUPER- │
│ │ │ │ VISOR │
│ │ │ │ unshare│
│ │ │ │ NEWNS │
│ │ │ │ mount │
│ │ │ │ /proc │
│ │ │ │ recv │
│ │ │ │ notif │
│ │ │ │ fd via │
│ │ │ │ SCM_ │
│ │ │ │ RIGHTS │
│ │ │ │ poll + │
│ │ │ │ waitpid│
│ │ │ │ loop │
│ │ │ │ │
│ │ │ └─ PID 2: │
│ │ │ WORKER │
│ │ │ setsid │
│ │ │ │
│ │ │ 6b. CGROUP │
│ │ │ Create │
│ │ │ child │
│ │ │ cgroup, │
│ │ │ write │
│ │ │ memory │
│ │ │ .max + │
│ │ │ cpu.max │
│ │ │ (before │
│ │ │ pivot_ │
│ │ │ root) │
│ │ │ │
│ │ │ 7. OVERLAY │
│ │ │ tmpfs │
│ │ │ root, │
│ │ │ bind │
│ │ │ mounts │
│ │ │ (from │
│ │ │ merged │
│ │ │ config), │
│ │ │ CWD bind │
│ │ │ mount │
│ │ │ (RW), │
│ │ │ pivot_ │
│ │ │ root, │
│ │ │ chdir() │
│ │ │ │
│ │ │ 7b. PROC │
│ │ │ HARDEN │
│ │ │ Mask │
│ │ │ /proc/* │
│ │ │ │
│ │ │ 8. NET │
│ │ │ SETUP │
│ │ │ loopback │
│ │ │ resolv. │
│ │ │ conf │
│ │ │ │
│ │ │ 9. PROCESS │
│ │ │ RLIMIT │
│ │ │ NPROC │
│ │ │ │
│ │ │ 10. NOTIF │
│ │ │ FILTER │
│ │ │ Worker │
│ │ │ installs │
│ │ │ USER_ │
│ │ │ NOTIF │
│ │ │ BPF, │
│ │ │ sends fd │
│ │ │ to PID 1 │
│ │ │ (super- │
│ │ │ visor) │
│ │ │ via SCM_ │
│ │ │ RIGHTS │
│ │ │ │
│ │ │ 11. SECCOMP │
│ │ │ Load │
│ │ │ main BPF │
│ │ │ filter │
│ │ │ │
│ │ │ 12. ENV │
│ │ │ Filter │
│ │ │ env vars │
│ │ │ │
│ │ │ 13. EXEC │
│ │ │ execve() │
│ │ │ │
│ 14. WAIT │ │ (running) │
│ waitpid │ │ │
│ │ │ (exits) │
│ │ └─────────────┘
│ 15. CLEANUP│
│ Kill │
│ pasta │
│ Stop DNS │
│ proxy │
│ Return │
│ exit code│
└─────────────┘
Critical ordering constraints:
- Recipe composition (load, merge, env expansion) happens entirely in the
CLI layer before forking. The child receives an already-resolved
SandboxConfig. unshare()is split into two phases. Phase 1:unshare(CLONE_NEWUSER | CLONE_NEWPID | CLONE_NEWNET)— creates user, PID, and network namespaces. Phase 2:unshare(CLONE_NEWNS)— creates the mount namespace. The split is necessary so pasta can access/proc/<child_pid>/ns/netbefore the child’s mount namespace changes.- UID/GID maps must be written from the parent process. The child cannot
write its own maps after
unshare(CLONE_NEWUSER). - pasta must be started after the child creates
CLONE_NEWNETand after UID/GID maps are written, but before the child callsunshare(CLONE_NEWNS). pasta is invoked with--userns /proc/<child_pid>/ns/user --netns /proc/<child_pid>/ns/net --runas <uid>. - The inner fork for PID namespace must happen before filesystem setup so
/procmount reflects the new PID namespace. setsid()must be called after the inner PID namespace fork. PID 1 inherits an invisible session/process-group from the parent namespace; withoutsetsid(), bash’s job control initialization fails (getpgrpreturns the parent-namespace group).- /proc hardening must happen after overlay + /proc mount but before seccomp.
RLIMIT_NPROCmust be set before seccomp (which blocksprctl).- Cgroups v2 setup must happen before
pivot_root, because the cgroup filesystem (/sys/fs/cgroup) is on the host and becomes inaccessible after the root is swapped. This is step 6b in the execution flow. - The CWD bind-mount must happen during overlay setup (step 7), before
pivot_root. The host’s current working directory is captured beforeunshare()and bind-mounted writable into the new root. Afterpivot_root, the child callschdir()to the mounted CWD path. - The notifier filter must be installed before the main seccomp filter.
The
seccomp()syscall withSECCOMP_FILTER_FLAG_NEW_LISTENERreturns the notification fd. The worker (PID 2) sends this fd to PID 1 (supervisor) via SCM_RIGHTS, then installs the main filter viaprctl(PR_SET_SECCOMP). - PID 1 (the supervisor) must receive the notifier fd and begin its poll loop
before the worker calls
execve(), so the supervisor is ready to handle notifications from the target program. - Seccomp must be loaded after all setup is complete, right before exec.
- Environment filtering happens at exec time —
execve()receives the filtered environment directly.
Isolation Layers
1. User Namespaces
Syscall: unshare(CLONE_NEWUSER)
The child process gets a new user namespace where it is mapped as UID 0 / GID 0. This gives it “root inside the namespace” which is required for mount operations, but grants zero real privileges on the host.
The parent writes the mapping:
/proc/<pid>/setgroups → "deny"
/proc/<pid>/uid_map → "0 <host_uid> 1"
/proc/<pid>/gid_map → "0 <host_gid> 1"
Security property: The child appears to be root but cannot affect any resources outside its namespace. All privilege checks are scoped to the namespace.
2. Mount Namespace + pivot_root
Syscall: unshare(CLONE_NEWNS) + pivot_root()
The child gets its own mount table. The setup sequence:
1. mount("", "/", MS_SLAVE | MS_REC) # break propagation to host
2. mount("tmpfs", new_root) # empty tmpfs as new root
3. mkdir skeleton dirs # /bin, /lib, /usr, /proc, /dev, /tmp, ...
4. bind-mount essentials (read-only) # from base.toml: /bin, /sbin, /usr/bin, ...
5. bind-mount allowed paths (RO) # from merged [filesystem].allow (all recipes)
5b. bind-mount CWD (read-write) # host working directory, always mounted
6. mount /tmp (read-write) # ephemeral writable space
7. mount /proc # needed by many programs
8. set up /dev # null, zero, urandom, tty, fd symlinks
9. pivot_root(new_root, old_root) # swap filesystem root
10. umount(old_root, MNT_DETACH) # detach host filesystem entirely
11. chdir(cwd_path) # restore working directory inside new root
Recipe-based mount resolution:
The paths visible inside the sandbox come from the merged recipe chain
(base.toml → auto-detected → explicit). There is no hardcoded prefix
detection at runtime. Instead:
-
base.tomldefines essential OS bind mounts (/bin,/sbin,/usr/bin,/usr/sbin,/lib,/lib64,/usr/lib,/etc). It is embedded in the binary viainclude_str!()and overridable on disk, following the same pattern asdefault.toml. -
Auto-detected recipes provide package-manager mounts. Each recipe declares
match_prefixpatterns in its[recipe]metadata. During CLI setup (before fork), the resolved binary path is matched against all discovered recipes. Matching recipes are merged into the chain, bringing their[filesystem].allowpaths with them:Recipe match_prefixAdds to allownix.toml/nix/store/nix/storehomebrew.toml/opt/homebrew,/home/linuxbrew/.linuxbrew/opt/homebrew(or linuxbrew)cargo.toml$HOME/.cargo,$HOME/.rustup$HOME/.cargo,$HOME/.rustupsnap.toml/snap/snapflatpak.toml/var/lib/flatpak,$HOME/.local/share/flatpakprefix paths gnu-store.toml/gnu/store/gnu/store -
Explicit recipes (
--recipe/-rflags) add whatever[filesystem].allowpaths they declare. -
Environment variable expansion (
$HOME,$USER,${XDG_CONFIG_HOME}) is performed duringinto_sandbox_config(), after merge but before the paths are used by the overlay module.
This design means adding support for a new package manager is “write a
.toml file” rather than “modify Rust code”. The detect_command_prefix()
function was removed entirely.
Security model: Filesystem visibility does not equal execution permission.
Mounted paths are visible inside the sandbox, but allow_execve and the
USER_NOTIF supervisor’s execve()/execveat() filtering control what can
actually be executed.
Security property: The process cannot see or access any host path that was not explicitly included in the merged recipe chain. The host’s current working directory is always bind-mounted writable so the sandboxed process can read/write files in its working directory. All other writes go to tmpfs and are discarded when the process exits.
MAC systems: When a Mandatory Access Control system (AppArmor on Ubuntu,
SELinux on Fedora/RHEL) blocks mount operations, filesystem isolation cannot be
established and the sandbox aborts. Run sudo can setup to install the
appropriate security policy (see MAC section).
3. Network Namespace
Syscall: unshare(CLONE_NEWNET) + pasta
Three modes, determined from config:
None mode: The sandbox has an empty network namespace with only loopback. No external connectivity.
Filtered mode: The parent starts pasta which mirrors the host’s
network configuration into the child’s network namespace. pasta copies the
host’s real IP addresses, routes, and gateway into the namespace:
┌──────────────────────────────────┐
│ Host network │
│ │
│ pasta ◄──── namespace fd │
│ │ │
│ │ mirrors host config │
│ │ │
└───────┼──────────────────────────┘
│
┌───────┼──────────────────────────┐
│ ▼ Sandbox network │
│ Host's real IP (mirrored) │
│ gateway: host's default gw │
│ DNS: 169.254.0.1 (link-local) │
│ │
│ ┌─────────────────────────┐ │
│ │ sandboxed process │ │
│ └─────────────────────────┘ │
└──────────────────────────────────┘
Allowed domains are pre-resolved to IP addresses at startup (from the
parent, which still has host DNS access). These resolved IPs are passed to
the USER_NOTIF supervisor, which intercepts connect() syscalls and validates
the destination IP against the allow list. A DNS proxy runs in the parent
process on an ephemeral port, filtering DNS queries to only resolve
allowed domains. The sandbox’s /etc/resolv.conf is configured to use
pasta’s DNS address (169.254.0.1:53, set via --dns), which routes
queries to the parent’s DNS proxy via --dns-forward. This prevents
DNS-based information exfiltration.
Port forwarding: When -p / --port flags are specified, pasta is
configured with explicit port forwarding rules via -t (TCP) and -u
(UDP) options. Auto-forwarding is disabled (-t none -u none) and only
the specified ports are forwarded.
Full mode: No CLONE_NEWNET. The sandbox shares the host network.
Security property: In None mode, the process has zero network access. In Filtered mode, connectivity is routed through pasta, and the USER_NOTIF supervisor enforces IP-level connect() filtering against the allowed domain/IP list. DNS queries are restricted to allowed domains. In Full mode, there is no network isolation.
4. Seccomp BPF
Syscall: prctl(PR_SET_NO_NEW_PRIVS) + prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER)
A classic BPF program is loaded right before execve(). The filter is
generated at runtime from the default baseline defined in
recipes/default.toml (~187 allowed, ~18 always-denied) plus any
[syscalls] overrides (allow_extra / deny_extra).
When the USER_NOTIF supervisor is enabled, two BPF filters are installed:
-
Notifier filter (installed first, via
seccomp()withSECCOMP_FILTER_FLAG_NEW_LISTENER): ReturnsSECCOMP_RET_USER_NOTIFfor eight intercepted syscalls (connect,sendto,sendmsg,clone,clone3,socket,execve,execveat). All others returnSECCOMP_RET_ALLOW. -
Main filter (installed second, via
prctl(PR_SET_SECCOMP)): The standard allow-list or deny-list filter described below.
The kernel evaluates filters in reverse install order, but
SECCOMP_RET_USER_NOTIF takes special precedence — when present, the kernel
always delivers the notification to the supervisor. See
Seccomp USER_NOTIF Supervisor for details.
When the notifier is disabled (kernel < 5.9, monitor mode, or notifier = false),
only the main filter is installed.
The baseline is embedded in the binary via include_str!() so it works
standalone. At runtime, Canister searches for an external default.toml in
./.canister/, $XDG_CONFIG_HOME/canister/recipes/, and
/etc/canister/recipes/. If found, the external file takes precedence over
the embedded copy. This lets users pin, audit, or version-control the
baseline without recompiling.
Two modes — allow-list (default) and deny-list:
| Mode | Default action | Listed syscalls | Recommended for |
|---|---|---|---|
| Allow-list (default) | DENY | Permitted | Production, CI |
| Deny-list | ALLOW | Blocked | Compatibility, unknown workloads |
Allow-list mode is the default and recommended mode. It inverts the security model: only syscalls explicitly listed in the profile are permitted; everything else is denied. This provides a much smaller attack surface than a deny-list.
BPF program structure (allow-list mode):
Instruction What it does
─────────────────────────────────────────────────
[0] Load seccomp_data.arch
[1] If arch == x86_64: skip to [3]
[2] Return KILL_PROCESS (wrong architecture)
[3] Load seccomp_data.nr (syscall number)
[4] If nr == allowed_0: jump to [ALLOW]
[5] If nr == allowed_1: jump to [ALLOW]
...
[N] If nr == allowed_K: jump to [ALLOW]
[N+1] Return ERRNO(EPERM) (no match → denied)
[N+2] Return ALLOW (match → permitted)
BPF program structure (deny-list mode):
Instruction What it does
─────────────────────────────────────────────────
[0] Load seccomp_data.arch
[1] If arch == x86_64: skip to [3]
[2] Return KILL_PROCESS (wrong architecture)
[3] Load seccomp_data.nr (syscall number)
[4] If nr == denied_0: jump to [DENY]
[5] If nr == denied_1: jump to [DENY]
...
[N] If nr == denied_K: jump to [DENY]
[N+1] Return ALLOW (no match → permitted)
[N+2] Return ERRNO(EPERM) (match → denied)
The mode is selected via [syscalls] seccomp_mode in the config file
(default: "allow-list").
Architecture validation: The first check rejects any syscall from a non-native architecture. On x86_64, this prevents bypass via the x32 ABI (which shares the kernel but uses different syscall numbers).
Deny action: In normal mode, Canister uses SECCOMP_RET_ERRNO | EPERM
which allows the sandboxed process to handle denied syscalls gracefully. In
strict mode (--strict), it uses SECCOMP_RET_KILL_PROCESS — the
process is killed immediately on any denied syscall.
Security property: Even if a process escapes all namespace isolation, it cannot invoke unlisted syscalls. The filter is enforced by the kernel and cannot be removed or modified by the filtered process (loading new seccomp filters is blocked by the default baseline’s deny list).
4b. Seccomp USER_NOTIF Supervisor
Syscall: seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER)
Module: notifier.rs
The USER_NOTIF supervisor extends seccomp BPF with argument-level inspection.
Classic BPF can only check the syscall number and architecture — it cannot
dereference pointers or read memory. The supervisor intercepts specific
syscalls via SECCOMP_RET_USER_NOTIF, reads the actual argument data from
/proc/<pid>/mem, and makes a policy decision.
Architecture:
The supervisor runs as PID 1 inside the sandbox’s PID namespace. This is necessary because:
- After
unshare(CLONE_NEWPID),clone(CLONE_THREAD)fails withEINVAL, so a supervisor thread cannot be spawned. - The host’s procfs denies
/proc/<pid>/memopens from a child user namespace. PID 1 mounts its own procfs (owned by the sandbox’s user namespace). - As PID 1, the supervisor is an ancestor of all sandboxed processes, satisfying
Yama
ptrace_scope=1withoutPR_SET_PTRACER.
PID 1 (supervisor) PID 2+ (worker / sandboxed)
────────────────── ──────────────────────────
unshare(CLONE_NEWNS) Sandbox setup (overlay, pivot_root)
mount /proc seccomp() → notifier fd
recv_fd() via SCM_RIGHTS send_fd() via SCM_RIGHTS
│ install main BPF filter
│ connect(AF_INET, ...) execve()
│ ──── SUSPENDED ────────────► │
│ ┌──────┴──────────────┐
│ │ Supervisor (PID 1) │
│ │ 1. NOTIF_RECV │
│ │ 2. open+read │
│ │ /proc/<pid>/mem │
│ │ 3. Check policy │
│ │ 4. NOTIF_ID_VALID │
│ ALLOW / ERRNO(EPERM) │ 5. NOTIF_SEND │
│ ◄──────────────────────────── └──────────────────────┘
│
▼ (continues or gets EPERM)
The supervisor runs inline (single-threaded) using poll() with a 200ms timeout,
interleaved with non-blocking waitpid to detect when the worker exits. After the
worker exits, remaining in-flight notifications are drained before the supervisor
terminates.
Filtered syscalls:
| Syscall | What is inspected | Policy enforcement |
|---|---|---|
connect() | sockaddr struct (IP + port) | Must match IPs resolved from each [[host]]’s domain, allow_ips CIDRs, or loopback |
sendto() | dest_addr + msg_controllen | DNS queries on port 53 trigger supervisor-side resolution; connected sockets (NULL addr) allowed |
sendmsg() | msghdr struct (msg_controllen) | Blocks any sendmsg() with ancillary data (msg_controllen > 0), preventing SCM_RIGHTS fd passing |
clone() | flags register | Namespace flags (CLONE_NEWNS, CLONE_NEWCGROUP, CLONE_NEWUTS, CLONE_NEWIPC, CLONE_NEWUSER, CLONE_NEWPID, CLONE_NEWNET) denied |
clone3() | clone_args.flags in userspace memory | Same namespace flag check, struct read via /proc/<pid>/mem |
socket() | domain + type + protocol registers | SOCK_RAW denied; AF_NETLINK restricted to NETLINK_ROUTE (protocol 0) only |
execve() | Pathname string in userspace memory | Must match allow_execve paths (empty = allow all) |
execveat() | Pathname + dirfd | Same as execve(), with dirfd resolution |
TOCTOU mitigation: Between reading the worker’s memory and sending the verdict,
a multi-threaded sandbox process could modify the inspected memory. The supervisor
calls ioctl(SECCOMP_IOCTL_NOTIF_ID_VALID) after the policy check — if the
notification ID is no longer valid (thread exited or memory was remapped), the
verdict is skipped.
Memory access: The supervisor (PID 1) runs in the same user namespace and PID
namespace as all sandboxed processes. It mounts its own procfs (the user namespace
owns the PID namespace, so the mount succeeds). notif.pid in the seccomp
notification matches PIDs visible in this procfs. As PID 1, the supervisor is an
ancestor of all sandboxed processes, so Yama ptrace_scope=1 is satisfied without
PR_SET_PTRACER. The supervisor does NOT have PR_SET_NO_NEW_PRIVS set, which
would otherwise block /proc/<pid>/mem access.
Fd passing protocol: Before fork(), the parent creates an anonymous
Unix socket pair (socketpair(AF_UNIX, SOCK_STREAM)). One end is inherited by
the worker (PID 2+), the other by the supervisor (PID 1). After the notifier
filter is installed, the worker sends the notifier fd to PID 1 as SCM_RIGHTS
ancillary data.
Requirements: Linux 5.9+ (auto-detected from /proc/sys/kernel/osrelease).
Disabled in monitor mode (incompatible with SECCOMP_RET_LOG). Configurable
via [syscalls] notifier in recipe config.
Security property: Even syscalls that pass the main BPF filter are
subject to argument-level inspection. A sandboxed process cannot connect to
unauthorized IPs, pass file descriptors via SCM_RIGHTS, create new namespaces
via clone flags, open raw sockets, open AF_NETLINK sockets beyond NETLINK_ROUTE,
or exec binaries outside the allow_execve list.
5. Process Control
Modules: process.rs (environment filtering, PID namespace, RLIMIT_NPROC,
allow_execve validation)
Process control enforces the [process] config section:
PID Namespace (CLONE_NEWPID + two forks):
The child calls unshare(CLONE_NEWPID) atomically with the other namespace
flags. Since CLONE_NEWPID affects children of the calling process (not the
caller itself), the child forks to enter the new PID namespace.
When the USER_NOTIF supervisor is enabled, a second fork inside the new PID namespace creates the supervisor/worker split:
- PID 1 (supervisor): Mounts its own
/procviaunshare(CLONE_NEWNS), receives the notifier fd from the worker via SCM_RIGHTS, and runs the supervisor loop inline (single-threaded poll + waitpid). - PID 2+ (worker): Performs sandbox setup (overlay, pivot_root, seccomp), installs the USER_NOTIF filter, sends the notifier fd to PID 1, then execs the target command.
When the notifier is disabled, there is only one fork. The child becomes PID 1 and proceeds directly with sandbox setup and exec.
After the inner fork, setsid() is called to create a new session and
process group. This is necessary because PID 1 inherits an invisible
session/process-group from the parent namespace. Without setsid(),
bash’s job control initialization fails because getpgrp() returns the
parent-namespace process group ID, which doesn’t exist in the new PID
namespace — causing “initialize_job_control: getpgrp failed”.
The intermediate parent (in the old PID namespace) waits and propagates the exit code.
Outer child (after unshare)
│
├── fork() (enters new PID namespace)
│ │
│ ├── [notifier enabled] fork() again:
│ │ │
│ │ ├── PID 1: Supervisor
│ │ │ ├── unshare(CLONE_NEWNS)
│ │ │ ├── mount /proc
│ │ │ ├── recv notifier fd
│ │ │ └── poll/waitpid supervisor loop
│ │ │
│ │ └── PID 2: Worker
│ │ ├── setsid()
│ │ ├── setup overlay, network, seccomp
│ │ ├── install notifier filter, send fd to PID 1
│ │ └── execve()
│ │
│ ├── [notifier disabled] PID 1: direct setup + exec
│ │ ├── setsid()
│ │ ├── setup overlay, network, seccomp
│ │ └── execve()
│ │
│ └── (intermediate parent waits, exits with child's code)
Security property: The sandboxed process tree is completely isolated. It cannot see or signal host processes via /proc or kill().
Environment Filtering (env_passthrough):
Before execve(), the environment is reconstructed from scratch. Only
variables listed in env_passthrough are kept. If the list is empty, the
process starts with a completely clean environment (zero host leakage).
A minimal PATH is injected if not explicitly passed through, to prevent
the sandbox from being unable to find executables.
Uses execve() instead of execvp() to pass the filtered environment
explicitly.
Security property: Sensitive environment variables (API keys, tokens,
credentials in AWS_SECRET_ACCESS_KEY, GITHUB_TOKEN, etc.) are never
leaked to the sandbox unless explicitly listed in env_passthrough.
max_pids (RLIMIT_NPROC):
Sets RLIMIT_NPROC via setrlimit() to cap the number of processes the
sandbox can create. This is a per-UID limit — effective because the sandbox
runs as UID 0 in its own user namespace, mapped to the host user.
Security property: Prevents fork bombs. A process that exceeds the limit
gets EAGAIN from fork().
allow_execve (pre-exec validation):
The resolved command path is checked against the allow_execve list
before forking. If the command is not in the list (and the list is non-empty),
execution is rejected immediately.
Prefix rules: Entries ending in /* match any binary under that
directory tree. For example, /nix/store/* allows any binary whose
resolved path starts with /nix/store/. The match requires a / boundary
to prevent false positives (e.g., /nix/store-extra/foo does NOT match
/nix/store/*). This is essential for content-addressed stores like Nix
where binary paths contain unpredictable hashes.
Limitation: allow_execve validates the initial command at the CLI
level. Ongoing enforcement of every execve() call inside the sandbox is
provided by the USER_NOTIF supervisor (see
Seccomp USER_NOTIF Supervisor), which
intercepts execve() and execveat() syscalls and validates the pathname
against the allow_execve list. When the notifier is disabled (kernel
< 5.9 or notifier = false), only the initial command is validated.
6. Cgroups v2
Files: cgroups.rs
Cgroups v2 enforces resource limits (memory and CPU) without requiring root. It leverages systemd’s per-user cgroup delegation, which is available on any modern system running systemd (Ubuntu 22.04+, Fedora 36+, etc.).
Resource limits are opt-in — none of the shipped base recipes include
[resources]. Users add memory_mb and/or cpu_percent in their own
recipes when needed.
Setup sequence (happens before pivot_root, while /sys/fs/cgroup is
still accessible):
- Detect the current cgroup by reading
/proc/self/cgroup. - Create a child cgroup at
<parent>/canister-<pid>. - Write
memory.max(bytes) andcpu.max(quota/period) to the child cgroup’s control files. - Move the sandboxed process into the child cgroup by writing its PID
to
cgroup.procs.
CPU limiting: cpu_percent = 50 translates to cpu.max = "50000 100000"
(50ms quota per 100ms period), effectively capping the process to 50% of
one CPU core.
Memory limiting: memory_mb = 512 translates to memory.max = 536870912
(512 * 1024 * 1024 bytes). When exceeded, the kernel OOM-kills the process.
Cleanup: Child cgroups are removed when the sandboxed process exits (the kernel removes empty cgroups automatically).
Failure handling: Cgroup setup failure aborts the sandbox. All setup failures are fatal regardless of mode.
Security property: The sandboxed process cannot consume unbounded memory or CPU. The limits are enforced by the kernel’s cgroup controller and cannot be modified by the sandboxed process (which has no write access to the cgroup filesystem after seccomp is loaded).
7. /proc Hardening
Files: overlay.rs (mount_proc function)
After mounting /proc inside the sandbox, Canister masks sensitive paths
following Docker’s default behavior, plus additional hardening:
Masked files (bind-mount /dev/null over them):
/proc/kcore— physical memory access/proc/keys— kernel keyring contents/proc/key-users— keyring user counts (information leak)/proc/sysrq-trigger— kernel SysRq commands/proc/timer_list— timer details (information leak)/proc/latency_stats— latency statistics/proc/kallsyms— kernel symbol addresses (KASLR bypass)/proc/schedstat— scheduler statistics (information leak)
Masked per-process files (bind-mount /dev/null over them):
/proc/self/mountinfo— mount topology (reveals sandbox structure)/proc/1/mountinfo— same, for PID 1
Masked directories (mount empty read-only tmpfs over them):
/proc/acpi— ACPI interface/proc/scsi— SCSI device interface
Read-only remount:
/proc/sys— prevents writing to sysctl tunables
Failure handling: Individual mask failures are logged at debug level and are non-fatal. The sandbox continues with whatever masking succeeded.
Security property: The sandboxed process cannot read sensitive kernel information from /proc, trigger SysRq commands, modify sysctl values, or inspect the sandbox’s mount topology via mountinfo.
8. Capability Dropping
Module: namespace.rs (drop_capabilities())
After all namespace setup is complete and before execve(), Canister drops
all Linux capabilities from the bounding set and clears the inheritable and
ambient sets.
Setup sequence:
- Read
CAP_LAST_CAPfrom/proc/sys/kernel/cap_last_capto discover the number of capabilities on the running kernel (currently 41). - Drop each capability from the bounding set using
prctl(PR_CAPBSET_DROP, cap). - Clear the inheritable capability set using
capset(). - Clear the ambient capability set using
prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_CLEAR_ALL).
Result after exec with NO_NEW_PRIVS:
CapEff: 0000000000000000
CapPrm: 0000000000000000
CapBnd: 0000000000000000
CapAmb: 0000000000000000
CapInh: 0000000000000000
Why this matters: Inside a user namespace, the sandboxed process has
CAP_SYS_ADMIN and other capabilities that allow namespace operations. While
seccomp blocks the dangerous syscalls (mount, unshare, etc.), dropping
capabilities provides defense-in-depth. Even if a seccomp bypass were found,
the empty capability set prevents privilege escalation.
AppArmor interaction: The canister AppArmor profile requires
allow capability setpcap, to permit PR_CAPBSET_DROP calls. This is
included in the shipped profile and installed via sudo can setup.
Security property: The sandboxed process executes with no capabilities
in any set. It cannot gain capabilities through any mechanism (exec of
setuid binaries is also blocked by NO_NEW_PRIVS).
9. Default Resource Limits
Module: process.rs (apply_default_resource_limits())
Before execve(), Canister applies conservative resource limits that
provide baseline protection even when no [resources] section is present
in the recipe:
| Limit | Value | Purpose |
|---|---|---|
RLIMIT_NPROC | 4096 | Limits total processes (fork bomb defense) |
RLIMIT_AS | 8 GB | Limits virtual address space |
RLIMIT_NOFILE | 4096 | Limits open file descriptors |
RLIMIT_FSIZE | 4 GB | Limits maximum file size |
RLIMIT_CORE | 0 | Disables core dumps (prevents data leakage) |
These defaults are applied first, then any explicit limits from the recipe’s
[resources] section override them. The RLIMIT_NPROC from
[process].max_pids takes precedence over the default if specified.
Security property: Fork bombs are bounded, memory-hungry processes are capped, and core dumps cannot leak sandbox state to disk.
10. Monitor Mode
Flag: --monitor
Monitor mode runs the sandbox with all namespace isolation active (for accurate observation) but relaxes policy enforcement. Each enforcement point logs what would have been blocked without actually blocking it.
Enforcement points and their monitor-mode behavior:
| Enforcement point | Normal mode | Monitor mode |
|---|---|---|
allow_execve | Rejects unlisted commands | Logs warning, allows through |
env_passthrough | Strips unlisted env vars | Logs what would be stripped, passes full env |
max_pids | Sets RLIMIT_NPROC | Logs the limit, skips setrlimit() |
| Seccomp BPF | SECCOMP_RET_ERRNO (EPERM) | SECCOMP_RET_LOG (allowed but kernel-logged) |
| USER_NOTIF supervisor | Active (intercepts syscalls) | Disabled (incompatible with SECCOMP_RET_LOG) |
| Filesystem isolation | Full overlay + pivot_root | Full overlay + pivot_root (unchanged) |
| Network isolation | Namespace + pasta | Namespace + pasta (unchanged) |
Key design decisions:
-
Namespaces stay active. Monitor mode does NOT skip namespace creation. This ensures the process runs in the same environment it would in enforced mode, so observations are accurate. If namespaces were disabled, the process might behave differently (different PIDs, different filesystem view, etc.).
-
SECCOMP_RET_LOGfor syscalls. Instead of returning EPERM, denied syscalls are allowed through but logged to the kernel audit subsystem. View these withjournalctl -k | grep seccomp. This uses a real BPF filter (same structure as enforcement mode) so the observation is exact. -
USER_NOTIF is disabled. The notifier supervisor is incompatible with monitor mode because
SECCOMP_RET_USER_NOTIFsuspends the syscall (it does not log-and-allow likeSECCOMP_RET_LOG). In monitor mode, all syscalls pass through to the kernel with logging only. -
Pre-run policy preview. Before forking, the CLI prints a summary of the active policy so the user knows what enforcement points will be observed.
-
Post-run summary. After the sandboxed process exits, the CLI prints a summary with the exit code and hints for reviewing the monitor output.
Intended workflow:
# 1. Run with monitor to see what the policy would block
can run --monitor --recipe my_policy.toml -- ./my_program
# 2. Review MONITOR: lines in output and seccomp audit logs
journalctl -k | grep seccomp
# 3. Adjust policy based on observations
# 4. Run with enforcement
can run --recipe my_policy.toml -- ./my_program
Security property: Monitor mode provides NO security guarantees. It is a development/debugging tool for iterating on sandbox policies.
Warning: A malicious process can detect monitor mode (e.g., by attempting a denied syscall and observing it succeeds) and behave differently. Always validate policies with enforcement enabled.
11. Strict Mode
Flag: --strict (or strict = true in config)
Strict mode is the inverse of monitor mode: instead of relaxing enforcement, it tightens it. Both normal and strict mode treat all setup failures as fatal. The key difference is the seccomp deny action.
Changes in strict mode:
| Enforcement point | Normal mode | Strict mode |
|---|---|---|
| Filesystem isolation | Aborts on failure | Aborts on failure |
| Network setup | Aborts on failure | Aborts on failure |
| Loopback bring-up | Aborts on failure | Aborts on failure |
| Seccomp deny action | SECCOMP_RET_ERRNO (EPERM) | SECCOMP_RET_KILL_PROCESS |
| Cgroup setup | Aborts on failure | Aborts on failure |
Mutual exclusion: --strict and --monitor cannot be used together.
This is enforced at the CLI level.
Recommended for: CI pipelines, production deployments, and any environment where reduced isolation is worse than no execution.
Parent-Child Protocol
The parent and child synchronize via three anonymous pipes:
Pipe 1: child_ready (child → parent) "namespaces created"
Pipe 2: maps_done (parent → child) "UID/GID maps written"
Pipe 3: network_done (parent → child) "pasta started, network ready"
Timeline:
Child: unshare(USER+PID+NET)
Child: write(child_ready, 0x00) ← "namespaces created"
Child: read(maps_done) ← blocks
Parent: read(child_ready) ← unblocks
Parent: write uid_map, gid_map
Parent: write(maps_done, 0x00) ← "maps written"
Child: read(maps_done) ← unblocks
Child: read(network_done) ← blocks
Parent: start pasta --userns /proc/<child>/ns/user --netns /proc/<child>/ns/net --runas <uid>
Parent: write(network_done, 0x00) ← "network ready"
Child: read(network_done) ← unblocks
Child: unshare(NEWNS)
Child: install notifier filter, send fd to PID 1 supervisor (if enabled)
Child: setup overlay, network, seccomp
Child: execve()
PID 1: receive notifier fd, run inline supervisor loop (if enabled)
This three-pipe protocol is necessary because:
-
UID/GID maps must be written from outside the namespace. The kernel requires an external process to write
/proc/<pid>/uid_map. -
pasta needs the child’s user and network namespaces. pasta is invoked with
--userns /proc/<child_pid>/ns/user --netns /proc/<child_pid>/ns/net --runas <uid>. Thesetns(CLONE_NEWNET)syscall requiresCAP_SYS_ADMINin the user namespace that owns the target network namespace — not the caller’s user namespace. Since the child created both namespaces atomically viaunshare(CLONE_NEWUSER | CLONE_NEWNET), the network namespace is owned by the child’s user namespace. pasta must therefore first join the child’s user namespace (setns(CLONE_NEWUSER)) to acquireCAP_SYS_ADMINthere, then join the network namespace (setns(CLONE_NEWNET)). The child callsprctl(PR_SET_PTRACER, PR_SET_PTRACER_ANY)before signaling the parent, so that pasta (a sibling process) can open/proc/<child>/ns/*despite Yamaptrace_scope=1.--runas <uid>prevents pasta from dropping to “nobody”, which would fail the kernel’s UID ownership check on namespace files. This must happen after the child createsCLONE_NEWNETbut before the child tries to use the network. -
Mount namespace is split from the initial unshare. The child first calls
unshare(USER+PID+NET), then waits for pasta, then callsunshare(NEWNS)separately. This split ensures pasta can access/proc/<child_pid>/ns/netbefore the child’s mount namespace changes. -
Mount operations need mapped UIDs. The child cannot mount anything until its UID is mapped (otherwise the kernel rejects it).
-
The notifier fd must be passed from worker to supervisor. The
seccomp()syscall returns the notifier fd in the worker’s process. The fd is sent to PID 1 (supervisor) viaSCM_RIGHTSover an anonymous Unix socket pair created before the supervisor/worker fork.
Mandatory Access Control (MAC)
Linux distributions use Mandatory Access Control systems to restrict
unprivileged processes. Canister detects the active MAC system at runtime and
manages the appropriate security policy via can setup. See ADR-0004 for the
design rationale.
Supported MAC Systems
| Distribution | MAC System | Restriction Mechanism |
|---|---|---|
| Ubuntu 24.04+ | AppArmor | kernel.apparmor_restrict_unprivileged_userns=1 |
| Fedora 41+ / RHEL 10+ | SELinux | user_namespace { create } permission |
| Arch, Void, Gentoo, etc. | None | No restriction — works natively |
Detection
Canister detects the active MAC system at startup:
- AppArmor:
/sys/module/apparmor/parameters/enabled=="Y" - SELinux:
/sys/fs/selinux/enforceexists - Neither: no policy needed, sandbox works natively
can check reports the active MAC system, its restriction status, and the
canister policy status.
AppArmor (Ubuntu)
Two-profile architecture:
Canister uses two AppArmor profiles, managed by can setup:
-
canister— attached to thecanbinary. Grants mount, pivot_root, capabilities (sys_admin,net_admin,sys_chroot,sys_ptrace,dac_override,dac_read_search), userns creation, and full file/network access. Has a catch-allpx /** -> canister//&canister_sandboxedrule that transitions all child exec’s to the restricted sub-profile. Also has specificux(unconfined exec) rules for:- pasta (
/usr/bin/pasta,/usr/bin/pasta.avx2,/bin/pasta,/bin/pasta.avx2): pasta needsCAP_SYS_ADMINto callsetns(CLONE_NEWUSER), which is denied bycanister_sandboxed. Theuxrules take precedence over thepx /**glob, so pasta runs unconfined. - apparmor_parser (
/usr/sbin/apparmor_parser,/sbin/apparmor_parser): needsCAP_MAC_ADMINto load/unload profiles duringcan setup.
- pasta (
-
canister_sandboxed— maximally strict sub-profile for sandboxed commands. Denies all capabilities (audit deny capability), mount/umount/ pivot_root, user namespace creation, ptrace (except allowing the USER_NOTIF supervisor to read process memory), and DBus.
Profile transition chain:
canister (binary starts, never execs itself)
├─ fork (child inherits "canister") → all namespace setup happens here
│ └─ execve(command) → "canister//&canister_sandboxed"
├─ spawn(pasta) → ux rule fires → runs unconfined
└─ spawn(apparmor_parser) → ux rule fires → runs unconfined
AppArmor specific path rules (ux /usr/bin/pasta) take precedence over glob
rules (px /**), so the ux rules for pasta and apparmor_parser work without
conflicting with the catch-all px rule.
One-time upgrade note: When upgrading from an older profile (without ux
rules for pasta/apparmor_parser) to the new profile, apparmor_parser may be
confined by the old profile and fail with “Access denied”. In this case,
manually reload: sudo apparmor_parser -r /etc/apparmor.d/canister.
SELinux (Fedora/RHEL)
Policy module architecture:
Canister’s SELinux policy defines three types:
-
canister_t— domain for thecanbinary. Grantsuser_namespace { create },cap_userns { sys_admin sys_ptrace net_admin sys_chroot }, mount/pivot_root permissions, full file access, and ptrace over sandboxed children. -
canister_sandboxed_t— restricted domain for sandboxed child processes. Basic file read/execute and network socket access only. No namespace creation, no capabilities, no mount operations. -
canister_exec_t— file type for thecanbinary, triggers automatic domain transition fromunconfined_ttocanister_ton exec.
Installation: SELinux policy installation requires checkmodule,
semodule_package, and semodule (from policycoreutils and checkpolicy
packages). can setup generates .te (type enforcement) and .fc (file
context) files, compiles them, and installs the module.
Impact on Canister
| Feature | With MAC restriction | With canister policy |
|---|---|---|
| User namespace | Works | Works |
| Mount namespace | Mounts fail → aborts | Full isolation |
| Filesystem isolation | Aborts (cannot establish) | Full |
| Network namespace | Works | Works |
| Loopback bring-up | Fails → aborts | Works |
| pasta | N/A (no connectivity) | Works |
| Seccomp | Works | Works |
| USER_NOTIF supervisor | Works | Works |
Policy management (can setup)
# Install the security policy (auto-detects MAC system and binary path)
sudo can setup
# Force reinstall (even if policy exists and appears current)
sudo can setup --force
# Remove the policy
sudo can setup --remove
can setup is interactive when stdout is a terminal: it shows the generated
policy content (or a diff when updating), and asks for confirmation before
writing. In non-interactive mode (piped/CI), it writes without prompting.
The command auto-detects the active MAC system and generates the appropriate policy. On systems with no MAC, it reports that no policy is needed.
Stale policy detection: when the installed policy content doesn’t match the
current template (e.g., after a Canister upgrade), can check reports the
policy as “OUTDATED” and can setup will update it.
Known Limitations
Fundamental limitations
-
Kernel exploits. No userspace sandbox can protect against kernel vulnerabilities. Seccomp reduces the attack surface but cannot eliminate it.
-
Side channels. Timing attacks, cache attacks, and speculative execution attacks are out of scope.
-
RLIMIT_NPROC is per-UID. The max_pids limit applies to the user’s total process count, not just the sandbox. Inside a user namespace this is usually fine (the sandbox runs as a mapped UID), but if multiple sandboxes share a UID they share the limit.
-
DNS resolution timing. Domain pre-resolution happens at sandbox startup. If DNS records change during execution, the resolved IP set becomes stale. TTL-aware re-resolution is not implemented.
-
USER_NOTIF TOCTOU window. The
SECCOMP_IOCTL_NOTIF_ID_VALIDcheck mitigates but does not fully eliminate the time-of-check-time-of-use race in the USER_NOTIF supervisor. A highly concurrent, adversarial workload with precise timing could theoretically modify memory between the supervisor’s read and verdict. This is an inherent limitation of theseccomp_unotifymechanism.