Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Loss Prevention (DLP)

Canister’s L7 egress proxy includes a built-in DLP layer that scans outbound HTTP traffic for credential patterns and enforces per-detector domain scoping. Even when a sandboxed process has filesystem access to credential files (because the user wants npm, gh, or aws to keep working), DLP makes it structurally impossible for those credentials to leak to unauthorised destinations.

Table of Contents


Threat Model

A sandboxed process typically has filesystem access to credential-bearing files — intentionally, because the user wants their package managers and CLI tools to keep working against private registries. That process is potentially:

  • Untrusted — a build script, post-install hook, or LLM-generated command running with read access to ~/.npmrc, ~/.aws/credentials, the GitHub keyring, etc.
  • Trusted-but-buggy — telemetry code that accidentally serialises environment variables containing tokens.
  • Trusted-but-compromised — a supply-chain attack inside an otherwise reputable dependency.

DLP’s goal: even when a credential is readable, it cannot leave the sandbox via HTTP(S) unless flowing to an explicitly authorised destination for that credential’s service.

In scope:

  • HTTP/1.1 and HTTP/2 request headers, bodies, trailers
  • URI query parameters and path segments
  • Bodies wrapped in gzip / deflate / brotli
  • Multi-layer encoded payloads (base64 / hex / percent), up to 32 levels
  • DNS-label exfiltration via high-entropy hostname labels
  • Slow byte-at-a-time exfiltration via cumulative entropy budgeting

Out of scope:

  • Covert timing channels
  • In-memory key extraction
  • Filesystem-write exfiltration to shared/CWD mounts
  • Pixel-level steganography in image payloads
  • Plain CONNECT (L4) tunnels — DLP forces interception when enabled, so any traffic that bypasses interception (e.g. non-HTTP protocols) is denied rather than inspected.

Architecture

DLP lives in the standalone can-dlp crate so it can be reused by both the proxy and the sandbox (for canary generation) without pulling proxy dependencies into the sandbox crate.

crates/can-dlp/
  src/
    detectors.rs      — DetectorId enum, compiled RegexSet, Finding
    scopes.rs         — per-detector domain matching (built-in + extras)
    decode.rs         — base64/hex/percent recursion, up to N layers
    decompress.rs     — gzip/deflate/brotli body decompression
    normalize.rs      — whitespace/unicode normalisation before scanning
    entropy.rs        — Shannon entropy + SessionEntropyBudget
    canary.rs         — fake credential generation
    scanner.rs        — DlpScanner: orchestrates the full pipeline
    error.rs          — DlpError (thiserror)

The DlpConfig serde struct lives in can-policy (next to NetworkConfig) to avoid a can-dlp → can-policy circular dependency.

Activation chain:

recipe / manifest [network.dlp]
        │
        ▼
NetworkConfig::dlp (Option<DlpConfig>)
        │
        ▼
ProxyServer constructed with DlpScanner + SessionEntropyBudget
        │
        ▼
Per-request: scan headers + URI + (decompressed, decoded) body

When DLP is enabled, the proxy forces interception of all traffic. The passthrough path (which is opaque to the proxy) is disabled because it would bypass scanning.


Detectors and Scope Model

Each detector has hardcoded home domains baked into the binary. Tokens can only flow to their home service — even if a [[host]] block permits the destination, a GitHub PAT bound for registry.npmjs.org is blocked.

DetectorPatternBuilt-in home domainsDefault action
github_patgh[pousr]_[A-Za-z0-9]{36} and github_pat_[A-Za-z0-9]{22}_[A-Za-z0-9]{59}github.com, *.github.comblock
npm_tokennpm_[A-Za-z0-9]{36}registry.npmjs.orgblock
aws_access_keyAKIA[A-Z0-9]{16}*.amazonaws.comblock
slack_tokenxox[baprs]-[0-9]{10,13}-[0-9]{10,13}-[A-Za-z0-9]{24}*.slack.comblock
ssh_private_key-----BEGIN (RSA|EC|OPENSSH|DSA )?PRIVATE KEY-----none — always blockblock
bearer_tokenBearer\s+[A-Za-z0-9\-._~+/]{20,}=*(requires explicit allow_credentials = ["bearer_token"] on a host)block
generic_high_entropySliding window, Shannon entropy > 4.5, 20+ chars(warn only)warn (promoted to block in --strict)
canary_tokenExact match against injected fake credentialsnone — always blockblock (error log)

Enforcement rules:

  1. Known-service tokens (github_pat, npm_token, aws_access_key, slack_token) — destination must be in the detector’s home domains or in a [[host]] block whose allow_credentials includes the detector id. Mismatched service → 451 block.
  2. bearer_token — generic; requires explicit per-host opt-in via allow_credentials = ["bearer_token"]. No implicit scope.
  3. ssh_private_key and canary_token — no legitimate HTTP destination; always blocked.
  4. generic_high_entropy — too noisy to scope; always warn, blocks only in --strict.

The shipped service contracts under recipes/services/*.toml (github.toml, npm.toml, …) already include the right allow_credentials for their detector — composing tools = ["npm", "gh"] produces the right behaviour: npm tokens can only reach npmjs.org, GitHub PATs can only reach GitHub.

Extending scopes for self-hosted services

Self-hosted services (GitHub Enterprise, private npm registries) extend the built-in scopes via extra_scopes:

[network.dlp]
enabled = true

[network.dlp.extra_scopes]
github_pat = ["github.corp.example.com"]
npm_token = ["npm.internal.example.com"]

Extras are unioned with the built-in domains. They never replace or narrow them, so a self-hosted override cannot accidentally weaken the default scope for the public service.


Scan Pipeline

Per request, the proxy runs:

1. Headers (Authorization, Cookie, Proxy-Authorization, X-*)
   → scan_text → token detected? scope check
2. URI (full reconstructed authority + path + query)
   → scan_text → token detected? scope check
3. Body
   a. Read Content-Encoding header
   b. Decompress (gzip / deflate / brotli) if configured
   c. Run encoding chain recursion (base64 / hex / percent)
   d. Pattern match each layer against PatternSet
4. For every finding:
   - canary    → BLOCK + error! log (zero false positives)
   - ssh key   → BLOCK
   - scoped    → BLOCK if destination not in home/extras
   - bearer    → BLOCK unless the destination's `[[host]]` block lists `"bearer_token"` in `allow_credentials`
   - generic   → WARN (BLOCK in --strict)
5. Session entropy budget update; BLOCK if exceeded.
6. Build response:
   - On allow: forward upstream with `update_content_length()` if body
     was buffered.
   - On block: 451 + `x-canister-error: dlp-blocked` +
     `x-canister-dlp-detector: <name>`.
   - On monitor-mode warn: forward upstream + add
     `x-canister-dlp-warning` so the sandboxed process can observe what
     would have been blocked.

DLP forces request body buffering within the existing max_buffered_body_bytes cap. A streaming scan would miss tokens that straddle chunk boundaries; the cap (default 8 MiB) prevents memory abuse.


Encoding Chain Recursion

decode.rs walks every layer of base64 / base64url / hex / percent-encoding up to max_decode_depth (default 32). At each layer the scanner attempts all decoders; any that produces output different from its input is recursed into. All decoded layers are matched against PatternSet, so:

  • Authorization: Bearer dGVzdA== (Bearer test) is matched at the original layer.
  • body={"x":"Z2hwX0FBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQUFBQQ=="} (a base64-wrapped GitHub PAT) is caught at depth 1.
  • base64(base64(token)) is caught at depth 2.
  • Garbage / malformed encoding at any layer is fail-closed: the original bytes are scanned as-is and the recursion stops on that branch — never silently skipped.

The depth cap is a fuse against adversarially deep nesting designed to exhaust CPU.


Content Decompression

decompress.rs inspects the Content-Encoding header and inflates gzip / deflate / brotli bodies before scanning. This is gated by network.dlp.decompress (default true).

Malformed or truncated compressed bodies fail the request rather than being forwarded unscanned — fail-closed.


DNS Entropy Check

Independently of HTTP scanning, the proxy applies a Shannon-entropy check to the destination hostname before resolving it. Each DNS label (the parts between the dots) is scored; if any label exceeds dns_entropy_threshold (default 4.5) the request is blocked with dlp-blocked + dns-entropy reason. This catches the classic DNS exfiltration pattern: <base64-of-secret>.attacker.example where the high-entropy subdomain is the payload.

The check runs even on CONNECT tunnels (before resolution), so it applies regardless of L7 protocol.


Session Entropy Budget

A sandbox session keeps a SessionEntropyBudget (default 8192 bytes). After each request scan, the count of high-entropy bytes (Shannon entropy > 4.0 in any 32-byte sliding window) is recorded against the budget. When the budget is exhausted, further requests are blocked.

This catches slow exfiltration: a credential split across many small requests, each individually below the per-request entropy threshold but collectively well above plausible legitimate traffic patterns.

The budget is per ProxyServer instance, which is one per sandbox session — it resets when the sandbox exits.


Canary Tokens

When canary_tokens = true (default when DLP is enabled), the sandbox generates three fake credentials at startup and exports them as environment variables into the sandboxed process:

Env varShape
CANISTER_CANARY_GITHUB_PATghp_ + 36 random alphanumeric
CANISTER_CANARY_NPM_TOKENnpm_ + 36 random alphanumeric
CANISTER_CANARY_AWS_ACCESS_KEYAKIA + 16 random uppercase

The same exact strings are registered as the highest-priority pattern in the DlpScanner. Any outbound appearance of one of these strings is proof of exfiltration — no legitimate process produces these values, because they don’t authenticate to anything. The detection is always block and is logged at error! level with full request context.

Canary tokens are an active honeypot: they create traffic that a malicious process is incentivised to scoop up (env vars are the first thing credential thieves enumerate) and which the proxy treats as a high-confidence trip wire.


Enforcement Modes

DLP integrates with the existing sandbox enforcement modes rather than introducing a separate kill switch.

ModeDLP enabled?generic_high_entropyBlock action
DefaultPer recipe enabled = truewarn451
--monitorAs configuredwarn (logged)Not blocked — request forwarded with x-canister-dlp-warning header
--strictImplicitly enabled when egress = "proxy-only"promoted to block451
  • Default: DLP runs if the recipe enables it; violations are 451.
  • --monitor: DLP findings are logged at warn! level with full detector / host / fingerprint detail but requests still go through. Mirrors how monitor mode handles seccomp and filesystem checks. Use this to dry-run a new policy before flipping it on.
  • --strict: DLP is implicitly enabled even without dlp.enabled = true, provided the recipe uses egress = "proxy-only" (strict mode requires DLP-grade enforcement). generic_high_entropy is promoted from warn to block.

No new flags or kill switches were added — --strict plus recipe config cover the same activation surface as a dedicated enable knob.


Response Headers and Status Codes

OutcomeStatusHeaders
Token detected, blocked451 Unavailable For Legal Reasonsx-canister-error: dlp-blocked, x-canister-dlp-detector: <name>
Token detected, monitor mode(upstream status)x-canister-dlp-warning: <name>
DNS-label entropy block451x-canister-error: dlp-blocked, x-canister-dlp-reason: dns-entropy
Session budget exhausted451x-canister-error: dlp-blocked, x-canister-dlp-reason: session-budget

451 is used so DLP blocks are distinguishable from upstream 403s. The detector name is exposed in the header so the sandboxed process / calling tool can produce a sensible error message.


Configuration

Full schema (all fields optional; defaults shown):

[network.dlp]
enabled = false                   # implicit true under --strict + proxy-only
canary_tokens = true              # default when DLP is enabled
max_decode_depth = 32             # encoding chain recursion cap
decompress = true                 # gzip/deflate/brotli before scan
dns_entropy_threshold = 4.5       # Shannon entropy per DNS label
session_entropy_budget = 8192     # cumulative high-entropy bytes/session

[network.dlp.extra_scopes]
github_pat = ["github.corp.example.com"]
npm_token = ["npm.internal.example.com"]

Merge semantics

When recipes / manifests are merged left-to-right (base.toml → auto-detected → explicit -r → manifest overrides), each field uses:

FieldMerge ruleRationale
enabledOR (any Some(true) wins)Security escalation, never reversed
canary_tokensORSame
extra_scopesper-detector domain unionNever narrows
max_decode_depthlast-Some-winsNumeric tuning
decompresslast-Some-wins
dns_entropy_thresholdlast-Some-wins
session_entropy_budgetlast-Some-wins

This guarantees a downstream recipe can never disable DLP that an upstream recipe enabled, and can never shrink the scope set.

Where to put it

  • Project-level: [network.dlp] in canister.toml enables DLP for every sandbox in the project.
  • Per-sandbox: same key under [sandbox.<name>.network.dlp].
  • Recipe-level: drop a [network.dlp] block into a custom recipe. Tool recipes (tool:gh, tool:npm, etc.) deliberately do not ship [network.dlp] — they declare the right [[host]] blocks with allow_credentials, and the scope check does the rest.

Limitations

  • Pattern coverage is finite. A novel credential shape (a vendor introducing a new prefix) won’t be caught until a detector is added. generic_high_entropy is the catch-all, but its warn-by-default posture means it’s only fatal in --strict.
  • Body buffering ceiling. Requests above max_buffered_body_bytes (default 8 MiB) are rejected with 413 Payload Too Large rather than forwarded unscanned. This is fail-closed by design, but it limits the protocol shapes DLP can cover (large file uploads need a higher cap or a different egress path).
  • TLS interception is required. DLP relies on the proxy’s MITM CA; it does not inspect end-to-end-pinned TLS (e.g. when the sandboxed process pins its own cert). Such traffic fails to handshake under the proxy, which is the same fail-closed posture.
  • No regex on raw binary. Detectors operate on UTF-8 text after decompression and decoding. Binary protocols carrying credentials outside text fields (e.g. proprietary RPC over HTTP) need a custom detector or a different egress strategy.