Policies

Policies

How DefenseClaw decides — repo policies (OPA/Rego), guardrail rule packs (regex + LLM judge), scanner policies, and the suppression layer that keeps your alert volume sane.

DefenseClaw records a normalized runtime action—allow, alert, confirm, or block—for guardrail events. Connector adapters then map that action onto the host's actual response vocabulary, such as allow/ask/deny. Admission, firewall, sandbox, and asset-policy evaluations have their own domain-specific outputs.

Guided example · Synthetic runtime event

Trace a runtime verdict from event to action

Follow normalization, deterministic matching, suppressions, severity, and the active action mapping.

Deterministic
rules:  - id: shell.data-egress    match: sensitive_source_and_external_destination    severity: highjudge:  enabled: falsesuppressions:  trusted_destinations: []actions:  high: blockrules:  - id: shell.data-egress    match: sensitive_source_and_external_destination    severity: highjudge:  enabled: falsesuppressions:  trusted_destinations: []actions:  high: block
DecisionBlock runtime action
Reason

HIGH runtime finding maps to block

Action

Emit decision record

06

Resolve actionThe runtime HIGH mapping resolves to block.

Step 6 / 6
What DefenseClaw did — and did not do

What it did

  • Show the ordered stages that assemble a runtime verdict
  • Distinguish the optional judge from deterministic rules

What it did not do

  • Evaluate Rego or scanner output in the browser
  • Reuse runtime actions as skill or MCP admission actions

What you just saw

The guided trace normalizes a connector event, matches a deterministic rule, checks suppressions, skips the optional judge when it is disabled, assigns severity, and resolves the active runtime action. Runtime guardrail actions are separate from the skill_actions and mcp_actions admission mappings used by scanners.

How the layers fit together

pattern matched
SystemPrompt / tool / result
PolicyLocal rules and triage
PolicyPre-judge strips
PolicyOptional LLM judge
PolicyFinding and tool suppressions
PolicySeverity thresholds + HILT policy
Systemallow / alert / confirm / block
Evidence storePersist finding
PolicySession correlator
SystemSynthetic CORR-* finding
The in-flight guardrail path ends at a connector verdict. The correlator runs afterward over persisted findings and emits a separate synthetic finding; it does not retroactively change the request that just completed.

Plain local rules cover many secrets and explicit prompt-injection patterns. The optional judge handles ambiguous and semantic cases. Suppressions remove known noise before the verdict is aggregated. The session correlator independently watches persisted findings for cross-event patterns such as the lethal trifecta. OPA/Rego also powers admission, firewall, sandbox, audit, and direct guardrail-policy evaluation.

Layer 1 — Repo policies (OPA / Rego)

The top-level policy file (policies/default.yaml and friends) is the source used to generate the active OPA data document. It declares admission behavior, severity-to-action mappings, scanner overrides, guardrail thresholds, firewall policy, audit settings, and related operator defaults:

policies/default.yaml (excerpt)
name: default
admission:
  scan_on_install: true
  allow_list_bypass_scan: true

skill_actions:
  critical:
    file: quarantine
    runtime: disable
    install: block
  high:
    file: quarantine
    runtime: disable
    install: block

guardrail:
  block_threshold: 4
  alert_threshold: 2
  hilt:
    enabled: false
    min_severity: HIGH
  cisco_trust_level: full

firewall:
  default_action: deny
  allowed_ports: [443, 80]

defenseclaw policy activate <name> normalizes this YAML into the active data.json consumed by the bundled Rego modules. Each package exposes domain-specific results rather than one universal decision object:

DomainRego packagePrimary outputs
admissiondefenseclaw.admissionverdict, reason
guardraildefenseclaw.guardrailaction, severity, reason, scanner_sources
firewalldefenseclaw.firewallaction, rule_name
sandboxdefenseclaw.sandboxallowed/denied endpoints, permissions, skills
auditdefenseclaw.auditretain, retain_reason, export_to
skill_actionsdefenseclaw.skill_actionsruntime/file/install actions and boolean helpers

You can hand-edit the Rego — the engine is just OPA — but most operators stay in the YAML layer because the bundled Rego covers the common cases.

How the engine evaluates

  1. 01Gateway Policy engine (OPA)

    input

  2. 02Policy engine (OPA) Rego module

    query package output rules

  3. 03Rego module Policy engine (OPA)

    domain-specific result

  4. 04Policy engine (OPA) Gateway

    normalized caller result

  5. 05Gateway Sinks

    emit audit / telemetry event

Callers query the output rules exported by the selected Rego package. Output shape varies by domain.

Guardrail scanner and hook events separately preserve matching rule IDs, evaluation IDs, reasons, and severity in their structured audit payloads. Those fields power the per-rule breakdowns in Splunk and Grafana; they are not a universal reasons[] field returned by every Rego package.

Layer 2 — Guardrail rule packs

The guardrail rule pack is the content the gateway evaluates against — the actual regex patterns, judge prompts, and category taxonomies. Three packs ship out of the box:

Each pack is a directory:

policies/guardrail/<pack>/
  rules/
    local-patterns.yaml       # triage phrases, PII regexes, secrets, exfil
    commands.yaml             # shell and execution rules
    sensitive-paths.yaml      # credential/config paths
    c2.yaml                   # command-and-control indicators
    cognitive.yaml            # cognitive tampering
    enterprise-data.yaml      # enterprise data exposure
    secrets.yaml              # credential signatures
    trust-exploit.yaml        # prompt/trust exploitation
    custom-org.yaml           # your org's regex extensions (optional)
  judge/
    injection.yaml
    pii.yaml
    tool-injection.yaml
  sensitive-tools.yaml
  suppressions.yaml           # the suppression layer (see below)

Pick a pack at setup time:

defenseclaw setup guardrail --rule-pack strict

--rule-pack only accepts the three bundled profiles (default, strict, permissive). To run a custom pack from your own directory, point guardrail.rule_pack_dir at it in ~/.defenseclaw/config.yaml — see Authoring custom rule packs below for the full workflow.

Rule shape

Rules are YAML for portability and review:

policies/guardrail/default/rules/local-patterns.yaml (excerpt)
version: 1

# Literal substring matches (case-folded against the normalized
# triage view, so "/ etc / passwd" still matches "/etc/passwd").
injection:
  - "ignore previous"
  - "ignore all instructions"
  - "jailbreak"

# Regex matches. Compiled at load time; bad regexes are logged and
# dropped (the rest of the file still applies).
injection_regexes:
  - 'ignore\s+(?:all\s+)?(?:previous|prior|above|your)\s+(?:instructions|rules|directives|guidelines)'

# Per-family pattern sets the gateway scans on every request.
# Omitting any key keeps the compiled-in default for that family.
# Setting `field: []` explicitly clears it (rare, used in testbed
# profiles that want a family disabled).
secrets:
  - "sk-"
  - "ghp_"
  - "-----begin rsa"

pii_requests:
  - "social security number"
  - "credit card number"

pii_data_regexes:
  - '\b\d{3}-\d{2}-\d{4}\b'

exfiltration:
  - "/etc/passwd"
  - "exfiltrate"

Three-state semantics for each top-level key:

  • Omit the key entirely — keep the compiled-in default for that family. Useful when you only want to tweak one thing without copying the whole baseline.
  • field: [...] — operator override; replaces the default wholesale.
  • field: [] — operator explicitly cleared the family. Mostly used in testbed profiles; rarely useful in production.

The bundled default, strict, and permissive local-patterns.yaml files each carry the full baseline 1:1 with what the gateway ships compiled-in. The Go test TestLocalPatternsDefaultsParity fails CI if the bundled YAML drifts from the Go source, so editing one without the other can't silently downgrade posture.

Layer 3 — LLM judge

The LLM judge is an optional second opinion. Three judge configuration files ship out of the box—injection, PII, and tool-injection. Data-exfiltration is a category inside the tool-injection judge rather than a fourth exfil.yaml file. They're particularly useful for:

  • Subtle prompt injection that doesn't match a regex.
  • Tool calls whose stated purpose disagrees with their effects (most common in MCP).
  • Categorical content checks — "does this prompt try to extract proprietary code?"

Shared severity rubric

All three judges evaluate against the same severity rubric. The rubric is embedded directly in each judge's system prompt so verdicts stay consistent across categories.

TierMeaning
CRITICALDirect unambiguous harm, provable from the content alone. Credential exfil, destructive shell, jailbreak succeeded, SSN/passport disclosed.
HIGHClear adversarial intent or high-impact sensitive data. Prompt injection with explicit override, /etc/passwd probe, phone number in completion.
MEDIUMSuspicious but ambiguous. Benign readings are plausible.
LOWWeak indicator; context-dependent (e.g. user self-disclosing their own email).
NONENo concern.

Signal-strength scoring

For each category a judge flags, it also emits a signal_strength label derived from two booleans: unambiguous (no plausible benign reading) and high_impact (hard-to-reverse damage possible).

unambiguoushigh_impactsignal_strength
strong_signal
signal
needs_review
weak_signal

The Go-side verdict aggregation uses these labels to keep weakly-corroborated flags from stacking to CRITICAL. A weak_signal on a single category downgrades to MEDIUM/alert; strong_signal on a structural category (destructive command, exfil channel) escalates to CRITICAL. The calibration metrics for this path are reproducible via go test -run TestEval — see the evaluation corpus README for the scorecard.

Judge configuration is split by detector under judge/. For example, the injection judge declares its categories and severity mapping in judge/injection.yaml:

policies/guardrail/default/judge/injection.yaml (excerpt)
version: 1
name: injection
enabled: true
categories:
  "Instruction Manipulation":
    finding_id: JUDGE-INJ-INSTRUCT
    severity: HIGH
  "Obfuscation":
    finding_id: JUDGE-INJ-OBFUSC
    severity: CRITICAL
min_categories_for_high: 1
single_category_max_severity: HIGH
min_categories_for_critical: 2

The judge call goes through the same Bifrost pipeline as everything else and uses the unified LLM key. Token usage is emitted as a judge.call event so you can monitor cost from your dashboards.

You can target a different model than the rest of the stack by setting guardrail.judge.llm in config.yaml:

guardrail:
  judge:
    enabled: true
    llm:
      provider: anthropic
      model: claude-3-5-haiku-20241022
      api_key_env: DEFENSECLAW_LLM_KEY

Layer 4 — Suppressions

Suppressions are the difference between "we deployed DefenseClaw" and "we use DefenseClaw daily." Three suppression flavours ship with every rule pack:

pre_judge_strips — redact before the judge sees it

policies/guardrail/default/suppressions.yaml
pre_judge_strips:
  - id: STRIP-SYSTEM-SENDER
    pattern: '\b(cli|system|bot|admin)\b'
    context: "System sender metadata injected by agent framework"
    applies_to: [pii]

The redaction happens before the LLM judge is invoked, so the secret never crosses a third-party API even when the judge is enabled. Useful for fields that can never legitimately contain prompt content (auth headers, API tokens).

finding_suppressions — silence known-good signatures

finding_suppressions:
  - id: SUPP-EMAIL-CHATID
    finding_pattern: JUDGE-PII-EMAIL
    entity_pattern: '^19:[a-f0-9\-]+@unq\.gbl\.spaces$'
    reason: "Teams chatId format, not email address"

When both patterns match, that entity is removed from the judge result before the final verdict is aggregated. Use narrow patterns and an auditable reason; the current schema has no expiry field, so time-bounded exceptions must be removed by your configuration-management workflow.

tool_suppressions — scope HITL prompts

tool_suppressions:
  - tool_pattern: '^(graph_auth_status|session_status|get_status)$'
    suppress_findings: [JUDGE-PII-USER]
    reason: "Status check tools return expected system metadata"

These remove only the listed findings for matching tool names. Other findings from the same call still participate in the verdict.

Why three layers?

LayerWhere it runsUse it for
pre_judge_stripsBefore the LLM judgeKeeping secrets out of third-party APIs
finding_suppressionsAfter the PII judge firesNarrow entity exceptions with a required reason; the schema has no expiry field
tool_suppressionsAfter a tool-scoped PII judge resultDropping only named finding IDs for matching tools

Together they make it realistic to leave the gateway in action mode without operator fatigue.

Layer 5 — Session correlator

The first four layers evaluate each event in isolation. The correlator is the cross-event layer: it watches the recent-finding stream for the same session and raises a synthetic CORR-* meta-finding when a pattern of individually-benign-looking events combines into something worse.

This closes a class of attacks where no single event crosses the action threshold on its own, but the sequence across turns is an exfil path — for example, Simon Willison's lethal trifecta (untrusted content + sensitive data access + external egress in the same session).

How it runs

SystemNew findinginserted
PolicySliding windowlast N findings in session
PolicyPattern matcheraxes · severities · fingerprints
PolicyCORR-* meta-findingseverity_on_match
SystemAudit DB+ webhook / Splunk / Rego
The correlator runs over the last N findings per session on every new finding insert. A match writes a synthetic CORR-<id> row back into scan_findings.

Every finding DefenseClaw records is tagged with a data axis (one or more of ingress_untrusted, sensitive_access, egress_external) and — on the tool-call surface — a tool capability class (read_fs, write_fs, exec_shell, network_fetch, send_message, none). Regex rules and judge categories resolve through the canonical mapping in internal/guardrail/axes.go, so patterns can reason across detectors without hard-coding rule IDs.

Bundled patterns

Four patterns ship in internal/guardrail/defaults/correlation-patterns.yaml:

Pattern IDWindowMatch severityTrigger
LETHAL-TRIFECTA30 eventsCRITICALOrdered axes: untrusted ingress, then sensitive access, then external egress.
TRIFECTA-WITH-FINGERPRINT-MATCH30 eventsCRITICALSame content fingerprint seen in a sensitive_access finding and a later egress_external finding — direct exfil, not just temporal coincidence.
ESCALATION-CHAIN10 eventsCRITICALMEDIUM → HIGH → HIGH severity progression inside the same session — attacker iterating on a prompt.
DESTRUCTIVE-FLOW50 eventsCRITICALAn exec_shell finding at HIGH+ in the same session window as a sensitive_access finding.

Each pattern sets its own window size — ESCALATION-CHAIN wants the progression tight; DESTRUCTIVE-FLOW tolerates 50 events between a credential read and an rm -rf.

What the operator sees

When a pattern matches, the correlator writes a synthetic row into scan_findings:

scanner: "correlator"
rule_id: "CORR-LETHAL-TRIFECTA"
severity: CRITICAL
tags: ["correlation", "LETHAL-TRIFECTA"]
description: |
  contributing findings:
    - INJ-IGNORE-ALL   at 2026-05-12T14:03:11Z
    - SENSITIVE-PATH-SSH-KEY at 2026-05-12T14:05:42Z
    - SRC-FETCH        at 2026-05-12T14:06:18Z

The same row is emitted to configured observability sinks — Splunk, webhooks, and the TUI audit panel. The four bundled patterns currently emit CRITICAL findings, so they route to operator attention without per-sink threshold changes.

The correlator is post-event detection, not an in-flight block of the request that completed the pattern. It writes a synthetic scan summary whose verdict is block and a CRITICAL CORR-* finding for sinks and operator workflows, but it does not retroactively stop the contributing action or automatically quarantine later requests.

Why alert-only, not block

Promoting a later request in the session to deny based on an earlier pattern match has two honest costs:

  • It inverts the existing "each request is evaluated on its own" contract.
  • It can cause spurious blocks on legitimate parallel/retry traffic in the same session.

Ways to act on correlations without changing the gateway runtime:

  • Tail scan_findings for rule_id LIKE 'CORR-%' and invoke a custom script.
  • Wire the existing Splunk / webhook sink to any CRITICAL severity — CORR-* inherits automatically.
  • Feed the sink event into an external response workflow that disables a connector, revokes a credential, or opens an incident under your organization's policy.

Authoring custom rule packs

The path of least resistance is to extend rather than replace. Custom rule packs live on disk and are pointed at via the guardrail.rule_pack_dir config key — there is no built-in linter or fixture-runner today, so the workflow is git-driven.

Locate the active rule-pack directory. The bundled packs ship inside the installed Python package; the operator-editable copies live under ~/.defenseclaw/policies/guardrail/{default,strict,permissive}/. Find the active one with:

awk '/rule_pack_dir/ {print $2}' ~/.defenseclaw/config.yaml

Copy the closest pack.

cp -r ~/.defenseclaw/policies/guardrail/default \
      ~/.defenseclaw/policies/guardrail/my-org

Add a rules file alongside local-patterns.yaml. Keep your custom rules in their own file so updates to the bundled patterns don't conflict.

Validate the surrounding config. There is no rule-pack-specific CLI linter today. defenseclaw config validate checks config.yaml itself (including the rule_pack_dir value), but it does not parse every YAML file inside that directory:

defenseclaw config validate

Point the gateway at the new directory. Edit ~/.defenseclaw/config.yaml and set:

guardrail:
  rule_pack_dir: ~/.defenseclaw/policies/guardrail/my-org

Then restart the gateway so it rebuilds the local rule set and judge from the new directory:

defenseclaw-gateway restart

defenseclaw-gateway policy reload is narrower: it recompiles the OPA modules under policy_dir; it is not the rule-pack reload command.

defenseclaw setup guardrail --rule-pack <name> is for switching between the three bundled packs (default, strict, permissive); custom directories go through rule_pack_dir.

Operator commands

The policy group manages named OPA / asset policies (stored under ~/.defenseclaw/policies/) — different surface from the guardrail rule packs above. Subcommands:

defenseclaw policy list                       # every named policy on disk
defenseclaw policy show my-policy             # normalized summary of one policy
defenseclaw policy create my-policy -d "Production policy"
defenseclaw policy activate my-policy         # set as active
defenseclaw policy delete my-policy           # remove
defenseclaw policy validate                   # validates data.json schema + compiles bundled Rego
defenseclaw policy test --verbose             # runs `opa test` against the bundled Rego (requires `opa` on PATH)
defenseclaw policy edit actions --severity high --runtime disable --policy-name my-policy

policy show and policy activate always take a policy name, not a filesystem path. policy show renders a normalized summary rather than the source YAML. policy validate checks that data.json parses, that every severity tier in actions and scanner_overrides has the required fields, and that the bundled Rego modules compile (it is not an opa fmt --diff). Both validate and test only accept --rego-dir to override the bundled Rego location; neither takes a path argument or --fixtures. There is no policy diff subcommand; compare custom YAML files with diff or another YAML-aware tool.

See also