Architecture

Skill Scanner is a modular security scanner built around a central orchestrator and pluggable analyzers. Scans execute in two phases: deterministic analysis first, then LLM-powered analysis enriched with Phase 1 context.

Scanning Pipeline

Every scan follows the same six-stage pipeline:

1. Load and Pre-process — SkillLoader parses the skill's SKILL.md frontmatter, discovers files recursively, classifies file types, and extracts referenced file hints. ContentExtractor safely unpacks any embedded archives (ZIP, TAR) with layered protections against zip bombs, path traversal, and symlink attacks.

2. Phase 1: Deterministic Analyzers — All non-LLM analyzers run: static (YAML + YARA + Python checks), bytecode, pipeline, behavioral (if enabled), VirusTotal, AI Defense, and trigger. The scanner collects validated binary files and unreferenced scripts for later enrichment.

3. Phase 2: LLM Analyzers with Enrichment — LLM and meta analyzers receive enrichment context built from Phase 1 results: file inventory, type distribution, magic mismatches, and top critical/high findings. This structured context gives the LLM a picture of what deterministic analysis already found.

4. Post-Processing — Policy enforcement: suppress VT-validated binary findings, enforce disabled rules, apply severity overrides, compute analyzability scores, normalize/deduplicate findings, annotate co-occurrence metadata, and attach the policy fingerprint.

5. Cleanup — Temporary extraction directories are removed. A ScanResult is built with findings, timing, analyzer names, analyzability score, and scan metadata.

6. Reporting — The ScanResult is passed to the chosen reporter (summary, JSON, Markdown, table, SARIF, or HTML).

Core Components

Scanner Orchestrator

The SkillScanner class in core/scanner.py runs the full pipeline. It manages analyzer lifecycle, enrichment context building, post-processing, and cleanup. For directory scans, it adds cross-skill analysis (description overlap, data relay patterns, shared external URLs).

Skill Loader

SkillLoader in core/loader.py handles:

Validating skill directory structure and SKILL.md presence
Parsing YAML frontmatter (name, description, metadata)
Recursive file discovery (excluding .git internals)
File-type classification (python, bash, markdown, binary, other)
Lenient mode fallback to .md files when SKILL.md is absent

Analyzer Factory

analyzer_factory.py is the single source of truth for analyzer assembly. CLI, API, pre-commit hook, and eval runners all use this factory to ensure parity.

build_core_analyzers(policy) — static, bytecode, pipeline (gated by policy)
build_analyzers(policy, use_behavioral, use_llm, ...) — adds optional analyzers based on flags

Content Extractor

Safely unpacks archives embedded in skills with protections for zip bombs, nesting depth, path traversal, symlinks, and total size/file count limits.

Analyzer Inventory

Core (Policy-Driven)

Analyzer	Detection Method
`static_analyzer`	YAML signatures + YARA rules + inventory checks
`bytecode_analyzer`	Python bytecode/source consistency
`pipeline_analyzer`	Shell pipeline taint analysis and command-risk checks

Optional (Flag-Driven)

Analyzer	Detection Method	Requires
`behavioral_analyzer`	AST dataflow + cross-file correlation	`--use-behavioral`
`llm_analyzer`	Semantic threat analysis with structured schema	API key + `--use-llm`
`meta_analyzer`	Second-pass LLM validation/filtering	API key + `--enable-meta`
`virustotal_analyzer`	Binary hash lookup + optional upload	API key + `--use-virustotal`
`aidefense_analyzer`	Cisco AI Defense cloud inspection	API key + `--use-aidefense`
`trigger_analyzer`	Overly broad trigger/description checks	`--use-trigger`
`cross_skill_scanner`	Multi-skill coordination detection	`--check-overlap`

Policy System

ScanPolicy in core/scan_policy.py centralizes all runtime configuration across 14 sections:

File limits and thresholds
Rule scoping and docs-path behavior
Command safety tiers
Hidden file allowlists
Severity overrides and disabled rules
Output deduplication and metadata behavior
Core analyzer toggles

Three built-in presets: strict, balanced (default), and permissive.

See Scan Policies for configuration details.

Data Models

Primary data structures in core/models.py:

Model	Purpose
`Skill`	Loaded skill package with files and metadata
`Finding`	Individual security finding with severity, category, location, and remediation
`ScanResult`	Single-skill scan output with findings, timing, and analyzability
`Report`	Multi-skill scan output aggregating multiple `ScanResult` objects
`Severity`	Enum: CRITICAL, HIGH, MEDIUM, LOW, INFO, SAFE
`ThreatCategory`	Enum: prompt injection, data exfiltration, command injection, etc.

Entry Points

Entry Point	Source	Description
CLI	`skill_scanner/cli/cli.py`	Main command-line interface
API	`skill_scanner/api/router.py`	FastAPI REST server
Pre-commit	`skill_scanner/hooks/pre_commit.py`	Git hook integration
SDK	`skill_scanner/__init__.py`	Python library import

All entry points use analyzer_factory.py for consistent analyzer construction.

Rule Packs

Built-in detection rules live in skill_scanner/data/packs/:

Pack	Contents
`core`	YAML signatures, YARA rules, Python checks — the main detection pack
`atr`	Additional threat research signatures

Each pack has a pack.yaml manifest that declares its rules and metadata.

Threat Taxonomy

Every finding is normalized to Cisco AI framework mappings (AITech/AISubtech) so findings from different analyzers use consistent category labels. The taxonomy can be overridden at runtime for custom organizational classifications.

Extension Points

To add new detection capabilities:

Add an analyzer class inheriting BaseAnalyzer
Register the construction path in analyzer_factory.py
Add policy knobs in scan_policy.py (if needed)
Add tests under tests/
Document CLI/API toggles

For rule-based detection, prefer extending skill_scanner/data/packs/core/ (signatures, YARA, Python checks) before adding analyzer-level logic.

See Writing Custom Rules for the full guide.

Architecture — Skill Scanner