SecureBERT 2.0

Advanced Domain-Specific Language Model for Cybersecurity Intelligence

SecureBERT 2.0 is Cisco AI's officially released, domain-adapted encoder-based language model for cybersecurity and threat intelligence. Built on the ModernBERT architecture, it incorporates hierarchical encoding and long-context modeling, enabling effective processing of complex cybersecurity documents, source code, and threat intelligence reports.

Pretrained on a massive, multi-modal corpus -- including over 13 billion text tokens and 53 million code tokens -- SecureBERT 2.0 achieves state-of-the-art performance in semantic search, named entity recognition, code vulnerability detection, and threat analysis.

View on GitHub | Paper (arXiv) | Join Discord

Key Features

Domain-Specific Pretraining -- Extensive cybersecurity corpus, including threat reports, vulnerability advisories, technical blogs, and source code
Multi-Modal Understanding -- Integrates natural language and code for advanced vulnerability detection and threat intelligence
Hierarchical & Long-Context Modeling -- Captures both fine-grained and high-level structures across extended documents
Optimized for Cybersecurity Tasks -- Semantic search, NER, code vulnerability detection, and threat intelligence analysis

Hugging Face Models

Task	Model Path
SecureBERT 2.0 Base	`cisco-ai/SecureBERT2.0-base`
Cross Encoder	`cisco-ai/SecureBERT2.0-cross_encoder`
Bi-Encoder	`cisco-ai/SecureBERT2.0-biencoder`
Named Entity Recognition	`cisco-ai/SecureBERT2.0-NER`
Vulnerability Classification	`cisco-ai/SecureBERT2.0-code-vuln-detection`

Pretraining Dataset

Dataset Category	Code Tokens	Text Tokens
Seed corpus	9,406,451	256,859,788
Large-scale web text	268,993	12,231,942,693
Reasoning-focused data	--	3,229,293
Instruction-tuning data	61,590	2,336,218
Code vulnerability corpus	2,146,875	--
Cybersecurity dialogue data	41,503,749	56,871,556
Original baseline dataset	--	1,072,798,637
Total	53,387,658	13,623,037,185

Downstream Task Performance

Document Embedding -- Cross-Encoder

Model	mAP	R@1	NDCG@10	MRR@10
ms-marco-TinyBERT-L2	0.920	0.849	0.964	0.955
SecureBERT 2.0	0.955	0.948	0.986	0.983

Document Embedding -- Bi-Encoder

Model	mAP	R@1	MRR@10
all-MiniLM-L12-v2	0.912	0.924	0.945
SecureBERT 2.0	0.951	0.984	0.989

Named Entity Recognition (NER)

Model	F1	Recall	Precision
CyBERT	0.351	0.281	0.467
SecureBERT	0.734	0.759	0.717
SecureBERT 2.0	0.945	0.965	0.927

Code Vulnerability Detection

Model	Accuracy	F1	Recall	Precision
CodeBERT	0.627	0.372	0.241	0.821
CyBERT	0.459	0.630	1.000	0.459
SecureBERT 2.0	0.655	0.616	0.602	0.630

Installation

Prerequisites: Python 3.10+, PyTorch 2.1+ with CUDA, Hugging Face Transformers

git clone https://github.com/cisco-ai-defense/securebert2.git

python -m venv venv
source venv/bin/activate

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers lightning tqdm pandas pyarrow

Repository Structure

.
├── mlm/                        # Model pretraining (Masked Language Modeling)
│   ├── train.py
│   └── SecureBERT_mlm_eval.py
├── vuln_classification/        # Code vulnerability detection
│   ├── CodeVuln_train.py
│   └── CodeVuln_eval.py
├── rt2/ner/                    # Named Entity Recognition (NER)
│   ├── NER_train.py
│   └── NER_eval.py
├── doc_embedding/              # Document embedding tasks
│   ├── BiEncoder_train.py
│   ├── CrossEncoder_train.py
│   ├── BiEncoder_eval.py
│   └── CrossEncoder_eval.py
├── opensource_data/            # Preprocessed datasets
├── dataset.py                  # Dataset loading utilities
└── requirements.txt

Training and Evaluation

Train on a single GPU:

cd mlm
python train.py

For multi-GPU:

torchrun --nproc_per_node=8 train.py

Evaluation example (MLM):

sentences = [
    "The attacker gained access through a [MASK] vulnerability.",
    "Users should always enable [MASK] authentication for better security.",
    "The malicious [MASK] was detected by the intrusion detection system.",
]

ground_truths = ["software", "multi-factor", "payload"]

model_ids = [
    "cisco-ai/SecureBERT2.0-base",
    "answerdotai/ModernBERT-base",
    "ehsanaghaei/SecureBERT",
]

python SecureBERT2_mlm_eval.py

Contributing

We welcome contributions including new datasets, additional downstream cybersecurity tasks, model architecture enhancements, and optimized evaluation pipelines.

Please review CONTRIBUTING.md for guidelines.

License

Apache 2.0 -- See LICENSE for details.

Related Projects

Skill ScannerSecurity scanner for AI agent skills — detects prompt injection, exfiltration, and malicious patterns.View docs Pickle FuzzerStructure-aware test case generator for Python pickle parsers to harden AI model file scanners.View docs DefenseClawEnterprise AI agent governance — scan, enforce, and audit every skill, MCP server, and plugin.View docs