SecureBERT 2.0
Advanced Domain-Specific Language Model for Cybersecurity Intelligence
SecureBERT 2.0 is Cisco AI's officially released, domain-adapted encoder-based language model for cybersecurity and threat intelligence. Built on the ModernBERT architecture, it incorporates hierarchical encoding and long-context modeling, enabling effective processing of complex cybersecurity documents, source code, and threat intelligence reports.
Pretrained on a massive, multi-modal corpus -- including over 13 billion text tokens and 53 million code tokens -- SecureBERT 2.0 achieves state-of-the-art performance in semantic search, named entity recognition, code vulnerability detection, and threat analysis.
View on GitHub | Paper (arXiv) | Join Discord
Key Features
- Domain-Specific Pretraining -- Extensive cybersecurity corpus, including threat reports, vulnerability advisories, technical blogs, and source code
- Multi-Modal Understanding -- Integrates natural language and code for advanced vulnerability detection and threat intelligence
- Hierarchical & Long-Context Modeling -- Captures both fine-grained and high-level structures across extended documents
- Optimized for Cybersecurity Tasks -- Semantic search, NER, code vulnerability detection, and threat intelligence analysis
Hugging Face Models
| Task | Model Path |
|---|---|
| SecureBERT 2.0 Base | cisco-ai/SecureBERT2.0-base |
| Cross Encoder | cisco-ai/SecureBERT2.0-cross_encoder |
| Bi-Encoder | cisco-ai/SecureBERT2.0-biencoder |
| Named Entity Recognition | cisco-ai/SecureBERT2.0-NER |
| Vulnerability Classification | cisco-ai/SecureBERT2.0-code-vuln-detection |
Pretraining Dataset
| Dataset Category | Code Tokens | Text Tokens |
|---|---|---|
| Seed corpus | 9,406,451 | 256,859,788 |
| Large-scale web text | 268,993 | 12,231,942,693 |
| Reasoning-focused data | -- | 3,229,293 |
| Instruction-tuning data | 61,590 | 2,336,218 |
| Code vulnerability corpus | 2,146,875 | -- |
| Cybersecurity dialogue data | 41,503,749 | 56,871,556 |
| Original baseline dataset | -- | 1,072,798,637 |
| Total | 53,387,658 | 13,623,037,185 |
Downstream Task Performance
Document Embedding -- Cross-Encoder
| Model | mAP | R@1 | NDCG@10 | MRR@10 |
|---|---|---|---|---|
| ms-marco-TinyBERT-L2 | 0.920 | 0.849 | 0.964 | 0.955 |
| SecureBERT 2.0 | 0.955 | 0.948 | 0.986 | 0.983 |
Document Embedding -- Bi-Encoder
| Model | mAP | R@1 | MRR@10 |
|---|---|---|---|
| all-MiniLM-L12-v2 | 0.912 | 0.924 | 0.945 |
| SecureBERT 2.0 | 0.951 | 0.984 | 0.989 |
Named Entity Recognition (NER)
| Model | F1 | Recall | Precision |
|---|---|---|---|
| CyBERT | 0.351 | 0.281 | 0.467 |
| SecureBERT | 0.734 | 0.759 | 0.717 |
| SecureBERT 2.0 | 0.945 | 0.965 | 0.927 |
Code Vulnerability Detection
| Model | Accuracy | F1 | Recall | Precision |
|---|---|---|---|---|
| CodeBERT | 0.627 | 0.372 | 0.241 | 0.821 |
| CyBERT | 0.459 | 0.630 | 1.000 | 0.459 |
| SecureBERT 2.0 | 0.655 | 0.616 | 0.602 | 0.630 |
Installation
Prerequisites: Python 3.10+, PyTorch 2.1+ with CUDA, Hugging Face Transformers
git clone https://github.com/cisco-ai-defense/securebert2.git
python -m venv venv
source venv/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers lightning tqdm pandas pyarrow
Repository Structure
.
├── mlm/ # Model pretraining (Masked Language Modeling)
│ ├── train.py
│ └── SecureBERT_mlm_eval.py
├── vuln_classification/ # Code vulnerability detection
│ ├── CodeVuln_train.py
│ └── CodeVuln_eval.py
├── rt2/ner/ # Named Entity Recognition (NER)
│ ├── NER_train.py
│ └── NER_eval.py
├── doc_embedding/ # Document embedding tasks
│ ├── BiEncoder_train.py
│ ├── CrossEncoder_train.py
│ ├── BiEncoder_eval.py
│ └── CrossEncoder_eval.py
├── opensource_data/ # Preprocessed datasets
├── dataset.py # Dataset loading utilities
└── requirements.txt
Training and Evaluation
Train on a single GPU:
cd mlm
python train.py
For multi-GPU:
torchrun --nproc_per_node=8 train.py
Evaluation example (MLM):
sentences = [
"The attacker gained access through a [MASK] vulnerability.",
"Users should always enable [MASK] authentication for better security.",
"The malicious [MASK] was detected by the intrusion detection system.",
]
ground_truths = ["software", "multi-factor", "payload"]
model_ids = [
"cisco-ai/SecureBERT2.0-base",
"answerdotai/ModernBERT-base",
"ehsanaghaei/SecureBERT",
]
python SecureBERT2_mlm_eval.py
Contributing
We welcome contributions including new datasets, additional downstream cybersecurity tasks, model architecture enhancements, and optimized evaluation pipelines.
Please review CONTRIBUTING.md for guidelines.
License
Apache 2.0 -- See LICENSE for details.