Skip to content
Cisco
CiscoAI Security

SecureBERT 2.0

Advanced Domain-Specific Language Model for Cybersecurity Intelligence

SecureBERT 2.0 is Cisco AI's officially released, domain-adapted encoder-based language model for cybersecurity and threat intelligence. Built on the ModernBERT architecture, it incorporates hierarchical encoding and long-context modeling, enabling effective processing of complex cybersecurity documents, source code, and threat intelligence reports.

Pretrained on a massive, multi-modal corpus -- including over 13 billion text tokens and 53 million code tokens -- SecureBERT 2.0 achieves state-of-the-art performance in semantic search, named entity recognition, code vulnerability detection, and threat analysis.

View on GitHub | Paper (arXiv) | Join Discord


Key Features

  • Domain-Specific Pretraining -- Extensive cybersecurity corpus, including threat reports, vulnerability advisories, technical blogs, and source code
  • Multi-Modal Understanding -- Integrates natural language and code for advanced vulnerability detection and threat intelligence
  • Hierarchical & Long-Context Modeling -- Captures both fine-grained and high-level structures across extended documents
  • Optimized for Cybersecurity Tasks -- Semantic search, NER, code vulnerability detection, and threat intelligence analysis

Hugging Face Models

TaskModel Path
SecureBERT 2.0 Basecisco-ai/SecureBERT2.0-base
Cross Encodercisco-ai/SecureBERT2.0-cross_encoder
Bi-Encodercisco-ai/SecureBERT2.0-biencoder
Named Entity Recognitioncisco-ai/SecureBERT2.0-NER
Vulnerability Classificationcisco-ai/SecureBERT2.0-code-vuln-detection

Pretraining Dataset

Dataset CategoryCode TokensText Tokens
Seed corpus9,406,451256,859,788
Large-scale web text268,99312,231,942,693
Reasoning-focused data--3,229,293
Instruction-tuning data61,5902,336,218
Code vulnerability corpus2,146,875--
Cybersecurity dialogue data41,503,74956,871,556
Original baseline dataset--1,072,798,637
Total53,387,65813,623,037,185

Downstream Task Performance

Document Embedding -- Cross-Encoder

ModelmAPR@1NDCG@10MRR@10
ms-marco-TinyBERT-L20.9200.8490.9640.955
SecureBERT 2.00.9550.9480.9860.983

Document Embedding -- Bi-Encoder

ModelmAPR@1MRR@10
all-MiniLM-L12-v20.9120.9240.945
SecureBERT 2.00.9510.9840.989

Named Entity Recognition (NER)

ModelF1RecallPrecision
CyBERT0.3510.2810.467
SecureBERT0.7340.7590.717
SecureBERT 2.00.9450.9650.927

Code Vulnerability Detection

ModelAccuracyF1RecallPrecision
CodeBERT0.6270.3720.2410.821
CyBERT0.4590.6301.0000.459
SecureBERT 2.00.6550.6160.6020.630

Installation

Prerequisites: Python 3.10+, PyTorch 2.1+ with CUDA, Hugging Face Transformers

git clone https://github.com/cisco-ai-defense/securebert2.git

python -m venv venv
source venv/bin/activate

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers lightning tqdm pandas pyarrow

Repository Structure

.
├── mlm/                        # Model pretraining (Masked Language Modeling)
│   ├── train.py
│   └── SecureBERT_mlm_eval.py
├── vuln_classification/        # Code vulnerability detection
│   ├── CodeVuln_train.py
│   └── CodeVuln_eval.py
├── rt2/ner/                    # Named Entity Recognition (NER)
│   ├── NER_train.py
│   └── NER_eval.py
├── doc_embedding/              # Document embedding tasks
│   ├── BiEncoder_train.py
│   ├── CrossEncoder_train.py
│   ├── BiEncoder_eval.py
│   └── CrossEncoder_eval.py
├── opensource_data/            # Preprocessed datasets
├── dataset.py                  # Dataset loading utilities
└── requirements.txt

Training and Evaluation

Train on a single GPU:

cd mlm
python train.py

For multi-GPU:

torchrun --nproc_per_node=8 train.py

Evaluation example (MLM):

sentences = [
    "The attacker gained access through a [MASK] vulnerability.",
    "Users should always enable [MASK] authentication for better security.",
    "The malicious [MASK] was detected by the intrusion detection system.",
]

ground_truths = ["software", "multi-factor", "payload"]

model_ids = [
    "cisco-ai/SecureBERT2.0-base",
    "answerdotai/ModernBERT-base",
    "ehsanaghaei/SecureBERT",
]
python SecureBERT2_mlm_eval.py

Contributing

We welcome contributions including new datasets, additional downstream cybersecurity tasks, model architecture enhancements, and optimized evaluation pipelines.

Please review CONTRIBUTING.md for guidelines.


License

Apache 2.0 -- See LICENSE for details.