OASB Skills Security Benchmark
The first ground-truth benchmark for AI agent skill scanners. 4,245 labeled samples. 9 attack categories. Verified precision and recall.
Last updated: April 2, 2026 | Dataset: v2.0 | Paper comparison: Holzbauer et al. (arXiv:2603.16572)
Why this benchmark exists
Holzbauer et al. evaluated 9 scanners across 238,180 skills from 3 marketplaces. Flag rates ranged from 3.8% to 41.9%, but only 33 out of 27,111 skills (0.12%) were flagged by all scanners. No scanner reported precision, recall, or F1 because no ground-truth labeled dataset existed.
OASB provides that ground truth: 4,245 samples with verified labels across 9 attack categories, sourced from DVAA scenarios, ARIA research findings, expert-reviewed payloads, and real registry data. Any scanner can submit results for standardized evaluation.
Scanner Leaderboard
| # | Scanner | Tier | F1 | Precision | Recall | FPR | Flag Rate | Categories |
|---|---|---|---|---|---|---|---|---|
| 1 | NanoMind TME v0.5.0 v0.12.9 - ONNX neural classifier only(best balanced) | GOLD | 89.2% | 88.4% | 90.0% | 0.82% | 6.9% | 9/9 |
| 2 | HMA Full Pipeline v0.12.9 - AST compilation + 6 analyzers + NanoMind | GOLD | 81.3% | 68.5% | 100.0% | 3.20% | 10.3% | 9/9 |
| 3 | HMA Static Patterns v0.12.9 - Regex-only, no NanoMind | SILVER | 67.5% | 99.3% | 51.1% | 0.03% | 3.6% | 9/9 |
Per-Category Detection (NanoMind TME v0.5.0)
30 malicious samples per category. Sorted by F1.
| Category | Recall | Precision | F1 | Recall Bar |
|---|---|---|---|---|
| Unicode Steganography | 100.0% | 96.6% | 98.2% | |
| Persistence | 96.7% | 100.0% | 98.3% | |
| Social Engineering | 100.0% | 90.6% | 95.1% | |
| Privilege Escalation | 93.3% | 96.6% | 94.9% | |
| Credential Exfiltration | 80.0% | 91.7% | 85.4% | |
| Data Exfiltration | 76.7% | 92.0% | 83.6% | |
| Supply Chain | 83.3% | -- | -- | |
| Prompt Injection | 86.7% | 61.8% | 72.1% | |
| Heartbeat RCE | 93.3% | 50.0% | 65.1% |
DVAA Ground-Truth Validation
70 intentionally vulnerable scenarios from the Damn Vulnerable AI Agent. Each scenario has a known attack type and expected detection.
Industry Comparison
Scanner flag rates from Holzbauer et al. (arXiv:2603.16572), 238,180 skills across 3 marketplaces. These scanners report flag rates only; no precision/recall is available (no ground truth).
| Scanner | Platform | Flag Rate | Precision | Recall | Flag Rate |
|---|---|---|---|---|---|
| HMA Static | OASB v2 | 3.6% | 99.3% | 51.1% | |
| NanoMind TME v0.5.0 | OASB v2 | 6.9% | 88.4% | 90.0% | |
| HMA Full Pipeline | OASB v2 | 10.3% | 68.5% | 100% | |
| Socket | Skills.sh | 3.8% | -- | -- | |
| Snyk | Skills.sh | 7.7% | -- | -- | |
| agent-trust-hub | Skills.sh | 13.8% | -- | -- | |
| Cisco Skill Scanner | Skills.sh | 14.0% | -- | -- | |
| Cisco Skill Scanner | ClawHub | 16.7% | -- | -- | |
| GPT 5.3-based LLM | Skills.sh | 27.3% | -- | -- | |
| VirusTotal | ClawHub | 36.2% | -- | -- | |
| GPT 5.3-based LLM | ClawHub | 38.8% | -- | -- | |
| OpenClaw Scanner | ClawHub | 41.9% | -- | -- |
Paper scanners tested on 238K real marketplace skills (no labels). HMA scanners tested on OASB v2 corpus (4,245 labeled samples). Flag rate comparison only; precision/recall requires ground truth.
Scoring Tiers
| Tier | F1 Score | False Positive Rate | Category Coverage | Kappa vs HMA |
|---|---|---|---|---|
| Platinum | >=0.90 | <=5% | 9/9 | >=0.85 |
| Gold | >=0.80 | <=10% | >=7/9 | -- |
| Silver | >=0.65 | <=20% | >=5/9 | -- |
| Listed | Any | Any | Any | -- |
Methodology
Dataset Composition
- 270 malicious samples (30 per category)
- 3,881 benign samples from real registries
- 94 edge cases (security tools, defensive configs)
- Sources: DVAA scenarios, ARIA research, HMA payloads, expert review, registry scans
- 225 registry metadata-flagged stubs excluded (no malicious content)
Scoring
- Binary detection: malicious/benign verdict per sample
- Category assignment: 9 attack categories for malicious verdicts
- Metrics: macro-averaged F1, precision, recall across categories
- FPR: false positives / (false positives + true negatives)
- Edge case samples excluded from scoring
Submit Your Scanner
The benchmark is open. Any scanner can submit results for evaluation. The methodology is the authority, not a gating decision.
Submissions expire after 90 days. Scanners must resubmit against each new dataset version to maintain their rating.
POST https://api.oa2a.org/api/v1/benchmark/submit
Content-Type: application/json
{
"scannerId": "your-scanner-id",
"scannerName": "Your Scanner",
"scannerVersion": "1.0.0",
"datasetVersion": "v2.0",
"results": [
{ "sampleId": "m001", "verdict": "malicious", "category": "supply_chain" },
...
]
}References
- Holzbauer et al., "Malicious Or Not: Adding Repository Context to Agent Skill Classification," arXiv:2603.16572, March 2026
- OASB benchmark code and dataset: github.com/opena2a-org/oasb
- DVAA scenarios: github.com/opena2a-org/damn-vulnerable-ai-agent