OASB

OASB Skills Security Benchmark

The first ground-truth benchmark for AI agent skill scanners. 4,245 labeled samples. 9 attack categories. Verified precision and recall.

Last updated: April 2, 2026 | Dataset: v2.0 | Paper comparison: Holzbauer et al. (arXiv:2603.16572)

4,245
Labeled Samples
9
Attack Categories
89.2%
Best F1 Score
100%
Best Recall
87.1%
DVAA Detection

Why this benchmark exists

Holzbauer et al. evaluated 9 scanners across 238,180 skills from 3 marketplaces. Flag rates ranged from 3.8% to 41.9%, but only 33 out of 27,111 skills (0.12%) were flagged by all scanners. No scanner reported precision, recall, or F1 because no ground-truth labeled dataset existed.

OASB provides that ground truth: 4,245 samples with verified labels across 9 attack categories, sourced from DVAA scenarios, ARIA research findings, expert-reviewed payloads, and real registry data. Any scanner can submit results for standardized evaluation.

Scanner Leaderboard

#ScannerTierF1PrecisionRecallFPRFlag RateCategories
1
NanoMind TME v0.5.0
v0.12.9 - ONNX neural classifier only(best balanced)
GOLD89.2%88.4%90.0%0.82%6.9%9/9
2
HMA Full Pipeline
v0.12.9 - AST compilation + 6 analyzers + NanoMind
GOLD81.3%68.5%100.0%3.20%10.3%9/9
3
HMA Static Patterns
v0.12.9 - Regex-only, no NanoMind
SILVER67.5%99.3%51.1%0.03%3.6%9/9

Per-Category Detection (NanoMind TME v0.5.0)

30 malicious samples per category. Sorted by F1.

CategoryRecallPrecisionF1Recall Bar
Unicode Steganography100.0%96.6%98.2%
Persistence96.7%100.0%98.3%
Social Engineering100.0%90.6%95.1%
Privilege Escalation93.3%96.6%94.9%
Credential Exfiltration80.0%91.7%85.4%
Data Exfiltration76.7%92.0%83.6%
Supply Chain83.3%----
Prompt Injection86.7%61.8%72.1%
Heartbeat RCE93.3%50.0%65.1%

DVAA Ground-Truth Validation

70 intentionally vulnerable scenarios from the Damn Vulnerable AI Agent. Each scenario has a known attack type and expected detection.

61/70
Scenarios Detected
87.1%
Detection Rate
4
Categories at 100%
9
Missed (config/binary files)

Industry Comparison

Scanner flag rates from Holzbauer et al. (arXiv:2603.16572), 238,180 skills across 3 marketplaces. These scanners report flag rates only; no precision/recall is available (no ground truth).

ScannerPlatformFlag RatePrecisionRecallFlag Rate
HMA StaticOASB v23.6%99.3%51.1%
NanoMind TME v0.5.0OASB v26.9%88.4%90.0%
HMA Full PipelineOASB v210.3%68.5%100%
SocketSkills.sh3.8%----
SnykSkills.sh7.7%----
agent-trust-hubSkills.sh13.8%----
Cisco Skill ScannerSkills.sh14.0%----
Cisco Skill ScannerClawHub16.7%----
GPT 5.3-based LLMSkills.sh27.3%----
VirusTotalClawHub36.2%----
GPT 5.3-based LLMClawHub38.8%----
OpenClaw ScannerClawHub41.9%----

Paper scanners tested on 238K real marketplace skills (no labels). HMA scanners tested on OASB v2 corpus (4,245 labeled samples). Flag rate comparison only; precision/recall requires ground truth.

Scoring Tiers

TierF1 ScoreFalse Positive RateCategory CoverageKappa vs HMA
Platinum>=0.90<=5%9/9>=0.85
Gold>=0.80<=10%>=7/9--
Silver>=0.65<=20%>=5/9--
ListedAnyAnyAny--

Methodology

Dataset Composition

  • 270 malicious samples (30 per category)
  • 3,881 benign samples from real registries
  • 94 edge cases (security tools, defensive configs)
  • Sources: DVAA scenarios, ARIA research, HMA payloads, expert review, registry scans
  • 225 registry metadata-flagged stubs excluded (no malicious content)

Scoring

  • Binary detection: malicious/benign verdict per sample
  • Category assignment: 9 attack categories for malicious verdicts
  • Metrics: macro-averaged F1, precision, recall across categories
  • FPR: false positives / (false positives + true negatives)
  • Edge case samples excluded from scoring

Submit Your Scanner

The benchmark is open. Any scanner can submit results for evaluation. The methodology is the authority, not a gating decision.

Submissions expire after 90 days. Scanners must resubmit against each new dataset version to maintain their rating.

POST https://api.oa2a.org/api/v1/benchmark/submit Content-Type: application/json { "scannerId": "your-scanner-id", "scannerName": "Your Scanner", "scannerVersion": "1.0.0", "datasetVersion": "v2.0", "results": [ { "sampleId": "m001", "verdict": "malicious", "category": "supply_chain" }, ... ] }

References

  • Holzbauer et al., "Malicious Or Not: Adding Repository Context to Agent Skill Classification," arXiv:2603.16572, March 2026
  • OASB benchmark code and dataset: github.com/opena2a-org/oasb
  • DVAA scenarios: github.com/opena2a-org/damn-vulnerable-ai-agent