OASB Skills Security Benchmark

The first ground-truth benchmark for AI agent skill scanners. 4,245 labeled samples. 9 attack categories. Verified precision and recall.

Last updated: April 2, 2026 | Dataset: v2.0 | Paper comparison: Holzbauer et al. (arXiv:2603.16572)

4,245

Labeled Samples

Attack Categories

89.2%

Best F1 Score

100%

Best Recall

87.1%

DVAA Detection

Why this benchmark exists

Holzbauer et al. evaluated 9 scanners across 238,180 skills from 3 marketplaces. Flag rates ranged from 3.8% to 41.9%, but only 33 out of 27,111 skills (0.12%) were flagged by all scanners. No scanner reported precision, recall, or F1 because no ground-truth labeled dataset existed.

OASB provides that ground truth: 4,245 samples with verified labels across 9 attack categories, sourced from DVAA scenarios, ARIA research findings, expert-reviewed payloads, and real registry data. Any scanner can submit results for standardized evaluation.

Scanner Leaderboard

#	Scanner	Tier	F1	Precision	Recall	FPR	Flag Rate	Categories
1	NanoMind TME v0.5.0 v0.12.9 - ONNX neural classifier only(best balanced)	GOLD	89.2%	88.4%	90.0%	0.82%	6.9%	9/9
2	HMA Full Pipeline v0.12.9 - AST compilation + 6 analyzers + NanoMind	GOLD	81.3%	68.5%	100.0%	3.20%	10.3%	9/9
3	HMA Static Patterns v0.12.9 - Regex-only, no NanoMind	SILVER	67.5%	99.3%	51.1%	0.03%	3.6%	9/9

Per-Category Detection (NanoMind TME v0.5.0)

30 malicious samples per category. Sorted by F1.

Category	Recall	Precision	F1
Unicode Steganography	100.0%	96.6%	98.2%
Persistence	96.7%	100.0%	98.3%
Social Engineering	100.0%	90.6%	95.1%
Privilege Escalation	93.3%	96.6%	94.9%
Credential Exfiltration	80.0%	91.7%	85.4%
Data Exfiltration	76.7%	92.0%	83.6%
Supply Chain	83.3%	--	--
Prompt Injection	86.7%	61.8%	72.1%
Heartbeat RCE	93.3%	50.0%	65.1%

DVAA Ground-Truth Validation

70 intentionally vulnerable scenarios from the Damn Vulnerable AI Agent. Each scenario has a known attack type and expected detection.

61/70

Scenarios Detected

87.1%

Detection Rate

Categories at 100%

Missed (config/binary files)

Industry Comparison

Scanner flag rates from Holzbauer et al. (arXiv:2603.16572), 238,180 skills across 3 marketplaces. These scanners report flag rates only; no precision/recall is available (no ground truth).

Scanner	Platform	Flag Rate	Precision	Recall
HMA Static	OASB v2	3.6%	99.3%	51.1%
NanoMind TME v0.5.0	OASB v2	6.9%	88.4%	90.0%
HMA Full Pipeline	OASB v2	10.3%	68.5%	100%
Socket	Skills.sh	3.8%	--	--
Snyk	Skills.sh	7.7%	--	--
agent-trust-hub	Skills.sh	13.8%	--	--
Cisco Skill Scanner	Skills.sh	14.0%	--	--
Cisco Skill Scanner	ClawHub	16.7%	--	--
GPT 5.3-based LLM	Skills.sh	27.3%	--	--
VirusTotal	ClawHub	36.2%	--	--
GPT 5.3-based LLM	ClawHub	38.8%	--	--
OpenClaw Scanner	ClawHub	41.9%	--	--

Paper scanners tested on 238K real marketplace skills (no labels). HMA scanners tested on OASB v2 corpus (4,245 labeled samples). Flag rate comparison only; precision/recall requires ground truth.

Scoring Tiers

Tier	F1 Score	False Positive Rate	Category Coverage	Kappa vs HMA
Platinum	>=0.90	<=5%	9/9	>=0.85
Gold	>=0.80	<=10%	>=7/9	--
Silver	>=0.65	<=20%	>=5/9	--
Listed	Any	Any	Any	--

Methodology

Dataset Composition

270 malicious samples (30 per category)
3,881 benign samples from real registries
94 edge cases (security tools, defensive configs)
Sources: DVAA scenarios, ARIA research, HMA payloads, expert review, registry scans
225 registry metadata-flagged stubs excluded (no malicious content)

Scoring

Binary detection: malicious/benign verdict per sample
Category assignment: 9 attack categories for malicious verdicts
Metrics: macro-averaged F1, precision, recall across categories
FPR: false positives / (false positives + true negatives)
Edge case samples excluded from scoring

Submit Your Scanner

The benchmark is open. Any scanner can submit results for evaluation. The methodology is the authority, not a gating decision.

Submissions expire after 90 days. Scanners must resubmit against each new dataset version to maintain their rating.

POST https://api.oa2a.org/api/v1/benchmark/submit
Content-Type: application/json

{
  "scannerId": "your-scanner-id",
  "scannerName": "Your Scanner",
  "scannerVersion": "1.0.0",
  "datasetVersion": "v2.0",
  "results": [
    { "sampleId": "m001", "verdict": "malicious", "category": "supply_chain" },
    ...
  ]
}

References

Holzbauer et al., "Malicious Or Not: Adding Repository Context to Agent Skill Classification," arXiv:2603.16572, March 2026
OASB benchmark code and dataset: github.com/opena2a-org/oasb
DVAA scenarios: github.com/opena2a-org/damn-vulnerable-ai-agent