Evaluation Suite

OASB Eval

MITRE ATT&CK Evaluations, but for AI agent security products.

222 standardized attack scenarios across 10 MITRE ATLAS techniques. Test whether runtime guards, EDRs, and security products detect real AI agent attacks — process spawning, network exfiltration, filesystem manipulation, and multi-step campaigns.

Different tools, different jobs

OASB Eval vs HackMyAgent

OASB Eval evaluates security products. HackMyAgent pentests agents. They complement each other but serve different audiences.

OASB EvalHackMyAgent
PurposeEvaluate security productsPentest AI agents
Tests"Does your EDR catch this?""Is your agent leaking?"
AudienceSecurity vendors, evaluatorsAgent developers, red teams
Analogous toMITRE ATT&CK EvaluationsOWASP ZAP / Burp Suite

222 attack scenarios

Test categories

Each category targets a specific detection surface. Tests range from basic OS-level operations to complex multi-step attack campaigns.

33

Multi-step

28

Filesystem

25

Process

21

Intelligence

20

Network

18

Enforcement

14

Real OS

14

App hooks

13

Baseline

222 total scenarios across 9 categories

Framework alignment

MITRE ATLAS coverage

All 222 scenarios map to 10 MITRE ATLAS techniques, providing standardized coverage across known AI/ML attack vectors.

TechniqueATLAS IDDescription
ReconnaissanceAML.T0013Discover ML model artifacts, configurations, and deployment details
Resource DevelopmentAML.T0017Develop adversarial ML capabilities and infrastructure
Initial AccessAML.T0019Gain initial entry to ML systems via prompt injection, API abuse
ML Attack StagingAML.T0040Prepare and stage attacks against ML models and pipelines
ExecutionAML.T0041Execute adversarial actions through model inference, tool calls
PersistenceAML.T0042Maintain access via poisoned models, backdoored pipelines
Privilege EscalationAML.T0043Escalate from model context to system-level access
Defense EvasionAML.T0044Evade detection through adversarial examples, model manipulation
ExfiltrationAML.T0024Extract training data, model weights, or sensitive context
ImpactAML.T0029Denial of service, model degradation, integrity compromise

Get started

Run the evaluation

Clone the repository, install dependencies, and run the full test suite against your security product.

# Clone and install

git clone https://github.com/opena2a-org/oasb.git
cd oasb
npm install

# Run the full evaluation suite

npm test

# Run specific category

npm test -- --grep "process"
npm test -- --grep "network"
npm test -- --grep "multi-step"

Scorecard

Product comparison

Same 222 tests, different products. Implement the SecurityProductAdapter interface and run the benchmark against your own product.

CategoryTestsarp-guardllm-guard
Process detection191919
Network detection181818
Filesystem detection282828
AI-layer scanning404013
Intelligence (L0/L1/L2)212121
Enforcement actions181818
Integration chains383837
Baseline (false positives)121212
E2E live detection282828
Total222222 (100%)194 (87.4%)

The AI-layer category is the primary differentiator. arp-guard covers 19 threat patterns across prompt injection, jailbreak, MCP exploitation, and A2A attacks. llm-guard covers prompt injection and PII detection but has no MCP, A2A, or output scanning.

Transparency

Known detection gaps

No security product catches everything. OASB Eval is designed to surface these gaps transparently so vendors and users can make informed decisions.

Multi-step campaigns

Attacks that chain multiple benign operations into malicious sequences are difficult to detect at the individual step level.

Application-level hooks

Runtime monitors operating at the OS level may miss attacks that exploit application-layer APIs and SDKs.

Encrypted exfiltration

Data exfiltration over encrypted channels (HTTPS, DNS-over-HTTPS) often bypasses network-level monitors.

Living-off-the-land

Attacks using legitimate system tools (curl, python, node) are inherently harder to distinguish from normal operations.

Run the evaluation

Test your security product against 222 attack scenarios. View the repository for setup instructions.

$git clone ... && npm test>