Product Evaluation Suite

OASB Eval

MITRE ATT&CK Evaluations, but for AI agent security products.

182 standardized attack scenarios across 10 MITRE ATLAS techniques. Test whether runtime guards, EDRs, and security products detect real AI agent attacks — process spawning, network exfiltration, filesystem manipulation, and multi-step campaigns.

Different tools, different jobs

OASB Eval vs HackMyAgent

OASB Eval evaluates security products. HackMyAgent pentests agents. They complement each other but serve different audiences.

OASB EvalHackMyAgent
PurposeEvaluate security productsPentest AI agents
Tests"Does your EDR catch this?""Is your agent leaking?"
AudienceSecurity vendors, evaluatorsAgent developers, red teams
Analogous toMITRE ATT&CK EvaluationsOWASP ZAP / Burp Suite

182 attack scenarios

Test categories

Each category targets a specific detection surface. Tests range from basic OS-level operations to complex multi-step attack campaigns.

33

Multi-step

28

Filesystem

25

Process

21

Intelligence

20

Network

18

Enforcement

14

Real OS

14

App hooks

13

Baseline

182 total scenarios across 9 categories

Framework alignment

MITRE ATLAS coverage

All 182 scenarios map to 10 MITRE ATLAS techniques, providing standardized coverage across known AI/ML attack vectors.

TechniqueATLAS IDDescription
ReconnaissanceAML.T0013Discover ML model artifacts, configurations, and deployment details
Resource DevelopmentAML.T0017Develop adversarial ML capabilities and infrastructure
Initial AccessAML.T0019Gain initial entry to ML systems via prompt injection, API abuse
ML Attack StagingAML.T0040Prepare and stage attacks against ML models and pipelines
ExecutionAML.T0041Execute adversarial actions through model inference, tool calls
PersistenceAML.T0042Maintain access via poisoned models, backdoored pipelines
Privilege EscalationAML.T0043Escalate from model context to system-level access
Defense EvasionAML.T0044Evade detection through adversarial examples, model manipulation
ExfiltrationAML.T0024Extract training data, model weights, or sensitive context
ImpactAML.T0029Denial of service, model degradation, integrity compromise

Get started

Run the evaluation

Clone the repository, install dependencies, and run the full test suite against your security product.

# Clone and install

git clone https://github.com/opena2a-org/oasb.git
cd oasb
npm install

# Run the full evaluation suite

npm test

# Run specific category

npm test -- --grep "process"
npm test -- --grep "network"
npm test -- --grep "multi-step"

Transparency

Known detection gaps

No security product catches everything. OASB Eval is designed to surface these gaps transparently so vendors and users can make informed decisions.

Multi-step campaigns

Attacks that chain multiple benign operations into malicious sequences are difficult to detect at the individual step level.

Application-level hooks

Runtime monitors operating at the OS level may miss attacks that exploit application-layer APIs and SDKs.

Encrypted exfiltration

Data exfiltration over encrypted channels (HTTPS, DNS-over-HTTPS) often bypasses network-level monitors.

Living-off-the-land

Attacks using legitimate system tools (curl, python, node) are inherently harder to distinguish from normal operations.

Run the evaluation

Test your security product against 182 attack scenarios. View the repository for setup instructions.

$git clone ... && npm test>