OASB Eval

MITRE ATT&CK Evaluations, but for AI agent security products.

182 standardized attack scenarios across 10 MITRE ATLAS techniques. Test whether runtime guards, EDRs, and security products detect real AI agent attacks — process spawning, network exfiltration, filesystem manipulation, and multi-step campaigns.

View on GitHub Documentation

Different tools, different jobs

OASB Eval vs HackMyAgent

OASB Eval evaluates security products. HackMyAgent pentests agents. They complement each other but serve different audiences.

	OASB Eval	HackMyAgent
Purpose	Evaluate security products	Pentest AI agents
Tests	"Does your EDR catch this?"	"Is your agent leaking?"
Audience	Security vendors, evaluators	Agent developers, red teams
Analogous to	MITRE ATT&CK Evaluations	OWASP ZAP / Burp Suite

182 attack scenarios

Test categories

Each category targets a specific detection surface. Tests range from basic OS-level operations to complex multi-step attack campaigns.

Multi-step

Filesystem

Process

Intelligence

Network

Enforcement

Real OS

App hooks

Baseline

182 total scenarios across 9 categories

Framework alignment

MITRE ATLAS coverage

All 182 scenarios map to 10 MITRE ATLAS techniques, providing standardized coverage across known AI/ML attack vectors.

Technique	ATLAS ID	Description
Reconnaissance	AML.T0013	Discover ML model artifacts, configurations, and deployment details
Resource Development	AML.T0017	Develop adversarial ML capabilities and infrastructure
Initial Access	AML.T0019	Gain initial entry to ML systems via prompt injection, API abuse
ML Attack Staging	AML.T0040	Prepare and stage attacks against ML models and pipelines
Execution	AML.T0041	Execute adversarial actions through model inference, tool calls
Persistence	AML.T0042	Maintain access via poisoned models, backdoored pipelines
Privilege Escalation	AML.T0043	Escalate from model context to system-level access
Defense Evasion	AML.T0044	Evade detection through adversarial examples, model manipulation
Exfiltration	AML.T0024	Extract training data, model weights, or sensitive context
Impact	AML.T0029	Denial of service, model degradation, integrity compromise

Get started

Run the evaluation

Clone the repository, install dependencies, and run the full test suite against your security product.

# Clone and install

git clone https://github.com/opena2a-org/oasb.git
cd oasb
npm install

# Run the full evaluation suite

npm test

# Run specific category

npm test -- --grep "process"
npm test -- --grep "network"
npm test -- --grep "multi-step"

Transparency

Known detection gaps

No security product catches everything. OASB Eval is designed to surface these gaps transparently so vendors and users can make informed decisions.

Multi-step campaigns

Attacks that chain multiple benign operations into malicious sequences are difficult to detect at the individual step level.

Application-level hooks

Runtime monitors operating at the OS level may miss attacks that exploit application-layer APIs and SDKs.

Encrypted exfiltration

Data exfiltration over encrypted channels (HTTPS, DNS-over-HTTPS) often bypasses network-level monitors.

Living-off-the-land

Attacks using legitimate system tools (curl, python, node) are inherently harder to distinguish from normal operations.

Run the evaluation

Test your security product against 182 attack scenarios. View the repository for setup instructions.

$git clone ... && npm test>