OpenAI and Paradigm’s EVMbench tests AI on EVM smart contract vulnerabilities
Developed with Paradigm, EVMbench is a new benchmark designed to evaluate AI agents on Ethereum Virtual Machine (EVM) smart contract vulnerabilities across three tasks: detect, patch, and exploit, as reported by The Defiant. The effort targets a measurable, security-relevant domain where smart contracts safeguard large on-chain values, aiming to clarify what current agents can and cannot do.
OpenAI’s launch coincides with a $10 million commitment to cybersecurity research, according to Crypto Briefing. The initiative situates AI agents within crypto audit workflows while emphasizing defensively oriented applications in an area where model capabilities are changing quickly.
Detect, patch, exploit: how EVMbench evaluates AI agents
EVMbench assesses whether an agent can identify a vulnerability, propose a safe fix, and demonstrate the ability to exploit the flaw in a controlled environment; exploit performance has improved faster than detection and patching, according to OpenAI. The organization reports that a model variant dubbed GPT-5.3-Codex achieved 72.2% in exploit mode versus 31.9% for GPT-5, while success rates for detect and patch remain below full coverage across the benchmarked set.
Paradigm underscored how these results preview a structural shift in audits. “It’s now clear to us that a growing portion of audits in the future will be done by agents,” said Alpin Yukseloglu, Partner of Investing & Research, at Paradigm. The firm also noted that when this work began, top models exploited fewer than 20% of critical fund-draining bugs, whereas leading agents now exceed 70% in exploit mode, an improvement with clear implications for both offense and defense.
Independent research points to the same dual-use dynamic: Anthropic’s SCONE-bench showed agents could autonomously generate exploit code simulating $4.6 million in losses, including on contracts deployed after model training cutoffs, as reported by Cointelegraph. These findings suggest defenders face a narrowing window between disclosure and exploit, reinforcing the need for measurable evaluations like EVMbench.
What EVMbench means for audits, defense, and governance
Agent-in-the-loop audits are likely to expand as exploit capabilities advance, but lagging detect-and-patch performance indicates that human-led review, threat modeling, and governance will remain central. Industry practitioners have cautioned against assuming AI can replace expert auditors outright, for example, OpenZeppelin has observed that models can handle many known challenges yet still struggle with novel or adversarial cases, as reported by BitcoinInsider.
From a policy and controls perspective, benchmarks like EVMbench may inform pre-deployment scanning, continuous monitoring of on-chain behavior, and structured vulnerability disclosure norms. This trajectory suggests that boards, protocol DAOs, and security leads will weigh agent-access governance and audit sign-off criteria more explicitly as agent capabilities evolve.
At the time of this writing, Coinbase Global (COIN) traded near $164.81 in after-hours, based on data from NasdaqGS. Such figures are not directional but provide context for market attention around AI–crypto security initiatives that could influence operational priorities, vendor selection, and oversight frameworks.
| Disclaimer: The content on The CCPress is provided for informational purposes only and should not be considered financial or investment advice. Cryptocurrency investments carry inherent risks. Please consult a qualified financial advisor before making any investment decisions. |

