Logical Intelligence Tops Leading AI Verification Benchmarks as Verified Code Generation Nears Reality with Aleph

News provided by

May 21, 2026, 11:30 ET

Aleph, an AI coding agent sets new records on four major formal reasoning benchmarks, proving that automated code generation can be formally verified for mission-critical systems

SAN FRANCISCO, May 21, 2026 /PRNewswire/ -- Logical Intelligence, an AI lab pioneering energy-based reasoning models (EBRMs), today announced that its AI coding agent, Aleph, achieved top scores on four public formal reasoning benchmarks: PutnamBench, VeriSoftBench, LeanEval, and Verina. This milestone demonstrates that formally verified code generation is no longer theoretical, but a practical reality for software that powers critical infrastructure.

"Organizations should use caution with vibe coding without verification, as AI systems are rapidly moving into environments where correctness is no longer optional," said Eve Bodina, founder and CEO of Logical Intelligence. "Aleph's performance on these benchmarks demonstrates meaningful progress on AI capable of reasoning under constraints, verification, and provable correctness. We believe formally verified code generation will become a foundational requirement for critical infrastructure, industrial software, semiconductor design, energy systems, and finance."

Bodnia noted that some people look at AI model benchmark performance as an indication that we are nearing AGI, while others dismiss them as irrelevant academic exercises. She thinks both views are wrong.

"Formal reasoning benchmarks are increasingly important because they force AI systems to operate in environments where correctness is mathematically enforced, and small reasoning failures compound into complete breakdowns," Bodnia said. "That matters enormously once AI moves into realms where being 'mostly right' is effectively the same as being wrong."

Many organizations are rapidly scaling AI-generated software, despite growing concerns around hallucinated code, hidden vulnerabilities, and unverifiable outputs. With AI now generating substantial portions of code, there's a new focus on formal verification – mathematical techniques used to prove software correctness under defined constraints. Unlike traditional AI benchmarks that reward plausible outputs, formal reasoning systems require machine-verifiable correctness where proofs either succeed or fail.

Logical Intelligence develops AI for environments where correctness is essential. Aleph operates in formal verification environments that require machine-checkable proofs rather than probabilistic outputs. The benchmarks Aleph now leads are among the most rigorous tests for reliable AI reasoning under mathematical and software constraints. As recently as last summer, state-of-the-art AI provers were solving less than 2% of PutnamBench problems. Aleph has now solved 668 of the benchmark's 672 formally stated mathematical problems.

Aleph has produced leading performance across the major AI benchmarks.

PutnamBench (advanced mathematical theorem proving): Solved 99.4% of the Putnam problem set, outperforming ByteDance's Seed-Prover 1.5 at 86% and Hilbert at 69%.
VeriSoftBench (real-world software verification): Achieved 94% success, outperforming Harmonic's Aristotle at 69% and Google Gemini-3 Pro at 65%.
LeanEval (formalization in Lean, a proof assistant and a functional programming language): State-of-the-art performance, surpassing frontier systems including Claude Opus 4.7 and Harmonic Aristotle.
Verina (formal verification benchmark): Verified 100% score, independently confirmed by benchmark authors.

"Today's AI systems are optimized to generate plausible outputs, not necessarily provable ones," said Patrick Hillmann, Chief Operating Officer at Logical Intelligence. "That becomes a serious problem once AI-generated software starts scaling across critical systems where failures carry real-world consequences. These benchmarks matter because they test whether AI systems can reason reliably in environments where correctness is mathematically enforced, not statistically inferred. We believe verified code generation will become one of the defining infrastructure layers of the AI era."

Logical Intelligence believes increasingly capable AI systems will require reasoning architectures optimized for generation, verification, and constraint satisfaction. Aleph is already being used in production verification workflows, including work involving the Ethereum Foundation's ArkLib cryptographic libraries supporting zkEVM infrastructure. Logical Intelligence plans to launch a beta version of Aleph later this year.

About Aleph
Aleph is Logical Intelligence's agent built for teams who cannot afford to be wrong. It automates formal verification end to end and produces machine-checkable proofs that critical logic behaves correctly across every execution path, not just the ones engineers thought to test.

Aleph is already replacing slow, manual verification work that normally takes months and turns it into something repeatable and scalable. It fits into existing engineering workflows and flags failure modes before code ever reaches production, instead of after something breaks.

The agent is currently being used in a limited set of pilots for verified code generation. If you are an operator of critical infrastructure or safety-sensitive systems who wants to help define how this becomes the new baseline, please reach out to us here.

About Logical Intelligence
Logical Intelligence develops energy-based AI systems for reasoning in mission-critical environments, including energy, advanced manufacturing, industrial automation, financial automation, and hardware design. Its Aleph agent is available today for formal verification and automated code generation with machine-checkable proofs, while its forthcoming Kona platform extends this approach into frontier-scale energy-based reasoning for continuously verifiable systems. To learn more, visit https://logicalintelligence.com/.

SOURCE Logical Intelligence