
Pearl.com evaluation of 25 models finds leading AI systems align with experts only ~70% of the time, with some widely used models dropping to ~20% in certain domains.
SAN FRANCISCO, May. 13, 2026 /PRNewswire/ -- Despite gains on AI benchmarks, leading frontier models still fall short of expert-level performance on real-world professional questions, according to a new evaluation from Pearl Enterprise. Across 25 models from OpenAI, Anthropic, Google DeepMind, Microsoft and others, the top systems aligned with licensed professionals only about 70% of the time, raising questions about whether benchmark progress translates into deployable trust.
The findings, published as the Pearl Expert Alignment Leaderboard, point to a disconnect between public benchmark performance and the professional-grade questions that define real-world deployment. While models post higher scores, those gains are not consistently translating to the judgment and reasoning required in professional settings.
Pearl's Expert Alignment Leaderboard measures what public benchmarks often miss: whether AI answers meet the standard of professional judgment required in real-world settings.
Pearl evaluated 25 models across five domains: business, health, law, pets and technology, using approximately 510 questions answered by credentialed experts. OpenAI's GPT 5.5 led with 72.7% expert alignment, followed by GPT 5 at 72.5%, GPT 5.1 at 72.0% and Anthropic's Claude Opus 4.7 at 71.9%.
For companies deploying AI in high-stakes areas like healthcare, legal and finance, the results raise a critical question: are today's models optimizing for benchmarks, or for real-world trust?
"Benchmarks measure whether a model can pass a test. We're asking whether a professional would trust the answer, and right now, the answer is no," said Andy Kurtzig, CEO of Pearl. "That's the gap companies need to solve before deploying AI in high-stakes environments. Almost right is still wrong."
Four findings from the leaderboard:
Public benchmark gains are not translating to expert-aligned performance. Models that lead on GPQA, SWE-bench and similar public benchmarks do not consistently lead on Pearl's private evaluation, suggesting benchmark gains may not fully capture performance on unseen, professional-grade questions.
The frontier has converged below expert level. No tested model exceeded 73% expert alignment in aggregate, suggesting today's top systems may be converging below expert-level performance on professional tasks.
Performance varies sharply by domain. While top scores reached 80.9% in business, some widely used models scored as low as ~20% in domains like law and health, revealing significant gaps in consistency across real-world use cases.
Increased reasoning yields diminishing returns. Maximum-reasoning configurations improved performance by only one to 2.6 percentage points over minimal-reasoning settings, gains that may not justify the latency and cost in production. In some cases, increased reasoning led to lower-quality responses, suggesting more compute does not consistently translate to better judgment.
Methodology
Each model received the same prompt, with no tuning, and responses were scored on a 1–5 rubric measuring correctness, completeness, prioritization and professional judgment. The dataset has not been previously released and was unavailable to model developers.
The leaderboard is available at www.pearl.com/leaderboard.
About Pearl Enterprise
Pearl builds AI systems for professional services, combining AI with expert judgment to deliver trusted answers and human-in-the-loop intelligence.
SOURCE PEARL.COM
Share this article