
Chinese Medical AI Team's MedGPT Tops Global Rankings in New Clinical Benchmark Published by Nature Portfolio Journal
BEIJING, Jan. 8, 2026 /PRNewswire/ -- A research team from China has introduced the first standardized framework for evaluating the clinical applicability of medical AI systems, with their findings published in npj Digital Medicine, a leading journal in the Nature Portfolio ranked in the top tier of medical journals by the Chinese Academy of Sciences, with a 2024 impact factor of 15.1. The study presents the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a comprehensive evaluation system designed to assess how well AI models perform in real-world clinical settings.
This marks the first time a Chinese research team has published benchmark standards for large language models in healthcare in a leading international journal. The establishment of CSEDB fills a critical gap in the global evaluation of medical AI's clinical capabilities, provides direction for the iterative improvement of medical large language models, and lays essential groundwork for deploying AI in serious clinical settings. In systematic testing of major global AI models using this benchmark, MedGPT—developed by Chinese AI medical technology company Future Doctor—achieved the highest scores across all metrics.
Bridging the Gap Between Testing and Clinical Reality
Patient safety is the fundamental priority in healthcare. As artificial intelligence increasingly penetrates serious medical scenarios such as diagnosis and treatment, every AI-assisted decision must withstand the rigorous scrutiny of clinical practice.
However, current global medical AI evaluation systems have significant limitations. Mainstream assessments largely rely on standardized medical licensing examinations, which feature fixed answers and limited options. In contrast, real-world clinical practice involves highly individualized cases and dynamic, evolving conditions. Evaluating AI's clinical applicability based solely on examination performance creates a substantial disconnect from the demands of actual patient care.
As medical AI advances rapidly, the industry urgently needs a scientific evaluation standard rooted in clinical practice and aligned with real-world decision-making scenarios—a challenge shared by the global medical AI community.
A New Dual-Track Evaluation Paradigm Developed by Chinese Experts
The CSEDB framework was developed collaboratively by Future Doctor's research team and 32 leading clinical experts from 23 top-tier medical institutions in China, including Peking Union Medical College Hospital, The Cancer Hospital of the Chinese Academy of Medical Sciences, Chinese PLA General Hospital, and Huashan Hospital affiliated to Fudan University.
This new standard breaks away from the previous model of evaluating medical AI capabilities based on question-answering accuracy. For the first time globally, it introduces the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a dual-track assessment system measuring both safety and effectiveness, fully aligned with real clinical decision-making scenarios.
The evaluation encompasses 30 core indicators based on clinical expert consensus: 17 focused on safety considerations, including critical illness recognition, fatal diagnostic errors, absolute contraindicated medications, and medication safety; and 13 addressing effectiveness, covering guideline adherence, prioritization in multimorbidity, and optimization of diagnostic and therapeutic pathways. Each indicator is weighted according to clinical risk level on a scale of 1 to 5. A score of 5 represents life-threatening scenarios, such as critical illness recognition, fatal diagnostic errors, and fatal drug–drug interactions, while a score of 1 indicates lower-risk, routine clinical scenarios.
The evaluation methodology also departs from traditional static question-and-answer formats. Based on the above indicators, the framework comprises 2,069 open-ended clinical scenarios spanning 26 medical specialties, designed to comprehensively simulate the complexity of real diagnostic and treatment decisions.
MedGPT Leads Global Rankings as Future Doctor Pioneers the Next Era of Medical AI
The establishment of CSEDB marks the first systematic evaluation standard in the AI era capable of genuinely reflecting medical AI's clinical capabilities. Major global AI models were put to the test, including DeepSeek-R1, OpenAI o3, Gemini 2.5, Qwen3-235B, and Claude 3.7.[1]
In this systematic evaluation, MedGPT—the AI medical cognitive system independently developed by Chinese AI medical technology company Future Doctor—delivered remarkable results: it achieved the top position across all metrics, scoring 15.3% higher than the second-best model overall and 19.8% higher in the safety dimension, with relatively higher top scores in safety (0.912) and effectiveness (0.861).[1]
Notably, while most models demonstrated weaker safety performance, MedGPT was the only model whose safety score surpassed its effectiveness score. This indicates that while its capabilities continue to approach the professional level of physicians, it also demonstrates the crucial quality of clinical caution—an essential trait in healthcare.
MedGPT's outstanding performance stems from the founding philosophy of Future Doctor, a leading global AI medical technology company: from the very beginning, the team embedded safety and effectiveness—based on clinical expert consensus—into the system's core architecture. The goal was to create a medical AI that "thinks like a physician," rather than one that merely "sounds like a physician." The underlying technical architecture is modeled on human cognitive reasoning processes, rather than relying solely on emergent intelligence from massive data training.
As early as 2023, MedGPT demonstrated strong clinical adaptability in trials involving real patients, achieving 96% diagnostic concordance with attending physicians at tertiary hospitals. This capability continues to evolve: over 10,000 physicians now use the Future Doctor platform to interact with patients, generating approximately 20,000 real clinical feedback entries weekly. Through this "feedback-driven iteration" flywheel mechanism, MedGPT's accuracy improves by 1.2% to 1.5% monthly, continuously advancing medical AI's clinical capabilities to higher levels.
Media Contact
Yoyo Yi
[email protected]
SOURCE Future Doctor
Share this article