Benchmark

LingxiDiagBench

A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions.

View

Contact Us

Please complete and submit the inquiry form, and we will get back to you within 24 business hours.

* Name

Phone

* Email Address

* Company/Institution Name

Contact Address

* Client need

* CAPTCHA Code ：

Please carefully review our Privacy Policy. We collect your personal information solely to establish contact and provide better services. By checking the box, you confirm that you have read and agree to the terms and conditions outlined in the Privacy Policy.

We use cookies to personalize and enhance your browsing experience on our website By clicking "Accept all cookies", you consent to the use of cookies You can read our Cookie Policy for more information.

Read Agree