A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions.

技术合作

本研究以阿尔茨海默病功能连接标记物为特征，利用非平滑非负矩阵分解算法探究了阿尔茨海默病的功能连接损伤亚型。阿尔茨海默病患者被可重复地分为4个。
技术合作

本研究以阿尔茨海默病功能连接标记物为特征，利用非平滑非负矩阵分解算法探究了阿尔茨海默病的功能连接损伤亚型。阿尔茨海默病患者被可重复地分为4个。
技术合作

本研究以阿尔茨海默病功能连接标记物为特征，利用非平滑非负矩阵分解算法探究了阿尔茨海默病的功能连接损伤亚型。阿尔茨海默病患者被可重复地分为4个。
技术合作

本研究以阿尔茨海默病功能连接标记物为特征，利用非平滑非负矩阵分解算法探究了阿尔茨海默病的功能连接损伤亚型。阿尔茨海默病患者被可重复地分为4个。

A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

技术合作

技术合作

技术合作

技术合作