<i>MMLU-ProX</i> : A Multilingual Benchmark for Advanced Large Language Model Evaluation

Xuan, Weihao; Yang, Rui; Qi, Heli; Zeng, Qingcheng; Xiao, Yunze; Feng, Aosong; Liu, Dairui; Xing, Yun; Wang, Junjue; Gao, Fan; Lu, Jinghui; Jiang, Yuang; Li, Huitao; Li, Xin; Yu, Kunyu; Dong, Ruihai; Gu, Shangding; Li, Yuekang; Xie, Xiaofei; Juefei-Xu, Felix; Khomh, Foutse; Yoshie, Osamu; Chen, Qingyu; Teodoro, Douglas; Liu, Nan; Goebel, Randy; Ma, Lei; Marrese-Taylor, Edison; Lu, Shijian; Iwasawa, Yusuke; Matsuo, Yutaka; Li, Irene

doi:10.18653/v1/2025.emnlp-main.79

Proceedings chapter

English

MMLU-ProX : A Multilingual Benchmark for Advanced Large Language Model Evaluation

Presented atSuzhou (China), November 4-9, 2025

Published inChristodoulopoulos, C., Chakraborty, T., Rose, C. & Peng, V. (Ed.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, p. 1513-1532

PublisherKerrville, TX : Association for Computational Linguistics

Publication date2025-11

First online date2025-11

Abstract

Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-lingual reasoning abilities. This dual limitation makes it challenging to assess LLMs’ performance in the multilingual setting comprehensively. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29 languages, built on an English benchmark. Each language version consists of 11,829 identical questions, enabling direct cross-lingual comparisons. Additionally, to meet efficient evaluation needs, we provide a lite version containing 658 questions per language. To ensure the high quality of MMLU-ProX, we employ a rigorous development process that involves multiple powerful LLMs for translation, followed by expert review to ensure accurate expression, consistent terminology, and cultural relevance. Building on this, we systematically evaluate 36 state-of-the-art LLMs, including reasoning-enhanced and multilingual-optimized LLMs. The results reveal significant disparities in the multilingual capabilities of LLMs: While they perform well in high-resource languages, their performance declines markedly in low-resource languages, particularly for African languages. Through MMLU-ProX, we aim to advance the development of more inclusive AI systems and promote equitable access to technology across global contexts.

Affiliation entities

Faculté de médecine / Section de médecine clinique / Département de radiologie et informatique médicale

Research groups

DS4DH - Data Science for Digital Health (1035)

Citation (ISO format)

XUAN, Weihao et al. MMLU-ProX : A Multilingual Benchmark for Advanced Large Language Model Evaluation. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Christodoulopoulos, C., Chakraborty, T., Rose, C. & Peng, V. (Ed.). Suzhou (China). Kerrville, TX : Association for Computational Linguistics, 2025. p. 1513–1532. doi: 10.18653/v1/2025.emnlp-main.79