We introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5%, 42.2%, and 45.8% accuracy respectively-performance gaps exceeding 29\% compared to human experts.
To further distinguish the difference between PhyX and other existing ones, we elaborate the benchmark details in Figure. From the realistic perspective, the prior benchmarks are heavily focused on abstract lining. In contrast, our benchmark are realistic and delete text redundancy.
Comparison with existing physics benchmarks.
Sampled PhyX examples from each discipline.
Key statistics of the MMMU benchmark
Distribution of question domains in the PhyX dataset
You can directly download our data from Huggingface datasets. For guidance on how to access and utilize the data, please consult our instructions on Github.
We evaluate various models including LLMs and LMMs. In each type, we consider both closed- and open-source models. Our evaluation is conducted under a zero-shot setting to assess the capability of models to generate accurate answers without fine-tuning or few-shot demonstrations on our benchmark.
Click on PhyX(OE), PhyX(MC) or PhyX(domain) to expand detailed results.
Reset | PhyX(OE) | PhyX(MC) | PhyX(domain) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | Size | Date | Text-DeRedundancy | Full-Text | Text-Minimal | Text-DeRedundancy | Full-Text | Text-Minimal | Overall | Mechanics | Electro-magnetism | Thermo-dynamics | Waves & Acoustics | Optics | Modern Physics |
Overall results of different models on the MMMU leaderboard. The best-performing model in each category is in-bold, and the second best is underlined. *: results provided by the authors.
@misc{shen2025phyxdoesmodelwits,
title={PhyX: Does Your Model Have the "Wits" for Physical Reasoning?},
author={Hui Shen and Taiqiang Wu and Qi Han and Yunta Hsieh and Jizhou Wang and Yuyue Zhang and Yuxin Cheng and Zijian Hao and Yuansheng Ni and Xin Wang and Zhongwei Wan and Kai Zhang and Wendong Xu and Jing Xiong and Ping Luo and Wenhu Chen and Chaofan Tao and Zhuoqing Mao and Ngai Wong},
year={2025},
eprint={2505.15929},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.15929},
}