Date
Publisher
arXiv
We present SeePhys, a large-scale multimodal benchmark for LLM reasoning
grounded in physics questions ranging from middle school to PhD qualifying
exams. The benchmark covers 7 fundamental domains spanning the physics
discipline, incorporating 21 categories of highly heterogeneous diagrams. In
contrast to prior works where visual elements mainly serve auxiliary purposes,
our benchmark features a substantial proportion of vision-essential problems
(75%) that mandate visual information extraction for correct solutions. Through
extensive evaluation, we observe that even the most advanced visual reasoning
models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our
benchmark. These results reveal fundamental challenges in current large
language models' visual understanding capabilities, particularly in: (i)
establishing rigorous coupling between diagram interpretation and physics
reasoning, and (ii) overcoming their persistent reliance on textual cues as
cognitive shortcuts.
Who age?
Why use AI?
Study design
