Date
Publisher
arXiv
Large Multimodal Models have achieved remarkable progress in integrating
vision and language, enabling strong performance across perception, reasoning,
and domain-specific tasks. However, their capacity to reason over multiple,
visually similar inputs remains insufficiently explored. Such fine-grained
comparative reasoning is central to real-world tasks, especially in mathematics
and education, where learners must often distinguish between nearly identical
diagrams to identify correct solutions. To address this gap, we present
VisioMath, a curated benchmark of 1,800 high-quality K-12 mathematics problems
in which all candidate answers are diagrams with subtle visual similarities. A
comprehensive evaluation of state-of-the-art LMMs, covering both leading
closed-source systems and widely adopted open-source models, reveals a
consistent decline in accuracy as inter-image similarity increases. Analysis
indicates that the dominant failure mode stems from image-text misalignment:
rather than grounding reasoning in textual cues, models often resort to shallow
positional heuristics, resulting in systematic errors. We further explore three
alignment-oriented strategies, spanning training-free approaches and
finetuning, and achieve substantial accuracy gains. We hope that VisioMath will
serve as a rigorous benchmark and catalyst for developing LMMs toward deeper
diagram understanding, precise comparative reasoning, and grounded
multi-image-text integration.
Why use AI?
Study design
