Date
Publisher
arXiv
Large Language Models (LLMs) have tremendous potential to play a key role in
supporting mathematical reasoning, with growing use in education and AI
research. However, most existing benchmarks are limited to English, creating a
significant gap for low-resource languages. For example, Bangla is spoken by
nearly 250 million people who would collectively benefit from LLMs capable of
native fluency. To address this, we present BanglaMATH, a dataset of 1.7k
Bangla math word problems across topics such as Arithmetic, Algebra, Geometry,
and Logical Reasoning, sourced from Bangla elementary school workbooks and
annotated with details like grade level and number of reasoning steps. We have
designed BanglaMATH to evaluate the mathematical capabilities of both
commercial and open-source LLMs in Bangla, and we find that Gemini 2.5 Flash
and DeepSeek V3 are the only models to achieve strong performance, with $\ge$
80\% accuracy across three elementary school grades. Furthermore, we assess the
robustness and language bias of these top-performing LLMs by augmenting the
original problems with distracting information, and translating the problems
into English. We show that both LLMs fail to maintain robustness and exhibit
significant performance bias in Bangla. Our study underlines current
limitations of LLMs in handling arithmetic and mathematical reasoning in
low-resource languages, and highlights the need for further research on
multilingual and equitable mathematical understanding. Dataset link:
\href{https://github.com/TabiaTanzin/BanglaMATH-A-Bangla-benchmark-dataset-for-testing-LLM-mathematical-reasoning-at-grades-6-7-and-8.git}{https://github.com/BanglaMATH}
Who age?
Why use AI?
Study design
