Date
Publisher
arXiv
Large Language Models (LLMs) are increasingly utilized in AI-driven
educational instruction and assessment, particularly within mathematics
education. The capability of LLMs to generate accurate answers and detailed
solutions for math problem-solving tasks is foundational for ensuring reliable
and precise feedback and assessment in math education practices. Our study
focuses on evaluating the accuracy of four LLMs (OpenAI GPT-4o and o1,
DeepSeek-V3 and DeepSeek-R1) solving three categories of math tasks, including
arithmetic, algebra, and number theory, and identifies step-level reasoning
errors within their solutions. Instead of relying on standard benchmarks, we
intentionally build math tasks (via item models) that are challenging for LLMs
and prone to errors. The accuracy of final answers and the presence of errors
in individual solution steps were systematically analyzed and coded. Both
single-agent and dual-agent configurations were tested. It is observed that the
reasoning-enhanced OpenAI o1 model consistently achieved higher or nearly
perfect accuracy across all three math task categories. Analysis of errors
revealed that procedural slips were the most frequent and significantly
impacted overall performance, while conceptual misunderstandings were less
frequent. Deploying dual-agent configurations substantially improved overall
performance. These findings offer actionable insights into enhancing LLM
performance and underscore effective strategies for integrating LLMs into
mathematics education, thereby advancing AI-driven instructional practices and
assessment precision.
What is the application?
Who is the user?
Why use AI?
Study design
