Date
Publisher
arXiv
Large Language Models (LLMs) have been applied to Math Word Problems (MWPs)
with transformative impacts, revolutionizing how these complex problems are
approached and solved in various domains including educational settings.
However, the evaluation of these models often prioritizes final accuracy,
overlooking the crucial aspect of reasoning capabilities. This work addresses
this gap by focusing on the ability of LLMs to detect and correct reasoning
mistakes. We introduce a novel dataset MWP-MISTAKE, incorporating MWPs with
both correct and incorrect reasoning steps generated through rule-based methods
and smaller language models. Our comprehensive benchmarking reveals significant
insights into the strengths and weaknesses of state-of-the-art models, such as
GPT-4o, GPT-4, GPT-3.5Turbo, and others. We highlight GPT-$o's superior
performance in mistake detection and rectification and the persistent
challenges faced by smaller models. Additionally, we identify issues related to
data contamination and memorization, impacting the reliability of LLMs in
real-world applications. Our findings emphasize the importance of rigorous
evaluation of reasoning processes and propose future directions to enhance the
generalization and robustness of LLMs in mathematical problem-solving.
What is the application?
Why use AI?
Study design
