Date
Publisher
arXiv
Recent advances in multimodal large language models (MLLMs) raise the
question of their potential for grading, analyzing, and offering feedback on
handwritten student classwork. This capability would be particularly beneficial
in elementary and middle-school mathematics education, where most work remains
handwritten, because seeing students' full working of a problem provides
valuable insights into their learning processes, but is extremely
time-consuming to grade. We present two experiments investigating MLLM
performance on handwritten student mathematics classwork. Experiment A examines
288 handwritten responses from Ghanaian middle school students solving
arithmetic problems with objective answers. In this context, models achieved
near-human accuracy (95%, k = 0.90) but exhibited occasional errors that human
educators would be unlikely to make. Experiment B evaluates 150 mathematical
illustrations from American elementary students, where the drawings are the
answer to the question. These tasks lack single objective answers and require
sophisticated visual interpretation as well as pedagogical judgment in order to
analyze and evaluate them. We attempted to separate MLLMs' visual capabilities
from their pedagogical abilities by first asking them to grade the student
illustrations directly, and then by augmenting the image with a detailed human
description of the illustration. We found that when the models had to analyze
the student illustrations directly, they struggled, achieving only k = 0.20
with ground truth scores, but when given human descriptions, their agreement
levels improved dramatically to k = 0.47, which was in line with human-to-human
agreement levels. This gap suggests MLLMs can "see" and interpret arithmetic
work relatively well, but still struggle to "see" student mathematical
illustrations.
What is the application?
Who is the user?
Who age?
Why use AI?
Study design
