Date
Publisher
arXiv
Large language models (LLMs) have demonstrated strong potential in performing
automatic scoring for constructed response assessments. While constructed
responses graded by humans are usually based on given grading rubrics, the
methods by which LLMs assign scores remain largely unclear. It is also
uncertain how closely AI's scoring process mirrors that of humans or if it
adheres to the same grading criteria. To address this gap, this paper uncovers
the grading rubrics that LLMs used to score students' written responses to
science tasks and their alignment with human scores. We also examine whether
enhancing the alignments can improve scoring accuracy. Specifically, we prompt
LLMs to generate analytic rubrics that they use to assign scores and study the
alignment gap with human grading rubrics. Based on a series of experiments with
various configurations of LLM settings, we reveal a notable alignment gap
between human and LLM graders. While LLMs can adapt quickly to scoring tasks,
they often resort to shortcuts, bypassing deeper logical reasoning expected in
human grading. We found that incorporating high-quality analytical rubrics
designed to reflect human grading logic can mitigate this gap and enhance LLMs'
scoring accuracy. These results underscore the need for a nuanced approach when
applying LLMs in science education and highlight the importance of aligning LLM
outputs with human expectations to ensure efficient and accurate automatic
scoring.
What is the application?
Who is the user?
Who age?
Why use AI?
Study design