Date
Publisher
arXiv
Automated scoring plays a crucial role in education by reducing the reliance
on human raters, offering scalable and immediate evaluation of student work.
While large language models (LLMs) have shown strong potential in this task,
their use as end-to-end raters faces challenges such as low accuracy, prompt
sensitivity, limited interpretability, and rubric misalignment. These issues
hinder the implementation of LLM-based automated scoring in assessment
practice. To address the limitations, we propose AutoSCORE, a multi-agent LLM
framework enhancing automated scoring via rubric-aligned Structured COmponent
REcognition. With two agents, AutoSCORE first extracts rubric-relevant
components from student responses and encodes them into a structured
representation (i.e., Scoring Rubric Component Extraction Agent), which is then
used to assign final scores (i.e., Scoring Agent). This design ensures that
model reasoning follows a human-like grading process, enhancing
interpretability and robustness. We evaluate AutoSCORE on four benchmark
datasets from the ASAP benchmark, using both proprietary and open-source LLMs
(GPT-4o, LLaMA-3.1-8B, and LLaMA-3.1-70B). Across diverse tasks and rubrics,
AutoSCORE consistently improves scoring accuracy, human-machine agreement (QWK,
correlations), and error metrics (MAE, RMSE) compared to single-agent
baselines, with particularly strong benefits on complex, multi-dimensional
rubrics, and especially large relative gains on smaller LLMs. These results
demonstrate that structured component recognition combined with multi-agent
design offers a scalable, reliable, and interpretable solution for automated
scoring.
What is the application?
Who age?
Why use AI?
Study design
