Date
Publisher
arXiv
The use of LLM tutors to provide automated educational feedback to students
on student assignment submissions has received much attention in the AI in
Education field. However, the stochastic nature and tendency for hallucinations
in LLMs can undermine both quality of learning experience and adherence to
ethical standards. To address this concern, we propose a method that uses LLM
feedback evaluators (DeanLLMs) to automatically and comprehensively evaluate
feedback generated by LLM tutor for submissions on university assignments
before it is delivered to students. This allows low-quality feedback to be
rejected and enables LLM tutors to improve the feedback they generated based on
the evaluation results. We first proposed a comprehensive evaluation framework
for LLM-generated educational feedback, comprising six dimensions for feedback
content, seven for feedback effectiveness, and three for hallucination types.
Next, we generated a virtual assignment submission dataset covering 85
university assignments from 43 computer science courses using eight commonly
used commercial LLMs. We labelled and open-sourced the assignment dataset to
support the fine-tuning and evaluation of LLM feedback evaluators. Our findings
show that o3-pro demonstrated the best performance in zero-shot labelling of
feedback while o4-mini demonstrated the best performance in few-shot labelling
of feedback. Moreover, GPT-4.1 achieved human expert level performance after
fine-tuning (Accuracy 79.8%, F1-score 79.4%; human average Accuracy 78.3%,
F1-score 82.6%). Finally, we used our best-performance model to evaluate 2,000
assignment feedback instances generated by 10 common commercial LLMs, 200 each,
to compare the quality of feedback generated by different LLMs. Our LLM
feedback evaluator method advances our ability to automatically provide
high-quality and reliable educational feedback to students.
What is the application?
Who is the user?
Who age?
Why use AI?
Study design
