Date
Publisher
arXiv
Large language models (LLMs) can act as evaluators, a role studied by methods
like LLM-as-a-Judge and fine-tuned judging LLMs. In the field of education,
LLMs have been studied as assistant tools for students and teachers. Our
research investigates LLM-driven automatic evaluation systems for academic
Text-Input Problems using rubrics. We propose five evaluation systems that have
been tested on a custom dataset of 110 answers about computer science from
higher education students with three models: JudgeLM, Llama-3.1-8B and
DeepSeek-R1-Distill-Llama-8B. The evaluation systems include: The JudgeLM
evaluation, which uses the model's single answer prompt to obtain a score;
Reference Aided Evaluation, which uses a correct answer as a guide aside from
the original context of the question; No Reference Evaluation, which ommits the
reference answer; Additive Evaluation, which uses atomic criteria; and Adaptive
Evaluation, which is an evaluation done with generated criteria fitted to each
question. All evaluation methods have been compared with the results of a human
evaluator. Results show that the best method to automatically evaluate and
score Text-Input Problems using LLMs is Reference Aided Evaluation. With the
lowest median absolute deviation (0.945) and the lowest root mean square
deviation (1.214) when compared to human evaluation, Reference Aided Evaluation
offers fair scoring as well as insightful and complete evaluations. Other
methods such as Additive and Adaptive Evaluation fail to provide good results
in concise answers, No Reference Evaluation lacks information needed to
correctly assess questions and JudgeLM Evaluations have not provided good
results due to the model's limitations. As a result, we conclude that
Artificial Intelligence-driven automatic evaluation systems, aided with proper
methodologies, show potential to work as complementary tools to other academic
resources.
What is the application?
Who is the user?
Who age?
Why use AI?
Study design
