Search and Filter

Submit a research study

Contribute to the repository:

Add a paper

Evaluating LLMs for Automated Scoring in Formative Assessments

Authors
Pedro C. Mendona,
Filipe Quintal,
F‡bio Mendona
Date
Publisher
Applied Sciences
The increasing complexity and scale of modern education have revealed the shortcomings of traditional grading methods in providing consistent and scalable assessments. Advancements in artificial intelligence have positioned Large Language Models (LLMs) as robust solutions for automating grading tasks. This study systematically compared the grading performance of an open-source LLM (LLaMA 3.2) and a premium LLM (OpenAI GPT-4o) against human evaluators across diverse question types in the context of a computer programming subject. Using detailed rubrics, the study assessed the alignment between LLM-generated and human-assigned grades. Results revealed that while both LLMs align closely with human grading, equivalence testing demonstrated that the premium LLM achieves statistically and practically similar grading patterns, particularly for code-based questions, suggesting its potential as a reliable tool for educational assessments. These findings underscore the ability of LLMs to enhance grading consistency, reduce educator workload, and address scalability challenges in programming-focused assessments.
What is the application?
Who is the user?
Who age?
Why use AI?