Date
Publisher
arXiv
Evaluating open-ended written examination responses from students is an
essential yet time-intensive task for educators, requiring a high degree of
effort, consistency, and precision. Recent developments in Large Language
Models (LLMs) present a promising opportunity to balance the need for thorough
evaluation with efficient use of educators' time. In our study, we explore the
effectiveness of LLMs ChatGPT-3.5, ChatGPT-4, Claude-3, and Mistral-Large in
assessing university students' open-ended answers to questions made about
reference material they have studied. Each model was instructed to evaluate 54
answers repeatedly under two conditions: 10 times (10-shot) with a temperature
setting of 0.0 and 10 times with a temperature of 0.5, expecting a total of
1,080 evaluations per model and 4,320 evaluations across all models. The RAG
(Retrieval Augmented Generation) framework was used as the framework to make
the LLMs to process the evaluation of the answers. As of spring 2024, our
analysis revealed notable variations in consistency and the grading outcomes
provided by studied LLMs. There is a need to comprehend strengths and
weaknesses of LLMs in educational settings for evaluating open-ended written
responses. Further comparative research is essential to determine the accuracy
and cost-effectiveness of using LLMs for educational assessments.
What is the application?
Who is the user?
Who age?
Why use AI?
Study design
