Date
Publisher
arXiv
Research to improve Automated Short Answer Grading has recently focused on
Large Language Models (LLMs) with prompt engineering and no- or few-shot
prompting to achieve best results. This is in contrast to the fine-tuning
approach, which has historically required large-scale compute clusters
inaccessible to most users. New closed-model approaches such as OpenAI's
fine-tuning service promise results with as few as 100 examples, while methods
using open weights such as quantized low-rank adaptive (QLORA) can be used to
fine-tune models on consumer GPUs. We evaluate both of these fine-tuning
methods, measuring their interaction with few-shot prompting for automated
short answer grading (ASAG) with structured (JSON) outputs. Our results show
that finetuning with small amounts of data has limited utility for Llama
open-weight models, but that fine-tuning methods can outperform few-shot
baseline instruction-tuned LLMs for OpenAI's closed models. While our
evaluation set is limited, we find some evidence that the observed benefits of
finetuning may be impacted by the domain subject matter. Lastly, we observed
dramatic improvement with the LLama 3.1 8B-Instruct open-weight model by
seeding the initial training examples with a significant amount of cheaply
generated synthetic training data.
What is the application?
Who is the user?
Who age?
Why use AI?
Study design
