Date
Publisher
arXiv
Frontier Large language models (LLMs) like ChatGPT and Gemini can decipher
cryptic compiler errors for novice programmers, but their computational scale,
cost, and tendency to over-assist make them problematic for widespread
pedagogical adoption. This work demonstrates that smaller, specialised language
models, enhanced via Supervised Fine-Tuning (SFT), present a more viable
alternative for educational tools. We utilise a new dataset of 40,000 C
compiler error explanations, derived from real introductory programming (CS1/2)
student-generated programming errors, which we used to fine-tune three
open-source models: Qwen3-4B, Llama-3.1-8B, and Qwen3-32B. We performed a dual
evaluation, combining expert human reviews with a large-scale automated
analysis of 8,000 responses using a validated LLM-as-judge ensemble. Our
results show that SFT significantly boosts the pedagogical quality of smaller
models, achieving performance comparable to much larger models. We analyse the
trade-offs between model size and quality, confirming that fine-tuning compact,
efficient models on high-quality, domain-specific data is a potent strategy for
creating specialised models to drive educational tools. We provide a replicable
methodology to foster broader access to generative AI capabilities in
educational contexts.
What is the application?
Who is the user?
Who age?
Why use AI?
Study design
