Date
Publisher
arXiv
Motivation: Students learning to program often reach states where they are
stuck and can make no forward progress. An automatically generated next-step
hint can help them make forward progress and support their learning. It is
important to know what makes a good hint or a bad hint, and how to generate
good hints automatically in novice programming tools, for example using Large
Language Models (LLMs).
Method and participants: We recruited 44 Java educators from around the world
to participate in an online study. We used a set of real student code states as
hint-generation scenarios. Participants used a technique known as comparative
judgement to rank a set of candidate next-step Java hints, which were generated
by Large Language Models (LLMs) and by five human experienced educators.
Participants ranked the hints without being told how they were generated.
Findings: We found that LLMs had considerable variation in generating high
quality next-step hints for programming novices, with GPT-4 outperforming other
models tested. When used with a well-designed prompt, GPT-4 outperformed human
experts in generating pedagogically valuable hints. A multi-stage prompt was
the most effective LLM prompt. We found that the two most important factors of
a good hint were length (80--160 words being best), and reading level (US grade
9 or below being best). Offering alternative approaches to solving the problem
was considered bad, and we found no effect of sentiment.
Conclusions: Automatic generation of these hints is immediately viable, given
that LLMs outperformed humans -- even when the students' task is unknown. The
fact that only the best prompts achieve this outcome suggests that students on
their own are unlikely to be able to produce the same benefit. The prompting
task, therefore, should be embedded in an expert-designed tool.
What is the application?
Who is the user?
Who age?
Why use AI?
Study design