Date
Publisher
arXiv
This study explores automatic generation (AIG) using language models to
create multiple choice questions (MCQs) for morphological assessment, aiming to
reduce the cost and inconsistency of manual test development. The study used a
two-fold approach. First, we compared a fine-tuned medium model (Gemma, 2B)
with a larger untuned one (GPT-3.5, 175B). Second, we evaluated seven
structured prompting strategies, including zero-shot, few-shot,
chain-of-thought, role-based, sequential, and combinations. Generated items
were assessed using automated metrics and expert scoring across five
dimensions. We also used GPT-4.1, trained on expert-rated samples, to simulate
human scoring at scale. Results show that structured prompting, especially
strategies combining chain-of-thought and sequential design, significantly
improved Gemma's outputs. Gemma generally produced more construct-aligned and
instructionally appropriate items than GPT-3.5's zero-shot responses, with
prompt design playing a key role in mid-size model performance. This study
demonstrates that structured prompting and efficient fine-tuning can enhance
midsized models for AIG under limited data conditions. We highlight the value
of combining automated metrics, expert judgment, and large-model simulation to
ensure alignment with assessment goals. The proposed workflow offers a
practical and scalable way to develop and validate language assessment items
for K-12.
What is the application?
Who is the user?
Why use AI?
Study design
