Date
Publisher
arXiv
Using LLMs to give educational feedback to students for their assignments has
attracted much attention in the AI in Education field. Yet, there is currently
no large-scale open-source dataset of student assignments that includes
detailed assignment descriptions, rubrics, and student submissions across
various courses. As a result, research on generalisable methodology for
automatic generation of effective and responsible educational feedback remains
limited. In the current study, we constructed a large-scale dataset of
Synthetic Computer science Assignments for LLM-generated Educational Feedback
research (SCALEFeedback). We proposed a Sophisticated Assignment Mimicry (SAM)
framework to generate the synthetic dataset by one-to-one LLM-based imitation
from real assignment descriptions, student submissions to produce their
synthetic versions. Our open-source dataset contains 10,000 synthetic student
submissions spanning 155 assignments across 59 university-level computer
science courses. Our synthetic submissions achieved BERTScore F1 0.84, PCC of
0.62 for assignment marks and 0.85 for length, compared to the corresponding
real-world assignment dataset, while ensuring perfect protection of student
private information. All these results of our SAM framework outperformed
results of a naive mimicry method baseline. The LLM-generated feedback for our
synthetic assignments demonstrated the same level of effectiveness compared to
that of real-world assignment dataset. Our research showed that one-to-one LLM
imitation is a promising method for generating open-source synthetic
educational datasets that preserve the original dataset's semantic meaning and
student data distribution, while protecting student privacy and institutional
copyright. SCALEFeedback enhances our ability to develop LLM-based
generalisable methods for offering high-quality, automated educational feedback
in a scalable way.
What is the application?
Who is the user?
Who age?
Why use AI?
Study design
