The student-generated question (SGQ) strategy is an effective instructional strategy for developing students' higher order cognitive and critical thinking. However, assessing the quality of SGQs is time consuming and domain experts intensive. Previous automatic evaluation work focused on surface-level features of questions. To overcome this limitation, the state-of-the-art language models GPT-3.5 and GPT-4.0 were used to evaluate 1084 SGQs for topic relevance, clarity of expression, answerability, challenging, and cognitive level. Results showed that GPT-4.0 exhibits superior grading consistency with experts compared to GPT-3.5 in terms of topic relevance, clarity of expression, answerability, and difficulty level. GPT-3.5 and GPT-4.0 had low consistency with experts in terms of cognitive level. Over three rounds of testing, GPT-4.0 demonstrated higher stability in autograding when contrasted with GPT-3.5. In addition, to validate the effectiveness of GPT in evaluating SGQs from different domains and subjects, we have done the same experiment on a part of LearningQ dataset. We also discussed the attitudes of teachers and students toward automatic grading by GPT models. The findings underscore the potential of GPT-4.0 to assist teachers in evaluating the quality of SGQs. Nevertheless, the cognitive level assessment of SGQs still needs manual examination by teachers.
Can Autograding of Student-Generated Questions Quality by ChatGPT Match Human Experts?
Date
Publisher
IEEE
Study design
Who is the user?
Who benefits?
What is the application?
Why use AI?