Date
Publisher
arXiv
We explore the effectiveness and reliability of an artificial intelligence
(AI)-based grading system for a handwritten general chemistry exam, comparing
AI-assigned scores to human grading across various types of questions. Exam
pages and grading rubrics were uploaded as images to account for chemical
reaction equations, short and long open-ended answers, numerical and symbolic
answer derivations, drawing, and sketching in pencil-and-paper format. Using
linear regression analyses and psychometric evaluations, the investigation
reveals high agreement between AI and human graders for textual and chemical
reaction questions, while highlighting lower reliability for numerical and
graphical tasks. The findings emphasize the necessity for human oversight to
ensure grading accuracy, based on selective filtering. The results indicate
promising applications for AI in routine assessment tasks, though careful
consideration must be given to student perceptions of fairness and trust in
integrating AI-based grading into educational practice.
What is the application?
Who is the user?
Who age?
Why use AI?
Study design
