Date
Publisher
arXiv
This shared task has aimed to assess pedagogical abilities of AI tutors
powered by large language models (LLMs), focusing on evaluating the quality of
tutor responses aimed at student's mistake remediation within educational
dialogues. The task consisted of five tracks designed to automatically evaluate
the AI tutor's performance across key dimensions of mistake identification,
precise location of the mistake, providing guidance, and feedback
actionability, grounded in learning science principles that define good and
effective tutor responses, as well as the track focusing on detection of the
tutor identity. The task attracted over 50 international teams across all
tracks. The submitted models were evaluated against gold-standard human
annotations, and the results, while promising, show that there is still
significant room for improvement in this domain: the best results for the four
pedagogical ability assessment tracks range between macro F1 scores of 58.34
(for providing guidance) and 71.81 (for mistake identification) on three-class
problems, with the best F1 score in the tutor identification track reaching
96.98 on a 9-class task. In this paper, we overview the main findings of the
shared task, discuss the approaches taken by the teams, and analyze their
performance. All resources associated with this task are made publicly
available to support future research in this critical domain.
What is the application?
Who is the user?
Why use AI?
Study design
