Date
Publisher
arXiv
This paper describes the results of the first shared task on the generation
of teacher responses in educational dialogues. The goal of the task was to
benchmark the ability of generative language models to act as AI teachers,
replying to a student in a teacher-student dialogue. Eight teams participated
in the competition hosted on CodaLab. They experimented with a wide variety of
state-of-the-art models, including Alpaca, Bloom, DialoGPT, DistilGPT-2,
Flan-T5, GPT-2, GPT-3, GPT- 4, LLaMA, OPT-2.7B, and T5-base. Their submissions
were automatically scored using BERTScore and DialogRPT metrics, and the top
three among them were further manually evaluated in terms of pedagogical
ability based on Tack and Piech (2022). The NAISTeacher system, which ranked
first in both automated and human evaluation, generated responses with GPT-3.5
using an ensemble of prompts and a DialogRPT-based ranking of responses for
given dialogue contexts. Despite the promising achievements of the
participating teams, the results also highlight the need for evaluation metrics
better suited to educational contexts.
What is the application?
Who is the user?
Why use AI?
Study design
