Date
Publisher
arXiv
As large language models are increasingly integrated into education, virtual
student agents are becoming vital for classroom simulation and teacher
training. Yet their classroom-oriented subjective abilities remain largely
unassessed, limiting understanding of model boundaries and hindering
trustworthy deployment. We present EduPersona, a large-scale benchmark spanning
two languages, three subjects, and ten persona types based on the Big Five
theory. The dataset contains 1,308 authentic classroom dialogue rounds,
corresponding to 12,814 teacher-student Q&A turns, and is further expanded
through persona stylization into roughly 10 times larger scale (128k turns),
providing a solid foundation for evaluation. Building on this resource, we
decompose hard-to-quantify subjective performance into three progressive tasks:
TASK1 basic coherence (whether behavior, emotion, expression, and voice align
with classroom context), TASK2 student realism, and TASK3 long-term persona
consistency, thereby establishing an evaluation framework grounded in
educational theory and research value. We conduct systematic experiments on
three representative LLMs, comparing their original versions with ten
persona-fine-tuned variants trained on EduPersona. Results show consistent and
significant average improvements across all tasks: TASK1 +33.6%, TASK2 +30.6%,
and TASK3 +14.9%. These improvements highlight the dataset's effectiveness and
research value, while also revealing the heterogeneous difficulty of persona
modeling. In summary, EduPersona delivers the first classroom benchmark
centered on subjective abilities, establishes a decoupled and verifiable
research paradigm, and we will open-source both the dataset and the framework
to support the broader research community in advancing trustworthy and
human-like AI for education.
What is the application?
Who is the user?
Who age?
Why use AI?
Study design
