Date
Publisher
arXiv
Fostering students' abilities for knowledge integration and transfer in
complex problem-solving scenarios is a core objective of modern education, and
interdisciplinary STEM is a key pathway to achieve this, yet it requires expert
guidance that is difficult to scale. While LLMs offer potential in this regard,
their true capability for guided instruction remains unclear due to the lack of
an effective evaluation benchmark. To address this, we introduce SID, the first
benchmark designed to systematically evaluate the higher-order guidance
capabilities of LLMs in multi-turn, interdisciplinary Socratic dialogues. Our
contributions include a large-scale dataset of 10,000 dialogue turns across 48
complex STEM projects, a novel annotation schema for capturing deep pedagogical
features, and a new suite of evaluation metrics (e.g., X-SRG). Baseline
experiments confirm that even state-of-the-art LLMs struggle to execute
effective guided dialogues that lead students to achieve knowledge integration
and transfer. This highlights the critical value of our benchmark in driving
the development of more pedagogically-aware LLMs.
What is the application?
Who is the user?
Who age?
Why use AI?
Study design
