Date
Publisher
arXiv
Benchmarks like Massive Multitask Language Understanding (MMLU) have played a
pivotal role in evaluating AI's knowledge and abilities across diverse domains.
However, existing benchmarks predominantly focus on content knowledge, leaving
a critical gap in assessing models' understanding of pedagogy - the method and
practice of teaching. This paper introduces The Pedagogy Benchmark, a novel
dataset designed to evaluate large language models on their Cross-Domain
Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND)
pedagogical knowledge. These benchmarks are built on a carefully curated set of
questions sourced from professional development exams for teachers, which cover
a range of pedagogical subdomains such as teaching strategies and assessment
methods. Here we outline the methodology and development of these benchmarks.
We report results for 97 models, with accuracies spanning a range from 28% to
89% on the pedagogical knowledge questions. We consider the relationship
between cost and accuracy and chart the progression of the Pareto value
frontier over time. We provide online leaderboards at
https://rebrand.ly/pedagogy which are updated with new models and allow
interactive exploration and filtering based on various model properties, such
as cost per token and open-vs-closed weights, as well as looking at performance
in different subjects. LLMs and generative AI have tremendous potential to
influence education and help to address the global learning crisis.
Education-focused benchmarks are crucial to measure models' capacities to
understand pedagogical concepts, respond appropriately to learners' needs, and
support effective teaching practices across diverse contexts. They are needed
for informing the responsible and evidence-based deployment of LLMs and
LLM-based tools in educational settings, and for guiding both development and
policy decisions.
What is the application?
Who is the user?
Why use AI?
Study design
