Date
Publisher
arXiv
Artificial intelligence (AI) is poised to transform education, but the
research community lacks a robust, general benchmark to evaluate AI models for
learning. To assess state-of-the-art support for educational use cases, we ran
an "arena for learning" where educators and pedagogy experts conduct blind,
head-to-head, multi-turn comparisons of leading AI models. In particular, $N =
189$ educators drew from their experience to role-play realistic learning use
cases, interacting with two models sequentially, after which $N = 206$ experts
judged which model better supported the user's learning goals. The arena
evaluated a slate of state-of-the-art models: Gemini 2.5 Pro, Claude 3.7
Sonnet, GPT-4o, and OpenAI o3. Excluding ties, experts preferred Gemini 2.5 Pro
in 73.2% of these match-ups -- ranking it first overall in the arena. Gemini
2.5 Pro also demonstrated markedly higher performance across key principles of
good pedagogy. Altogether, these results position Gemini 2.5 Pro as a leading
model for learning.
What is the application?
Who is the user?
Why use AI?
Study design
