Date
Publisher
Center for Educational Data Science and Innovation
The Gates Foundation, the Walton Family Foundation, and the Chan Zuckerberg Initiative have
launched a series of collaborative investments in building large-scale datasets that can support and
accelerate data infrastructure for AI R&D efforts in education. In partnership with researchers
from Harvard University and Stanford University, the Center for Educational Data Science and
Innovation (EDSI) at the University of Maryland is leading an unprecedented effort to build a
benchmark classroom dataset between 2025 and 2027. This dataset will be collected and processed
in a way that enables model training, benchmarking, tool-building and deeper research into teaching and learning processes.
Our work builds on and extends prior large-scale classroom data collection efforts in the field
of educational research. For example, supported by the Gates Foundation, the Measures of Effective Teaching (MET) project collected around 20,000 videotaped lessons from 3,000 teacher
volunteers in six urban districts in 2012-2013. During the same period, the Institute of Education Sciences also supported the National Center for Teacher Effectiveness (NCTE) at Harvard
University to conduct a three-year data collection effort that captured classroom recordings from
approximately 50 schools and 300 classrooms in four districts. These rich datasets, along with other related efforts that might be at a smaller scale, have helped spur extensive research on teachers and
teaching and advanced the field of educational research significantly.
With all of these existing datasets, why do we need more education data? And what makes collecting data for R&D in AI and education different from a typical data collection effort for educational or social science research purposes? Just half a year after ChatGPT was released, a convening
at Stanford brought together a group of leading researchers, industry professionals, and education
practitioners to discuss how language technologies, broadly defined, can be used to support educators. A strong consensus among this diverse group of stakeholders was that collecting high-quality,
open-source education data is one of the highest priorities for the field so AI can fulfill its promise in education.3 Prior datasets, although successful in advancing educational research, lack key
features that can meet the needs of R&D in the age of AI. For example, prior datasets often lack
high-quality audio that capture student speech clearly, limiting the ability to study how students
engage in classroom discourse and their reasoning processes. Such datasets also rarely offer transcripts of lessonsÐa key data source for AI model training and research that focuses on classroom
interactions. Because of the sensitive nature of the personal data in these datasets, the data often
cannot be shared broadly and can only be shared through highly secure channels. For example,
accessing MET project data requires a complex application and approval process at the University
of Michigan.
To contribute to the corpus of large-scale classroom datasets and provide more high quality data
for AI R&D, our research team plans to prioritize a few key parameters in our design, including
i) high-quality multimodal data that include audio, video, student and teacher survey data, administrative data, and classroom artifacts that are all linked together; ii) naturalistic data that captures
the nuances of student-teacher interactions with a specific focus on maximizing student speech
quality and student speaker identification, which will allow researchers to connect studentsÕ class-
room contributions over time and to survey and administrative data; iii) instruction that is rooted
in high-quality instructional materials to allow for sufficient observation of high-leverage teaching
practices: iv) a data pipeline that fully anonymizes personally identifiable information that enables
convenient access to researchers and solution providers. We will also focus on 4th-8th grade mathematics classrooms and set a goal of achieving a sample size of 300 teachers that capture a range of
localities and student bodies.
To inform this work and maximize the impact of the dataset we are building, EDSI partnered
with researchers from the Research Partnership for Professional Learning (RPPL)4 to interview 22
experts spanning industry, data science, social sciences, and educational research. In conducting and
synthesizing these interviews, we aim to achieve three goals: first, to gather best practices on topics
like survey design, privacy, and dissemination, and to ensure our data collection meets the fieldÕs
needs. Second, to understand the full R&D potential of a dataset like ours by learning how experts
might use it, thereby facilitating effective dissemination. Finally, to share the resulting insights with
the broader community, collectively advancing the field at the intersection of AI, education, and
data science.
What is the application?
Who age?
Why use AI?
Study design
