A Framework for Building High-Quality Education Data for R&D in the Age of AI: The EDSI Dataset and Expert Insights

Authors

Jing Liu,

Brendon Krall,

Sarah Montana,

Ting-Yu Ariel Chung,

Heather Hill

Date

10/2025

Publisher

Center for Educational Data Science and Innovation

Link

https://edsi.umd.edu/publications/framework-building-high-quality-education-dat…

The Gates Foundation, the Walton Family Foundation, and the Chan Zuckerberg Initiative have launched a series of collaborative investments in building large-scale datasets that can support and accelerate data infrastructure for AI R&D efforts in education. In partnership with researchers from Harvard University and Stanford University, the Center for Educational Data Science and Innovation (EDSI) at the University of Maryland is leading an unprecedented effort to build a benchmark classroom dataset between 2025 and 2027. This dataset will be collected and processed in a way that enables model training, benchmarking, tool-building and deeper research into teaching and learning processes. Our work builds on and extends prior large-scale classroom data collection efforts in the field of educational research. For example, supported by the Gates Foundation, the Measures of Effective Teaching (MET) project collected around 20,000 videotaped lessons from 3,000 teacher volunteers in six urban districts in 2012-2013. During the same period, the Institute of Education Sciences also supported the National Center for Teacher Effectiveness (NCTE) at Harvard University to conduct a three-year data collection effort that captured classroom recordings from approximately 50 schools and 300 classrooms in four districts. These rich datasets, along with other related efforts that might be at a smaller scale, have helped spur extensive research on teachers and teaching and advanced the field of educational research significantly. With all of these existing datasets, why do we need more education data? And what makes collecting data for R&D in AI and education different from a typical data collection effort for educational or social science research purposes? Just half a year after ChatGPT was released, a convening at Stanford brought together a group of leading researchers, industry professionals, and education practitioners to discuss how language technologies, broadly defined, can be used to support educators. A strong consensus among this diverse group of stakeholders was that collecting high-quality, open-source education data is one of the highest priorities for the field so AI can fulfill its promise in education.3 Prior datasets, although successful in advancing educational research, lack key features that can meet the needs of R&D in the age of AI. For example, prior datasets often lack high-quality audio that capture student speech clearly, limiting the ability to study how students engage in classroom discourse and their reasoning processes. Such datasets also rarely offer transcripts of lessons√êa key data source for AI model training and research that focuses on classroom interactions. Because of the sensitive nature of the personal data in these datasets, the data often cannot be shared broadly and can only be shared through highly secure channels. For example, accessing MET project data requires a complex application and approval process at the University of Michigan. To contribute to the corpus of large-scale classroom datasets and provide more high quality data for AI R&D, our research team plans to prioritize a few key parameters in our design, including i) high-quality multimodal data that include audio, video, student and teacher survey data, administrative data, and classroom artifacts that are all linked together; ii) naturalistic data that captures the nuances of student-teacher interactions with a specific focus on maximizing student speech quality and student speaker identification, which will allow researchers to connect students√ï class- room contributions over time and to survey and administrative data; iii) instruction that is rooted in high-quality instructional materials to allow for sufficient observation of high-leverage teaching practices: iv) a data pipeline that fully anonymizes personally identifiable information that enables convenient access to researchers and solution providers. We will also focus on 4th-8th grade mathematics classrooms and set a goal of achieving a sample size of 300 teachers that capture a range of localities and student bodies. To inform this work and maximize the impact of the dataset we are building, EDSI partnered with researchers from the Research Partnership for Professional Learning (RPPL)4 to interview 22 experts spanning industry, data science, social sciences, and educational research. In conducting and synthesizing these interviews, we aim to achieve three goals: first, to gather best practices on topics like survey design, privacy, and dissemination, and to ensure our data collection meets the field√ïs needs. Second, to understand the full R&D potential of a dataset like ours by learning how experts might use it, thereby facilitating effective dissemination. Finally, to share the resulting insights with the broader community, collectively advancing the field at the intersection of AI, education, and data science.

What is the application?

Teaching – Instructional Materials,

Teaching – Assessment and Feedback,

Teaching – Professional Learning,

Learning – Student Support,

Communicating / Social Tools,

Who age?

Why use AI?