Data
How can we collaborate on education analytics when we work in closed data environments?
An open-source community for education data analysts seems like an oxymoron because most of the student-level data that we work with is private, and data formats are specific to agencies. We hope that we can tackle these two problems by drawing on the OpenSDP community for help making realistic synthetic data that can be used to develop, share, and present analyses, and embracing data standards whenever possible.
The repositories below show OpenSDP’s preliminary attempts to tackle the synthetic data challenge. We encourage you to explore the open-source data engine and the Faketucky and Fake County synthetic datasets. We also encourage you to use publicly available data. Contact OpenSDP at opensdp@gse.harvard.edu to discuss ways to contribute to the evolution of open-source synthetic data.
Synthetic Data Engine
OpenSDPsynthR is not actually a dataset; it is a data simulation package written in R. There are advantages to using simulation to generate synthetic data. The data can become richer and more complex over time as the simulation code is tuned and extended. Eventually, it’s possible that simulation could leverage publicly available data to generate synthetic versions of any school district on demand.
OpenSDPsynthR
Author: | Strategic Data Project |
Programming Language: | R |
OpenSDPsynthR Templates
Author: | Strategic Data Project |
Programming Language: | R and Stata |
Static Synthetic Data
Using machine learning routines to mimic real data is an approach that could work well for analysts who want to demonstrate code on OpenSDP that they have written for their own school systems. For example, SDP’s “Faketucky” is a synthetic dataset based on real student data. It was developed as an offshoot of the Strategic Data Project’s college-going diagnostic for Kentucky, using the R machine learning routine synthpop. “Fake County” is a synthetic teacher dataset resulting from SDP’s human capital diagnostic work.
Faketucky
Author: | Strategic Data Project |
Programming Language: | Stata and R |
Fake County
Author: | Strategic Data Project |
Programming Language: | Stata |
Publicly Available Data
OpenSDP encourages community members to use publicly available data sets and share the results (as well as guidance for replicating the work). For example, the Nevada Report Card package from Data Insight Partners extracts and prepares Nevada school- and district-level data, as well as providing analysis-ready datasets.
nevadaReportCardR
Author: | Data Insight Partners |
Programming Language: | R |
Some useful public data sets include:
College Scorecard Data: These data provide insights into the performance of colleges eligible to receive federal financial aid, and offer a look at the outcomes of students at those schools.
The Ed Data Inventory describes data reported to the Department of Education as part of grant activities, along with administrative and statistical data assembled and maintained by the Department.
Stanford CEPA data includes data files with a range of detailed data on educational conditions, contexts, and outcomes in schools and school districts across the United States.