How can we collaborate on education analytics when we work in closed data environments?

An open-source community for education data analysts seems like an oxymoron because most of the student-level data that we work with is private, and data formats are specific to agencies. We hope that we can tackle these two problems by drawing on the OpenSDP community for help making realistic synthetic data that can be used to develop, share, and present analyses, and embracing data standards whenever possible.

The repositories below show OpenSDP’s preliminary attempts to tackle the synthetic data challenge, in a tale of three datasets. We encourage you to explore the open-source data engine, the Faketucky dataset, and to consider using publicly available data. Contact OpenSDP at opensdp@gse.harvard.edu to discuss ways to contribute to the evolution of open-source synthetic data.

Synthetic Data Engine

OpenSDPsynthR is not actually a dataset; it is a data simulation package written in R. There are advantages to using simulation to generate synthetic data. The data can become richer and more complex over time as the simulation code is tuned and extended. Eventually, it’s possible that simulation could leverage publicly available data to generate synthetic versions of any school district on demand.

OpenSDPsynthR

This simulation package generates fully synthetic data with inputs that can be varied to make an infinite number of simulated student populations with different sizes and characteristics. It is used to generate sample data for OpenSDP's College-Going Pathways analysis guides and for our R and Stata analysis template repositories.

Go to Repository

OpenSDPsynthR Templates

These R and Stata template repositories contain files, folders, and sample web guide templates to make it easier to get started with analyses using the OpenSDPsynthR data. Clone or download the repositories to begin.

Template in R Template in Stata

Static Synthetic Data

Using machine learning routines to mimic real data is an approach that could work well for analysts who want to demonstrate code on OpenSDP that they have written for their own school systems. For example, SDP’s “Faketucky” is a synthetic dataset based on real student data. It was developed as an offshoot of the Strategic Data Project’s college-going diagnostic for Kentucky, using the R machine learning routine synthpop. This dataset is surprisingly realistic, but it can’t be extended or altered.

Faketucky

The Faketucky synthetic college-going analysis file contains high school and college outcome data for two graduating cohorts of approximately 40,000 students. There are no real children in the dataset, but it mirrors the relationships between variables present in real data.

Go to Repository

Publicly Available Data

OpenSDP encourages community members to use publicly available data sets and share the results (as well as guidance for replicating the analyses). Some useful data sets incude:

College Scorecard Data: These data provide insights into the performance of colleges eligible to receive federal financial aid, and offer a look at the outcomes of students at those schools.

The Ed Data Inventory describes data reported to the Department of Education as part of grant activities, along with administrative and statistical data assembled and maintained by the Department.

Stanford CEPA data includes data files with a range of detailed data on educational conditions, contexts, and outcomes in schools and school districts across the United States.