How can we collaborate on education analytics when we work in closed data environments?

An open-source community for education data analysts seems like an oxymoron because most of the student-level data that we work with is private, and data formats are specific to agencies. We hope that we can tackle these two problems by drawing on the OpenSDP community for help making realistic synthetic data that can be used to develop, share, and present analyses, and embracing data standards whenever possible.

The repositories below show OpenSDP’s preliminary attempts to tackle the synthetic data challenge. We encourage you to explore the open-source data engine and the Faketucky and Fake County synthetic datasets. We also encourage you to use publicly available data. Contact OpenSDP at opensdp@gse.harvard.edu to discuss ways to contribute to the evolution of open-source synthetic data.

Synthetic Data Engine

OpenSDPsynthR is not actually a dataset; it is a data simulation package written in R. There are advantages to using simulation to generate synthetic data. The data can become richer and more complex over time as the simulation code is tuned and extended. Eventually, it’s possible that simulation could leverage publicly available data to generate synthetic versions of any school district on demand.

OpenSDPsynthR

This simulation package generates fully synthetic data with inputs that can be varied to make an infinite number of simulated student populations with different sizes and characteristics. It is used to generate sample data for OpenSDP's College-Going Pathways analysis guides and for our R and Stata analysis template repositories.

Go to Repository

OpenSDPsynthR Templates

These R and Stata template repositories contain files, folders, and sample web guide templates to make it easier to get started with analyses using the OpenSDPsynthR data. Clone or download the repositories to begin.

Template in R Template in Stata

Static Synthetic Data

Using machine learning routines to mimic real data is an approach that could work well for analysts who want to demonstrate code on OpenSDP that they have written for their own school systems. For example, SDP’s “Faketucky” is a synthetic dataset based on real student data. It was developed as an offshoot of the Strategic Data Project’s college-going diagnostic for Kentucky, using the R machine learning routine synthpop. “Fake County” is a synthetic teacher dataset resulting from SDP’s human capital diagnostic work.

Faketucky

The Faketucky synthetic college-going analysis file contains high school and college outcome data for two graduating cohorts of approximately 40,000 students. There are no real children in the dataset, but it mirrors the relationships between variables present in real data.

Go to Repository

Fake County

Fake County is a synthetic panel dataset comprising four years of teacher data with roughly 10,000 teachers per year. It includes information about teacher demographics, assignments, salary, credentials, experience, evaluation scores, and hiring and retention status, as well as school characteristics. There are no real teachers in the dataset, but it is based on real data.

Go to Repository

Publicly Available Data

OpenSDP encourages community members to use publicly available data sets and share the results (as well as guidance for replicating the work). For example, the Nevada Report Card package from Data Insight Partners extracts and prepares Nevada school- and district-level data, as well as providing analysis-ready datasets.

nevadaReportCardR

nevadaReportCardr is an open-source R package that provides functions to connect to the publicly available API for NevadaReportCard.com. Additionally, this package contains already formatted datasets of most data available on Nevada Report Card.

Go to Repository

Some useful public data sets include:

College Scorecard Data: These data provide insights into the performance of colleges eligible to receive federal financial aid, and offer a look at the outcomes of students at those schools.

The Ed Data Inventory describes data reported to the Department of Education as part of grant activities, along with administrative and statistical data assembled and maintained by the Department.

Stanford CEPA data includes data files with a range of detailed data on educational conditions, contexts, and outcomes in schools and school districts across the United States.