College-Going Pathways R Version
In this guide you will be able to visualize the progress of students and student subgroups through important milestones from ninth grade through the second year of college
The College-Going Pathways series is a set of guides, code, and sample data about policy-relevant college-going topics. Browse this and other guides in the series for ideas about ways to investigate student pathways through high school and college. Each guide includes several analyses in the form of charts together with Stata analysis and graphing code to generate each chart.
Once you’ve identified analyses that you want to try to replicate or modify, click the “Download” buttons to download Stata code and sample data. You can make changes to the charts using the code and sample data, or modify the code to work with your own data. If you’re familiar with Github, you can click “Go to Repository” and clone the entire College-Going Pathways repository to your own computer. Go to the Participate page to read about more ways to engage with the OpenSDP community.
The data visualizations in the College-Going Pathways series use a synthetically generated college-going analysis sample data file which has one record per student. Each high school student is assigned to a ninth-grade cohort, and each student record includes demographic and program participation information, annual GPA and on-track status, high school graduation outcomes, and college enrollment information. The Connect guide (coming soon) will provide guidance and example code which will help you build a college-going analysis file using data from your own school system.
This guide takes advantage of the OpenSDP synthetic dataset.
library(tidyverse) # main suite of R packages to ease data analysis
library(magrittr) # allows for some easier pipelines of data
library(tidyr) #
library(ggplot2) # to plot
library(scales) # to format
library(grid)
library(gridExtra) # to plot
# Read in some R functions that are convenience wrappers
source("../R/functions.R")
pkgTest("devtools")
pkgTest("OpenSDPsynthR")
The analyses in this guide summarize student attainment from ninth grade through college using three milestones: 1) on-time high school completion, 2) seamless college transition, and 3) persistence to the second year of college.
Through these analyses, you identify drop-offs along the education pipeline for students as a group and as subgroups. For different subgroups, these analyses illuminate disparities in college attainment by race, family income, high school attended, and academic achievement. A steep decline in college enrollment from high school completion date for specific subgroups may indicate barriers to college access. On the other hand, a steep decline from initial college enrollment to second-year persistence might suggest students were not prepared for rigorous college coursework during high school.
One of the most important decisions in running each analysis is defining the sample. Each analysis corresponds to a different part of the education pipeline and as a result requires different cohorts of students.
If you are using the synthetic data we have provided, the sample restrictions have been predefined and are included below. If you run this code using your own agency data, change the sample restrictions based on your data. Note that you will have to run these sample restrictions at the beginning of your do file so they will feed into the rest of your code.
# Read in global variables for sample restriction
# Agency name
agency_name <- "Agency"
# Ninth grade cohorts you can observe persisting to the second year of college
chrt_ninth_begin_persist_yr2 = 2004
chrt_ninth_end_persist_yr2 = 2006
# Ninth grade cohorts you can observe graduating high school on time
chrt_ninth_begin_grad = 2004
chrt_ninth_end_grad = 2006
# Ninth grade cohorts you can observe graduating high school one year late
chrt_ninth_begin_grad_late = 2004
chrt_ninth_end_grad_late = 2006
# High school graduation cohorts you can observe enrolling in college the fall after graduation
chrt_grad_begin = 2008
chrt_grad_end = 2010
# High school graduation cohorts you can observe enrolling in college two years after hs graduation
chrt_grad_begin_delayed = 2008
chrt_grad_end_delayed = 2010
# In RStudio these variables will appear in the Environment pane under "Values"
Based on the sample data, you will have three cohorts (sometimes only two) for analysis. If you are using your own agency data, you may decide to aggregate results for more or fewer cohorts to report your results. This decision depends on 1) how much historical data you have available and 2) what balance to strike between reliability and averaging away information on recent trends. We suggest you average results for the last three cohorts to take advantage of larger sample sizes and improve reliability. However, if you have data for more than three cohorts, you may decide to not average data out for fear of losing information about trends and recent changes in your agency.
This guide is an open-source document hosted on Github and generated using the Stata Webdoc package. We welcome feedback, corrections, additions, and updates. Please visit the OpenSDP college-going pathways repository to read our contributor guidelines.
Purpose: This analysis tracks the overall percent of ninth graders who complete high school on-time, seamlessly enroll in college, and persist to the second year of college. To examine the range of attainment at each milestone, the minimum and maximum values of any high school are shown.
Required Analysis File Variables:
sid
chrt_ninth
first_hs_name
ontime_grad
enrl_1oct_ninth_yr1_any
enrl_1oct_ninth_yr2_any
Analysis-Specific Sample Restrictions: Keep students in ninth grade cohorts for which persistence to the second year of college can be reported.
Ask Yourself
Analytic Technique: Calculate the proportion of first-time ninth graders that progress to each step along the education pipeline.
# Step 1: Keep students in ninth grade cohorts you can observe persisting to the
# second year of college
plotdf <- filter(cgdata, chrt_ninth >= chrt_ninth_begin_persist_yr2 &
chrt_ninth <= chrt_ninth_end_persist_yr2)
# Step 4: Create variables for the outcomes "regular diploma recipients",
# "seamless transitioners" and "second year persisters"
plotdf$grad <- ifelse(!is.na(plotdf$chrt_grad) & plotdf$ontime_grad ==1, 1, 0)
plotdf$seamless_transitioners_any <- as.numeric(plotdf$enrl_1oct_ninth_yr1_any == 1 &
plotdf$ontime_grad == 1)
plotdf$second_year_persisters = as.numeric(plotdf$enrl_1oct_ninth_yr1_any == 1 &
plotdf$enrl_1oct_ninth_yr2_any == 1 &
plotdf$ontime_grad == 1)
# Step 4: Create agency-level average outcomes
# 2. Calculate the mean of each outcome variable by agency
agencyData <- plotdf %>%
summarize(grad = mean(grad),
seamless_transitioners_any = mean(seamless_transitioners_any, na.rm=TRUE),
second_year_persisters = mean(second_year_persisters, na.rm=TRUE),
N = n())
agencyData$school_name <- "AGENCY AVERAGE"
# 2. Calculate the mean of each outcome variable by first high school attended
schoolData <- plotdf %>% group_by(first_hs_name) %>%
summarize(grad = mean(grad),
seamless_transitioners_any = mean(seamless_transitioners_any,
na.rm=TRUE),
second_year_persisters = mean(second_year_persisters, na.rm=TRUE),
N = n())
# 1. Create a variable school_name that takes on the value of students’ first
## high school attended
names(schoolData)[1] <- "school_name"
# 3. Identify the agency maximum values for each of the three outcome variables
maxSchool <- schoolData %>% summarize_all(.funs = funs("max"))
maxSchool$school_name <- "AGENCY MAX HS"
# 4. Identify the agency minimum values for each of the three outcome variables
minSchool <- schoolData %>% summarize_all(.funs = funs("min"))
minSchool$school_name <- "AGENCY MIN HS"
# 5. Append the three tempfiles to the school-level file loaded into R
schoolData <- bind_rows(schoolData, agencyData,
minSchool, maxSchool)
rm(agencyData, minSchool, maxSchool)
# Step 6: Prepare to graph the results
schoolData$cohort <- 1
schoolData <- schoolData %>% gather(key = outcome,
value = measure, -N, -school_name)
schoolData$subset <- grepl("AGENCY", schoolData$school_name)
schoolData$outcome[schoolData$outcome == "cohort"] <- "Ninth Graders"
schoolData$outcome[schoolData$outcome == "grad"] <- "On-time Graduates"
schoolData$outcome[schoolData$outcome == "seamless_transitioners_any"] <-
"Seamless College Transitioner"
schoolData$outcome[schoolData$outcome == "second_year_persisters"] <-
"Second Year Persisters"
# Figure caption
figureCaption <- paste0("Sample: ", chrt_ninth_begin_persist_yr2-1, "-",
chrt_ninth_begin_persist_yr2, " to ",
chrt_ninth_end_persist_yr2-1, "-", chrt_ninth_end_persist_yr2, " ",
agency_name, " high school graduates. \n",
"Postsecondary enrollment outcomes from NSC matched records. \n",
"All other data from ", agency_name, " administrative records.")
## Step 7: Graph the results
ggplot(schoolData[schoolData$subset,],
aes(x = outcome, y = measure, group = school_name,
color = school_name, linetype = school_name)) +
geom_line(size = 1.1) + geom_point(aes(group = 1), color = I("black")) +
geom_text(aes(label = round(measure * 100, 1)), vjust = -0.8, hjust = -0.25,
color = I("black")) +
scale_y_continuous(limits = c(0, 1), label = percent) +
theme_bw() + theme(legend.position = c(0.825, 0.825)) +
guides(color = guide_legend("", keywidth = 6,
label.theme = element_text(face = "bold",
size = 8,
angle = 0)),
linetype = "none") +
labs(y = "Percent of Ninth Graders",
title = "Student Progression from 9th Grade Through College",
subtitle = "Agency Average", x = "",
caption = figureCaption)
Purpose: This analysis tracks the percent of ninth graders of different races/ethnicities who complete high school on-time, seamlessly enroll in college, and persist to the second year of college.
Required Analysis File Variables:
sid
race_ethnicity
chrt_ninth
ontime_grad
enrl_1oct_ninth_yr1_any
enrl_1oct_ninth_yr2_any
Analysis-Specific Sample Restrictions:
Ask Yourself
Analytic Technique: Calculate the proportion of first-time ninth graders that progress to each step along the education pipeline.
# Step 1: Keep students in ninth grade cohorts you can observe persisting to
## the second year of college
plotdf <- filter(cgdata, chrt_ninth >= chrt_ninth_begin_persist_yr2 &
chrt_ninth <= chrt_ninth_end_persist_yr2)
# Step 2: Create variables for the outcomes "regular diploma recipients",
## "seamless transitioners" and "second year persisters"
plotdf$grad <- ifelse(!is.na(plotdf$chrt_grad) & plotdf$ontime_grad ==1, 1, 0)
plotdf$seamless_transitioners_any <- as.numeric(plotdf$enrl_1oct_ninth_yr1_any == 1 &
plotdf$ontime_grad == 1)
plotdf$second_year_persisters = as.numeric(plotdf$enrl_1oct_ninth_yr1_any == 1 &
plotdf$enrl_1oct_ninth_yr2_any == 1 &
plotdf$ontime_grad == 1)
# Step 3: Create agency-level average outcomes
progressRace <- plotdf %>% group_by(race_ethnicity) %>%
summarize(grad = mean(grad),
seamless_transitioners_any = mean(seamless_transitioners_any, na.rm=TRUE),
second_year_persisters = mean(second_year_persisters, na.rm=TRUE),
N = n())
# Step 4: Reformat the data for plotting
progressRace$cohort <- 1
progressRace <- progressRace %>% gather(key = outcome,
value = measure, -N, -race_ethnicity)
# Step 5: Recode variables for plot-friendly labels
progressRace$outcome[progressRace$outcome == "cohort"] <- "Ninth Graders"
progressRace$outcome[progressRace$outcome == "grad"] <- "On-time Graduates"
progressRace$outcome[progressRace$outcome == "seamless_transitioners_any"] <-
"Seamless College Transitioner"
progressRace$outcome[progressRace$outcome == "second_year_persisters"] <-
"Second Year Persisters"
progressRace$subset <- ifelse(as.numeric(progressRace$race_ethnicity) %in% c(1, 3, 2, 5),
TRUE, FALSE)
progressRace$race_ethnicity[progressRace$race_ethnicity == 1] <- "Black"
progressRace$race_ethnicity[progressRace$race_ethnicity == 2] <- "Asian"
progressRace$race_ethnicity[progressRace$race_ethnicity == 3] <- "Hispanic"
progressRace$race_ethnicity[progressRace$race_ethnicity == 4] <- "Native American"
progressRace$race_ethnicity[progressRace$race_ethnicity == 5] <- "White"
progressRace$race_ethnicity[progressRace$race_ethnicity == 6] <- "Multiple/Other"
# Step 6: Graph the results
ggplot(progressRace[progressRace$subset,],
aes(x = outcome, y = measure, group = race_ethnicity,
color = race_ethnicity, linetype = race_ethnicity)) +
geom_line(size = 1.1) + geom_point(aes(group = 1), color = I("black")) +
geom_text(aes(label = round(measure * 100, 1)), vjust = -0.8, hjust = -0.25,
color = I("black")) +
scale_y_continuous(limits = c(0, 1), label = percent) +
theme_bw() + theme(legend.position = c(0.825, 0.825)) +
guides(color = guide_legend("", keywidth = 6,
label.theme =
element_text(face = "bold", size = 8,
angle = 0)), linetype = "none") +
labs(y = "Percent of Ninth Graders",
title = "Student Progression from 9th Grade Through College",
subtitle = "By Student Race/Ethnicity", x = "",
caption = figureCaption)
Purpose: This analysis tracks the percent of ninth graders of different races/ethnicities who ever qualified for free or reduce price lunch who complete high school on-time, seamlessly enroll in college, and persist to the second year of college.
Required Analysis File Variables:
sid
race_ethnicity
frpl_ever
chrt_ninth
ontime_grad
enrl_1oct_ninth_yr1_any
enrl_1oct_ninth_yr2_any
Analysis-Specific Sample Restrictions:
Ask Yourself
Analytic Technique: Calculate the proportion of first-time ninth graders that progress to each step along the education pipeline.
## Step 1: Keep students in ninth grade cohorts you can observe persisting to
## the second year of college AND are ever FRPL-eligible
plotdf <- filter(cgdata, chrt_ninth >= chrt_ninth_begin_persist_yr2 &
chrt_ninth <= chrt_ninth_end_persist_yr2) %>%
filter(frpl_ever > 0)
# Step 2: Create variables for the outcomes "regular diploma recipients",
## "seamless transitioners" and "second year persisters"
plotdf$grad <- ifelse(!is.na(plotdf$chrt_grad) & plotdf$ontime_grad == 1, 1, 0)
plotdf$seamless_transitioners_any <- ifelse(plotdf$enrl_1oct_ninth_yr1_any == 1 &
plotdf$ontime_grad == 1, 1, 0)
plotdf$second_year_persisters = ifelse(plotdf$enrl_1oct_ninth_yr1_any == 1 &
plotdf$enrl_1oct_ninth_yr2_any == 1 &
plotdf$ontime_grad == 1, 1, 0)
# Step 4: Create agency-level average outcomes
# Calculate the mean of each outcome variable by race/ethnicity
progressRaceFRL <- plotdf %>% group_by(race_ethnicity) %>%
summarize(grad = mean(grad),
seamless_transitioners_any = mean(seamless_transitioners_any, na.rm=TRUE),
second_year_persisters = mean(second_year_persisters, na.rm=TRUE),
N = n())
# Step 5: Reformat the data file so that one variable contains all the
# outcomes of interest
progressRaceFRL %<>% filter(N >= 20)
# Step 6: Prepare to graph the results
## Reshape the data
progressRaceFRL$cohort <- 1
progressRaceFRL <- progressRaceFRL %>% gather(key = outcome,
value = measure, -N, -race_ethnicity)
## Recode the variables for plot friendly labels
progressRaceFRL$outcome[progressRaceFRL$outcome == "cohort"] <-
"Ninth Graders"
progressRaceFRL$outcome[progressRaceFRL$outcome == "grad"] <-
"On-time Graduates"
progressRaceFRL$outcome[progressRaceFRL$outcome == "seamless_transitioners_any"] <-
"Seamless College Transitioner"
progressRaceFRL$outcome[progressRaceFRL$outcome == "second_year_persisters"] <-
"Second Year Persisters"
progressRaceFRL$subset <- ifelse(as.numeric(progressRaceFRL$race_ethnicity) %in% c(1, 3, 5),
TRUE, FALSE)
progressRaceFRL$race_ethnicity[progressRaceFRL$race_ethnicity == 1] <- "Black"
progressRaceFRL$race_ethnicity[progressRaceFRL$race_ethnicity == 2] <- "Asian"
progressRaceFRL$race_ethnicity[progressRaceFRL$race_ethnicity == 3] <- "Hispanic"
progressRaceFRL$race_ethnicity[progressRaceFRL$race_ethnicity == 4] <- "Native American"
progressRaceFRL$race_ethnicity[progressRaceFRL$race_ethnicity == 5] <- "White"
progressRaceFRL$race_ethnicity[progressRaceFRL$race_ethnicity == 6] <- "Multiple/Other"
ggplot(progressRaceFRL[progressRaceFRL$subset,],
aes(x = outcome, y = measure, group = race_ethnicity,
color = race_ethnicity, linetype = race_ethnicity)) +
geom_line(size = 1.1) + geom_point(aes(group = 1), color = I("black")) +
geom_text(aes(label = round(measure * 100, 1)), vjust = -0.8, hjust = -0.25,
color = I("black")) +
scale_y_continuous(limits = c(0, 1), label = percent) +
theme_bw() + theme(legend.position = c(0.825, 0.825)) +
guides(color = guide_legend("", keywidth = 6,
label.theme = element_text(face = "bold",
size = 8,
angle = 0)),
linetype = "none") +
labs(y = "Percent of Ninth Graders",
title = "Student Progression from 9th Grade Through College",
subtitle = paste0(c(
"Among Students Qualifying for Free or Reduced Price Lunch \n",
"By Student Race/Ethnicity")),
x = "",
caption = figureCaption)
Purpose: This analysis tracks the percent of ninth graders at different levels of being on-track for graduation who complete high school on-time, seamlessly enroll in college, and then persist to the second year of college.
Required Analysis File Variables:
sid
chrt_ninth
ontrack_sample
ontrack_end_yr*
cum_gpa_yr*
ontime_grad
enrl_1oct_ninth_yr1_any
enrl_1oct_ninth_yr2_any
Analysis-Specific Sample Restrictions:
Ask Yourself
Analytic Technique: Calculate the proportion of first-time ninth graders that progressed along the education pipeline.
# // Step 1: Keep students in ninth grade cohorts you can observe persisting
# to the second year of college AND are included in the on-track analysis sample
plotdf <- filter(cgdata, chrt_ninth >= chrt_ninth_begin_persist_yr2 &
chrt_ninth <= chrt_ninth_end_persist_yr2) %>%
filter(ontrack_sample == 1) %>%
mutate(ontrack_endyr1 = as.numeric(ontrack_endyr1) -1)
# // Step 2: Create variables for the outcomes "regular diploma recipients",
# "seamless transitioners" and "second year persisters"
plotdf$grad <- ifelse(!is.na(plotdf$chrt_grad) & plotdf$ontime_grad ==1, 1, 0)
plotdf$seamless_transitioners_any <- as.numeric(plotdf$enrl_1oct_ninth_yr1_any == 1 &
plotdf$ontime_grad == 1)
plotdf$second_year_persisters = as.numeric(plotdf$enrl_1oct_ninth_yr1_any == 1 &
plotdf$enrl_1oct_ninth_yr2_any == 1 &
plotdf$ontime_grad == 1)
# // Step 3: Generate on track indicators that take into account students’ GPAs
# upon completion of their first year in high school
# TODO: Check for NA values in how ontrack_endyr is coded
plotdf$ot <- NA
plotdf$ot[plotdf$ontrack_endyr1 == 0] <- "Off-Track to Graduate"
# Check for correctness
plotdf$ot[plotdf$ontrack_endyr1 == 1 & plotdf$cum_gpa_yr1 < 3 &
!is.na(plotdf$cum_gpa_yr1)] <- "On-Track to Graduate, GPA < 3.0"
plotdf$ot[plotdf$ontrack_endyr1 == 1 & plotdf$cum_gpa_yr1 >= 3 &
!is.na(plotdf$cum_gpa_yr1)] <- "On-Track to Graduate, GPA >= 3.0"
# // Step 4: Calculate aggregates for the Agency by on track status
progressTrack <- plotdf %>% group_by(ot) %>%
summarize(grad = mean(grad),
seamless_transitioners_any = mean(seamless_transitioners_any, na.rm=TRUE),
second_year_persisters = mean(second_year_persisters, na.rm=TRUE),
N = n())
# // Step 5: Reformat the data file so that one variable contains all the outcomes
# of interest
progressTrack$cohort <- 1
progressTrack <- progressTrack %>% gather(key = outcome,
value = measure, -N, -ot)
progressTrack$outcome[progressTrack$outcome == "cohort"] <- "Ninth Graders"
progressTrack$outcome[progressTrack$outcome == "grad"] <- "On-time Graduates"
progressTrack$outcome[progressTrack$outcome == "seamless_transitioners_any"] <-
"Seamless College Transitioner"
progressTrack$outcome[progressTrack$outcome == "second_year_persisters"] <-
"Second Year Persisters"
# Annotate for direct labels, find endpoints
annotation_pos <- c(progressTrack$measure[progressTrack$outcome == "Second Year Persisters" &
progressTrack$ot == "Off-Track to Graduate"] - 0.25,
progressTrack$measure[progressTrack$outcome == "Second Year Persisters" &
progressTrack$ot == "On-Track to Graduate, GPA < 3.0"] + 0.07,
progressTrack$measure[progressTrack$outcome == "Second Year Persisters" &
progressTrack$ot == "On-Track to Graduate, GPA >= 3.0"]+0.05)
ann_txt <- data.frame(outcome = rep("Second Year Persisters", 3),
measure = annotation_pos + 0.1,
textlabel = c("Off-Track \nto Graduate",
"On-Track to Graduate,\n GPA < 3.0",
"On-Track to Graduate,\n GPA >= 3.0"))
ann_txt$ot <- ann_txt$textlabel
ggplot(progressTrack,
aes(x = outcome, y = measure, group = ot,
color = ot, linetype = ot)) +
geom_line(size = 1.1) + geom_point(aes(group = 1), color = I("black")) +
geom_text(aes(label = round(measure * 100, 1)), vjust = -0.8, hjust = -0.25,
color = I("black")) +
geom_text(data = ann_txt, aes(label = textlabel)) +
scale_y_continuous(limits = c(0, 1), label = percent) +
theme_bw() + theme(legend.position = c(0.825, 0.825)) +
scale_color_brewer(type = "qual", palette = 2) +
guides(color = "none",
linetype = "none") +
labs(y = "Percent of Ninth Graders",
title = "Student Progression from 9th Grade Through College",
subtitle = "By Course Credits and GPA after First High School Year", x = "",
caption = figureCaption)