College-Going Pathways R Version
In this guide you will be able to visualize the share of students who enroll and then persist into a second year of college based on high school, college type, and top-enrolling colleges.
The College-Going Pathways series is a set of guides, code, and sample data about policy-relevant college-going topics. Browse this and other guides in the series for ideas about ways to investigate student pathways through high school and college. Each guide includes several analyses in the form of charts together with Stata analysis and graphing code to generate each chart.
Once you’ve identified analyses that you want to try to replicate or modify, click the “Download” buttons to download Stata code and sample data. You can make changes to the charts using the code and sample data, or modify the code to work with your own data. If you’re familiar with Github, you can click “Go to Repository” and clone the entire College-Going Pathways repository to your own computer. Go to the Participate page to read about more ways to engage with the OpenSDP community.
The data visualizations in the College-Going Pathways series use a synthetically generated college-going analysis sample data file which has one record per student. Each high school student is assigned to a ninth-grade cohort, and each student record includes demographic and program participation information, annual GPA and on-track status, high school graduation outcomes, and college enrollment information. The Connect guide (coming soon) will provide guidance and example code which will help you build a college-going analysis file using data from your own school system.
This guide takes advantage of the OpenSDP synthetic dataset.
library(tidyverse) # main suite of R packages to ease data analysis
library(magrittr) # allows for some easier pipelines of data
library(tidyr) #
library(ggplot2) # to plot
library(scales) # to format
library(grid)
library(gridExtra) # to plot
# Read in some R functions that are convenience wrappers
source("../R/functions.R")
pkgTest("devtools")
pkgTest("OpenSDPsynthR")
For many high school graduates, college enrollment is just the first of many hurdles on the road to postsecondary success. While considerable attention has been paid to challenges that surround college preparedness, access, and enrollment, only recently has conversation expanded to consider barriers to degree completion. These barriers must be understood and addressed at both the secondary and postsecondary levels for college attainment rates to increase. In the last section of the education pipeline, you examine patterns of persistence to the second year of college to identify early indications of student progress towards degree attainment.
One of the most important decisions in running each analysis is defining the sample. Each analysis corresponds to a different part of the education pipeline and as a result requires different cohorts of students.
If you are using the synthetic data we have provided, the sample restrictions have been predefined and are included below. If you run this code using your own agency data, change the sample restrictions based on your data. Note that you will have to run these sample restrictions at the beginning of your do file so they will feed into the rest of your code.
# Read in global variables for sample restriction
# Agency name
agency_name <- "Agency"
# Ninth grade cohorts you can observe persisting to the second year of college
chrt_ninth_begin_persist_yr2 = 2004
chrt_ninth_end_persist_yr2 = 2006
# Ninth grade cohorts you can observe graduating high school on time
chrt_ninth_begin_grad = 2004
chrt_ninth_end_grad = 2006
# Ninth grade cohorts you can observe graduating high school one year late
chrt_ninth_begin_grad_late = 2004
chrt_ninth_end_grad_late = 2006
# High school graduation cohorts you can observe enrolling in college the fall after graduation
chrt_grad_begin = 2008
chrt_grad_end = 2010
# High school graduation cohorts you can observe enrolling in college two years after hs graduation
chrt_grad_begin_delayed = 2008
chrt_grad_end_delayed = 2010
# In RStudio these variables will appear in the Environment pane under "Values"
Based on the sample data, you will have three cohorts (sometimes only two) for analysis. If you are using your own agency data, you may decide to aggregate results for more or fewer cohorts to report your results. This decision depends on 1) how much historical data you have available and 2) what balance to strike between reliability and averaging away information on recent trends. We suggest you average results for the last three cohorts to take advantage of larger sample sizes and improve reliability. However, if you have data for more than three cohorts, you may decide to not average data out for fear of losing information about trends and recent changes in your agency.
This guide is an open-source document hosted on Github and generated using the Stata Webdoc package. We welcome feedback, corrections, additions, and updates. Please visit the OpenSDP college-going pathways repository to read our contributor guidelines.
Purpose: Initial enrollment decisions can dramatically affect higher education trajectories and the likelihood of degree attainment. This analysis provides a snapshot of persistence to the second year of college by examining persistence rates across high schools in the system. The analysis illuminates differences in persistence by level of college first attended (two-year vs. four-year). Given another year of sample data, the analysis could also be conducted by time of initial entry (seamless vs. delayed enrollment).
Required Analysis File Variables:
sid
enrl_1oct_grad_yr1_any
enrl_1oct_grad_yr1_4yr
enrl_1oct_grad_yr1_2yr
enrl_grad_persist_any
enrl_grad_persist_4yr
enrl_grad_persist_2yr
last_hs_code
last_hs_name
enrl_ever_w2_grad_any
Analysis-Specific Sample Restrictions:
Ask Yourself
Possible Next Steps or Action Plans: Consider establishing MOUs with local community colleges to obtain detailed data on graduates’ postsecondary pursuits at two-year colleges (Course enrollment and transcript data) allowing agencies to explore persistence rates by assignment to remediation coursework.
Analytic Technique: Calculate the proportion of students who persist to the second year of college by the high school those students first attended.
# // Step 1: Keep students in high school graduation cohorts you can observe
# enrolling in college the fall after graduation
plotdf <- cgdata %>% filter(chrt_grad >= chrt_grad_begin &
chrt_grad <= chrt_grad_end) %>%
select(sid, chrt_grad, enrl_1oct_grad_yr1_2yr, enrl_1oct_grad_yr1_4yr,
enrl_1oct_grad_yr1_any, enrl_grad_persist_any,
enrl_grad_persist_2yr, enrl_grad_persist_4yr, last_hs_name,
enrl_ever_w2_grad_any)
# // Step 2: Rename and recode for simplicity
plotdf$groupVar <- NA
plotdf$groupVar[plotdf$enrl_1oct_grad_yr1_2yr == 1] <- "2-year College"
plotdf$groupVar[plotdf$enrl_1oct_grad_yr1_4yr == 1] <- "4-year College"
# // Step 3: Obtain the agency-level average for persistence and enrollment
agencyData <- plotdf %>% group_by(groupVar) %>%
summarize(persistCount = sum(enrl_grad_persist_any, na.rm=TRUE),
totalCount = n()) %>%
ungroup %>%
mutate(total = sum(persistCount)) %>%
mutate(persistRate = persistCount / totalCount,
last_hs_name = "Agency AVERAGE")
# // Step 4: Obtain the school-level average for persistence and enrollment
schoolData <- plotdf %>% group_by(groupVar, last_hs_name) %>%
summarize(persistCount = sum(enrl_grad_persist_any, na.rm=TRUE),
totalCount = n()) %>%
ungroup %>% group_by(last_hs_name) %>%
mutate(total = sum(persistCount)) %>%
mutate(persistRate = persistCount / totalCount)
# Combine for chart
chartData <- bind_rows(agencyData, schoolData)
# // Step 5: Recode variables for plotting
chartData$last_hs_name <- gsub(" High School", "", chartData$last_hs_name)
# // STep 6: Filter rows out with missing values or small cell sizes
chartData <- na.omit(chartData)
chartData <- filter(chartData, totalCount > 20)
# // Step 7: Calculate rank for plot order
# Make ranks the same for 2 and 4 year colleges
chartData <- chartData %>% group_by(last_hs_name) %>%
mutate(order = mean(persistRate)) %>%
ungroup() %>%
mutate(order = min_rank(order)) %>%
arrange(last_hs_name)
# Convert to a factor and order for ggplot purposes
chartData$groupVar <- factor(chartData$groupVar)
chartData$groupVar <- relevel(chartData$groupVar, ref = "4-year College")
# Figure caption
figureCaption <- paste0("Sample: ", chrt_grad_begin-1, "-", chrt_grad_begin,
" to ",chrt_grad_end-1, "-", chrt_grad_end, " ",
agency_name, " high school graduates. \n",
"Postsecondary enrollment outcomes from NSC matched records. \n",
"All other data from ", agency_name, " administrative records.")
# // Step 8: Plot
ggplot(chartData, aes(x = reorder(last_hs_name, -order),
group = groupVar,
y = persistRate, fill = groupVar,
color = I("black"))) +
geom_bar(stat = 'identity', position = 'dodge') +
geom_text(aes(label = round(persistRate * 100, 0)),
position = position_dodge(0.9), vjust = -0.4) +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2),
name = "% Of Seamless Enrollers",
expand = c(0,0), label = percent) +
theme_classic() +
guides(fill = guide_legend("", keywidth = 3, nrow = 2)) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5, color = "black"),
legend.position = c(0.15, 0.2), axis.ticks.x = element_blank(),
legend.key = element_rect(color = "black")) +
scale_fill_brewer(type = "div", palette = 2) +
labs(x = "",
title = "College Persistence by High School, at Any College",
subtitle = "Seamless Enrollers by Type of College",
caption = figureCaption)
Purpose: This analysis provides a snapshot of persistence to the second year of college from one type of college to another for different high schools in the system. The left analysis charts explores how seamless enrollers in 4-year colleges either persist at a 4-year or switch to a 2-year. The right analysis charts how seamless enrollers in 2-year colleges either persist at a 2-year or switch to a 4-year.
Required Analysis File Variables:
sid
enrl_1oct_grad_yr1_any
enrl_1oct_grad_yr1_4yr
enrl_1oct_grad_yr1_2yr
enrl_grad_persist_any
enrl_grad_persist_4yr
enrl_grad_persist_2yr
last_hs_code
last_hs_name
Analysis-Specific Sample Restrictions:
Ask Yourself
Possible Next Steps or Action Plans: Create individual school-level reports for administrators and college counselors to communicate which postsecondary institutions are associated with greater rates of persistence. Additionally, conduct similar analyses that include more detailed institutional information that may be associated with students’ prospects of persisting (e.g. cost of tuition and room/board, financial aid, etc.).
Analytic Technique: Calculate the proportion of 4-yr college-goers who persist through four years of college by the postsecondary institution first attended and cumulative high school GPA category.
# // Step 1: Keep students in high school graduation cohorts you can observe
# enrolling in college the fall after graduation
plotdf <- cgdata %>% filter(chrt_grad >= chrt_grad_begin &
chrt_grad <= chrt_grad_end) %>%
select(sid, chrt_grad, enrl_1oct_grad_yr1_2yr, enrl_1oct_grad_yr1_4yr,
enrl_1oct_grad_yr1_any, enrl_1oct_grad_yr2_2yr, enrl_1oct_grad_yr2_4yr,
enrl_1oct_grad_yr2_any, enrl_grad_persist_any,
enrl_grad_persist_2yr, enrl_grad_persist_4yr, last_hs_name)
# Clean up missing data for binary recoding
plotdf$enrl_grad_persist_4yr <- zeroNA(plotdf$enrl_grad_persist_4yr)
plotdf$enrl_grad_persist_2yr <- zeroNA(plotdf$enrl_grad_persist_2yr)
plotdf$enrl_1oct_grad_yr1_2yr <- zeroNA(plotdf$enrl_1oct_grad_yr1_2yr)
plotdf$enrl_1oct_grad_yr1_4yr <- zeroNA(plotdf$enrl_1oct_grad_yr1_4yr)
# // Step 2: Create binary outcomes for enrollers who switch from 4-yr to 2-yr,
# or vice versa and recode variables
plotdf$persist_pattern <- "Not persisting"
plotdf$persist_pattern[plotdf$enrl_grad_persist_4yr == 1 &
!is.na(plotdf$chrt_grad)] <- "Persisted at 4-Year College"
plotdf$persist_pattern[plotdf$enrl_grad_persist_2yr ==1 &
!is.na(plotdf$chrt_grad)] <- "Persisted at 2-Year College"
plotdf$persist_pattern[plotdf$enrl_1oct_grad_yr1_4yr == 1 &
plotdf$enrl_1oct_grad_yr2_2yr == 1 &
!is.na(plotdf$chrt_grad)] <- "Switched to 2-Year College"
plotdf$persist_pattern[plotdf$enrl_1oct_grad_yr1_2yr == 1 &
plotdf$enrl_1oct_grad_yr2_4yr == 1 &
!is.na(plotdf$chrt_grad)] <- "Switched to 4-Year College"
plotdf$groupVar <- NA
plotdf$groupVar[plotdf$enrl_1oct_grad_yr1_2yr == 1] <- "2-year College"
plotdf$groupVar[plotdf$enrl_1oct_grad_yr1_4yr == 1] <- "4-year College"
# Drop NA
plotdf %<>% filter(!is.na(groupVar))
# // Step 3: Obtain agency and school level average for persistence outcomes
chartData <- plotdf %>%
group_by(last_hs_name, groupVar, persist_pattern) %>%
summarize(tally = n()) %>% # counts the occurrence persist_pattern
ungroup %>%
group_by(last_hs_name, groupVar) %>% # regroup by grouping variable and school
mutate(denominator = sum(tally)) %>% # sum all levels of persist_pattern
mutate(persistRate = tally / denominator) %>% # calculate rate
filter(persist_pattern != "Not persisting") %>%
mutate(rankRate = sum(persistRate))
agencyData <- plotdf %>%
group_by(groupVar, persist_pattern) %>%
summarize(tally = n(),
last_hs_name = "Agency AVERAGE") %>%
ungroup %>%
group_by(last_hs_name, groupVar) %>%
mutate(denominator = sum(tally)) %>%
mutate(persistRate = tally / denominator) %>%
filter(persist_pattern != "Not persisting") %>%
mutate(rankRate = sum(persistRate))
chartData <- bind_rows(chartData, agencyData)
# // Step 4: Recode variable names, sort data frame, and code labels for plot
chartData$last_hs_name <- gsub(" High School", "", chartData$last_hs_name)
chartData$last_hs_name <- gsub(" ", "\n", chartData$last_hs_name)
# chartData %<>% filter(persist_pattern != "Not persisting")
chartData %<>% arrange(persist_pattern)
chartData <- as.data.frame(chartData)
chartData$persist_pattern <- factor(as.character(chartData$persist_pattern),
ordered = TRUE,
levels = c("Switched to 4-Year College",
"Switched to 2-Year College",
"Persisted at 2-Year College",
"Persisted at 4-Year College"))
# // Step 5: Prepare plot for 2-year colleges
p1 <- ggplot(chartData[chartData$groupVar == "2-year College",],
aes(x = reorder(last_hs_name, rankRate),
y = persistRate, group = persist_pattern,
fill = persist_pattern)) +
scale_y_continuous(limits = c(0, 1.25), expand = c(0, 0),
label = percent, breaks = seq(0, 1, 0.2)) +
geom_bar(stat = 'identity', position = 'stack',
color = I("black")) +
geom_text(aes(label = round(persistRate * 100, 0)),
position = position_stack(vjust = 0.5)) +
geom_text(aes(label = round(rankRate * 100, 0), y = rankRate), vjust = -0.7) +
guides(fill = guide_legend("", keywidth = 2, nrow = 2)) +
scale_fill_brewer(type = "qual", palette = 1) +
labs(x = "", y = "Percent of Seamless Enrollers") +
theme_classic() + theme(axis.text.x = element_text(angle = 30, vjust = 0.2),
axis.ticks.x = element_blank(),
legend.position = c(0.225, 0.925),
plot.caption = element_text(hjust = 0, size = 7)) +
labs(subtitle = "Seamless Enrollers at 2-year Colleges",
caption = figureCaption)
# // Step 6: Prepare plot for 4-year colleges by replacing data in plot
# above with 4 year data
p2 <- p1 %+% chartData[chartData$groupVar == "4-year College",] +
labs(subtitle = "Seamless Enrollers at 4-year Colleges")
# // Step 7: Print out plots with labels
grid.arrange(grobs= list(p2, p1), nrow=1,
top = "College Persistence by High School")
Purpose: This analysis reports enrollment and persistence rates among top-enrolling two- and four- year institutions attended by graduates. This analysis illuminates differences in persistence rates to the second year of college among top-enrolling postsecondary institutions. Agency staff that advise students during their senior year may find this information useful when meeting to weigh college options.
Required Analysis File Variables:
sid
enrl_1oct_grad_yr1_any
enrl_1oct_grad_yr1_4yr
enrl_1oct_grad_yr1_2yr
enrl_grad_persist_any
enrl_grad_persist_4yr
enrl_grad_persist_2yr
first_college_name_any
first_college_name_2yr
first_college_name_4yr
** Analysis-Specific Sample Restrictions:**
Ask Yourself
Analytic Technique: Calculate the proportion of college-goers attending top-enrolling 2- and 4-year institutions, as well as the proportion of seamless enrollers who persist to the second year of any college, by the postsecondary institution graduates first attended.
# // Step 1: Keep students in high school graduation cohorts you can observe
# enrolling in college the fall after graduation
plotdf <- cgdata %>% filter(chrt_grad >= chrt_grad_begin &
chrt_grad <= chrt_grad_end) %>%
select(sid, chrt_grad, enrl_1oct_grad_yr1_2yr, enrl_1oct_grad_yr1_4yr,
enrl_1oct_grad_yr1_any, enrl_1oct_grad_yr2_2yr, enrl_1oct_grad_yr2_4yr,
enrl_1oct_grad_yr2_any, enrl_grad_persist_any,
enrl_grad_persist_2yr, enrl_grad_persist_4yr,
first_college_name_any, first_college_name_2yr, first_college_name_4yr)
# // Step 2: Indicate the number of institutions you would like listed
num_inst <- 5
# // Step 3: Calculate the number and % of students enrolled in each college
# the fall after graduation, and the number and % of students persisting, by
# college type
chart4year <- bind_rows(
plotdf %>% group_by(first_college_name_4yr) %>%
summarize(enrolled = sum(enrl_1oct_grad_yr1_4yr, na.rm=TRUE),
persisted = sum(enrl_grad_persist_4yr, na.rm=TRUE)) %>%
ungroup %>%
mutate(total_enrolled = sum(enrolled)) %>%
mutate(perEnroll = round(100 * enrolled/total_enrolled, 1),
perPersist = round(100 * persisted/enrolled, 1)),
plotdf %>%
summarize(enrolled = sum(enrl_1oct_grad_yr1_4yr, na.rm=TRUE),
persisted = sum(enrl_grad_persist_4yr, na.rm=TRUE),
first_college_name_4yr = "All 4-Year Colleges") %>%
ungroup %>%
mutate(total_enrolled = sum(enrolled)) %>%
mutate(perEnroll = round(100 * enrolled/total_enrolled, 1),
perPersist = round(100 * persisted/enrolled, 1))
)
chart2year <- bind_rows(plotdf %>% group_by(first_college_name_2yr) %>%
summarize(enrolled = sum(enrl_1oct_grad_yr1_2yr, na.rm=TRUE),
persisted = sum(enrl_grad_persist_2yr, na.rm=TRUE)) %>%
ungroup %>%
mutate(total_enrolled = sum(enrolled)) %>%
mutate(perEnroll = round(100 * enrolled/total_enrolled, 1),
perPersist = round(100 * persisted/enrolled, 1)),
plotdf %>%
summarize(enrolled = sum(enrl_1oct_grad_yr1_2yr, na.rm=TRUE),
persisted = sum(enrl_grad_persist_2yr, na.rm=TRUE),
first_college_name_2yr = "All 2-Year Colleges") %>%
ungroup %>%
mutate(total_enrolled = sum(enrolled)) %>%
mutate(perEnroll = round(100 * enrolled/total_enrolled, 1),
perPersist = round(100 * persisted/enrolled, 1))
)
# // Step 4: Create tables
chart4year %>% arrange(-enrolled) %>%
select(first_college_name_4yr, enrolled, perEnroll, persisted, perPersist) %>%
head(num_inst) %>%
knitr::kable(., col.names = c("Name", "Number Enrolled",
"% Enrolled", "Number Persisted",
"% Persisted"))
Name | Number Enrolled | % Enrolled | Number Persisted | % Persisted |
---|---|---|---|---|
All 4-Year Colleges | 233 | 100.0 | 2269 | 973.8 |
University of South Carolina-Ups… | 51 | 21.9 | 535 | 1049.0 |
University of Portland | 47 | 20.2 | 382 | 812.8 |
Louisiana State University-Alexa… | 29 | 12.4 | 249 | 858.6 |
Notre Dame College | 22 | 9.4 | 217 | 986.4 |
chart2year %>% arrange(-enrolled) %>%
select(first_college_name_2yr, enrolled, perEnroll, persisted, perPersist) %>%
head(num_inst) %>%
knitr::kable(., col.names = c("Name", "Number Enrolled",
"% Enrolled", "Number Persisted",
"% Persisted"))
Name | Number Enrolled | % Enrolled | Number Persisted | % Persisted |
---|---|---|---|---|
All 2-Year Colleges | 631 | 100.0 | 5831 | 924.1 |
Keiser University-Ft Lauderdale | 168 | 26.6 | 1723 | 1025.6 |
Asheville-Buncombe Technical Com… | 91 | 14.4 | 693 | 761.5 |
Cochise College | 55 | 8.7 | 444 | 807.3 |
Atlanta Metropolitan State College | 38 | 6.0 | 343 | 902.6 |