Animating Spatio-Temporal COVID-19 Data
I present an update to my previous posts #1 and #2. This update can also be found on a public GitHub repository.
Starting May 17, 2020 the DC Mayoral Office began releasing testing information by neighborhood on their coronavirus data portal. Molly Tolzmann zmotoly added this feature to her publicly available Google Sheet presented at the DC health planning neighborhood level.
Here, I present the daily 5-day rolling average test positivity rate (TPR) for COVID-19 cases from May 25 to June 7, 2020 in an animated Graphics Interchange Format (GIF) using the open-source software language R. A TPR is the proportion of positive tests for every administered test. For example, the World Health Organization recommends a test positivity rate below 5% (1 out of 20 tests, TPR = 0.05) before reopening. A 5-day rolling average is the mean value of a daily TPR and the TPR of the four days prior.
Important Note: There are missing data for certain dates. Case information is missing for May 22nd and 27th. Testing information is missing for May 22nd, 23rd, 24th, and 27th. Missing data are considered NA
but included in the 5-day rolling averages. Therefore, 5-day rolling averages that include these dates are more unstable. Because reporting case information at the DC health planning neighborhood level became available on May 7th, daily incident case information is available starting May 8th and 5-day rolling average cases are available starting May 12th. Because testing information became available on May 20th, daily testing information is available starting May 21st and 5-day averages are available starting May 25th.
Necessary packages and settings for the exercise include:
# Packages
loadedPackages <- c("broom", "dplyr", "geojsonio", "gganimate", "ggplot2", "googlesheets4", "sp", "tidyr", "transformr")
invisible(lapply(loadedPackages, require, character.only = T))
# Settings
googlesheets4::gs4_deauth() # no Google authorization necessary because we are not reading a public repo
Data Importation
I merge the District of Columbia Health Planning Neighborhoods boundaries and Molly Tolzmann’s zmotoly collation of the cumulative cases and tests from start of the SARS-CoV-2 outbreak. I clean up variable names that are easier for R to use.
# District of Columbia Health Planning Neighborhoods
gis_path <- "https://opendata.arcgis.com/datasets/de63a68eb7674548ae0ac01867123f7e_13.geojson"
dc <- geojsonio::geojson_read(gis_path, what = "sp")
# District of Columbia SAR-CoV-2 Data prepared by @zmotoly
covid_path <- "https://docs.google.com/spreadsheets/d/1u-FlJe2B1rYV0obEosHBks9utkU30-C2TSkHka6AVS8/edit#gid=1923705378"
covid <- googlesheets4::read_sheet(ss = covid_path,
sheet = 2, # second sheet
skip = 1) # skip 1st row of annotation
# Fix column names
names(covid) <- sub("\n", "", names(covid)) # remove extra line in column names
names(covid) <- gsub(" ", "_", names(covid)) # replace spaces with underscore
names(covid) <- gsub("Total_cases", "Cumulative", names(covid)) # Change to one word
names(covid) <- gsub("Total_tests", "Tested_", names(covid)) # Change to one word
names(covid) <- gsub("Cases_per_1000", "Case.Rate", names(covid)) # Change to one word
names(covid) <- gsub("Tests_per_1000", "Test.Rate_", names(covid)) # Change to one word
names(covid) <- gsub("__", "_", names(covid)) # replace double underscores
# Merge
dc_covid <- sp::merge(dc, covid, by.x = "CODE", by.y = "NB_Code")
# Spatial Projection
## UTM zone 18N (Washington, DC)
dc_covid_proj <- sp::spTransform(dc_covid, CRS("+init=EPSG:32618"))
Data Management
The 5-day rolling average test positivity rate is not provided in Molly Tolzmann’s zmotoly Google Sheet and I calculate it for every day data are available. First, I find the daily incident cases from the reported cumulative cases.
# Uses ggplot2 package
## helpful material: https://cengel.github.io/rspatial/4_Mapping.nb.html
## Data preparation, ggplot2 requires a data.frame
dc_covid_df <- broom::tidy(dc_covid_proj) # convert to tidy data frame
dc_covid_proj$polyID <- sapply(slot(dc_covid_proj, "polygons"), function(x) slot(x, "ID")) # preserve polygon id
CoV_DC_df <- sp::merge(dc_covid_df, dc_covid_proj, by.x = "id", by.y="polyID") # merge data
rm(dc_covid_proj, dc_covid_df, dc_covid, covid, dc) # conserve memory
# Step 1) Daily incident cases
CoV_DC_loop <- CoV_DC_df %>% dplyr::select(starts_with("cumulative"))
CoV_DC_loop <- CoV_DC_loop[,rev(1:length(CoV_DC_loop))] # reorder dates in ascending order
CoV_DC_loop$Cumulative_May_22 <- NA # missing data
CoV_DC_loop$Cumulative_May_27 <- NA # missing data
CoV_DC_loop <- CoV_DC_loop[,c(1:15, 31, 16:19, 32, 20:30)] # reorder for consistent dates
i <- NULL # initiate indexing
# Empty matrices
mat_inc <- matrix(ncol = length(CoV_DC_loop)-1, nrow = nrow(CoV_DC_loop)) # for values
col_lab <- vector(mode = "character", length = length(CoV_DC_loop)-1) # for names
for (i in 1:length(CoV_DC_loop)-1) {
mat_inc[ , i] <- ifelse(CoV_DC_loop[ , i+1] - CoV_DC_loop[ , i] < 0, NA, CoV_DC_loop[ , i+1] - CoV_DC_loop[ , i])
col_lab[i] <- paste(sub("Cumulative", names(CoV_DC_loop[i+1]), replacement = "incident"))
if(i == length(CoV_DC_loop))
mat_inc <- data.frame(mat_inc)
colnames(mat_inc) <- col_lab
}
CoV_DC_df <- cbind(CoV_DC_df, mat_inc) # merge with original dataset
Then I calculate 5-day rolling average cases (rate included, too).
# Step 2) 5-Day Average Incident Case and Case Rate
CoV_DC_loop <- CoV_DC_df %>% dplyr::select(starts_with("incident"))
i <- NULL # reinitiate indexing
# Empty matrices
mat_5d <- matrix(ncol = length(CoV_DC_loop)-4, nrow = nrow(CoV_DC_loop)) # for values
mat_5dr <- matrix(ncol = length(CoV_DC_loop)-4, nrow = nrow(CoV_DC_loop)) # for values
col_lab <- vector(mode = "character", length = length(CoV_DC_loop)-4) # for names
col_labr <- vector(mode = "character", length = length(CoV_DC_loop)-4) # for names
for (i in 1:(length(CoV_DC_loop)-4)) {
mat_5d[ , i] <- rowMeans(CoV_DC_loop[ , i:(i+4)], na.rm = T)
mat_5dr[ , i] <- mat_5d[ , i] / CoV_DC_df$`Population_(2018_ACS)` * 1000
col_lab[i] <- paste(sub("incident", names(CoV_DC_loop[(i+4)]), replacement = "average"))
col_labr[i] <- paste(sub("incident", names(CoV_DC_loop[(i+4)]), replacement = "average.rate"))
if(i == (length(CoV_DC_loop)-4))
mat_5d <- data.frame(mat_5d)
mat_5dr <- data.frame(mat_5dr)
colnames(mat_5d) <- col_lab
colnames(mat_5dr) <- col_labr
out <- cbind(mat_5d, mat_5dr)
}
CoV_DC_df <- cbind(CoV_DC_df, out) # merge with original dataset
I also find the daily tests administered from the reported cumulative tests (daily test positivity rate included, too), and hen I calculate a 5-day rolling average tests.
# Step 3) Daily Tests
CoV_DC_loop <- CoV_DC_df %>% dplyr::select(starts_with("Tested"), starts_with("incident"))
CoV_DC_loop <- CoV_DC_loop[,c(rev(1:15),28:46)] # reorder dates in ascending order
CoV_DC_loop$Tested_May_22 <- NA # missing data
CoV_DC_loop$Tested_May_23 <- NA # missing data
CoV_DC_loop$Tested_May_24 <- NA # missing data
CoV_DC_loop$Tested_May_27 <- NA # missing data
CoV_DC_loop <- CoV_DC_loop[,c(1:2,35:37,3:4,38,5:34)] # reorder for consistent dates
i <- NULL # reinitiate indexing
# Empty matrices
mat_inc <- matrix(ncol = length(CoV_DC_loop)/2-1, nrow = nrow(CoV_DC_loop)) # for values
mat_inc2 <- matrix(ncol = length(CoV_DC_loop)/2-1, nrow = nrow(CoV_DC_loop)) # for values
col_lab <- vector(mode = "character", length = length(CoV_DC_loop)/2-1) # for names
col_lab2 <- vector(mode = "character", length = length(CoV_DC_loop)/2-1) # for names
for (i in 1:(length(CoV_DC_loop)/2-1)) {
mat_inc[ , i] <- ifelse(CoV_DC_loop[ , (i+1)] - CoV_DC_loop[ , i] < 0, NA, CoV_DC_loop[ , (i+1)] - CoV_DC_loop[ , i])
mat_inc2[ , i] <- CoV_DC_loop[ , (length(CoV_DC_loop)/2+i)] / mat_inc[ , i]
col_lab[i] <- paste(sub("Tested", names(CoV_DC_loop[i+1]), replacement = "testing"))
col_lab2[i] <- paste(sub("Tested", names(CoV_DC_loop[i+1]), replacement = "positivity"))
if(i == (length(CoV_DC_loop)/2-1))
mat_inc <- data.frame(mat_inc)
mat_inc2 <- data.frame(mat_inc2)
colnames(mat_inc) <- col_lab
colnames(mat_inc2) <- col_lab2
out <- cbind(mat_inc, mat_inc2)
}
CoV_DC_df <- cbind(CoV_DC_df, out) # merge with original dataset
A 5-day rolling average test positivity is calculated by dividing the 5-day rolling average cases by the 5-day rolling average tests.
# Step 4) 5-day average testing and positivity
CoV_DC_loop <- CoV_DC_df %>% dplyr::select(starts_with("testing_"),starts_with("average_"))
i <- NULL # reinitiate indexing
# Empty matrices
mat_5d <- matrix(ncol = 18-4, nrow = nrow(CoV_DC_loop)) # for values
mat_5dr <- matrix(ncol = 18-4, nrow = nrow(CoV_DC_loop)) # for values
col_lab <- vector(mode = "character", length = 18-4) # for names
col_labr <- vector(mode = "character", length = 18-4) # for names
for (i in 1:(18-4)) {
mat_5d[ , i] <- rowMeans(CoV_DC_loop[ , i:(i+4)], na.rm = T)
mat_5dr[ , i] <- CoV_DC_loop[ , 28+i] / mat_5d[ , i]
col_lab[i] <- paste(sub("testing", names(CoV_DC_loop[(i+4)]), replacement = "test.avg"))
col_labr[i] <- paste(sub("testing", names(CoV_DC_loop[(i+4)]), replacement = "posit.avg"))
if(i == (18-4))
mat_5d <- data.frame(mat_5d)
mat_5dr <- data.frame(mat_5dr)
colnames(mat_5d) <- col_lab
colnames(mat_5dr) <- col_labr
out <- cbind(mat_5d, mat_5dr)
}
CoV_DC_df <- cbind(CoV_DC_df, out) # merge with original dataset
# Conserve memory
rm(out, col_lab, col_labr, CoV_DC_loop, mat_5d, mat_5dr, mat_inc, mat_inc2)
I use the ggplot2 and gganimate packages to create a GIF of the daily 5-day rolling average TPR for COVID-19 cases from May 25 to June 7, 2020. The packages require converting the data from wide- to long-format.
# Convert to long format
CoV_DC_5dtest <- CoV_DC_df %>%
dplyr::select(1:25,starts_with("posit.avg")) %>%
pivot_longer(cols = starts_with("posit.avg"),
values_to="posit.avg",
names_to=c("Posit.Avg","month", "day"),
names_sep = "_") %>%
tidyr::unite("date_reported", month:day) %>%
dplyr::select(-Posit.Avg) %>%
dplyr::mutate(date_reported = as.Date(date_reported, format="%B_%d"))
Finally, some 5-day rolling average TPR are invalid for two possible reasons. If any district had zero (0) administered tests, then the TPR (a ratio) would be undefined. Also, if there were more reported positive cases than administered tests (e.g., reporting backlog) then the TPR would be above one (1.0). In these cases, I present these instances as NA values.
# Clean-up
## Any Inf values (zero tests in denominator) set as NA
CoV_DC_5dtest$posit.avg <- ifelse(is.infinite(CoV_DC_5dtest$posit.avg), NA, CoV_DC_5dtest$posit.avg)
## Any values above 1.0 (more positive tests than werer administered) set as NA
CoV_DC_5dtest$posit.avg <- ifelse(CoV_DC_5dtest$posit.avg > 1, NA, CoV_DC_5dtest$posit.avg)
Animated Map
The following is the code to create a GIF of the daily 5-day rolling average TPR for COVID-19 cases from May 25 to June 7, 2020. I chose non-alarmist colors similar to previous posts #1 and #2. Grey-colored areas have missing (NA
) or no information.
# 5-day average test positivity
g1 <- CoV_DC_5dtest %>% # data
ggplot2::ggplot() + # initial plot
ggplot2::geom_polygon(ggplot2::aes(x = long, y = lat,
group = group, fill = posit.avg),
color = NA) + # add polygons
ggplot2::scale_fill_gradient(name = "5-day average rate", # color fill
low = "lavenderblush",
high = "navyblue",
na.value = "grey80",
breaks = range(CoV_DC_5dtest$posit.avg, na.rm = T)) +
ggplot2::theme(line = ggplot2::element_blank(), # remove axis lines
axis.text = ggplot2::element_blank(), # remove tickmarks
axis.title = ggplot2::element_blank(), # remove axis labels
panel.background = ggplot2::element_blank(), # remove gridlines
panel.border = ggplot2::element_blank(), # remove border
legend.position = "bottom", # legend position
text = ggplot2::element_text(size = 15)) + # set font size
ggplot2::coord_equal() + # force equal dimensions
gganimate::transition_time(date_reported) + # animate by date
ggplot2::labs(title = "5-Day Average SARS-CoV-2 Test Positivity\nDate: {frame_time}") # add title
gganimate::animate(g1, end_pause = 30) # animate
Overall, my qualitative assessment of the daily 5-day rolling average TPR for COVID-19 cases from May 25 to June 7, 2020 is a decrease in TPR overtime, especially after June 3rd. Ward 4 (Northern-most section) appears to have consistently high 5-day rolling average TPR overtime.
The provided maps are not intended to inform decision-making. Instead, I provide the the open-source code to download, manage, and visualize publicly available data. Future steps (in addition to one listed in my previous post) include, displaying TPR above or below the the 5% WHO recommendation and conducting spatial statistical analysis such as, for example, assessing the presence of spatio-temporal clustering. Other values (e.g., daily TPR, incident cases per 1,000) can be animated by modifying the above code, which is also provided in my public GitHub repository.
Thanks, again, to Molly Tolzmann zmotoly for the data collation, management, and custodianship.