Recently the TidyTuesday social project proposed a dataset documenting R&D spending of the US government. This dataset was originally collected by the American Association for the Advancement of Science.
This tidy dataset stores R&D spending by the US government from 1976 to 2016, devided by agency. It’s available here.
1 Load Packages and Download Files
Let’s explore this dataset.
First we load the packages that we need.
library(tidyverse)
library(scico)
And then the data.
# Get data ---------------------------------------------------
rd_path <- "_data/2-07-funding.Rdata"
rd_url <- paste0("https://raw.githubusercontent.com/",
"rfordatascience/tidytuesday/master/data/",
"2019/2019-02-12/fed_r_d_spending.csv")
if(!file.exists(rd_path)) {
rd <-
read_csv(rd_url)
save(rd, file = rd_path)
} else {
load(rd_path)
}
The rd
dataset has this aspect.
rd %>% print()
## # A tibble: 588 x 6
## department year rd_budget total_outlays discretionary_outl… gdp
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 DOD 1976 3.57e10 371800000000 175600000000 1.79e12
## 2 NASA 1976 1.25e10 371800000000 175600000000 1.79e12
## 3 DOE 1976 1.09e10 371800000000 175600000000 1.79e12
## 4 HHS 1976 9.23e 9 371800000000 175600000000 1.79e12
## 5 NIH 1976 8.02e 9 371800000000 175600000000 1.79e12
## 6 NSF 1976 2.37e 9 371800000000 175600000000 1.79e12
## 7 USDA 1976 1.84e 9 371800000000 175600000000 1.79e12
## 8 Interior 1976 1.15e 9 371800000000 175600000000 1.79e12
## 9 DOT 1976 1.14e 9 371800000000 175600000000 1.79e12
## 10 EPA 1976 9.68e 8 371800000000 175600000000 1.79e12
## # ... with 578 more rows
We are interested in the spending, which, as stated in the data dictionary are stored in the variable rd_budget. Their unit is inflation adjusted dollars.
2 Explore - Which Agencies Get Similar Funding?
Which agencies get similar fundings? Do funding to certain groups of research agencies correlate? And what could we learn from that, could we guess policies?
2.1 Correlation between funding pattern
This was one of the first things that I explored, and I got lucky. It seems that there are two groups of correlation.
To test correlation, we must switch to the wide data format, because the cor()
function requires that each variable is in its own column.
# Wide format
rd_wide <-
rd %>%
select(department, year, rd_budget) %>%
tidyr::spread(key = department,
value = rd_budget)
# correlation of everything against everything
rd_wide %>%
column_to_rownames("year") %>%
# Test correlation
cor() %>%
# visualize
as_tibble(rownames = "agency") %>%
gather(DHS:VA, key = "agency_2", value = "corr") %>%
ggplot(aes(x = agency,
y = agency_2,
fill = corr)) +
geom_tile() +
scale_fill_viridis_c() +
theme(aspect.ratio = 1,
axis.text.x = element_text(angle = 90, hjust = .5))
Lucky, it looks there are two defined groups of correlation. Could we capture this behaviour in clusters?
2.2 Clusters of funding patterns?
With clusters, we want to capture the behaviour of our variables. Meaning that we want to capture how one variable behaves compared to another: when one variable increases in value, does the other do the same?
So, to make variables comparable, we must scale them.
Probably, the most common scaling is by z-scores. But, because z-scores are centered at 0. This introduces negative numbers, but spendings for R&D agencies are (hopefully ;) ) never negative.
Thus, I decided to rescale spendings for each agency between 0 and 1, in this way we can make them comparable and present them in a more intuitive way. Moreover, this kind of scaling preserves 0 as anchored point.
# scale each variable between 0 and 1
rd_wide_01 <-
rd_wide %>%
mutate_at(vars(-contains("year")),
~scales::rescale(., to = c(0,1),
from = c(0, max(.))))
# this is how it looks after scaling
rd_wide_01 %>% print()
## # A tibble: 42 x 15
## year DHS DOC DOD DOE DOT EPA HHS Interior NASA NIH
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1976 0 0.449 0.378 0.695 0.935 0.824 0.250 0.980 0.852 0.227
## 2 1977 0 0.459 0.403 0.877 0.897 0.822 0.257 0.920 0.855 0.233
## 3 1978 0 0.478 0.392 1 0.947 1 0.285 0.957 0.853 0.249
## 4 1979 0 0.522 0.394 0.997 0.822 0.938 0.274 1 0.891 0.262
## 5 1980 0 0.518 0.392 0.972 0.858 0.769 0.272 0.920 0.943 0.257
## 6 1981 0 0.454 0.442 0.945 0.801 0.767 0.261 0.842 0.904 0.243
## 7 1982 0 0.380 0.497 0.776 0.575 0.546 0.240 0.736 0.703 0.235
## 8 1983 0 0.384 0.541 0.711 0.655 0.426 0.255 0.710 0.401 0.247
## 9 1984 0 0.414 0.616 0.749 0.834 0.441 0.274 0.644 0.424 0.272
## 10 1985 0 0.433 0.683 0.769 0.725 0.523 0.300 0.666 0.489 0.300
## # ... with 32 more rows, and 4 more variables: NSF <dbl>, Other <dbl>,
## # USDA <dbl>, VA <dbl>
After scaling, I tried hierachical clustering. But, clustering functions such as hclust()
or kmeans()
, by default, cluster the rows of a dataset: we must to modify the format of the data again.
This time we transpose the data with t()
, to exchange rows and columns.
# For clustering
# we need to transpose the data
# because dist() %>% clusters the columns
rd_for_clust <-
rd_wide_01 %>%
column_to_rownames("year") %>%
t()
rd_hclust <-
rd_for_clust %>%
dist() %>%
hclust()
The results of hierarchical clustering are a bit more complex:
- The DHS stays by itself, since it was established late, in the ’00s.
- VA, HHS, NIH and HHS group together, since lately, they have seen an increase in fundings,
- EPA is also pretty much by itself, since it’s fundings kept decreasing.
rd_hclust %>% plot()
2.3 Visualize the clusters with heatmaps
In biology, small numbers of clustered observation are often visualized with a combination of heatmaps and dendrograms.
You can plot those together with the packages superheat
rd_for_clust %>%
superheat::superheat(row.dendrogram = TRUE,
left.label.text.size = 3,
bottom.label.text.angle = 90,
bottom.label.text.size = 3,
grid.hline.col = "grey40",
grid.vline.col = "grey40")
3 Other visualizations.
Heatmaps are great for the tight space that scientific journals reserve to figures.
But, here, because I don’t have limitations and for fun, I wanted to try other visualization.
The one below has more white space, but might be more pleasurable to the eye than an highly compressed heatmap, and it conveys additional information, such as maximum absolute spending for each department
# I restarted from the original data,
# to have both scaled and absolute measurements
# in the same dataset
rd_to_plot <-
rd %>%
select(department, year, rd_budget) %>%
# rescale 0 to 1
group_by(department) %>%
mutate(budget_01 = rd_budget %>%
{scales::rescale(., to = c(0, 1),
from = c(0, max(.)))}) %>%
ungroup()
I would like to prepare a faceted plot, in which facets are ordered based on a dendrogram, and with the dendrogram itself on the side.
I can order the facets with the dendrogram from hierarchical clustering using a factor. (But I have no idea how to show the dendrogram itself.)
rd_to_plot <-
rd_to_plot %>%
mutate(department = department %>%
factor(levels = rd_hclust$labels[rd_hclust$order]))
line_col <- "grey70"
text_col <- "grey40"
line_size <- .2
label_spacer <- .4
p <-
rd_to_plot %>%
ggplot(aes(x = year,
y = budget_01,
colour = budget_01)) +
geom_line(colour = line_col,
size = line_size) +
geom_segment(data = . %>%
group_by(department) %>%
filter(rd_budget == max(rd_budget)),
aes(y = budget_01,
yend = budget_01 -label_spacer,
xend = year),
size = line_size, colour = line_col) +
geom_point(aes(size = rd_budget)) +
geom_text(data = . %>%
group_by(department) %>%
filter(rd_budget == max(rd_budget)),
aes(y = budget_01 - label_spacer - .1,
label = rd_budget %>% scales::scientific(digits = 0)),
size = 4) +
facet_grid(department ~ .) +
theme_minimal() +
scale_color_scico(palette = "devon",
guide = FALSE,
direction = -1,
begin = .2) +
lims(y = c(0, 1.1)) +
labs(y = "Government R&D Spending") +
scale_size(range = c(.1, 3), guide = FALSE) +
theme(text = element_text(family = "sans",
colour = text_col),
axis.text.y = element_blank(),
panel.grid = element_blank(),
panel.grid.major.x = element_line(colour = line_col,
size = line_size/2),
legend.position = "top",
strip.text.y = element_text(family = "sans",
angle = 0,
colour = text_col))
p %>% print()
What do you think?