Explore US Government R&D Spendings

TidyTuesday 7 - Source AAAS

Recently the TidyTuesday social project proposed a dataset documenting R&D spending of the US government. This dataset was originally collected by the American Association for the Advancement of Science.

This tidy dataset stores R&D spending by the US government from 1976 to 2016, devided by agency. It’s available here.

1 Load Packages and Download Files

Let’s explore this dataset.

First we load the packages that we need.

library(tidyverse)
library(scico)

And then the data.

# Get data ---------------------------------------------------

rd_path <- "_data/2-07-funding.Rdata"
rd_url <- paste0("https://raw.githubusercontent.com/",
                 "rfordatascience/tidytuesday/master/data/",
                 "2019/2019-02-12/fed_r_d_spending.csv")

if(!file.exists(rd_path)) {
  rd <- 
    read_csv(rd_url)
  
  save(rd, file = rd_path)
  
} else {
  load(rd_path)
}

The rd dataset has this aspect.

rd %>% print()
## # A tibble: 588 x 6
##    department  year  rd_budget total_outlays discretionary_outl…       gdp
##    <chr>      <int>      <dbl>         <dbl>               <dbl>     <dbl>
##  1 DOD         1976    3.57e10  371800000000        175600000000   1.79e12
##  2 NASA        1976    1.25e10  371800000000        175600000000   1.79e12
##  3 DOE         1976    1.09e10  371800000000        175600000000   1.79e12
##  4 HHS         1976    9.23e 9  371800000000        175600000000   1.79e12
##  5 NIH         1976    8.02e 9  371800000000        175600000000   1.79e12
##  6 NSF         1976    2.37e 9  371800000000        175600000000   1.79e12
##  7 USDA        1976    1.84e 9  371800000000        175600000000   1.79e12
##  8 Interior    1976    1.15e 9  371800000000        175600000000   1.79e12
##  9 DOT         1976    1.14e 9  371800000000        175600000000   1.79e12
## 10 EPA         1976    9.68e 8  371800000000        175600000000   1.79e12
## # ... with 578 more rows

We are interested in the spending, which, as stated in the data dictionary are stored in the variable rd_budget. Their unit is inflation adjusted dollars.

2 Explore - Which Agencies Get Similar Funding?

Which agencies get similar fundings? Do funding to certain groups of research agencies correlate? And what could we learn from that, could we guess policies?

2.1 Correlation between funding pattern

This was one of the first things that I explored, and I got lucky. It seems that there are two groups of correlation.

To test correlation, we must switch to the wide data format, because the cor() function requires that each variable is in its own column.

# Wide format
rd_wide <- 
  rd %>% 
  select(department, year, rd_budget) %>% 
  tidyr::spread(key = department,
                value = rd_budget)

# correlation of everything against everything
rd_wide %>%
  column_to_rownames("year") %>%
  # Test correlation
  cor() %>% 
  # visualize
  as_tibble(rownames = "agency") %>% 
  gather(DHS:VA, key = "agency_2", value = "corr") %>% 
  ggplot(aes(x = agency,
             y = agency_2,
             fill = corr)) +
  geom_tile() + 
  scale_fill_viridis_c() +
  theme(aspect.ratio = 1,
        axis.text.x = element_text(angle = 90, hjust = .5))

Lucky, it looks there are two defined groups of correlation. Could we capture this behaviour in clusters?

2.2 Clusters of funding patterns?

With clusters, we want to capture the behaviour of our variables. Meaning that we want to capture how one variable behaves compared to another: when one variable increases in value, does the other do the same?

So, to make variables comparable, we must scale them.

Probably, the most common scaling is by z-scores. But, because z-scores are centered at 0. This introduces negative numbers, but spendings for R&D agencies are (hopefully ;) ) never negative.

Thus, I decided to rescale spendings for each agency between 0 and 1, in this way we can make them comparable and present them in a more intuitive way. Moreover, this kind of scaling preserves 0 as anchored point.

# scale each variable between 0 and 1
rd_wide_01 <- 
  rd_wide %>%
  mutate_at(vars(-contains("year")),
            ~scales::rescale(., to = c(0,1),
                             from = c(0, max(.))))


# this is how it looks after scaling
rd_wide_01 %>% print()
## # A tibble: 42 x 15
##     year   DHS   DOC   DOD   DOE   DOT   EPA   HHS Interior  NASA   NIH
##    <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>
##  1  1976     0 0.449 0.378 0.695 0.935 0.824 0.250    0.980 0.852 0.227
##  2  1977     0 0.459 0.403 0.877 0.897 0.822 0.257    0.920 0.855 0.233
##  3  1978     0 0.478 0.392 1     0.947 1     0.285    0.957 0.853 0.249
##  4  1979     0 0.522 0.394 0.997 0.822 0.938 0.274    1     0.891 0.262
##  5  1980     0 0.518 0.392 0.972 0.858 0.769 0.272    0.920 0.943 0.257
##  6  1981     0 0.454 0.442 0.945 0.801 0.767 0.261    0.842 0.904 0.243
##  7  1982     0 0.380 0.497 0.776 0.575 0.546 0.240    0.736 0.703 0.235
##  8  1983     0 0.384 0.541 0.711 0.655 0.426 0.255    0.710 0.401 0.247
##  9  1984     0 0.414 0.616 0.749 0.834 0.441 0.274    0.644 0.424 0.272
## 10  1985     0 0.433 0.683 0.769 0.725 0.523 0.300    0.666 0.489 0.300
## # ... with 32 more rows, and 4 more variables: NSF <dbl>, Other <dbl>,
## #   USDA <dbl>, VA <dbl>

After scaling, I tried hierachical clustering. But, clustering functions such as hclust() or kmeans(), by default, cluster the rows of a dataset: we must to modify the format of the data again.

This time we transpose the data with t(), to exchange rows and columns.

# For clustering
# we need to transpose the data
# because dist() %>% clusters the columns
rd_for_clust <-   
  rd_wide_01 %>%
  column_to_rownames("year") %>%
  t() 

rd_hclust <-
  rd_for_clust %>% 
  dist() %>%
  hclust() 

The results of hierarchical clustering are a bit more complex:

  • The DHS stays by itself, since it was established late, in the ’00s.
  • VA, HHS, NIH and HHS group together, since lately, they have seen an increase in fundings,
  • EPA is also pretty much by itself, since it’s fundings kept decreasing.
rd_hclust %>% plot()

2.3 Visualize the clusters with heatmaps

In biology, small numbers of clustered observation are often visualized with a combination of heatmaps and dendrograms.

You can plot those together with the packages superheat

rd_for_clust %>% 
  superheat::superheat(row.dendrogram = TRUE,
                       left.label.text.size = 3,
                       bottom.label.text.angle = 90,
                       bottom.label.text.size = 3,
                       grid.hline.col = "grey40",
                       grid.vline.col = "grey40")

3 Other visualizations.

Heatmaps are great for the tight space that scientific journals reserve to figures.

But, here, because I don’t have limitations and for fun, I wanted to try other visualization.

The one below has more white space, but might be more pleasurable to the eye than an highly compressed heatmap, and it conveys additional information, such as maximum absolute spending for each department

# I restarted from the original data,
# to have both scaled and absolute measurements
#  in the same dataset
rd_to_plot <- 
  rd %>% 
  select(department, year, rd_budget) %>% 
  # rescale 0 to 1
  group_by(department) %>% 
  mutate(budget_01 = rd_budget %>% 
           {scales::rescale(., to = c(0, 1),
                            from = c(0, max(.)))}) %>% 
  ungroup() 

I would like to prepare a faceted plot, in which facets are ordered based on a dendrogram, and with the dendrogram itself on the side.

I can order the facets with the dendrogram from hierarchical clustering using a factor. (But I have no idea how to show the dendrogram itself.)

rd_to_plot <- 
  rd_to_plot %>% 
  mutate(department = department %>% 
           factor(levels = rd_hclust$labels[rd_hclust$order]))
line_col <- "grey70"
text_col <- "grey40"
line_size <- .2
label_spacer <- .4

p <- 
  rd_to_plot %>% 
  ggplot(aes(x = year,
             y = budget_01,
             colour = budget_01)) +
  geom_line(colour = line_col,
            size = line_size) +
  geom_segment(data = . %>% 
              group_by(department) %>% 
              filter(rd_budget == max(rd_budget)),
            aes(y = budget_01,
                yend = budget_01 -label_spacer,
                xend = year),
            size = line_size, colour = line_col) +
  geom_point(aes(size = rd_budget)) +
  geom_text(data = . %>% 
              group_by(department) %>% 
              filter(rd_budget == max(rd_budget)),
            aes(y = budget_01 - label_spacer - .1,
                label = rd_budget %>% scales::scientific(digits = 0)),
            size = 4) +
  facet_grid(department ~ .) +
  theme_minimal() + 
  scale_color_scico(palette = "devon",
    guide = FALSE, 
    direction = -1,
    begin = .2) +
  lims(y = c(0, 1.1)) +
  labs(y = "Government R&D Spending") +
  scale_size(range = c(.1, 3), guide = FALSE) +
  theme(text = element_text(family = "sans",
                            colour = text_col),
        axis.text.y = element_blank(),
        panel.grid = element_blank(),
        panel.grid.major.x = element_line(colour = line_col,
                                          size = line_size/2),
        legend.position = "top",
        strip.text.y = element_text(family = "sans",
                                    angle = 0,
                                    colour = text_col))


p %>% print()

What do you think?

Avatar
Otho Mantegazza
Biologist, Data Scientist
comments powered by Disqus