Intermediate programming with R

Summarizing with dplyr

Learning Objectives

  • Create new columns using mutate
  • Summarize data using summarize
  • Count number of observations using n()
  • Group data by variable(s) with group_by

At this point we have only used dplyr to subset and organize our data. But of course we can also create new data. And the true power of dplyr is revealed when we perform these operations by groups.

Create new columns with mutate

To create a new column in the data frame, we use mutate. Let’s create a new column that is the number of weeks since the article was published.

research <- mutate(research,
                   weeksSincePublished = daysSincePublished / 7)

We can instantly reference the new variables we have created. For example, we can create a variable yearsSincePublished referencing the newly created weeksSincePublished.

research <- mutate(research,
                   weeksSincePublished = daysSincePublished / 7,
                   yearsSincePublished = weeksSincePublished / 52)
select(research, contains("Since")) %>% slice(1:10)
   daysSincePublished weeksSincePublished yearsSincePublished
1                2628            375.4286            7.219780
2                2593            370.4286            7.123626
3                2684            383.4286            7.373626
4                2684            383.4286            7.373626
5                2628            375.4286            7.219780
6                2628            375.4286            7.219780
7                2656            379.4286            7.296703
8                2656            379.4286            7.296703
9                2628            375.4286            7.219780
10               2628            375.4286            7.219780

Summarize data using summarize

We use mutate when the result has the same number of rows as the original data. When we need to reduce the data to a single summary statistic, we can use summarize. For example, let’s calculate a summary statistic which is the mean number of PLOS comments.

summarize(research, plos_mean = mean(plosCommentCount))
  plos_mean
1 0.2642681

And we can additional statistics, like the standard deviation:

summarize(research, plos_mean = mean(plosCommentCount),
          plos_sd = sd(plosCommentCount))
  plos_mean  plos_sd
1 0.2642681 1.230676

Notice that this creates a second column in the data frame result.

And of course we can pipe input to summarize. Let’s calculate these statistics specifically for the articles in PLOS One published in 2007.

research %>% filter(journal == "pone", year == 2007) %>%
  summarize(plos_mean = mean(plosCommentCount),
            plos_sd = sd(plosCommentCount))
  plos_mean  plos_sd
1 0.8315704 2.033141

Lastly, since it is often useful to know how many observations, in this case articles, are present in a given subset, dplyr provides the convenience function n().

research %>% filter(journal == "pone", year == 2007) %>%
  summarize(plos_mean = mean(plosCommentCount),
            plos_sd = sd(plosCommentCount),
            num = n())
  plos_mean  plos_sd  num
1 0.8315704 2.033141 1229

Summarizing per group with group_by

The function summarize is most powerful when applied to groupings of the data. dplyr makes the code much easier to write, understand, and extend.

Recall the function we wrote earlier to calculate the mean of a metric for each level of a factor.

mean_metric_per_var <- function(metric, variable) {
  result <- numeric(length = length(levels(variable)))
  names(result) <- levels(variable)
  for (v in levels(variable)) {
    result[v] <- mean(metric[variable == v])
  }
  return(result)
}

Which we ran as the following.

mean_metric_per_var(research$backtweetsCount, research$journal)
      pbio       pcbi       pgen       pmed       pntd       pone 
0.05811321 0.12657291 0.06547251 0.31104199 0.02576490 0.49303878 
      ppat 
0.02604524 

We can perform the same operation by combining summarize with group_by

research %>% group_by(journal) %>%
  summarize(tweets_mean = mean(backtweetsCount))
Source: local data frame [7 x 2]

  journal tweets_mean
   <fctr>       <dbl>
1    pbio  0.05811321
2    pcbi  0.12657291
3    pgen  0.06547251
4    pmed  0.31104199
5    pntd  0.02576490
6    pone  0.49303878
7    ppat  0.02604524

Conveniently it returns the result as a data frame. And if we want to further group it by another factor, we can just add it to the group_by function.

research %>% group_by(journal, year) %>%
  summarize(tweets_mean = mean(backtweetsCount))
Source: local data frame [42 x 3]
Groups: journal [?]

   journal  year tweets_mean
    <fctr> <int>       <dbl>
1     pbio  2003 0.000000000
2     pbio  2004 0.000000000
3     pbio  2005 0.011363636
4     pbio  2006 0.010869565
5     pbio  2007 0.004926108
6     pbio  2008 0.030456853
7     pbio  2009 0.005524862
8     pbio  2010 0.367231638
9     pcbi  2005 0.000000000
10    pcbi  2006 0.000000000
..     ...   ...         ...

In the function we wrote to do this manually, we would have had to write another for loop!

Challenges

Summarizing the number of tweets per journal

Create a new data frame, tweets_per_journal, that for each journal contains the total number of articles, the mean number of tweets (backtweetsCount) received by articles in that journal, and the standard error of the mean (SEM) of the number of tweets. The SEM is the standard deviation divided by the square root of the sample size (i.e. the number of articles).