Intermediate programming with R

Chaining commands with dplyr

Learning Objectives

  • Chain commands together using %>%
  • Sort rows using arrange

The Unix design philosophy is to create small tools that do one thing well and can be chained together to perform more complex operations. In an earlier lesson on the Unix shell, we reviewed how to chain multiple Unix commands together using the pipe operator |. dplyr provides similar functionality in R by utilizing the pipe operator %>%, which is implemented in the maggrittr package.

How to use the pipe %>%

In the previous lesson, we learned how to subset rows and columns using filter and select, respectively. Instead of performing these operations separately, we can combine them into one expression. Below we subset to include the Facebook data for all the articles in the published in 2006.

facebook_2006 <- research %>% filter(year == 2006) %>%
  select(contains("facebook"))
head(facebook_2006)
  facebookShareCount facebookLikeCount facebookCommentCount
1                  0                 0                    0
2                  0                 0                    0
3                  0                 0                    0
4                  0                 0                    0
5                  0                 0                    0
6                  0                 0                    0
  facebookClickCount
1                  0
2                  0
3                  0
4                  0
5                  0
6                  0

This is equivalent to the following:

research_2006 <- filter(research, year == 2006)
facebook_2006 <- select(research_2006, contains("facebook"))

Comparing the more verbose version to the version with pipes, we can see how %>% passes the output of one function to the next function: the output from the previous function becomes the first positional argument to the next function. Thus research %>% filter(year == 2006) is converted to filter(research, year == 2006).

And this feature is not limited to dplyr functions. We can pipe the output to other R functions as well. For example, instead of saving the output as a new data frame and then inspecting it with head, we can just pipe the output directly to head.

research %>% filter(year == 2006) %>% select(contains("facebook")) %>% head
  facebookShareCount facebookLikeCount facebookCommentCount
1                  0                 0                    0
2                  0                 0                    0
3                  0                 0                    0
4                  0                 0                    0
5                  0                 0                    0
6                  0                 0                    0
  facebookClickCount
1                  0
2                  0
3                  0
4                  0
5                  0
6                  0

Ths is especially useful for providing quick feedback while iteratively developing code.

Sort rows using arrange

To practice using %>%, we’ll utitlize an additional dplyr function, arrange. It sorts the rows by the values in the specified columns, using subsequent columns to break ties in the previous column. This is similar to the R function order. For example, here are the first 10 rows after sorting by number of authors and the 2011 citation count. Since these commands are starting to get longer, we’ll put each function on its own line.

research %>%
  arrange(authorsCount, wosCountThru2011) %>%
  select(authorsCount, wosCountThru2011) %>%
  slice(1:10)
   authorsCount wosCountThru2011
1             1                0
2             1                0
3             1                0
4             1                0
5             1                0
6             1                0
7             1                0
8             1                0
9             1                0
10            1                0

This isn’t very interesting because it sorts from lowest to highest. We can reverse this sorting using the function desc, for descending.

research %>%
  arrange(desc(authorsCount), desc(wosCountThru2011)) %>%
  select(authorsCount, wosCountThru2011) %>%
  slice(1:10)
   authorsCount wosCountThru2011
1           158              196
2           144                0
3           120                7
4           117              300
5           114              119
6            82                6
7            80                3
8            74                5
9            71               25
10           67               16

Challenges

Titles of most cited articles

Using a chain of pipes, output the titles of the three research articles with the largest 2011 citation count (wosCountThru2011).

Lots of authors

Using a chain of pipes, output the author count (authorsCount), title, journal, and subject tags (plosSubjectTags) of the three research articles with the largest number of authors.