Intermediate programming with R
Chaining commands with dplyr
Learning Objectives
- Chain commands together using
%>%
- Sort rows using
arrange
The Unix design philosophy is to create small tools that do one thing well and can be chained together to perform more complex operations. In an earlier lesson on the Unix shell, we reviewed how to chain multiple Unix commands together using the pipe operator |
. dplyr provides similar functionality in R by utilizing the pipe operator %>%
, which is implemented in the maggrittr package.
How to use the pipe %>%
In the previous lesson, we learned how to subset rows and columns using filter
and select
, respectively. Instead of performing these operations separately, we can combine them into one expression. Below we subset to include the Facebook data for all the articles in the published in 2006.
facebook_2006 <- research %>% filter(year == 2006) %>%
select(contains("facebook"))
head(facebook_2006)
facebookShareCount facebookLikeCount facebookCommentCount
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
facebookClickCount
1 0
2 0
3 0
4 0
5 0
6 0
This is equivalent to the following:
research_2006 <- filter(research, year == 2006)
facebook_2006 <- select(research_2006, contains("facebook"))
Comparing the more verbose version to the version with pipes, we can see how %>%
passes the output of one function to the next function: the output from the previous function becomes the first positional argument to the next function. Thus research %>% filter(year == 2006)
is converted to filter(research, year == 2006)
.
And this feature is not limited to dplyr functions. We can pipe the output to other R functions as well. For example, instead of saving the output as a new data frame and then inspecting it with head
, we can just pipe the output directly to head.
research %>% filter(year == 2006) %>% select(contains("facebook")) %>% head
facebookShareCount facebookLikeCount facebookCommentCount
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
facebookClickCount
1 0
2 0
3 0
4 0
5 0
6 0
Ths is especially useful for providing quick feedback while iteratively developing code.
Sort rows using arrange
To practice using %>%
, we’ll utitlize an additional dplyr function, arrange
. It sorts the rows by the values in the specified columns, using subsequent columns to break ties in the previous column. This is similar to the R function order
. For example, here are the first 10 rows after sorting by number of authors and the 2011 citation count. Since these commands are starting to get longer, we’ll put each function on its own line.
research %>%
arrange(authorsCount, wosCountThru2011) %>%
select(authorsCount, wosCountThru2011) %>%
slice(1:10)
authorsCount wosCountThru2011
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
10 1 0
This isn’t very interesting because it sorts from lowest to highest. We can reverse this sorting using the function desc
, for descending.
research %>%
arrange(desc(authorsCount), desc(wosCountThru2011)) %>%
select(authorsCount, wosCountThru2011) %>%
slice(1:10)
authorsCount wosCountThru2011
1 158 196
2 144 0
3 120 7
4 117 300
5 114 119
6 82 6
7 80 3
8 74 5
9 71 25
10 67 16
Challenges
Titles of most cited articles
Using a chain of pipes, output the titles of the three research articles with the largest 2011 citation count (wosCountThru2011
).
Lots of authors
Using a chain of pipes, output the author count (authorsCount
), title, journal, and subject tags (plosSubjectTags
) of the three research articles with the largest number of authors.