Intermediate programming with R

Creating different plots with geoms

Learning Objectives

Map a categorical (discrete) variable to an axis
Create a boxplot and barplot
Add error bars and text annotations

To this point we have mainly focused on a scatter plot created using geom_point. ggplot2 offers many different geoms to create a wide variety of plots. In this lesson, we’ll explore a few more geoms.

Mapping categorical (discrete) variables

The scatter plot we made mapped continuous variables to the x and y axes.

p <- ggplot(research, aes(x = log10(pdfDownloadsCount + 1),
                          y = log10(wosCountThru2011 + 1))) +
  geom_point(aes(color = journal)) +
  geom_smooth() +
  scale_x_continuous(breaks = c(1, 3), labels = c(10, 1000)) +
  scale_y_continuous(breaks = c(1, 3), labels = c(10, 1000))
p

geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-6

Of course we can also map categorical (also known as discrete) variables.

p_box <- ggplot(research, aes(x = journal,
                              y = log10(wosCountThru2011 + 1))) +
  geom_boxplot() +
  scale_y_continuous(breaks = c(1, 3), labels = c(10, 1000))
p_box

plot of chunk unnamed-chunk-7

Creating a barplot

A common figure in scientific publications is the barplot, where the height of the bars represent the mean and the error bars represent the standard error of the mean (SEM). Recall from the challenge in the dplyr section that we calculated the mean and SEM for the number of article tweets per journal.

tweets_per_journal <- research %>%
  group_by(journal) %>%
  summarize(num = n(),
            mean = mean(backtweetsCount),
            sem = sd(backtweetsCount) / sqrt(num))
tweets_per_journal

Source: local data frame [7 x 4]

  journal   num       mean         sem
1    pbio  1325 0.05811321 0.020153395
2    pcbi  1351 0.12657291 0.052177184
3    pgen  1619 0.06547251 0.020408525
4    pmed   643 0.31104199 0.187868371
5    pntd   621 0.02576490 0.009057697
6    pone 14078 0.49303878 0.034484187
7    ppat  1459 0.02604524 0.008807428

Let’s create a barplot to display this result using geom_bar.

tweets_bar <- ggplot(tweets_per_journal, aes(x = journal, y = mean)) +
  geom_bar(stat = "identity")
tweets_bar

plot of chunk unnamed-chunk-9

We have to set the argument stat = "identity" because by default geom_bar tabulates the number of occurrences of the variable mapped to x and maps this frequency to y.

Now let’s add the error bars using geom_errorbar. We pass it values for ymin and ymax, which define the range of the error bars.

tweets_bar <- ggplot(tweets_per_journal, aes(x = journal, y = mean)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = mean - sem, ymax = mean + sem))
tweets_bar

plot of chunk unnamed-chunk-10

By default the error bars are as wide as the bars. Let’s reduce their size using the argument width.

tweets_bar <- ggplot(tweets_per_journal, aes(x = journal, y = mean)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = mean - sem, ymax = mean + sem), width = 0.1)
tweets_bar

plot of chunk unnamed-chunk-11

Since we also calculated the sample size per journal, let’s add the number of articles to the plot using geom_text.

tweets_bar <- ggplot(tweets_per_journal, aes(x = journal, y = mean)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = mean - sem, ymax = mean + sem), width = 0.1) +
  geom_text(aes(label = num))
tweets_bar

plot of chunk unnamed-chunk-12

These are difficult to read because the text boxes are centered on the coordinate that marks the top of each bar. The positioning of the text box is controlled by the arguments hjust and vjust, for horizontal and vertical justification. By default these are both set to 0.5. We can image the text box being defined from 0 to 1 from left to right for hjust and from 0 to 1 from bottom to top for vjust. The values we set for hjust and vjust adjust where the text box is situated relative to that coordinate at the top of each bar. We’ll set them both to 0 to place the bottom left corner of the text box at that point, and thus the text is above and to the right of the bar.

tweets_bar <- ggplot(tweets_per_journal, aes(x = journal, y = mean)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = mean - sem, ymax = mean + sem), width = 0.1) +
  geom_text(aes(label = num), hjust = 0, vjust = 0)
tweets_bar

plot of chunk unnamed-chunk-13

To make it easier to see this central point, and to show an alternative to a boxplot, we can easily switch from a barplot to using points.

tweets_point <- ggplot(tweets_per_journal, aes(x = journal, y = mean)) +
  geom_point() + 
  geom_errorbar(aes(ymin = mean - sem, ymax = mean + sem), width = 0.1) +
  geom_text(aes(label = num), hjust = 0, vjust = 0)
tweets_point

plot of chunk unnamed-chunk-14

Using multiple geoms, we were able to created a plot with multiple layers to display multiple sources of information, the mean, SEM, and sample size.

Challenges

Mean number of tweets per journal per year

Modify the dplyr code above to calculate the mean, SEM, and sample size of the number of article tweets per journal and per year. Use facet_wrap to make a separate subplot per year.

Visualizing a single distribution

The geoms geom_histogram and geom_density can be used to create histograms and density plots, respectively. Using these geoms, visualize the distribution of 2011 citations (wosCountThru2011). Compare the raw distribution to log10 and square root transformations.