{:title "Making Figures with R: common things I need to look up and tricks" :date "2019-01-13" :updated "{{{time(%Y-%m-%d %a)}}}" :tags ["R" "science" "graphics"]}

Making figures with R: common things I need to look up and tricks

First published: 2019-01-13 Sun
Last updated: 2019-01-13 Sun

I use R for most of my data analysis and especially for making figures. There are a few things I constantly need to look up when it comes to fine-grained figure editing and graphics, as well as some best practices I picked up which have made my life easier. I'll present them below in the form of an example problem.

Disclaimer! I am mostly self taught when it comes to R and statistics. So if I misuse any statistics or mangle any R code, please let me know (email listed on my home page) and I will fix it!

Making the data

In the spirit of making a reproducible example that you can follow along with, we will make the data from scratch.

For my PhD thesis project, I work on investigating neuron-glia interactions. This is fake data, but I recently wrote my comprehensive exam so this data is modeled after what I was working with for my project. Basically, I was measuring part of the neuron called the Axon initial segment and wanted to compare two different treatments.

To model this problem, I will just use two normal distributions with different means with separate labels. In this case, the names will be purposely awful so that we can rename them later. I'll also add other grouping categories so that we can segment the data in different ways.

# make two normal distributions with different means and n's
set.seed(42) # reproducibility
vals <- c(rnorm(mean=22,sd = 3, n=120), rnorm(mean=18,sd=3, n=120))
# now labels
names <- c(rep("untreated", times=120), rep("treated", times=120))
samples <- rep(c(rep("a1", times=40),rep("a2", times=40), rep("a3", times=40)), times=2)
data <- data.frame(names, vals, samples)
head(data)
      names     vals samples
1 untreated 26.11288      a1
2 untreated 20.30591      a1
3 untreated 23.08939      a1
4 untreated 23.89859      a1
5 untreated 23.21280      a1
6 untreated 21.68163      a1

Useful libraries and standardizing settings

Now that we made the data, let's start playing with it. Here are the libraries I commonly use for figures.

library(dplyr) # essential!
library(ggplot2) # plots
library(ggpubr) # signif levels on plots
library(cowplot) # prettier plots
library(latex2exp) # annotating plots

You'll see what we use these for as we go, but briefly:

  • dplyr is essential for organizing data for plotting.
  • ggpubr for adding statistical significance and testing directly to plots.
  • latex2exp adding \LaTeX{} symbols to your plots

You want the figures you make to have a standard look and feel to them. I use ggplot2 along with cowplot, which provides nice defaults on top of ggplot for scientific plotting.

However, I might make a lot of drafts of plots before I decide on a common style. I like to set my defaults at the top of my script and then add it to all my plots as I go. That way, if I want to change a color or font size, I can change it in one place and ensure consistency.

base_file_save <- "~/personal_projects/website-clj/resourc/img/"
theme_and_axis_size <- theme(legend.position = "None", 
                             text=element_text(size=16), 
                             axis.text = element_text(size=16)) 

color2_standard <- scale_color_manual(values=c("black", "red"))
fill2_standard <- scale_fill_manual(values=c("black", "red"))
base_pallett <- "Paired" # BuGn is also good and BrBG 

statsize <- 6
starsize <- 9

You'll see how we use these variables will be explained as we go, but the idea is that we just add them to our plots, and then we can change them all at once as we go through.

base_file_save stores the base path where all my files will be saved. We will use this inside of ggsave(file.path(base_file_save, "this-plot-name.png")) later on to simplify and standardize save locations for a particular project (in this case for the website).
theme_and_axis_size stores a ggplot2 standard size and legend options. This one is really important for having the same text size for your plots. I add this to every plot, and if I want a particular option changed (i.e. I do want a legend) I can specifically add that below.

Since I am only plotting two classes for most of this example, I used black and red as my color options in scale_color_manual and scale_fill_manual. I would otherwise use http://colorbrewer2.org/ to pick a colorblind-safe and pretty palette like Paired or PRGn. Storing it as a variable makes it easy to change for all your plots.
statsize and starsize are for the significance plots.

renaming variables, re-ordering columns, and renaming columns

The data looks like so:

head(data)
      names     vals samples
1 untreated 26.11288      a1
2 untreated 20.30591      a1
3 untreated 23.08939      a1
4 untreated 23.89859      a1
5 untreated 23.21280      a1
6 untreated 21.68163      a1

If we make a quick plot of it, say a boxplot:

ggplot(data, aes(x=names, y=vals, color=names)) +
    geom_boxplot() +
    theme_and_axis_size +
    color2_standard

wrong.png

Option 1: Renaming variables with if_else() and case_when()

ggplot2 orders variables in alphabetical order, so our untreated (aka control) is shown before our treated (aka experimental). Not ideal. One way to fix this is to rename the variables. This can be done like so:

data  %>%
    mutate(names = if_else(names=="untreated", "control", "treated"))  %>%
    head()
    names     vals samples
1 control 26.11288      a1
2 control 20.30591      a1
3 control 23.08939      a1
4 control 23.89859      a1
5 control 23.21280      a1
6 control 21.68163      a1

if_else()works great for dichotomous variables. However, if you have a bunch and you want to rename them all, use case_when().

data  %>%
    mutate(names = case_when(names == "untreated" ~ "control",
                             names ==  "treated" ~ "experimental"))  %>% # and so on for more cases
    head()
    names     vals samples
1 control 26.11288      a1
2 control 20.30591      a1
3 control 23.08939      a1
4 control 23.89859      a1
5 control 23.21280      a1
6 control 21.68163      a1

if_else and case_when() would solve the problem. But a more general, less destructive solution would be to re-level the factors. I will use forcats to demonstrate this.

Option 2: Reordering factors

Note: in the code below I am not going to import the entire forcats library, because I only need one function. Instead, I will use 'inline import' to grab the one function I need. This is useful if you only need one function and don't want to load the whole library, or if you think that two libraries have functions with the same name and you aren't sure which you loaded first. In R, you inline import like so: libraryName::functionName. Read it as, "from libraryName use functionName". You can do this with any function from any library, including base R. This is actually great to do because it is more explicit.

fct_relevel is the function we need (docs).

data$names <- forcats::fct_relevel(data$names, "untreated")

Now plot it again:

ggplot(data, aes(x=names, y=vals, color=names)) +
    geom_boxplot() +
    theme_and_axis_size +
    color2_standard

releveled.png

We just re-ordered the variables without re-naming them. Note that fct_relevel accepts a vector, so from our data frame, we selected the column, then just put the variable we wanted first as the next argument. Then we assigned it back to the original column name.

Renaming columns

data  %>%
    rename(NewNames = names)  %>%
    head()
   NewNames     vals samples
1 untreated 26.11288      a1
2 untreated 20.30591      a1
3 untreated 23.08939      a1
4 untreated 23.89859      a1
5 untreated 23.21280      a1
6 untreated 21.68163      a1

rename is from dplyr. The argument order is NewColumnName = OldColumnName

Paired plots

We have two grouping variables in this dataset. Let's say measurements were paired, and we wanted to show both the paired differences and the overall boxplot.

Summarizing and making a paired plot

We could first summarize like so:

summarized_data <- data  %>%
    group_by(names, samples)  %>%
    summarize(mean_val = mean(vals),
              sd_vals = sd(vals), n = n())  %>%
    mutate(sem_vals = sd_vals/sqrt(n))
summarized_data
# A tibble: 6 x 6
# Groups:   names [2]
      names samples mean_val  sd_vals     n  sem_vals
     <fctr>  <fctr>    <dbl>    <dbl> <int>     <dbl>
1 untreated      a1 21.88139 3.667164    40 0.5798295
2 untreated      a2 22.23953 2.748078    40 0.4345093
3 untreated      a3 22.14594 2.904321    40 0.4592135
4   treated      a1 17.22540 2.556320    40 0.4041897
5   treated      a2 18.09546 2.638866    40 0.4172414
6   treated      a3 17.71718 2.811301    40 0.4445057

We made a summary of the data in two steps. First, we grouped by both the treatment group and the individual samples. Then, used dplyr::summarize to make some summary vars. The mutate step adds the standard error of the mean, a measure of the spread of our sample mean around the population mean. The formula is \(SEM=\dfrac{s}{\sqrt{n}}\). Where \(s\) is the standard deviation.

Using these data, let's make a summary boxplot.

ggplot(summarized_data, aes(x=names, y=mean_val, color=names)) +
    geom_boxplot() +
    geom_errorbar(width=0.05, aes(ymin=mean_val - sem_vals,
                                  ymax=mean_val + sem_vals, alpha=0.4)) +
    geom_line(inherit.aes = FALSE, aes(x=names, y=mean_val, group=samples)) +
    color2_standard +
    theme_and_axis_size +
    labs(x="", y=TeX("Length $\\mu{}m$"))

paired-boxplot.png

Note the use of TeX() in the axis label.

Significance test with R

let's do a two-tailed t-test to see whether we can conclude that the difference between the groups is unlikely to occur by chance (significance arbitrarily set to \(\alpha{}=0.05\)).
We will use the R formula interface.

t.test(mean_val~names, data=summarized_data, paired=TRUE)
	Paired t-test

data:  mean_val by names
t = 29.777, df = 2, p-value = 0.001126
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 3.772432 5.046781
sample estimates:
mean of the differences 
               4.409607

We can reject the null hypothesis that the true difference in the means is equal to 0 with \(\alpha{}=0.05\).

Be careful when interpreting p-values! Below are my favorite papers on this contentious subject:

Significance stars and stats with ggpubr

Using ggpubr, we can add this same information to our plot.

ggplot(summarized_data, aes(x=names, y=mean_val, color=names)) +
    geom_boxplot() +
    geom_errorbar(width=0.05, aes(ymin=mean_val - sem_vals,
                                  ymax=mean_val + sem_vals, alpha=0.4)) +
    geom_line(inherit.aes = FALSE, aes(x=names, y=mean_val, group=samples)) +
    color2_standard +
    theme_and_axis_size +
    stat_compare_means(method="t.test", paired=TRUE, label="p.signif", size=starsize) + # NEW!
    labs(x="", y=TeX("Length $\\mu{}m$"))

paired-boxplot-signif.png

See the docs for ggpubr for more options (types of tests, pairing, etc.). This is a really awesome library.
But this looks ok, however it could use some tweaking. Let's move the stars around and add the p-value and test name

ggplot(summarized_data, aes(x=names, y=mean_val, color=names)) +
    geom_boxplot() +
    geom_errorbar(width=0.05, aes(ymin=mean_val - sem_vals,
                                  ymax=mean_val + sem_vals, alpha=0.4)) +
    geom_line(inherit.aes = FALSE, aes(x=names, y=mean_val, group=samples)) +
    color2_standard +
    theme_and_axis_size +
    stat_compare_means(method="t.test", paired=TRUE, label="p.signif", # edited
                       label.x = 1.97, label.y=23, size=starsize) +
    stat_compare_means(method="t.test", paired=TRUE, size=statsize, # New!
                       label.x=2.05, label.y=23.5) +
    labs(x="", y=TeX("Length $\\mu{}m$"))

paired-signif2.png
We added a new call to ggpubr to add the test name, and we moved both labels so they looked nicer.

Stats within ggplot2 and custom legend positions

Let's say we wanted to make a plot of the cumulative distribution for all the data. The cumulative distribution function (CDF) maps a value to the probability that a random variable is less than or equal to that value (you can also say, the function maps a value to its percentile rank. See Allen Downey's book Think Stats for an excellent, simple explanation http://www.greenteapress.com/thinkstats/ and wikipedia). You can approximate the true CDF by calculating the empirical CDF (ECDF) with R using the base function stats::ecdf().
However, ggplot2 also provides a number of methods for calculating and plotting data summaries like the ECDF with the stats_* layers. Let's use stats_ecdf to plot the ECDF.

Plotting the ecdf with ggplot2

ggplot(data, aes(vals, color=names)) +
    stat_ecdf(geom="step", pad=TRUE) +
    color2_standard +
    theme_and_axis_size +
    labs(x=TeX("Length ($\\mu{}m$)"), y="Probability")

cdf-raw.png

Custom legend positions

We previously removed the legend with our theme_and_axis_size presets. Here, we can add it back.

ggplot(data, aes(vals, color=names)) +
    stat_ecdf(geom="step", pad=TRUE) +
    color2_standard +
    theme_and_axis_size +
    theme(legend.position="right")+
    labs(x=TeX("Length ($\\mu{}m$)"), y="Probability")

cdf-with-legend.png

Looks ok, but I want to remove the title and move it to the left more.

ggplot(data, aes(vals, color=names)) +
    stat_ecdf(geom="step", pad=TRUE) +
    color2_standard +
    theme_and_axis_size +
    theme(legend.position=c(0.65, 0.5), legend.title = element_blank())+
    labs(x=TeX("Length ($\\mu{}m$)"), y="Probability")

cdf-with-legend-moved.png

legend.position accepts coordinates, which are between 0 and 1, and relative to the bottom left origin (0,0) of the plot (legend position is well explained here).

Another great resource for legends and all other things R is the r cookbook website.

Kolmogorov-Smirnov Test

Want to compare the distributions with a Kolmogorov-Smirnov Test?

test_vals <- filter(data, names == "treated")$vals
control_vals <- filter(data, names == "untreated")$vals
ks.test(control_vals, test_vals)

	Two-sample Kolmogorov-Smirnov test

data:  control_vals and test_vals
D = 0.6, p-value < 2.2e-16
alternative hypothesis: two-sided

This is a work in progress. As I come across other problems, I will add them here!