{:title "Making Figures with R: common things I need to look up and tricks" :date "2019-01-13" :updated "{{{time(%Y-%m-%d %a)}}}" :tags ["R" "science" "graphics"]}

*First published: 2019-01-13 Sun*

*Last updated: 2019-01-13 Sun*

I use R for most of my data analysis and especially for making figures. There are a few things I constantly need to look up when it comes to fine-grained figure editing and graphics, as well as some best practices I picked up which have made my life easier. I'll present them below in the form of an example problem.

**Disclaimer!** I am mostly self taught when it comes to R and statistics. So if I misuse any statistics or mangle any R code, please let me know (email listed on my home page) and I will fix it!

In the spirit of making a reproducible example that you can follow along with, we will make the data from scratch.

For my PhD thesis project, I work on investigating neuron-glia interactions. This is fake data, but I recently wrote my comprehensive exam so this data is modeled after what I was working with for my project. Basically, I was measuring part of the neuron called the Axon initial segment and wanted to compare two different treatments.

To model this problem, I will just use two normal distributions with different means with separate labels. In this case, the names will be purposely awful so that we can rename them later. I'll also add other grouping categories so that we can segment the data in different ways.

# make two normal distributions with different means and n's set.seed(42) # reproducibility vals <- c(rnorm(mean=22,sd = 3, n=120), rnorm(mean=18,sd=3, n=120)) # now labels names <- c(rep("untreated", times=120), rep("treated", times=120)) samples <- rep(c(rep("a1", times=40),rep("a2", times=40), rep("a3", times=40)), times=2) data <- data.frame(names, vals, samples) head(data)

names vals samples 1 untreated 26.11288 a1 2 untreated 20.30591 a1 3 untreated 23.08939 a1 4 untreated 23.89859 a1 5 untreated 23.21280 a1 6 untreated 21.68163 a1

Now that we made the data, let's start playing with it. Here are the libraries I commonly use for figures.

library(dplyr) # essential! library(ggplot2) # plots library(ggpubr) # signif levels on plots library(cowplot) # prettier plots library(latex2exp) # annotating plots

You'll see what we use these for as we go, but briefly:

- dplyr is essential for organizing data for plotting.
- ggpubr for adding statistical significance and testing directly to plots.
- latex2exp adding \LaTeX{} symbols to your plots

You want the figures you make to have a standard look and feel to them. I use ggplot2 along with cowplot, which provides nice defaults on top of ggplot for scientific plotting.

However, I might make a lot of drafts of plots before I decide on a common style. I like to set my defaults at the top of my script and then add it to all my plots as I go. That way, if I want to change a color or font size, I can change it in one place and ensure consistency.

base_file_save <- "~/personal_projects/website-clj/resourc/img/" theme_and_axis_size <- theme(legend.position = "None", text=element_text(size=16), axis.text = element_text(size=16)) color2_standard <- scale_color_manual(values=c("black", "red")) fill2_standard <- scale_fill_manual(values=c("black", "red")) base_pallett <- "Paired" # BuGn is also good and BrBG statsize <- 6 starsize <- 9

You'll see how we use these variables will be explained as we go, but the idea is that we just add them to our plots, and then we can change them all at once as we go through.

`base_file_save`

stores the base path where all my files will be saved. We will use this inside of `ggsave(file.path(base_file_save, "this-plot-name.png"))`

later on to simplify and standardize save locations for a particular project (in this case for the website).

`theme_and_axis_size`

stores a `ggplot2`

standard size and legend options. This one is really important for having the same text size for your plots. I add this to every plot, and if I want a particular option changed (i.e. I do want a legend) I can specifically add that below.

Since I am only plotting two classes for most of this example, I used black and red as my color options in `scale_color_manual`

and `scale_fill_manual`

. I would otherwise use http://colorbrewer2.org/ to pick a colorblind-safe and pretty palette like Paired or PRGn. Storing it as a variable makes it easy to change for all your plots.

`statsize`

and `starsize`

are for the significance plots.

The data looks like so:

head(data)

names vals samples 1 untreated 26.11288 a1 2 untreated 20.30591 a1 3 untreated 23.08939 a1 4 untreated 23.89859 a1 5 untreated 23.21280 a1 6 untreated 21.68163 a1

If we make a quick plot of it, say a boxplot:

ggplot(data, aes(x=names, y=vals, color=names)) + geom_boxplot() + theme_and_axis_size + color2_standard

`if_else()`

and `case_when()`

ggplot2 orders variables in alphabetical order, so our `untreated`

(aka control) is shown before our `treated`

(aka experimental). Not ideal. One way to fix this is to **rename the variables.** This can be done like so:

data %>% mutate(names = if_else(names=="untreated", "control", "treated")) %>% head()

names vals samples 1 control 26.11288 a1 2 control 20.30591 a1 3 control 23.08939 a1 4 control 23.89859 a1 5 control 23.21280 a1 6 control 21.68163 a1

if_else()works great for dichotomous variables. However, if you have a bunch and you want to rename them all, use case_when().

data %>% mutate(names = case_when(names == "untreated" ~ "control", names == "treated" ~ "experimental")) %>% # and so on for more cases head()

names vals samples 1 control 26.11288 a1 2 control 20.30591 a1 3 control 23.08939 a1 4 control 23.89859 a1 5 control 23.21280 a1 6 control 21.68163 a1

`if_else`

and `case_when()`

would solve the problem. But a more general, less destructive solution would be to re-level the factors. I will use forcats to demonstrate this.

**Note**: in the code below I am not going to import the entire `forcats`

library, because I only need one function. Instead, I will use 'inline import' to grab the one function I need. This is useful if you only need one function and don't want to load the whole library, or if you think that two libraries have functions with the same name and you aren't sure which you loaded first. In R, you inline import like so: `libraryName::functionName`

. Read it as, "from `libraryName`

use `functionName`

". You can do this with any function from any library, including base R. This is actually great to do because it is more explicit.

`fct_relevel`

is the function we need (docs).

data$names <- forcats::fct_relevel(data$names, "untreated")

Now plot it again:

We just re-ordered the variables without re-naming them. Note that `fct_relevel`

accepts a vector, so from our data frame, we selected the column, then just put the variable we wanted first as the next argument. Then we assigned it back to the original column name.

data %>% rename(NewNames = names) %>% head()

NewNames vals samples 1 untreated 26.11288 a1 2 untreated 20.30591 a1 3 untreated 23.08939 a1 4 untreated 23.89859 a1 5 untreated 23.21280 a1 6 untreated 21.68163 a1

rename is from `dplyr`

. The argument order is `NewColumnName`

= `OldColumnName`

We have two grouping variables in this dataset. Let's say measurements were paired, and we wanted to show both the paired differences and the overall boxplot.

We could first summarize like so:

summarized_data <- data %>% group_by(names, samples) %>% summarize(mean_val = mean(vals), sd_vals = sd(vals), n = n()) %>% mutate(sem_vals = sd_vals/sqrt(n)) summarized_data

# A tibble: 6 x 6 # Groups: names [2] names samples mean_val sd_vals n sem_vals <fctr> <fctr> <dbl> <dbl> <int> <dbl> 1 untreated a1 21.88139 3.667164 40 0.5798295 2 untreated a2 22.23953 2.748078 40 0.4345093 3 untreated a3 22.14594 2.904321 40 0.4592135 4 treated a1 17.22540 2.556320 40 0.4041897 5 treated a2 18.09546 2.638866 40 0.4172414 6 treated a3 17.71718 2.811301 40 0.4445057

We made a summary of the data in two steps. First, we grouped by both the treatment group and the individual samples. Then, used dplyr::summarize to make some summary vars. The `mutate`

step adds the standard error of the mean, a measure of the spread of our sample mean around the population mean. The formula is \(SEM=\dfrac{s}{\sqrt{n}}\). Where \(s\) is the standard deviation.

Using these data, let's make a summary boxplot.

ggplot(summarized_data, aes(x=names, y=mean_val, color=names)) + geom_boxplot() + geom_errorbar(width=0.05, aes(ymin=mean_val - sem_vals, ymax=mean_val + sem_vals, alpha=0.4)) + geom_line(inherit.aes = FALSE, aes(x=names, y=mean_val, group=samples)) + color2_standard + theme_and_axis_size + labs(x="", y=TeX("Length $\\mu{}m$"))

Note the use of `TeX()`

in the axis label.

let's do a two-tailed *t*-test to see whether we can conclude that the difference between the groups is unlikely to occur by chance (significance arbitrarily set to \(\alpha{}=0.05\)).

We will use the R formula interface.

```
t.test(mean_val~names, data=summarized_data, paired=TRUE)
```

Paired t-test data: mean_val by names t = 29.777, df = 2, p-value = 0.001126 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 3.772432 5.046781 sample estimates: mean of the differences 4.409607

We can reject the null hypothesis that the true difference in the means is equal to 0 with \(\alpha{}=0.05\).

**Be careful when interpreting p-values!** Below are my favorite papers on this contentious subject:

- Erroneous analysis of interactions in neuroscience: a problem of significance
- Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations
- The Difference Between "Significant" and "Not Significant" is not Itself Statistically Significant (Paywall)
- Nice explanation of
*p*-values http://statisticsbyjim.com/hypothesis-testing/interpreting-p-values/

Using ggpubr, we can add this same information to our plot.

ggplot(summarized_data, aes(x=names, y=mean_val, color=names)) + geom_boxplot() + geom_errorbar(width=0.05, aes(ymin=mean_val - sem_vals, ymax=mean_val + sem_vals, alpha=0.4)) + geom_line(inherit.aes = FALSE, aes(x=names, y=mean_val, group=samples)) + color2_standard + theme_and_axis_size + stat_compare_means(method="t.test", paired=TRUE, label="p.signif", size=starsize) + # NEW! labs(x="", y=TeX("Length $\\mu{}m$"))

See the docs for ggpubr for more options (types of tests, pairing, etc.). This is a really awesome library.

But this looks ok, however it could use some tweaking. Let's move the stars around and add the p-value and test name

ggplot(summarized_data, aes(x=names, y=mean_val, color=names)) + geom_boxplot() + geom_errorbar(width=0.05, aes(ymin=mean_val - sem_vals, ymax=mean_val + sem_vals, alpha=0.4)) + geom_line(inherit.aes = FALSE, aes(x=names, y=mean_val, group=samples)) + color2_standard + theme_and_axis_size + stat_compare_means(method="t.test", paired=TRUE, label="p.signif", # edited label.x = 1.97, label.y=23, size=starsize) + stat_compare_means(method="t.test", paired=TRUE, size=statsize, # New! label.x=2.05, label.y=23.5) + labs(x="", y=TeX("Length $\\mu{}m$"))

We added a new call to `ggpubr`

to add the test name, and we moved both labels so they looked nicer.

Let's say we wanted to make a plot of the cumulative distribution for all the data. The cumulative distribution function (CDF) maps a value to the probability that a random variable is less than or equal to that value (you can also say, the function maps a value to its percentile rank. See Allen Downey's book *Think Stats* for an excellent, simple explanation http://www.greenteapress.com/thinkstats/ and wikipedia). You can approximate the true CDF by calculating the *empirical* CDF (ECDF) with R using the base function stats::ecdf().

However, `ggplot2`

also provides a number of methods for calculating *and* plotting data summaries like the ECDF with the stats_* layers. Let's use stats_ecdf to plot the ECDF.

ggplot(data, aes(vals, color=names)) + stat_ecdf(geom="step", pad=TRUE) + color2_standard + theme_and_axis_size + labs(x=TeX("Length ($\\mu{}m$)"), y="Probability")

We previously removed the legend with our `theme_and_axis_size`

presets. Here, we can add it back.

ggplot(data, aes(vals, color=names)) + stat_ecdf(geom="step", pad=TRUE) + color2_standard + theme_and_axis_size + theme(legend.position="right")+ labs(x=TeX("Length ($\\mu{}m$)"), y="Probability")

Looks ok, but I want to remove the title and move it to the left more.

ggplot(data, aes(vals, color=names)) + stat_ecdf(geom="step", pad=TRUE) + color2_standard + theme_and_axis_size + theme(legend.position=c(0.65, 0.5), legend.title = element_blank())+ labs(x=TeX("Length ($\\mu{}m$)"), y="Probability")

`legend.position`

accepts coordinates, which are between 0 and 1, and relative to the bottom left origin (0,0) of the plot (legend position is well explained here).

Another great resource for legends and all other things R is the r cookbook website.

Want to compare the distributions with a Kolmogorov-Smirnov Test?

test_vals <- filter(data, names == "treated")$vals control_vals <- filter(data, names == "untreated")$vals ks.test(control_vals, test_vals)

Two-sample Kolmogorov-Smirnov test data: control_vals and test_vals D = 0.6, p-value < 2.2e-16 alternative hypothesis: two-sided

*This is a work in progress. As I come across other problems, I will add them here!*