1 Introduction

You have already explored the Iris dataset by determining differences in petal and sepal size of different types of these plants. In this tutorial you will learn how to visualize these difference using different kinds of plots with an external package called ggplot2.

ggplot2 is an R package created by Hadley Wickham in 2005. It can highly improve the quality and aesthetic of your graphs. As you’ve seen in the last days, R also has some built-in plotting functions, but today we will be focusing on ggplot2.

Scroll through http://www.r-graph-gallery.com/portfolio/ggplot2-package/ to see some examples of the plots you can make. Clicking on a plot shows the code that was used to make it.


First, we have to load the ggplot2 package using the library() function.



2 Plotting

2.1 Easy example

Now we have loaded the ggplot2 library lots of new methods are available. The first function you always need when making a ggplot is ggplot(). Let’s try the basic function ggplot() and geom_point() to generate a scatter plot.

The first argument in ggplot() is the dataset you want to use to generate the plot, in this case we’re using the iris dataset again. The second argument is the aes() or aesthetic function, where you can set which variables of the dataset are used for the available visual properties. In the case of this scatterplot we’re only interested in the x and y position of each data point.


Just running the command above will generate an empty plot. This is because ggplot2 doesn’t know what kind of plot you wish to make. The function geom_point() will tell ggplot2 to make a scatterplot. Combining the two functions ggplot(...) + geom_point() will result in a scatterplot:


Q1: Make a scatterplot between the Petal Length and Petal Width. How many clusters do you see?



2.2 Histrograms

One of the first things to check when you are visualy exploring a dataset is the distribution of its variables. A common way to do this is by plotting a histogram. With ggplot you can do this easily by using geom_histogram(). For instance we can look at the distribution of petal lengths with the following command:


As you can see, aes() function now only takes one argument (x) since we are only interested in the distribution of one variable.

The length of every bin is the number of observations within the range (width) of each bin. You can customize the width of each bin with the binwidth argument. This might be useful if the dataset contains many more observations:


Q2

  • Make a histogram of the Sepal length
  • Use binwidth = 0.05
  • By looking at the plot: what sepal length is most prevalent?



2.3 Boxplots

A box plot is another quick way of examining one or more sets of data graphically. The box shows where the majority of the data points are located, the band in the box is the median and the whiskers shows the complete range of values. Box plots may seem more primitive than a histogram but they do have some advantages. They take up less space and are therefore particularly useful for comparing distributions between several groups or sets of data.

Since the iris dataset includes different species we can use boxplots to compare the distribution of values between these groups.

Q3:

  • Make a boxplot of the Sepal length per species
  • Use + geom_boxplot()
  • Specify that the aesthetic for x-axis are Species, and for y-axis the sepal width
  • What species has the largest median sepal width?



2.4 Saving plots

In ggplot2 it is possible to add extra options to a plot by simply on top of the ggplot() function. Therefore it can be convenient to store the result of ggplot() inside a variable. For example:


Notice that this won’t generate any graphical output. Like with any assignment, this will not immediately print the content of the variable plot3.

Q4 Consecutively type plot1, plot2, plot3. What is added in each plot?


Q5

  • Make a boxplot of sepal length
  • Save it under the name sepal.length.boxplot



3 Improving plots

3.1 Adding shapes

Now we have stored the plot, we can add shapes to the plot by adding to the sepal.length.boxplot variable. The functions geom_hline() and geom_vline() are used to add horizontal and vertical lines to a plot. For instance we can draw a horizontal line over the plot to indicate what the mean is of sepal lengths for the complete dataset. The only argument these functions need is the interception of the x- or y-axis.

Q6

  • Calculate the mean Sepal length across all species
  • Save this mean as sepal.length.mean
  • Add the mean as a line to sepal.length.boxplot using geom_hline(yintercept = sepal.length.mean)
  • Make the line stand out by adding color = "red" within the brackets of geom_hline()
  • Do you think the mean is representative of all species? Does it overlap with the ‘whiskers’ of the boxplot?



3.2 Adding text

Previously we have determined if there was a significant differences in petal lengths using a t-test. The p-value yielded by this test is very informative and can be added to plot easily.

Q7

  • Run a t-test to determine the difference in means of sepal length between Iris virginica and Iris versicolor
  • Store the p-value in a variable called t.test.p.value
  • What is the p-value?


To annotate the boxplot with the p-value of the t-test, we can use the annotate() function. We need to provide what type of annotation we want, the x and y position and the actual text of the annotation. But first we have to format the label properly.

Q8

  • Round the 3 p-value to 3 significant digits using signif() and save the new value as t.test.p.value again
  • Create a label using paste("p-value =", t.test.p.value)
  • Save the label as p.value.label
  • Add the annotation to the boxplot: + annotate("text", x = 1.5, y = 8, label = p.value.label)
  • Adjust the position of the p.value.label to the right lower Hint: play around with the values of x and y in annotate() until the label is placed where you like it



3.3 Adding layers

While ggplot2 makes good plots right out off the box, they are also highly adjustable.

Basically every line, colour, shape or text in the plot can be adjusted. Let’s enhance our boxplot a bit by adding the actual data points as dots by adding geom_point().


Because we are plotting in one dimension, the above statement will draw all dots on a single line. Since some measurements might overlap it is a good idea to add some random noise to the ‘x’ position of each point. We can adjust the x position by using the position argument of the geom_point() function. The random noise can be added using "jitter" as value. To move the points a bit to the background we can also adjust the size, color and transparency using the size, color, alpha arguments.

Q9

  • Make a new boxplot of Sepal length called sepal.length.boxplot.jitter
  • In the brackets of geom_point(), add the following: position = "jitter" size = 1 alpha = 0.5 color = c("COLORNAME")
  • Fill in a color you like, mabye "blue" or "cyan"
  • Check out all kinds of color names in R in this [overview] (http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf)
  • If you make the same plot a few times in a row, are they exactly the same? Why (not)?



3.4 Changing axis labels

Sometimes it is useful to change the sizes of the labels. For instance when you want to use your plots for a presentation or for a poster you want to increase the font size of the labels, so that the people in the back of the room also have an idea what you are talking about. We can do this by using the theme() function. Type ?theme in your console to see what arguments can be used when calling this function.

As you can see there are a lot of arguments available, we can use the axis.title argument to adjust both axes at once. To change them separately we could use axis.title.x and axis.title.y:


The text of the label on the y axis “Sepal.Length” is not despriptive enough. All the measurements of the iris dataset are in centimeters, so we should add the unit here. To change the text of the y label we can use the ylab() function and add it to plot like we are used to.

Q10

  • Start with your saves sepal.length.boxplot.jitter
  • Change the axis labels to be larger (see example code above)
  • Change the y-axis label text to include the units: "Sepal length (cm)"
  • Store the new plot as sepal.length.boxplot.labels


The size of the text of the species beneath the boxplots can also be increased by adding more arguments to the theme() function. The argument we need to change the text of axis is axis.text.x.

Q11

  • Start with your saves sepal.length.boxplot.labels
  • Change the x-axis species text size using theme(axis.text.x = element_text(size = 3))
  • What’s wrong?
  • Adjust the mistake and save the result as sepal.length.boxplot.sizes


To make a beter distinction between the two boxplot we could also color the boxes:


As you can see the boxplot is drawn on top of geom_point, this is because we have added the boxplot as another layer on top of the previous plot.

Q12:

  • Recreate the coloured boxplot above so that the points lay on top of the boxplot
  • Call the new boxplot sepal.length.boxplot.new
  • Change the colour of the dots if you do not like the combination with the box plot colours


Not everyone likes the grey background. An easy way to remove these (and to make the plot look less like a default ggplot) is to add theme_bw():



4 Grouped plots

4.1 Melting data

There are many situations where data is presented in a format that is not ready to dive straight to exploratory data analysis or to use a desired statistical method. The reshape2 package for R provides useful functionality to avoid having to hack data around in a spreadsheet prior to import into R. More recently, it has become the norm to use the dplyr package for reformatting data, and although I highly recommend that you use dplyr if you are going to work more in R, it does require a little getting used to. Therefore, we will stick to reshape2 for this tutorial.

The melt() function takes data formatted as a matrix with a set of columns, like our data, and formats it into a single column. For some application of GGplot we need this format. To make use of the function we need to specify a data frame, the id variables (which will be left at their settings) and the measured variables (columns of data) to be stacked. The default assumption on measured variables is that it is all columns which are not specified as id variables.

Load the reshape2 library, so we can use the melt() function, and print the head() of iris again.

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa


Our id variable is Species, so we use this the value of the id.vars argument of the melt function. All the other columns are variables we would like to use and therefore we do not have to specify the measure.vars argument.


Q13

  • Explore the new reformatted data frame iris.melt using head(), dim() and str()
  • What does each of the columns indicate?
  • How is this different from the initial dataset?



4.2 Facets

Now that we have melted our data we can use the facet_wrap() and facet_grid() functions. The facet approach partitions a plot into a matrix of panels. Each panel shows a different subset of the data.

We can plot the molten data frame by plotting the distribution of every variable as boxplots:


As you can see, these boxplots combine the values of all three iris species for the width and length of sepal and patel.

If we want to plot the distribution of every species separately we can use the facet_grid() function. You have to specify a formula with the rows (of the plot matrix) on the left hand side of ~ and the columns (of the plot matrix) on the right hand side of ~: facet_grid(ROWVARIABLE~COLUMNVARIABLE). You can keep one of the two variables empty as well by adding .:


Q14

  • Make the same plot as above, but this time put the species as row variable
  • What changes?
  • Which plot makes it easier to compare across species?


It’s probably more informative to group the variables instead of the species so you can compare the sepal lengths across species.

Q15 Use the facet_grid() function so that that the boxplots are grouped by the measured variables. The plot should look like this:



4.3 Violin plots

A violin plot is another method of plotting numerical data. The violin plot is similar to box plots, except that they also show the probability density of the data at different values. A violin plot is more informative than a plain boxplot: while a boxplot only shows summary statistics: mean/median, interquartile ranges and outliers, the violin plot shows the full distribution of the data. We can make violin plots using the geom_violin() function from ggplot2.

Q16

  • Use the geom_violin() function to create a similar plot like the last boxplot (grouped by the measured variables)
  • Color the boxes by species and draw actual data points on top of the violin plot
  • Add + theme(axis.text.x=element_text(angle=45, hjust=1)) to rotate the x-axis labels so they don’t overlap
  • The final plot should look like below
  • What do you notice in the distribution of the petal length across the three species?



5 Using plots to find anomalies

5.1 Coloring points above a threshold

Sometimes you want to quickly see which points are above or below a certain threshold. Coloring these points can be done by first creating a vector with which points you want to color.

You can use the function ifelse() to make a vector of outliers vs. non-outliers. ifelse() is a function to determine the outcome depending on a condition.

The first argument in the function is what you want to evaluate. For example: for every value in iris$Sepal.Length: is it higher than 7.6 (True) or lower than 7.6 (False)?

The second parameter is the value that it should return if True

The third parameter is the value it should return if False

Here is the example we described:


The outlier vector can then be given as a vector to the color parameter. Using geom_text() we also add a label to those points.

Q17

  • Run the code below
  • Do you think this cut-off really identifies outliers?



6 Simpsons’ paradox

One great way of showing the power of good visualization is Simpsons’ paradox.

Look again at the Iris correlations using cor(iris[1:4]) and specifically look at the correlation between sepal length and width. As we discussed yesterday, epal width is negatively correlated with sepal length when correlating for all species together, but when correlating using the separate species they are all positive.

Q18


Q19



7 Saving plots to file

Finally, when you are finished with your plots you will want to save them to a file so that you can add them to your presentation or poster. When using ggplot() the easiest way to save is ggsave():


When making figures for posters or presentations, be sure to only use either eps or pdf as these do not lose quality when resizing (read https://thepoliticalmethodologist.com/2013/11/25/making-high-resolution-graphics-for-academic-publishing/ if you want to know more).

You can save a specific plot stored in a variable by adding the parameter plot=NAME. In the example below, you are saving plot1 instead of plot2, which would be the case if you didn’t specify that the plot you wanted to save now was plot=plot1.


You will often want to change the width, size and resolution of the plot depending on its use. For presentations when using one figure per slide you want it a bit wider than high, so you can add width and height. For posters you will probably want to use 300 dpi (dots per inch) as your resolution to ensure the plot looks great when printed at high quality.

Q20