You have already explored the Iris dataset by determining differences in petal and sepal size of different types of these plants. In this tutorial you will learn how to visualize these difference using different kinds of plots with an external package called ggplot2.
ggplot2 is an R package created by Hadley Wickham in 2005. It can highly improve the quality and aesthetic of your graphs. As you’ve seen in the last days, R also has some built-in plotting functions, but today we will be focusing on ggplot2.
Scroll through http://www.r-graph-gallery.com/portfolio/ggplot2-package/ to see some examples of the plots you can make. Clicking on a plot shows the code that was used to make it.
First, we have to load the ggplot2 package using the library() function.
Now we have loaded the ggplot2 library lots of new methods are available. The first function you always need when making a ggplot is ggplot(). Let’s try the basic function ggplot() and geom_point() to generate a scatter plot.
The first argument in ggplot() is the dataset you want to use to generate the plot, in this case we’re using the iris dataset again. The second argument is the aes() or aesthetic function, where you can set which variables of the dataset are used for the available visual properties. In the case of this scatterplot we’re only interested in the x and y position of each data point.
Just running the command above will generate an empty plot. This is because ggplot2 doesn’t know what kind of plot you wish to make. The function geom_point() will tell ggplot2 to make a scatterplot. Combining the two functions ggplot(...) + geom_point() will result in a scatterplot:
Q1: Make a scatterplot between the Petal Length and Petal Width. How many clusters do you see?
One of the first things to check when you are visualy exploring a dataset is the distribution of its variables. A common way to do this is by plotting a histogram. With ggplot you can do this easily by using geom_histogram(). For instance we can look at the distribution of petal lengths with the following command:
As you can see, aes() function now only takes one argument (x) since we are only interested in the distribution of one variable.
The length of every bin is the number of observations within the range (width) of each bin. You can customize the width of each bin with the binwidth argument. This might be useful if the dataset contains many more observations:
Q2
binwidth = 0.05A box plot is another quick way of examining one or more sets of data graphically. The box shows where the majority of the data points are located, the band in the box is the median and the whiskers shows the complete range of values. Box plots may seem more primitive than a histogram but they do have some advantages. They take up less space and are therefore particularly useful for comparing distributions between several groups or sets of data.
Since the iris dataset includes different species we can use boxplots to compare the distribution of values between these groups.
Q3:
+ geom_boxplot()Species, and for y-axis the sepal widthIn ggplot2 it is possible to add extra options to a plot by simply on top of the ggplot() function. Therefore it can be convenient to store the result of ggplot() inside a variable. For example:
plot1 <- ggplot(iris, aes(x=Species, y=Sepal.Width))
plot2 <- plot1 + geom_boxplot()
plot3 <- plot2 + theme_bw()
Notice that this won’t generate any graphical output. Like with any assignment, this will not immediately print the content of the variable plot3.
Q4 Consecutively type plot1, plot2, plot3. What is added in each plot?
Q5
sepal.length.boxplotNow we have stored the plot, we can add shapes to the plot by adding to the sepal.length.boxplot variable. The functions geom_hline() and geom_vline() are used to add horizontal and vertical lines to a plot. For instance we can draw a horizontal line over the plot to indicate what the mean is of sepal lengths for the complete dataset. The only argument these functions need is the interception of the x- or y-axis.
Q6
sepal.length.meangeom_hline(yintercept = sepal.length.mean)color = "red" within the brackets of geom_hline()Previously we have determined if there was a significant differences in petal lengths using a t-test. The p-value yielded by this test is very informative and can be added to plot easily.
Q7
t.test.p.valueTo annotate the boxplot with the p-value of the t-test, we can use the annotate() function. We need to provide what type of annotation we want, the x and y position and the actual text of the annotation. But first we have to format the label properly.
Q8
signif() and save the new value as t.test.p.value againpaste("p-value =", t.test.p.value)p.value.label+ annotate("text", x = 1.5, y = 8, label = p.value.label)p.value.label to the right lower Hint: play around with the values of x and y in annotate() until the label is placed where you like itWhile ggplot2 makes good plots right out off the box, they are also highly adjustable.
Basically every line, colour, shape or text in the plot can be adjusted. Let’s enhance our boxplot a bit by adding the actual data points as dots by adding geom_point().
Because we are plotting in one dimension, the above statement will draw all dots on a single line. Since some measurements might overlap it is a good idea to add some random noise to the ‘x’ position of each point. We can adjust the x position by using the position argument of the geom_point() function. The random noise can be added using "jitter" as value. To move the points a bit to the background we can also adjust the size, color and transparency using the size, color, alpha arguments.
Q9
sepal.length.boxplot.jittergeom_point(), add the following: position = "jitter" size = 1 alpha = 0.5 color = c("COLORNAME")"blue" or "cyan"Sometimes it is useful to change the sizes of the labels. For instance when you want to use your plots for a presentation or for a poster you want to increase the font size of the labels, so that the people in the back of the room also have an idea what you are talking about. We can do this by using the theme() function. Type ?theme in your console to see what arguments can be used when calling this function.
As you can see there are a lot of arguments available, we can use the axis.title argument to adjust both axes at once. To change them separately we could use axis.title.x and axis.title.y:
The text of the label on the y axis “Sepal.Length” is not despriptive enough. All the measurements of the iris dataset are in centimeters, so we should add the unit here. To change the text of the y label we can use the ylab() function and add it to plot like we are used to.
Q10
sepal.length.boxplot.jitter"Sepal length (cm)"sepal.length.boxplot.labelsThe size of the text of the species beneath the boxplots can also be increased by adding more arguments to the theme() function. The argument we need to change the text of axis is axis.text.x.
Q11
sepal.length.boxplot.labelstheme(axis.text.x = element_text(size = 3))sepal.length.boxplot.sizes
To make a beter distinction between the two boxplot we could also color the boxes:
As you can see the boxplot is drawn on top of geom_point, this is because we have added the boxplot as another layer on top of the previous plot.
Q12:
sepal.length.boxplot.newNot everyone likes the grey background. An easy way to remove these (and to make the plot look less like a default ggplot) is to add theme_bw():
There are many situations where data is presented in a format that is not ready to dive straight to exploratory data analysis or to use a desired statistical method. The reshape2 package for R provides useful functionality to avoid having to hack data around in a spreadsheet prior to import into R. More recently, it has become the norm to use the dplyr package for reformatting data, and although I highly recommend that you use dplyr if you are going to work more in R, it does require a little getting used to. Therefore, we will stick to reshape2 for this tutorial.
The melt() function takes data formatted as a matrix with a set of columns, like our data, and formats it into a single column. For some application of GGplot we need this format. To make use of the function we need to specify a data frame, the id variables (which will be left at their settings) and the measured variables (columns of data) to be stacked. The default assumption on measured variables is that it is all columns which are not specified as id variables.
Load the reshape2 library, so we can use the melt() function, and print the head() of iris again.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Our id variable is Species, so we use this the value of the id.vars argument of the melt function. All the other columns are variables we would like to use and therefore we do not have to specify the measure.vars argument.
Q13
iris.melt using head(), dim() and str()Now that we have melted our data we can use the facet_wrap() and facet_grid() functions. The facet approach partitions a plot into a matrix of panels. Each panel shows a different subset of the data.
We can plot the molten data frame by plotting the distribution of every variable as boxplots:
As you can see, these boxplots combine the values of all three iris species for the width and length of sepal and patel.
If we want to plot the distribution of every species separately we can use the facet_grid() function. You have to specify a formula with the rows (of the plot matrix) on the left hand side of ~ and the columns (of the plot matrix) on the right hand side of ~: facet_grid(ROWVARIABLE~COLUMNVARIABLE). You can keep one of the two variables empty as well by adding .:
Q14
It’s probably more informative to group the variables instead of the species so you can compare the sepal lengths across species.
Q15 Use the facet_grid() function so that that the boxplots are grouped by the measured variables. The plot should look like this:
A violin plot is another method of plotting numerical data. The violin plot is similar to box plots, except that they also show the probability density of the data at different values. A violin plot is more informative than a plain boxplot: while a boxplot only shows summary statistics: mean/median, interquartile ranges and outliers, the violin plot shows the full distribution of the data. We can make violin plots using the geom_violin() function from ggplot2.
Q16
geom_violin() function to create a similar plot like the last boxplot (grouped by the measured variables)+ theme(axis.text.x=element_text(angle=45, hjust=1)) to rotate the x-axis labels so they don’t overlap
Sometimes you want to quickly see which points are above or below a certain threshold. Coloring these points can be done by first creating a vector with which points you want to color.
You can use the function ifelse() to make a vector of outliers vs. non-outliers. ifelse() is a function to determine the outcome depending on a condition.
The first argument in the function is what you want to evaluate. For example: for every value in iris$Sepal.Length: is it higher than 7.6 (True) or lower than 7.6 (False)?
The second parameter is the value that it should return if True
The third parameter is the value it should return if False
Here is the example we described:
The outlier vector can then be given as a vector to the color parameter. Using geom_text() we also add a label to those points.
Q17
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, color=outlier)) +
geom_point() +
geom_text(aes(label = Sepal.Width), data = iris[iris$Sepal.Length>7.6,], color="black")One great way of showing the power of good visualization is Simpsons’ paradox.
Look again at the Iris correlations using cor(iris[1:4]) and specifically look at the correlation between sepal length and width. As we discussed yesterday, epal width is negatively correlated with sepal length when correlating for all species together, but when correlating using the separate species they are all positive.
Q18
+ geom_smooth(method = "lm")Q19
facet_grid() to split for each speciesscales = "free_x" as parameter to the facet_grid() functionFinally, when you are finished with your plots you will want to save them to a file so that you can add them to your presentation or poster. When using ggplot() the easiest way to save is ggsave():
plot1 <- ggplot(data = iris, aes(Sepal.Length, Sepal.Width)) +
geom_point()
# Running ggsave() will save the last plot you made
ggsave('filename.png')
# The file type is based on the extension you use.
ggsave('filename.pdf')When making figures for posters or presentations, be sure to only use either eps or pdf as these do not lose quality when resizing (read https://thepoliticalmethodologist.com/2013/11/25/making-high-resolution-graphics-for-academic-publishing/ if you want to know more).
You can save a specific plot stored in a variable by adding the parameter plot=NAME. In the example below, you are saving plot1 instead of plot2, which would be the case if you didn’t specify that the plot you wanted to save now was plot=plot1.
plot2 <- ggplot(data = iris, aes(Sepal.Length, Sepal.Width)) +
geom_boxplot()
ggsave('filename.pdf', plot=plot1)You will often want to change the width, size and resolution of the plot depending on its use. For presentations when using one figure per slide you want it a bit wider than high, so you can add width and height. For posters you will probably want to use 300 dpi (dots per inch) as your resolution to ensure the plot looks great when printed at high quality.
Q20