1 Introduction

Today, we will continue the R course with some basic statistics. We will use two datasets to go through a number of exercises and questions today.
First we will use a dataset of flower petal measurements called iris, this dataset is inherent to R and meant for practice. You do not need to load or import the data, you can get started immediately. You can find many examples of people using this dataset online, so it is useful if you have an idea of what it looks like.
The next dataset is gene expression data from an actual experiment on a B cell line derived from publicly available sample called ‘GM12878’.

We will explore these datasets and apply some of the statistics you have learned about previously.

1.1 Exploring the data

The following commands are useful for exploring data:

head() #prints first 10 rows of a dataset
dim() #gives the dimensions (rows and columns) of a dataset
colnames() #prints the column names of a dataset
rownames() #prints the row names of a dataset
View() #opens the dataset in the top window of RStudio
str() #prints an overview of the 'structure' of the data, e.g. the column names, their data types
summary() #gives a summary of each column, specific output depends on data types in each column
table() #gives a count table about a particular variable

Q1: As iris is a default dataset that is included with R, you can start exploring the data without loading it first:

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

What measurements does iris contain?
What information does each of the columns describe? Hint Find more information about sepal and petal
What is the unit of measurement? Hint Here is some basic information about the dataset: iris
How many flower measurements are recoreded?

Q2: What species of iris are tested? And how many observations are there in each species? Access the relevant column using iris$Species and use one of the commands mentioned in the overview above.

2 Differences between 2 groups

We would like to know if the petal length of Iris virginica is longer than the petal length of Iris versicolor. When asking these questions usually it refers to the averages of the groups. So first, look at the mean of both groups:

mean(iris[iris$Species == 'virginica',]$Petal.Length)

## [1] 5.552

mean(iris[iris$Species == 'versicolor',]$Petal.Length)

## [1] 4.26

Q3: What does this part of the code above do: iris[iris$Species == 'virginica',]?

Save the differences between the two means in variable diff:

# Save the difference in the *variable* diff
diff <- mean(iris[iris$Species == 'virginica',]$Petal.Length) - mean(iris[iris$Species == 'versicolor',]$Petal.Length)

Q4: What is the value of diff? Why would we need P-values and confidence intervals if we can already see that there is a difference in petal length?

2.1 Parametric: two-sample t-test

We have already used iris[iris$Species == 'virginica',]$Petal.length a few times, but these long statements are not very readable. Save the petal lengths for virginica in a variable called virg_pl:

virg_pl <- iris[iris$Species == 'virginica',]$Petal.Length

Do the same for versicolor (vers_pl) and setosa (set_pl).

You may have covered how to manually calculate the P-value from a t-test using the mean and standard deviation in your program, but here we show how to simply do it in R using the function t.test().

A t-test is a parametric test, and therefore is assumes that the length of the petals is approximatly normally distributed for the population. To assess this assumption, it would be good to plot the data and have a look at it. We will go into that tomorrow.

Q5: Run a t-test to compare the petal length between virginica and versicolor:

t.test(virg_pl, vers_pl)

What is our null hypothesis, and given our t-test result, can we reject it?

You can save the result of the t-test by assigning it to a variable using variable <- t.test(), like we did a number of times in yesterdays practical.

Q6:

Run a t-test to compare the petal length between virginica and setosa
Assign the result to t_test_virg_set
Use t_test_virg_set$p.value to find the exact p-value of this test
Use t_test_virg_set$estimate to view the means and t_test_virg_set$conf.int to view the 95% confidence interval values
Can we reject the null hypothesis of this test? Is the difference in means larger when comparing virginica with versicolor or with setosa?

2.2 Non-paramteric: Wilcoxon test

We have seen how to perform a t-test to compare if the mean of two groups is the same. However, we assumed that our data was normally distributed. Large outliers can heavily influence the sample mean and standard deviation, which would influence the t-test result. But what if during recording of the petal lengths the wrong values had been added?

Let’s change three values to simulate such outliers:

vers_pl[1] <- 15
vers_pl[2] <- 17
vers_pl[3] <- 14

Now run the t-test again:

t.test(virg_pl,vers_pl)

## 
##  Welch Two Sample t-test
## 
## data:  virg_pl and vers_pl
## t = 1.6666, df = 53.03, p-value = 0.1015
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1330538  1.4410538
## sample estimates:
## mean of x mean of y 
##     5.552     4.898

The t-test is no longer significant.

Q7:

Run the wilcoxon test on the same data (wilcox.test, type ?wilcox.test to see examples of how to use it)
Is the p-value lower or higher? (type ?wilcox.test for examples how to use it)

The Wilcoxon test merges the data together, ranks each point from lowest to highest values, separates the ranks back to the two groups, and using the sum or average rank calculates the test-statistics.

Q8:

Make a plot of the values (including the changes you just added) using the code below
Read the stripchart help page
When would you typically use a stripchart?

stripchart(list(virg_pl, vers_pl),
           vertical=TRUE,
           ylab="Observations",
           pch=21,
           bg=1)

As you can see in your plot, there is a big gap between the outlier points that we just introduced and the normal points.

Q9:

Make two plots of the real and ranked values using the code below
Why is the wilcoxon test better for data with outliers?

# par() is a function that allows multiple plots to be combined in one figure
# 1 = the number of rows, 2 = the number of columns
par(mfrow=c(1,2))

xrank <- rank(c(virg_pl,vers_pl))[seq(along=virg_pl)]
yrank <- rank(c(virg_pl,vers_pl))[-seq(along=vers_pl)]
# plot the previous plot for comparison
stripchart(list(virg_pl, vers_pl),
           vertical=TRUE,
           ylab="Observations",
           pch=21,
           bg=1)
stripchart(list(xrank,yrank),
           vertical=TRUE,
           ylab="Ranks",
           pch=21,bg=1,
           cex=1.25)

3 Differences between more than 2 groups

Sometimes, we want to know the difference between not just 2 groups, but between more than 2 groups. For example, in the iris data, there are 3 species of flower that we can investigate. Of course we can run a t-test or Wilcoxon test for each of the combinations, but we can also use an analysis of variance (ANOVA) or the non-parametric Krukal-Wallis test.

3.1 Parametric: ANOVA

If you want to compare the difference of mean values among multiple groups, you can use the ANOVA test. We want to examine whether the mean value of Sepal.Length differs among the different species of iris. We use the function aov() and we assign the result to model1.
The tilde (~) must be read as the phrase ‘is modeled as a function of’. In this example, we would say: The length of iris sepals is modeled as a function of the iris species.

model1 <- aov(Sepal.Length~Species, data = iris)

In general, the basic notation for a linear model or function notation in R is:

response_variable ~ explanatory_variables

Q10:

Use the function summary() to have a look at model1
The p-value is in the column Pr(>F)
Is the p-value significant for this test?
What is the significance code for this test, and what does that mean?
The degrees of freedom are in the column Df
How many degrees of freedom are there for the Species?

3.2 Non-parametric: Kruskal-Wallis test

The same test without the assumption of normality (i.e. non-parametric test) is called the Kruskal-Wallis test.

Q11:

Use the function kruskal.test() to calculate the differences between the length of the sepals across the three species
Save the output in model2
Use the function kruskal.test() to calculate the differences between the width of the sepals across the three species
Save the output in model3
Is there a difference in significance between model1 and model2?
What is the p-value for model3?
What can you conclude about the relationship between sepal length and iris species? Is the species important for the sepal length? How about for the sepal width?

4 Differences between paired samples

In other instances, you might be interested to understand the differences between samples that are linked (or ‘paired’) to one another. For example, if you have an experiment with two measurements before and after a treatment from the same sample. Because you are looking at data from the same individual, the measurements before and after will be ‘paired’.

4.1 Parametric: paired t-test

We will use gene expression data for this part of the tutorial. The data describes an experiment where cells from a B cell line (from one individual, GM12878) are stimulated with the cytokine IL21. The same cells are sequenced before treatment, after 3 hours with cytokine stimulation and after 3 hours without cytokine stimulation.

Q12:

Copy the following commands to import and examine the data
Fill in the missing parts:
The dataset contains gene expression for the sample GM12878 on ? genes
- Hint: the ‘ENSG….’ numbers are gene identifiers
There are ? columns (experiments)
The time points used in this experiment are t=0 and t=180: TRUE/FALSE
The experiment has been done in triplo: TRUE/FALSE

#load the necessary packages
library(stringr)

# read in the table containing the read counts of all samples that we will use for the analysis
exp <- read.csv("DIRECTORY WHERE YOU SAVED THE DATA/BigData_backup_GM12878-IL21.csv", header=TRUE)
# to continue with the table, we need to create rownames of the column named "probes" in which the gene_IDs are located
rownames(exp) <- exp$X
exp <- exp[,-1]
head(exp)
dim(exp)

Q13 Using this dataset, we can look for specific and a-specific changes in gene expression as a consequence of stimulation by comparing the following data points:

Unstimulated & 180 min without cytokine
180 min without cytokine & 180 min with cytokine
Unstimulated & 180 min with cytokine

Which comparison will be most reliable and informative if we want to know the effects of IL21 stimulation?

Q14 Run a paired t-test to see if there is a significant difference between the stimulated and unstimulated gene expression after 3 hours in sample A:

t.test(exp$GM12878_A_t180unstim, exp$GM12878_A_t180, paired = TRUE)

Now test the difference between unstimulated at the start and unstimulated after 3 hours. Are the paired t-tests significant? What is the biggest difference?

4.2 Non-parametric: Wilcoxon signed rank test

The non-parametric version of the paired t-test is called the Wilcoxon signed rank test.

Q15:

Run a Wilcoxon signed rank test using wilcox.test() on GM12878_A_t0unstim and GM12878_A_t180unstim
What happens if you do not tell R that the sample information is paired?

5 Association between variables

Until now we have discussed how to test if there is a statistically significant difference between two groups, multiple groups or paired samples within a group. Statistical analyses also often revolve around finding association between variables. For example, in the iris dataset, we could wonder whether the length of the petals is associated to the width of the petals, and whether knowledge about one of those two measurements is predictive for the other.

In this section, we will discuss association between discrete variables, correlation, and linear models to identify and examine the association between variables.

5.1 Association between discrete variables

If you want to know the relationship between two categorical variables (also known as discrete data), you can use the Chi-square test or the Fisher’s exact test. Categorical variables, as the name implies, are variables that indicate membership of a category (gender: M/F, Iris species: virginica / versicolor / setosa). Categorical variables are called factors in R (see yesterday’s info on data types).

Imagine that you have some data on the food preferences (vegetarian vs. non-vegetarian) and favourite sport to play (soccer vs. volleyball) of a group of 50 people. You can think of this data as a 2x2 table (also called a contingency table) with counts for each of the preferences for both variables:

##             food
## sport        non-veg veg
##   soccer          30   7
##   volleyball       8   5

You might want to figure out whether soccer players are more likely to be vegetarian, and as you can appreciate in the table, that means you compare the proportion of vegetarian soccer players to the proportion of vegetarion volleyball players.

5.1.1 Chi-square test

The Chi-square test allows you to test whether there is a relationship between two variables, but it does not tell you about the direction and size of the relationship.
Null Hypothesis: There is no relationship between the two variables
When you reject the null hypothesis with a Chi-Square, you are saying that there is a relationship between the two variables.

The Chi-square test is used to find if two nominal or ordinal (both categorical) variables are independent. R has a built-in dataset called survey with student smoking and exercise habits. Load in the dataset:

library(MASS) 
head(survey)

##      Sex Wr.Hnd NW.Hnd W.Hnd    Fold Pulse    Clap Exer Smoke Height
## 1 Female   18.5   18.0 Right  R on L    92    Left Some Never 173.00
## 2   Male   19.5   20.5  Left  R on L   104    Left None Regul 177.80
## 3   Male   18.0   13.3 Right  L on R    87 Neither None Occas     NA
## 4   Male   18.8   18.9 Right  R on L    NA Neither None Never 160.00
## 5   Male   20.0   20.0 Right Neither    35   Right Some Never 165.00
## 6 Female   18.0   17.7 Right  L on R    64   Right Some Never 172.72
##        M.I    Age
## 1   Metric 18.250
## 2 Imperial 17.583
## 3     <NA> 16.917
## 4   Metric 20.333
## 5   Metric 23.667
## 6 Imperial 21.000

Q16

Construct a contingency table for smoking and exercise by filling in the ???: table(survey$???, survey$???)
Save the result as tblSmoke
How many individuals are in the contingency table?
Run a Chi-square test on this sample using chisq.test
What is the null hypothesis and do you reject it?

Q17

Use ?survey to see the description of the columns
Run a chi-square test to test if which hand is on top after clapping and exercise levels are independent
What is the null hypothesis and do you reject it (at 0.05 significance level)?
Do you think this is correct?

5.1.2 Fisher’s exact test

When you ran the Chi-square test, you may have seen this warning:
Warning message: In chisq.test(tblClap) : Chi-squared approximation may be incorrect

The Chi-square test gives an estimation of the p-value, so the approximation could be incorrect. The Fisher’s exact test gives (as the name implies) the exact value of the p-value. There is some discussion about whether it is better to use a Chi-square or Fisher’s exact test. Traditionally, it was not possible to calculate the exact value (by hand), so the latter has only been used regularly since statistics are done on computers. When the numbers in the contingency table are small (i.e. one of the cells has a value of < 5), it is often advised to use Fisher’s exact test.

Q18: Redo the Chi-square exercises but instead use the Fisher Exact test. What are the differences? Hint: see ?fisher.test for information how to use the function.

5.2 Correlation

A correlation is a statistic that indicates how two variables are related to one another. In the iris dataset, we hyopthesize that the wider a plants petals are, the longer they get. We would like to test whether we have evidence for this hypothesis. Often plotting the two variables will already tell you if they correlate.

5.2.1 Parametric: Pearson correlation

Make a scatterplot between Sepal length and Sepal width:

plot(iris$Sepal.Length, iris$Sepal.Width)

This plot indicates there might be a slight negative correlation between the Sepal length and Sepal width.

Q19:

What is the correlation coefficient belonging to the plot above? Use the function cor()
What is the possible range of correlation coefficients (i.e. what is the maximum and minimum possible correlation between any pair of variables)?

Q20:

What correlation coefficient do you get if you correlate Sepal length of versicolor with the Sepal width of versicolor?
Make a scatterplot using plot()
Why is the correlation coefficient so different (size and direction)?
Use the function cor.test() to test whether the correlation is significantly different from 0
Is the correlation significant?
What is the associated p-value?

Q21:

Instead of calculating the correlation between two columns at a time, it is also possible to give a whole dataframe input for the correlation function
Select the columns with numeric values from the iris dataframe by using iris[1:4]
Get the correlation coefficients for all numeric columns in the iris dataframe and call it cormat
The output of this command is called a ‘correlation matrix’. Why are some values in this matrix 1?

Q22:

Make a scatterplot for the 4 characteristics using plot() again (the result should look like below)
Add colours to each of the species with with col=iris$Species
Compare the correlation matrix and the plots
Do you think the negative correlation coefficients that you see in the matrix are a good reflection of the relationships between the variables (e.g. between Petal length and Sepal width)?

5.2.2 Non-parametric: Spearman rank correlation

The non-parametric version of the Pearson correlation is called the Spearman rank correlation. Like the Wilcoxon test, this function first ranks all observations and then calculates the correlations across the ranks, rather than across the real values.
The default of cor() is to use the parametric Pearson correlation, but you can change the method of correlation by adding method = 'spearman' to your command.

Q23: Answer Q20 again, but this time using the Spearman rank correlation

5.3 Linear models

Now that we have seen that the petal width a petal length is strongly correlated we can use regression analysis to predict values that we have no measured. Regression finds a relationship between the predictor variable (measured in experiments) and response variable (derived from predictor variable). If there is a linear relationship between these two variables they can be represented by a line. The regression formula is

y = ax + b

where y is the response variable, x is the predictor variable, and a and b are the coefficients. We want to predict petal length using the petal width.

A linear model is fitted using the function lm() as follows:

modelname <- lm(formula, data=mydata)

In the case of the iris dataset, this would for example look like this (where m1 stands for model 1):

m1 <- lm(Petal.Length ~ Petal.Width, data=iris)

Q24: Using the results from m1, write down the mathematical equation to calculate petal length given the petal width (fill in the values of a and b)

Q25: Examine the model output by using summary(m1). This command will show you the formula, the quartiles of the residual values, the coefficient estimates, standard errors and p-values, the residual standard error, the explained variance and the F-statistic. How much of the variation in petal length is explained by the variation in petal width?

Q26:

Given this strong association, you can predict the petal length from the petal width using predict.lm()
Imagine you have measured another iris flower, and its petal width is 4 cm
Make a new data frame with one measurement: measured <- data.frame(Petal.Width=4)
Predict what the petal length would be for this measurement using predict.lm(???, ???)
The first argument is the model name
The second argument is the name of the data frame with the measurement
What is the predicted petal length?
Save the predicted length as predicted

We want to plot the observed points and regression line to have a look at how well it fits:

# Plot the chart
m1 <- lm(Petal.Length ~ Petal.Width, data=iris)
plot(iris$Petal.Width,iris$Petal.Length,
  abline(m1))

Q27:

Run the same plot again, but this time add specific ranges for the x- and y-axes by adding:
- xlim=c(0,4)
- ylim=c(0,11)
Add the extra (predicted) point to the plot using points(4,predicted, col="red")
Hint: this last line has to be outside of the brackets of plot

Q28: Try the same thing but without xlim and ylim in the plot function. What happens?

6 Dimensionality reduction

Most of the statistics we have dealt with so far are suitable for both small and large datasets. However, when dealing with larger datasets, there are specific methods used to cluster and summarise the data. Typically we call data big data when a dataset has many samples and many observations per sample, like for example when we have information on the expression of thousands of genes per individual. Data with information on many variables is sometimes called high dimensional data. In the last part of today’s tutorial we will return to the gene expression dataset to calculate principal components and run some clustering.

6.1 Normalization and data visualization

In order to perform certain downstream analyses, such as principal component analysis (PCA), or clustering, we need to properly normalise the expression counts first.

Let’s start by saving some information regarding the gene expression dataset as colData:

# Use the information in the column names to generate a data.frame comprising the sample information.
library(stringr)
names = colnames(exp)
colData <- data.frame(row.names = names,
                      type = str_split_fixed(names, "_", 3)[,1],
                      sample = str_split_fixed(names, "_", 3)[,2],
                      stimulation = str_split_fixed(names, "_", 3)[,3])

# Add library size, which is the total ammount of gene reads per sample
colData$libSize <- colSums(exp)

# Make sure that the colnames of your count table are matching with the row names of your colData
all(rownames(colData) %in% colnames(exp))

## [1] TRUE

Q29: What information can you find in the newly made column libSize? What’s the minimum and maximum value of libSize?

Install and load the following packages: DESeq2, ggplot2, ggsci, ggrepel, RColorBrewer, pheatmap

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("GenomeInfoDbData")

library(GenomeInfoDbData)

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2")

#or try
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")

There are multiple ways in which we can approximate our counts to a normal-like distribution. The DESeq2 package provides two built-in functions to normalise your data:

Regularized logarithm, or rlog
Variance stabilizing transformations, or VST

These two transformations can accurately corect for the different conditions for which the cells have been treated. Select blind = FALSE to not include the experimental design information.

dds_stim <- DESeqDataSetFromMatrix(countData = exp,
                              colData = colData,
                              design = ~ stimulation+sample)

vst_stim <- assay(vst(dds_stim, blind=FALSE))

6.2 Principal component analysis (PCA)

PCA will help assess which samples are more closely related to each other, and how much variation is found between samples.

Follow the next steps to include the observations at t=0 unstimulated, t=180 unstimulated and t=180 stimulated in 1 PCA plot.

Q30:

Calculate PCs using prcomp(t(vst_stim))
Save these PCs as pcData
The importance of each component is displayed in summary(pcData)
What proportion of Variance does the first PC explain?
Save the importance of components as pcVar

Q31:

Extract the importance of PC1 using pcVar$importance[2,1]
Save this number as varPC1
Do the same for PC2
Make a data frame for plotting: pcPlotData <- data.frame(pcData$x[,1:2], colData[rownames(pcData$x),])
Examine pcPlotData using head()
What information do you have in this data frame?

Make the plot using this code:

pcaPlot_stim <- ggplot(pcPlotData, aes(x=PC1 , y=PC2 , color=stimulation))+
                  geom_jitter(alpha=0.6)+
                  xlab(paste0("PC1 explained variance = ", varPC1*100, "%"))+
                  ylab(paste0("PC2 explained variance = ", varPC2*100, "%"))+
                  scale_color_aaas()+
                  theme_bw()+
                  theme(legend.position = "bottom")+
                  guides(col = guide_legend(ncol = 8))

# display the pca plot in "Plots"
pcaPlot_stim

Use the following code to save your plot:

# to save your file to pdf format in your working directory do
pdf("DIRECTORY WHERE YOU SAVED THE DATA/PCA_GM12878.pdf", height = 7.5, width=7.5)
pcaPlot_stim
dev.off()

Clearly each of the duplicated of the three conditions cluster together, and there is quite some difference between t0unstim and t180unstim. However, the difference between the t180 and t180unstim samples is still much greater. This is the stimulation effect, which is what we are interested in.

Q32 (bonus): Make the PC plot, this time color the dots by the sample instead of the stimulation. Can you also observe a pattern here?

7 Unsupervised clustering

Alternatively, it’s possible to assess the similarity between samples using an unsupervised clustering approach using the whole trancriptome to generate a distance matrix.

Q33: Follow the next steps to make clusters and show a heatmap:

# Again we need to transpose our matrix to calculate the distance between each of the samples.
sampleDists <- dist(t(vst_stim))
sampleDistMatrix <- as.matrix(sampleDists)

# By using brewer.pal() we can generate a palette of colors, for more colors check (http://colorbrewer2.org/)
colors <- colorRampPalette(brewer.pal(9, "GnBu"))(255)

pheatmap(sampleDistMatrix, main = "Stimulation",
         show_colnames = FALSE,
         annotation = colData[,c("stimulation","sample")],
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)

What is the biggest difference in these experiments: the sample (A/B), the time (t=0 / t=180) or the stimulation (stimulated / unstimulated)? Hint: Look at the dendogram on the left of the graph.

R course day 2: Statistics

Niek de Klein, Annique Claringbould