Today, we will continue the R course with some basic statistics. We will use two datasets to go through a number of exercises and questions today.
First we will use a dataset of flower petal measurements called iris, this dataset is inherent to R and meant for practice. You do not need to load or import the data, you can get started immediately. You can find many examples of people using this dataset online, so it is useful if you have an idea of what it looks like.
The next dataset is gene expression data from an actual experiment on a B cell line derived from publicly available sample called ‘GM12878’.
We will explore these datasets and apply some of the statistics you have learned about previously.
The following commands are useful for exploring data:
head() #prints first 10 rows of a dataset
dim() #gives the dimensions (rows and columns) of a dataset
colnames() #prints the column names of a dataset
rownames() #prints the row names of a dataset
View() #opens the dataset in the top window of RStudio
str() #prints an overview of the 'structure' of the data, e.g. the column names, their data types
summary() #gives a summary of each column, specific output depends on data types in each column
table() #gives a count table about a particular variableQ1: As iris is a default dataset that is included with R, you can start exploring the data without loading it first:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Q2: What species of iris are tested? And how many observations are there in each species? Access the relevant column using iris$Species and use one of the commands mentioned in the overview above.
We would like to know if the petal length of Iris virginica is longer than the petal length of Iris versicolor. When asking these questions usually it refers to the averages of the groups. So first, look at the mean of both groups:
## [1] 5.552
## [1] 4.26
Q3: What does this part of the code above do: iris[iris$Species == 'virginica',]?
Save the differences between the two means in variable diff:
# Save the difference in the *variable* diff
diff <- mean(iris[iris$Species == 'virginica',]$Petal.Length) - mean(iris[iris$Species == 'versicolor',]$Petal.Length)Q4: What is the value of diff? Why would we need P-values and confidence intervals if we can already see that there is a difference in petal length?
We have already used iris[iris$Species == 'virginica',]$Petal.length a few times, but these long statements are not very readable. Save the petal lengths for virginica in a variable called virg_pl:
Do the same for versicolor (vers_pl) and setosa (set_pl).
You may have covered how to manually calculate the P-value from a t-test using the mean and standard deviation in your program, but here we show how to simply do it in R using the function t.test().
A t-test is a parametric test, and therefore is assumes that the length of the petals is approximatly normally distributed for the population. To assess this assumption, it would be good to plot the data and have a look at it. We will go into that tomorrow.
Q5: Run a t-test to compare the petal length between virginica and versicolor:
What is our null hypothesis, and given our t-test result, can we reject it?
You can save the result of the t-test by assigning it to a variable using variable <- t.test(), like we did a number of times in yesterdays practical.
Q6:
t_test_virg_sett_test_virg_set$p.value to find the exact p-value of this testt_test_virg_set$estimate to view the means and t_test_virg_set$conf.int to view the 95% confidence interval valuesWe have seen how to perform a t-test to compare if the mean of two groups is the same. However, we assumed that our data was normally distributed. Large outliers can heavily influence the sample mean and standard deviation, which would influence the t-test result. But what if during recording of the petal lengths the wrong values had been added?
Let’s change three values to simulate such outliers:
Now run the t-test again:
##
## Welch Two Sample t-test
##
## data: virg_pl and vers_pl
## t = 1.6666, df = 53.03, p-value = 0.1015
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1330538 1.4410538
## sample estimates:
## mean of x mean of y
## 5.552 4.898
The t-test is no longer significant.
Q7:
wilcox.test, type ?wilcox.test to see examples of how to use it)?wilcox.test for examples how to use it)The Wilcoxon test merges the data together, ranks each point from lowest to highest values, separates the ranks back to the two groups, and using the sum or average rank calculates the test-statistics.
Q8:
As you can see in your plot, there is a big gap between the outlier points that we just introduced and the normal points.
Q9:
# par() is a function that allows multiple plots to be combined in one figure
# 1 = the number of rows, 2 = the number of columns
par(mfrow=c(1,2))
xrank <- rank(c(virg_pl,vers_pl))[seq(along=virg_pl)]
yrank <- rank(c(virg_pl,vers_pl))[-seq(along=vers_pl)]
# plot the previous plot for comparison
stripchart(list(virg_pl, vers_pl),
vertical=TRUE,
ylab="Observations",
pch=21,
bg=1)
stripchart(list(xrank,yrank),
vertical=TRUE,
ylab="Ranks",
pch=21,bg=1,
cex=1.25)Sometimes, we want to know the difference between not just 2 groups, but between more than 2 groups. For example, in the iris data, there are 3 species of flower that we can investigate. Of course we can run a t-test or Wilcoxon test for each of the combinations, but we can also use an analysis of variance (ANOVA) or the non-parametric Krukal-Wallis test.
If you want to compare the difference of mean values among multiple groups, you can use the ANOVA test. We want to examine whether the mean value of Sepal.Length differs among the different species of iris. We use the function aov() and we assign the result to model1.
The tilde (~) must be read as the phrase ‘is modeled as a function of’. In this example, we would say: The length of iris sepals is modeled as a function of the iris species.
In general, the basic notation for a linear model or function notation in R is:
Q10:
summary() to have a look at model1Pr(>F)DfSpecies?The same test without the assumption of normality (i.e. non-parametric test) is called the Kruskal-Wallis test.
Q11:
kruskal.test() to calculate the differences between the length of the sepals across the three speciesmodel2kruskal.test() to calculate the differences between the width of the sepals across the three speciesmodel3model1 and model2?model3?In other instances, you might be interested to understand the differences between samples that are linked (or ‘paired’) to one another. For example, if you have an experiment with two measurements before and after a treatment from the same sample. Because you are looking at data from the same individual, the measurements before and after will be ‘paired’.
We will use gene expression data for this part of the tutorial. The data describes an experiment where cells from a B cell line (from one individual, GM12878) are stimulated with the cytokine IL21. The same cells are sequenced before treatment, after 3 hours with cytokine stimulation and after 3 hours without cytokine stimulation.
Q12:
#load the necessary packages
library(stringr)
# read in the table containing the read counts of all samples that we will use for the analysis
exp <- read.csv("DIRECTORY WHERE YOU SAVED THE DATA/BigData_backup_GM12878-IL21.csv", header=TRUE)
# to continue with the table, we need to create rownames of the column named "probes" in which the gene_IDs are located
rownames(exp) <- exp$X
exp <- exp[,-1]
head(exp)
dim(exp)Q13 Using this dataset, we can look for specific and a-specific changes in gene expression as a consequence of stimulation by comparing the following data points:
Which comparison will be most reliable and informative if we want to know the effects of IL21 stimulation?
Q14 Run a paired t-test to see if there is a significant difference between the stimulated and unstimulated gene expression after 3 hours in sample A:
Now test the difference between unstimulated at the start and unstimulated after 3 hours. Are the paired t-tests significant? What is the biggest difference?
The non-parametric version of the paired t-test is called the Wilcoxon signed rank test.
Q15:
wilcox.test() on GM12878_A_t0unstim and GM12878_A_t180unstimUntil now we have discussed how to test if there is a statistically significant difference between two groups, multiple groups or paired samples within a group. Statistical analyses also often revolve around finding association between variables. For example, in the iris dataset, we could wonder whether the length of the petals is associated to the width of the petals, and whether knowledge about one of those two measurements is predictive for the other.
In this section, we will discuss association between discrete variables, correlation, and linear models to identify and examine the association between variables.
If you want to know the relationship between two categorical variables (also known as discrete data), you can use the Chi-square test or the Fisher’s exact test. Categorical variables, as the name implies, are variables that indicate membership of a category (gender: M/F, Iris species: virginica / versicolor / setosa). Categorical variables are called factors in R (see yesterday’s info on data types).
Imagine that you have some data on the food preferences (vegetarian vs. non-vegetarian) and favourite sport to play (soccer vs. volleyball) of a group of 50 people. You can think of this data as a 2x2 table (also called a contingency table) with counts for each of the preferences for both variables:
## food
## sport non-veg veg
## soccer 30 7
## volleyball 8 5
You might want to figure out whether soccer players are more likely to be vegetarian, and as you can appreciate in the table, that means you compare the proportion of vegetarian soccer players to the proportion of vegetarion volleyball players.
The Chi-square test allows you to test whether there is a relationship between two variables, but it does not tell you about the direction and size of the relationship.
Null Hypothesis: There is no relationship between the two variables
When you reject the null hypothesis with a Chi-Square, you are saying that there is a relationship between the two variables.
The Chi-square test is used to find if two nominal or ordinal (both categorical) variables are independent. R has a built-in dataset called survey with student smoking and exercise habits. Load in the dataset:
## Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height
## 1 Female 18.5 18.0 Right R on L 92 Left Some Never 173.00
## 2 Male 19.5 20.5 Left R on L 104 Left None Regul 177.80
## 3 Male 18.0 13.3 Right L on R 87 Neither None Occas NA
## 4 Male 18.8 18.9 Right R on L NA Neither None Never 160.00
## 5 Male 20.0 20.0 Right Neither 35 Right Some Never 165.00
## 6 Female 18.0 17.7 Right L on R 64 Right Some Never 172.72
## M.I Age
## 1 Metric 18.250
## 2 Imperial 17.583
## 3 <NA> 16.917
## 4 Metric 20.333
## 5 Metric 23.667
## 6 Imperial 21.000
Q16
table(survey$???, survey$???)tblSmokechisq.testQ17
?survey to see the description of the columnsWhen you ran the Chi-square test, you may have seen this warning:
Warning message: In chisq.test(tblClap) : Chi-squared approximation may be incorrect
The Chi-square test gives an estimation of the p-value, so the approximation could be incorrect. The Fisher’s exact test gives (as the name implies) the exact value of the p-value. There is some discussion about whether it is better to use a Chi-square or Fisher’s exact test. Traditionally, it was not possible to calculate the exact value (by hand), so the latter has only been used regularly since statistics are done on computers. When the numbers in the contingency table are small (i.e. one of the cells has a value of < 5), it is often advised to use Fisher’s exact test.
Q18: Redo the Chi-square exercises but instead use the Fisher Exact test. What are the differences? Hint: see ?fisher.test for information how to use the function.
A correlation is a statistic that indicates how two variables are related to one another. In the iris dataset, we hyopthesize that the wider a plants petals are, the longer they get. We would like to test whether we have evidence for this hypothesis. Often plotting the two variables will already tell you if they correlate.
Make a scatterplot between Sepal length and Sepal width:
This plot indicates there might be a slight negative correlation between the Sepal length and Sepal width.
Q19:
cor()Q20:
plot()cor.test() to test whether the correlation is significantly different from 0Q21:
iris[1:4]cormatQ22:
plot() again (the result should look like below)col=iris$SpeciesThe non-parametric version of the Pearson correlation is called the Spearman rank correlation. Like the Wilcoxon test, this function first ranks all observations and then calculates the correlations across the ranks, rather than across the real values.
The default of cor() is to use the parametric Pearson correlation, but you can change the method of correlation by adding method = 'spearman' to your command.
Q23: Answer Q20 again, but this time using the Spearman rank correlation
Now that we have seen that the petal width a petal length is strongly correlated we can use regression analysis to predict values that we have no measured. Regression finds a relationship between the predictor variable (measured in experiments) and response variable (derived from predictor variable). If there is a linear relationship between these two variables they can be represented by a line. The regression formula is
y = ax + b
where y is the response variable, x is the predictor variable, and a and b are the coefficients. We want to predict petal length using the petal width.
A linear model is fitted using the function lm() as follows:
In the case of the iris dataset, this would for example look like this (where m1 stands for model 1):
Q24: Using the results from m1, write down the mathematical equation to calculate petal length given the petal width (fill in the values of a and b)
Q25: Examine the model output by using summary(m1). This command will show you the formula, the quartiles of the residual values, the coefficient estimates, standard errors and p-values, the residual standard error, the explained variance and the F-statistic. How much of the variation in petal length is explained by the variation in petal width?
Q26:
predict.lm()measured <- data.frame(Petal.Width=4)predict.lm(???, ???)predictedWe want to plot the observed points and regression line to have a look at how well it fits:
# Plot the chart
m1 <- lm(Petal.Length ~ Petal.Width, data=iris)
plot(iris$Petal.Width,iris$Petal.Length,
abline(m1))
Q27:
xlim=c(0,4)ylim=c(0,11)points(4,predicted, col="red")plotQ28: Try the same thing but without xlim and ylim in the plot function. What happens?
Most of the statistics we have dealt with so far are suitable for both small and large datasets. However, when dealing with larger datasets, there are specific methods used to cluster and summarise the data. Typically we call data big data when a dataset has many samples and many observations per sample, like for example when we have information on the expression of thousands of genes per individual. Data with information on many variables is sometimes called high dimensional data. In the last part of today’s tutorial we will return to the gene expression dataset to calculate principal components and run some clustering.
In order to perform certain downstream analyses, such as principal component analysis (PCA), or clustering, we need to properly normalise the expression counts first.
Let’s start by saving some information regarding the gene expression dataset as colData:
# Use the information in the column names to generate a data.frame comprising the sample information.
library(stringr)
names = colnames(exp)
colData <- data.frame(row.names = names,
type = str_split_fixed(names, "_", 3)[,1],
sample = str_split_fixed(names, "_", 3)[,2],
stimulation = str_split_fixed(names, "_", 3)[,3])
# Add library size, which is the total ammount of gene reads per sample
colData$libSize <- colSums(exp)
# Make sure that the colnames of your count table are matching with the row names of your colData
all(rownames(colData) %in% colnames(exp))## [1] TRUE
Q29: What information can you find in the newly made column libSize? What’s the minimum and maximum value of libSize?
Install and load the following packages: DESeq2, ggplot2, ggsci, ggrepel, RColorBrewer, pheatmap
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GenomeInfoDbData")
library(GenomeInfoDbData)
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
#or try
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")There are multiple ways in which we can approximate our counts to a normal-like distribution. The DESeq2 package provides two built-in functions to normalise your data:
These two transformations can accurately corect for the different conditions for which the cells have been treated. Select blind = FALSE to not include the experimental design information.
dds_stim <- DESeqDataSetFromMatrix(countData = exp,
colData = colData,
design = ~ stimulation+sample)
vst_stim <- assay(vst(dds_stim, blind=FALSE))PCA will help assess which samples are more closely related to each other, and how much variation is found between samples.
Follow the next steps to include the observations at t=0 unstimulated, t=180 unstimulated and t=180 stimulated in 1 PCA plot.
Q30:
prcomp(t(vst_stim))pcDatasummary(pcData)pcVarQ31:
pcVar$importance[2,1]varPC1pcPlotData <- data.frame(pcData$x[,1:2], colData[rownames(pcData$x),])pcPlotData using head()Make the plot using this code:
pcaPlot_stim <- ggplot(pcPlotData, aes(x=PC1 , y=PC2 , color=stimulation))+
geom_jitter(alpha=0.6)+
xlab(paste0("PC1 explained variance = ", varPC1*100, "%"))+
ylab(paste0("PC2 explained variance = ", varPC2*100, "%"))+
scale_color_aaas()+
theme_bw()+
theme(legend.position = "bottom")+
guides(col = guide_legend(ncol = 8))
# display the pca plot in "Plots"
pcaPlot_stim
Use the following code to save your plot:
# to save your file to pdf format in your working directory do
pdf("DIRECTORY WHERE YOU SAVED THE DATA/PCA_GM12878.pdf", height = 7.5, width=7.5)
pcaPlot_stim
dev.off()Clearly each of the duplicated of the three conditions cluster together, and there is quite some difference between t0unstim and t180unstim. However, the difference between the t180 and t180unstim samples is still much greater. This is the stimulation effect, which is what we are interested in.
Q32 (bonus): Make the PC plot, this time color the dots by the sample instead of the stimulation. Can you also observe a pattern here?
Alternatively, it’s possible to assess the similarity between samples using an unsupervised clustering approach using the whole trancriptome to generate a distance matrix.
Q33: Follow the next steps to make clusters and show a heatmap:
# Again we need to transpose our matrix to calculate the distance between each of the samples.
sampleDists <- dist(t(vst_stim))
sampleDistMatrix <- as.matrix(sampleDists)
# By using brewer.pal() we can generate a palette of colors, for more colors check (http://colorbrewer2.org/)
colors <- colorRampPalette(brewer.pal(9, "GnBu"))(255)
pheatmap(sampleDistMatrix, main = "Stimulation",
show_colnames = FALSE,
annotation = colData[,c("stimulation","sample")],
clustering_distance_rows=sampleDists,
clustering_distance_cols=sampleDists,
col=colors)What is the biggest difference in these experiments: the sample (A/B), the time (t=0 / t=180) or the stimulation (stimulated / unstimulated)? Hint: Look at the dendogram on the left of the graph.