In this practical session of RNASeq analysis, we will go through the most basic steps to perform a differential expression analysis. Due to time constraints we will skip the alignment and counting of reads per gene and start with a publically available expression matrix.
In this practical session, the differential expression analysis will be performed using the DESeq2 package from Love, et al. Other packages in R could also provide similar or comparable results such as: edgeR or limma.
There are several repositories from which we can browse and retrieve RNASeq and microarray based expression data. The most commonly used ones are:
Both GEO and ArrayExpress and routinely synchronized with each other, and are the repositories were processed data can also be stored. On the other hand, ENA covers raw sequencing data, sequence assembly information and functional annotation, which could potentially be processed to generate gene counts.
For this practical session we will make use of a dataset comprised of 74 transcriptome profiles of CD4+ T cells from celiac disease patients and healthy controls, which have also been subjected to two different immune stimulation with CD3 and PMA. You can find the dataset here: GSE69549. Download the supplementary data (GSE69549_RAW.tar).
Q1 Read the abstract from the paper this data is from. What is the most upregulated gene in CD4+ T cells from CD patients compared to controls?
Q2 What is PMA stimulation?
Q3 Is CD3 stimulation more or less specific for T-cells, and why would it be important to have a T-cell specific stimulation for this study?
## Install packages from CRAN
install.packages("ggsci")
install.packages("ggplot2")
install.packages("gridExtra")
install.packages("ggrepel")
install.packages("RColorBrewer")
install.packages("pheatmap")
install.packages("VennDiagram")
install.packages("gridExtra")
install.packages("BiocManager")
BiocManager::install("DESeq2")
BiocManager::install("biomaRt")
### Or with R version < 3.5
# source("https://bioconductor.org/biocLite.R")
# biocLite("DESeq2")
# biocLite("biomaRt")Let’s create a new working directory in order to save our data and results
# Create the new directory
dir.create("~/RNASeqPracticalSesion/")
# Change the working directoy
dir.create("~/RNASeqPracticalSesion/Data/")Once you have created this new folder RNASeqPracticalSesion and the Data folder within, copy the ´GSE69549_RAW.tar´ file to ~/RNASeqPracticalSesion/Data/ and unpack it and unzip the resulting .txt.gz files.
We have downloaded 74 trascriptome profiles from GEO, we need to load and merge these individual transcriptomes into one single table, which should have genes by rows and samples by columns.
setwd("~/RNASeqPracticalSesion/")
# Get all file names with a certain pateern from a directary
fileList <- list.files("./Data/", pattern = ".txt", full.names = TRUE)
# For each of the components of the fileList, read them into R, and arrange them in a list.
countList <- lapply(fileList, read.delim, sep="\t", header=TRUE, row.names=1)
# Take each the loaded matrix and merge the by it's column (cbind)
countTable <- do.call(cbind, countList)
# Check for the dimentions of the count table. ROWS x COLUMS
dim(countTable)## [1] 23228 74
## Coeliac_1_CD3 Coeliac_1_PMA Coeliac_1_UNS Coeliac_2_CD3
## 1/2-SBSRNA4 18 27 26 23
## A1BG 102 100 140 86
## A1BG-AS1 16 24 31 29
## A1CF 1 5 2 0
## A2LD1 47 33 54 55
## Coeliac_2_PMA
## 1/2-SBSRNA4 31
## A1BG 65
## A1BG-AS1 16
## A1CF 0
## A2LD1 58
We now have a matrix with 23,228 genes (rows) and 74 transcriptomes (columns).
Since differential expression analysis is the comparison of gene expression between one or more conditions. Therefore, we need to generate a data.frame object that contains the experiment design, and can match this information to our countTable.
## [1] "Coeliac_1_CD3" "Coeliac_1_PMA" "Coeliac_1_UNS" "Coeliac_2_CD3"
## [5] "Coeliac_2_PMA" "Coeliac_3_CD3"
# Use the information in the column names to generate a data.frame comprising the sample information.
colData <- data.frame(row.names = colnames(countTable),
type= unlist(lapply(colnames(countTable),
function(x){unlist(strsplit(x, split = "_"))[1]})),
sample= unlist(lapply(colnames(countTable),
function(x){unlist(strsplit(x, split = "_"))[2]})),
stimulation= unlist(lapply(colnames(countTable),
function(x){unlist(strsplit(x, split = "_"))[3]}))
)
# Add library size, which is the total ammount of gene reads per sample
colData$libSize <- colSums(countTable)
#First 10 rowns of the colData object
head(colData)## type sample stimulation libSize
## Coeliac_1_CD3 Coeliac 1 CD3 21956844
## Coeliac_1_PMA Coeliac 1 PMA 34729927
## Coeliac_1_UNS Coeliac 1 UNS 21982148
## Coeliac_2_CD3 Coeliac 2 CD3 31240248
## Coeliac_2_PMA Coeliac 2 PMA 41313390
## Coeliac_3_CD3 Coeliac 3 CD3 56773967
# Make sure that the colnames of your count table are matching with the row names of your colData
all(rownames(colData) %in% colnames(countTable))## [1] TRUE
# Save your countTable and colData
write.csv(countTable, file="countTable.csv")
write.csv(colData, file="colData.csv")Lets separate the cohort into 3 different set of samples depending on the stimulation status, hence we will have 3 count tables, and 4 colData objects: 1. Unstimulated samples 2. Stimulated with CD3 3. Stimulated with PMA
Using the function which, we are can get back the index and colnames from count table using the colData object.
# table generates a table of frequencies given a vector, we can check how many samples we should have per stimulation
table(colData$stimulation)##
## CD3 PMA UNS
## 26 25 23
# It's possible to include two vectors and count frequencies given two conditions.
table(colData$stimulation,colData$type)##
## Coeliac Control
## CD3 15 11
## PMA 14 11
## UNS 13 10
unstim_sampleNames <- rownames(colData)[which(colData$stimulation == "UNS")]
unstim_countTable <- countTable[,unstim_sampleNames]
unstim_colData <- colData[unstim_sampleNames,] #colData has sample names by row
cd3_sampleNames <- rownames(colData)[which(colData$stimulation == "CD3")]
cd3_countTable <- countTable[,cd3_sampleNames]
cd3_colData <- colData[cd3_sampleNames,] #colData has sample names by row
pma_sampleNames <- rownames(colData)[which(colData$stimulation == "PMA")]
pma_countTable <- countTable[,pma_sampleNames]
pma_colData <- colData[pma_sampleNames,] #colData has sample names by rowThe definition of a differentially expressed gene usually depends on whether or not a particular gene is significantly (p-value = 0.05, after multiple testing correction) over or under expressed when compared to another class (can be different treatment, control, tissue etc,). Nevertheless, we can be bit more strict in this definition, we could require a lower p-value (such as 0.01, or 0.001), and require a certain difference of expression between the two conditions. This difference between two conditions can be interpreted as the log2 fold change ( log2(mean expression condition 1 / mean expression condition 2) ).
Q4 The next step will be selecting differentially expressed genes between unstimulated vs control, CD3+ stimulation vs control, and pma stimulation versus . Before running this, which of these do you expect to have the least amount of differentially epressed genes and why?
# Load
library(DESeq2)
library(ggplot2)
library(ggsci)
library(ggrepel)
# To properly compare control versus celiac, we need to define that in ourt colData objects,
# therefore we need to set the colData$type as a factor, and the firs level should be control
unstim_colData$type <- factor(unstim_colData$type, levels= c("Control", "Coeliac"))
cd3_colData$type <- factor(cd3_colData$type, levels= c("Control", "Coeliac"))
pma_colData$type <- factor(pma_colData$type, levels= c("Control", "Coeliac"))
# We now generate the new DESeq objects and run the differential expression analysis.
dds_unstim <- DESeqDataSetFromMatrix(countData = unstim_countTable,
colData = unstim_colData,
design = ~ type)
dds_unstim <- DESeq(dds_unstim)
res_unstim <- results(dds_unstim)
head(res_unstim)## log2 fold change (MLE): type Coeliac vs Control
## Wald test p-value: type Coeliac vs Control
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## 1/2-SBSRNA4 18.9700647640932 -0.0489463944764025 0.247778142525225
## A1BG 184.208818373133 1.26566732903999 0.347182008638059
## A1BG-AS1 32.8469866775803 -0.325575498896567 0.336817346566393
## A1CF 3.44637518824066 -1.30027843783477 0.530066845376109
## A2LD1 86.8325731964753 -0.404173756534075 0.37043424953508
## A2M 69.5629916801679 -0.465870048575705 0.458484625861709
## stat pvalue padj
## <numeric> <numeric> <numeric>
## 1/2-SBSRNA4 -0.197541211575672 0.843404034580425 0.962101036527923
## A1BG 3.64554411677324 0.000266826633193313 0.0208518660825498
## A1BG-AS1 -0.966623311464127 0.333732376695595 0.731518903960386
## A1CF -2.45304615668267 0.0141652161342231 0.18545852445342
## A2LD1 -1.09108095982307 0.275237260106493 0.689395792871344
## A2M -1.01610833231347 0.309577801644254 0.716873331844766
# Given the conditions we declared for differential expression analysis we can subset our list of
de_unstim <- res_unstim [which(res_unstim$padj <= 0.05),]
nrow(de_unstim)## [1] 401
dds_cd3 <- DESeqDataSetFromMatrix(countData = cd3_countTable,
colData = cd3_colData,
design = ~ type)
dds_cd3 <- DESeq(dds_cd3)
res_cd3 <- results(dds_cd3)
de_cd3 <- res_unstim [which(res_cd3$padj <= 0.05),]
dds_pma <- DESeqDataSetFromMatrix(countData = pma_countTable,
colData = pma_colData,
design = ~ type)
dds_pma <- DESeq(dds_pma)
res_pma <- results(dds_pma)
de_pma <- res_unstim [which(res_pma$padj <= 0.05),]
nrow(de_pma)## [1] 5696
Q5 In the above code, DESeq normalized the counts using Median of ratios method. Unlike RPK/RPKM, this does not correct for read length. Why do you think this is not necesarry?
Q6 Compare the overlap in differentially epressed genes between the groups, e.g. using a venn diagram. Was your answer for Q4 correct? Are the results as you expected?
Q7 How many genes are found differentially expressed in all conditions? Do you think these genes would be interesting to look into or not? Why (not)?
The volcano Plot is a great way to visualize your differentially expressed genes.
By definition a volcano plot has in it’s X-axis the log2(fold change) per gene, whilst on the Y-axis the -log10(p Value) per gene. We already have all this information in our “res” objects from the DESeq2.
library(gridExtra)
pData <- as.data.frame(res_unstim[which(!is.na(res_unstim$padj)),])
pData$top10label <- NA
pData$top10label[order(pData$padj)[1:10]] <- rownames(pData)[order(pData$padj)[1:10]]
unstim_Volcano <- ggplot(pData, aes(x=log2FoldChange, y= -log10(padj)))+
geom_point(aes(color= padj <= 0.05))+
geom_hline(yintercept = 0, lwd=1, alpha= 0.6)+
geom_vline(xintercept = 0, lwd=1, alpha= 0.6)+
scale_color_d3()+
ggtitle("Unstimulated")+
geom_text_repel(aes(label=top10label))+ ##add the lables in the top 10
theme_bw()+
theme(legend.position = "bottom")Q8 Do the same for the CD3+ and PMA stimulated data. Do you find the same results as was reported in the manuscript (see Q1)?
Q9 In the abstract the authors mention BACH2. What is the p-value and log-fold change of BACH2 in the CD3+ differential group? Is it in the top most differentially expressed genes? If not, do you expect it to be?
Heatmaps are color coded, graphical representations of a matrix. The rows and columns of this matrix can be arranged in a certain way to showcase the similarities between columns and rows. This arrangement of columns and rows can be using an unsupervised approach such as hierarchical clustering.
We firstly need normalize the expression levels to compare the expression levels between samples.
library(pheatmap)
vst_unstim <- assay(vst(dds_unstim))
deGenes <- vst_unstim[rownames(de_unstim),]
pheatmap(deGenes, scale = "row",show_rownames = FALSE, main = "Differential expressed genes unstimulated")Q10 Do the same for CD3+ and PMA stimulation. What do you think are the advantages of a heatmap over a volcano plot?
Finally, write the expression data for CD3 to a file, as we will use it for the pathway analysis.
# Lets create a new directory to save our results
dir.create("./Results/")
# The write.csv function will generate an Excel "friendly" file.
unstim_fileName <- "./Results/diffExpGenes_unstim_all.csv"
write.csv(res_unstim, unstim_fileName)
cd3_fileName <- "./Results/diffExpGenes_cd3_all.csv"
write.csv(res_cd3, cd3_fileName)
pma_fileName <- "./Results/diffExpGenes_pma_all.csv"
write.csv(res_pma, pma_fileName)