1 Introduction

In this practical session of RNASeq analysis, we will go through the most basic steps to perform a differential expression analysis. Due to time constraints we will skip the alignment and counting of reads per gene and start with a publically available expression matrix.

In this practical session, the differential expression analysis will be performed using the DESeq2 package from Love, et al. Other packages in R could also provide similar or comparable results such as: edgeR or limma.

1.1 Public data

There are several repositories from which we can browse and retrieve RNASeq and microarray based expression data. The most commonly used ones are:

Both GEO and ArrayExpress and routinely synchronized with each other, and are the repositories were processed data can also be stored. On the other hand, ENA covers raw sequencing data, sequence assembly information and functional annotation, which could potentially be processed to generate gene counts.

1.1.1 Case study

For this practical session we will make use of a dataset comprised of 74 transcriptome profiles of CD4+ T cells from celiac disease patients and healthy controls, which have also been subjected to two different immune stimulation with CD3 and PMA. You can find the dataset here: GSE69549. Download the supplementary data (GSE69549_RAW.tar).

Q1 Read the abstract from the paper this data is from. What is the most upregulated gene in CD4+ T cells from CD patients compared to controls?

Q2 What is PMA stimulation?

Q3 Is CD3 stimulation more or less specific for T-cells, and why would it be important to have a T-cell specific stimulation for this study?

1.1.3 Set up working directory

Let’s create a new working directory in order to save our data and results

Once you have created this new folder RNASeqPracticalSesion and the Data folder within, copy the ´GSE69549_RAW.tar´ file to ~/RNASeqPracticalSesion/Data/ and unpack it and unzip the resulting .txt.gz files.

1.1.4 Create a count table

We have downloaded 74 trascriptome profiles from GEO, we need to load and merge these individual transcriptomes into one single table, which should have genes by rows and samples by columns.

## [1] 23228    74
##             Coeliac_1_CD3 Coeliac_1_PMA Coeliac_1_UNS Coeliac_2_CD3
## 1/2-SBSRNA4            18            27            26            23
## A1BG                  102           100           140            86
## A1BG-AS1               16            24            31            29
## A1CF                    1             5             2             0
## A2LD1                  47            33            54            55
##             Coeliac_2_PMA
## 1/2-SBSRNA4            31
## A1BG                   65
## A1BG-AS1               16
## A1CF                    0
## A2LD1                  58

We now have a matrix with 23,228 genes (rows) and 74 transcriptomes (columns).

1.1.5 Column data

Since differential expression analysis is the comparison of gene expression between one or more conditions. Therefore, we need to generate a data.frame object that contains the experiment design, and can match this information to our countTable.

## [1] "Coeliac_1_CD3" "Coeliac_1_PMA" "Coeliac_1_UNS" "Coeliac_2_CD3"
## [5] "Coeliac_2_PMA" "Coeliac_3_CD3"
##                  type sample stimulation  libSize
## Coeliac_1_CD3 Coeliac      1         CD3 21956844
## Coeliac_1_PMA Coeliac      1         PMA 34729927
## Coeliac_1_UNS Coeliac      1         UNS 21982148
## Coeliac_2_CD3 Coeliac      2         CD3 31240248
## Coeliac_2_PMA Coeliac      2         PMA 41313390
## Coeliac_3_CD3 Coeliac      3         CD3 56773967
## [1] TRUE

Lets separate the cohort into 3 different set of samples depending on the stimulation status, hence we will have 3 count tables, and 4 colData objects: 1. Unstimulated samples 2. Stimulated with CD3 3. Stimulated with PMA

Using the function which, we are can get back the index and colnames from count table using the colData object.

## 
## CD3 PMA UNS 
##  26  25  23
##      
##       Coeliac Control
##   CD3      15      11
##   PMA      14      11
##   UNS      13      10

2 Differential expression analysis

2.1 Definition of differential expression

The definition of a differentially expressed gene usually depends on whether or not a particular gene is significantly (p-value = 0.05, after multiple testing correction) over or under expressed when compared to another class (can be different treatment, control, tissue etc,). Nevertheless, we can be bit more strict in this definition, we could require a lower p-value (such as 0.01, or 0.001), and require a certain difference of expression between the two conditions. This difference between two conditions can be interpreted as the log2 fold change ( log2(mean expression condition 1 / mean expression condition 2) ).

Q4 The next step will be selecting differentially expressed genes between unstimulated vs control, CD3+ stimulation vs control, and pma stimulation versus . Before running this, which of these do you expect to have the least amount of differentially epressed genes and why?

## log2 fold change (MLE): type Coeliac vs Control 
## Wald test p-value: type Coeliac vs Control 
## DataFrame with 6 rows and 6 columns
##                     baseMean      log2FoldChange             lfcSE
##                    <numeric>           <numeric>         <numeric>
## 1/2-SBSRNA4 18.9700647640932 -0.0489463944764025 0.247778142525225
## A1BG        184.208818373133    1.26566732903999 0.347182008638059
## A1BG-AS1    32.8469866775803  -0.325575498896567 0.336817346566393
## A1CF        3.44637518824066   -1.30027843783477 0.530066845376109
## A2LD1       86.8325731964753  -0.404173756534075  0.37043424953508
## A2M         69.5629916801679  -0.465870048575705 0.458484625861709
##                           stat               pvalue               padj
##                      <numeric>            <numeric>          <numeric>
## 1/2-SBSRNA4 -0.197541211575672    0.843404034580425  0.962101036527923
## A1BG          3.64554411677324 0.000266826633193313 0.0208518660825498
## A1BG-AS1    -0.966623311464127    0.333732376695595  0.731518903960386
## A1CF         -2.45304615668267   0.0141652161342231   0.18545852445342
## A2LD1        -1.09108095982307    0.275237260106493  0.689395792871344
## A2M          -1.01610833231347    0.309577801644254  0.716873331844766
## [1] 401
## [1] 5696

Q5 In the above code, DESeq normalized the counts using Median of ratios method. Unlike RPK/RPKM, this does not correct for read length. Why do you think this is not necesarry?

Q6 Compare the overlap in differentially epressed genes between the groups, e.g. using a venn diagram. Was your answer for Q4 correct? Are the results as you expected?

Q7 How many genes are found differentially expressed in all conditions? Do you think these genes would be interesting to look into or not? Why (not)?

2.2 Visualization of Differentially expressed genes

2.2.1 Volcano Plot

The volcano Plot is a great way to visualize your differentially expressed genes.

By definition a volcano plot has in it’s X-axis the log2(fold change) per gene, whilst on the Y-axis the -log10(p Value) per gene. We already have all this information in our “res” objects from the DESeq2.

Q8 Do the same for the CD3+ and PMA stimulated data. Do you find the same results as was reported in the manuscript (see Q1)?

Q9 In the abstract the authors mention BACH2. What is the p-value and log-fold change of BACH2 in the CD3+ differential group? Is it in the top most differentially expressed genes? If not, do you expect it to be?

2.2.2 Heatmap of differentially expressed genes

Heatmaps are color coded, graphical representations of a matrix. The rows and columns of this matrix can be arranged in a certain way to showcase the similarities between columns and rows. This arrangement of columns and rows can be using an unsupervised approach such as hierarchical clustering.

We firstly need normalize the expression levels to compare the expression levels between samples.

Q10 Do the same for CD3+ and PMA stimulation. What do you think are the advantages of a heatmap over a volcano plot?

Finally, write the expression data for CD3 to a file, as we will use it for the pathway analysis.