In this part of the course we will look into basic steps of the pathway enrichment analysis. There are many different methods, implementations and tools for performing this.

Some up-to-date web-based enrichment tools are, for example:

Here we will focus on ORA (overrepresentation analysis) and GSEA (Gene Set Enrichment Analysis), as implemented in ClusterProfiler R package [@Yu2012].

1 Preparation

1.4 Set seed for the R session

Some of the commands we use in this practical use random processes, such as random permutations for GSEA, FDR calculations and rendering of the network graphs. This means that each time the script is ran, the outcome may differ very slightly, just by chance. To make your code reproducible it is good idea to define the number “seed” in the beginning of the script. This way the script gives exactly the same output every time it is ran.

2 Prepare the data

2.1 Read in the data from RNA-seq differential analysis

We will read in the results of differential expression analysis from previous practical. We will concentrate on anti-CD3-stimulated dataset in this practical.

2.2 Prepare data for ORA and GSEA

Next we will do some data cleaning and convert gene IDs into usable format.

We will remove the genes with mean base expression level 0, as those were not expressed in our data.

As you may see from the R message, ~10% of HGNC symbols were not connected with any ENTREZ ID. This is due to the unstable nature of HGNC annotation. In real-life science project you should opt for more stable annotation scheme (e.g. use one ENSEMBL/UCSC annotation database version throughout project).

For the sake of brevity and simplicity, in this practical we remove unlinked gene IDs from following steps and proceed with remaining 90% of genes.

We will prepare the data for ORA. In ORA we will use genes which are significantly differentically expressed (FDR<0.05, absolute fold-change>2). Fold change is often used as additional filter for identifying most relevant genes for interpretation.

Next we will prepare data for GSEA analysis. For that we need to order all genes based on their correlation strenght with the phenotype or effect size. In our case we will order those based on \(log_2(FC)\). We will also extract the vector of all tested genes which can be used as “gene universe”.

3 Over-representation analyses (ORA)

3.1 Run KEGG over-representation analysis

In this part we run enrichment test (hypergeometric test) for all differentially expressed genes (FDR<0.05, FC>2).

  • We use all genes tested in the RNA-profiling study as a background set or “gene universe”. This list was already constructed in the previous step.

  • We will query for all enrichment results and write out 25 most differentially expressed, not accounting the significance.

  • The default multiple testing is done by Benjamini-Hochberg method, flag pvalueCutoff can be used for filtering only significant results (<0.05). Additionally, qvalueCutoff flag is used to filter results based on another popular method of FDR estimation (Storey q-value).

  • The defaults of the command also apply restrictions on the sizes of gene sets tested in the analysis, those can be seen with command ?enrichKEGG.

We will accustom ourselves with the data structure of the analysis output.

The resulting object is more complex R S4 data structure which consists separate “containers” for different results, analysis settings, etc. You can access to those containers by using @.

Now we look at the results by printing out 25 first rows of the result table. View function in RStudio makes it very comfortable.

Q1: How many KEGG pathways are significant if we consider Benjamini-Hochberg P<0.05?

Q2: How many KEGG pathways are significant if we consider Storey Q<0.05?

Q3: Authors of the original paper [@Quinn2015, [link](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0140049)] have already applied the KEGG overrepresentation test to the same data. Find the results from the publication and compare to your results- are those similar? If not exactly, speculate, what could be the reason(s) of observed differences?

Q4: Use Fisher’s exact test to test the over-representation of the fifth most significant gene set enrichment result.

  • Google “Fisher’s Exact test R”

  • Construct the contigency table- you can use this code snippet to replace “NAs” with the correct numbers from the output table:

  • Tip: you can use command addmargins on your GeneMatrixFisher to see if your contigency table adds up.

  • Use the GeneMatrixFisher to run Fisher’s Exact Test.

Q4.1: Report the used command, Fisher’s exact test P-value and odds ratio.

Q4.2: Does Fisher’s exact test give similar/same result as ClusterProfiler?

3.1.1 Visualize the overall KEGG enrichment analysis results

We can visualize our results on a barplot. Length of the bar shows significance (\(-log_{10}(P)\)) and the color shows number of genes overlapping with gene set. From now on, we will save the plots as separate files in the work directory, that enables us to do closer inspection.

Next we visualize five most enriched pathways on the network graph. Size of the pathway node shows statistical significance and links indicate gene membership in the pathway. This gives the static graph as a output. If it is necessary to investigate further, you can use flag fixed = FALSE to see interactive version.

Q5: What are the most up- and down-regulated genes from 5 most enriched pathways, based on visual inspection?

Q6: Is there any of top pathways which shows coordinated up- or down-regulation for the majority (or all) of its members?

3.1.2 Visualize the most significantly enriched KEGG pathway

For KEGG pathways it is possible to visualize the location of differentially expressed genes on the pathway, as well as their magnitude of differential expression. This kind of visualization might help to identify most relevant genes for given condition and generate new research hypotheses.

Q7: Same figure is also reported in the original publication [@Quinn2015, [link](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0140049)]. Does it look similar or do you see any difference? Speculate, where can it come from?

3.2 Run Gene Ontology over-representation analysis

Next we run the overrepresentation test for Gene Ontologies.

  • For the sake of brevity, we test only the ontologies from Biological Process category. By setting the ont flag, is also possible to test Molecular Function (MF), Cellular Component (CC) or all three categories combined (ALL).

  • For visualization purposes we apply this time significance thresholds (Benjamini-Hochberg FDR<0.05 and Storey FDR<0.05).

  • We specify that we test the enrichment for the gene sets with sizes between 10-10000 genes (flags minGSSize and maxGSSize)

3.2.1 Visualize the GO enrichment analysis results

To ease the interpretation and see the hierarchical relationships between most enriched GO terms, we visualize those on graph structure. Square boxes are significantly enriched terms and color depicts the significance (P-value). We will visualize 15 most significant terms.

Q8: What is the most significant GO term?

Q9: What are the main general themes coming out of the analysis?

3.3 Run Disease Ontology over-representation analysis

Clusterprofiler allows to do also ORA for human diseases, making use of several databases of curated gene-disease relationships. As our trait of interest is Coeliac disease, it would be interesting to see differentially expressed genes also show enrichment of some human diseases and whether those are could be related with Coeliac disease or any other autoimmune disease.

Q10: If you see that the genes differentially expressed in your disease-of-interest (in our case, Coeliac disease) are enriched by genes known to be associated with some other disease (e.g. inflammatory bowel disease, another autoimmune disease), what is the biological conclusion you make? Could this observation be useful for the search of the treatment and how would you proceed with that?

Q11: Does resulting significant disease list make sense in relation to Coeliac disease? Are there any diseases which could show similar expression profile as Coeliac disease? If not, speculate, what could be the reason of ambigous results?

4 Gene set enrichment analysis (GSEA)

Next we will use GSEA [@Subramanian2005] on the diffrential analysis results. Unlike ORA, GSEA uses all the results from differential expression analysis, not only significant ones.

4.1 Run GSEA (Gene Set Enrichment Analysis) for KEGG pathways

For the sake of speed and brevity, we use recommended minimal number of 1000 permutations for this analysis (to get more stable and precise results, your could increase the number of permutations to e.g. 10000 in real scientific work). Also, we test only KEGG pathways which have more than 50 known gene members.

Q10: How many pathways are significant if we consider Benjamini-Hochberg FDR?

Q11: Are the results in line with ORA results?

5 Compare the biological themes of different stimulations

It is possible to easily compare the results several ORA analyses, using the functionality from compareCluster results. Next we read in and preprocess differentially expressed genes from all three conditions (UN-CeD vs control, unstimulated; CD3-CeD vs control, anti-CD3 stimulation; PMA-CeD vs control, PMA stimulation). We will run the KEGG ORA, using default settings and all genes tested in the study as an background.

We use dplyr and ggplot2 functionality to compare the 20 most significant pathways for each group. Here, the size of the dot indicates significance (\(-log_{10}(P)\)) and red dots indicate pathways withstanding multiple testing correction.

Q12: Are there any KEGG pathways overlapping between any of the conditions?

Q13: What condition has the largest number of significantly enriched pathways?

Q14: Are the per-cohort results in line with the ORA results reported in the paper?

6 Report the session information

Finally, when wrapping up your analysis, it is always a good idea to save the settings you have used. This involves software versions and the date of the analysis, making it possible to replicate or debug the results by different person.

Command sink is useful for saving any output from the screen into file.

Investigate the resulting file and note what information is saved.