6 Session 6: RNA-seq part 2

The aim of this session is to complete the analysis of the RNA-seq data using the limma and edgeR packages. With the results of the analysis, we want to be able to produce publication quality figures and write out the results to a file for sharing or publication.

Learning Objectives

fit a linear model to the data
construct design and contrast matrices for the linear model
model the mean-variance relationship of the data using the voom() function
perform statistical testing to identify differentially expressed genes
visualise the results using an MA plot
write out the results to a file

# load required packages
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(edgeR)

Loading required package: limma

library(limma)

group <- parse_factor(
  c("LP", "ML", "Basal", "Basal", "ML", "LP", "Basal", "ML", "LP"),
  levels = c("Basal", "LP", "ML")
)

samplenames <- c(
  "10_6_5_11", "9_6_5_11", "purep53", "JMS8-2", "JMS8-3",
  "JMS8-4", "JMS8-5", "JMS9-P7c", "JMS9-P8c"
)

# create DGEList object
dge <- readDGE(
  dir(path = "data/counts", pattern = "GSM*"),
  path = "data/counts",
  columns = c(1, 3),
  group = group,
  labels = samplenames
)

# add gene annotation information
dge$genes <- read_tsv("data/Ses3_geneAnnot.tsv")

Rows: 27179 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): SYMBOL, TXCHROM
dbl (1): ENTREZID

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# filter lowly expressed genes
keep <- filterByExpr(dge)
dge <- dge[keep, ]

# normalise the data for library size and RNA composition
dge <- calcNormFactors(dge, method = "TMM")

6.1 Linear modelling

To identify differentially expressed genes from our normalised gene expression data, we will use the limma package to fit linear models to the genes. A linear model is a broad class of statistical models that fit the value of an response (or dependent) variable as a linear function of one or more explanatory (or independent) variables (also called covariates).

The general form of a linear model looks like this:

\(Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2}... + \epsilon\)

This equation is saying that a response variable of interest \(Y\) is equal to a constant (\(\beta_{0}\)) plus the sum of the covariates (\(X_{i}\)) each multiplied by a constant coefficient (\(\beta_{i}\)) and some error term (\(\epsilon\)). The error term is the difference between the observed value and the predicted value of the model and is assumed to have a normal distribution with mean 0 and some variance.

Our experiment is quite simple, since there is only a single covariate, the cell type. The true benefit of using linear models is in its ability to accommodate more complex designs including multiple covariates.

To fit the linear models in the limma-voom framework we need two objects in addition to our data:

A design matrix, representing the covariates.
A contrast matrix, representing the specific comparison we wish to make.

6.1.1 Design matrix

The first step to fitting a linear model is to specify a design matrix. The design matrix specifies the values of the covariates for each sample, and is represented as a matrix due to the mathematical convenience.

To generate a design matrix. We use the function model.matrix(), with the expression ~0 + group. This returns a matrix representing the design where there is no intercept term and group is the only covariate. If we omit the 0 then there would be an intercept in the model, and if we included more covariates then more columns would be generated.

design <- model.matrix(~0 + group, data = dge$samples)
design

          groupBasal groupLP groupML
10_6_5_11          0       1       0
9_6_5_11           0       0       1
purep53            1       0       0
JMS8-2             1       0       0
JMS8-3             0       0       1
JMS8-4             0       1       0
JMS8-5             1       0       0
JMS9-P7c           0       0       1
JMS9-P8c           0       1       0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"

There are 9 rows, one for each sample. Along the columns are the names of the groups. The values in the cells denote membership of the particular sample for a particular group, as our groups in this case are mutually exclusive, each row contains only a single 1 to denote membership in a single group.

6.1.2 Contrasts

Contrast matrices are companions to design matrices. They are used to specify the comparison of interest. In our case, we have three experimental groups: Basal, LP and ML. So if we are to perform differential expression analysis, we are most likely interested in the differences between only two of the groups at a time. Contrast matrices let use specify which comparison we’d like to make, and are also represented as a matrix just like the design.

A contrast matrix can be made using the makeContrasts() function. Within this function, we specify the name of each specific contrast and the formula for that contrast. For example, the BasalvsLP contrasts compares the difference between the Basal and LP groups. Note that the name of the phenotype groups must be written exactly as they are in the column names of our design matrix (see above).

In addition to the individual contrasts, the function must know about the design of the model. This is passed through the levels argument, which either accepts a matrix with the column names corresponding to levels of your experimental groups, or the levels themselves as a character vector. Here we first simplify the column names of our design matrix to make it easier to read. Then we create the contrast matrix using the makeContrasts() function.

colnames(design) <- c("Basal", "LP", "ML")

contr_matrix <- makeContrasts(
  BasalvsLP = "Basal - LP",
  BasalvsML = "Basal - ML",
  LPvsML = "LP - ML",
  levels = design # alternatively 'levels = colnames(design)'
)

contr_matrix

       Contrasts
Levels  BasalvsLP BasalvsML LPvsML
  Basal         1         1      0
  LP           -1         0      1
  ML            0        -1     -1

There are two things to note about a design matrix. First, the sum of the numbers in each row is equal to 0. Second, the way that you set up the equation in the matrix will determine the interpretation of the log-fold-change calculated later, as well as the meaning of up-regulated and down-regulated genes. For example, the contrast Basal - LP will give a positive log-fold-change for genes that are up-regulated in the Basal group compared to the LP group. If we had set up the contrast as LP - Basal, then the opposite would be true, and we would get a negative log-fold-change for genes that are up-regulated in the Basal group compared to the LP group.

6.1.3 Variance modelling with voom

We are now ready to fit our linear models. Limma fits linear models to the data with the assumption that the underlying data is normally distributed. Count data is generally not normally distributed, but log transforming count data gives it a roughly normal distribution sufficient for linear models to work well. To do this, limma transforms the raw count data to log-cpm using library sizes and the normalisation factors we calculated previously.

In addition to the normalisation steps, the limma-voom pipeline uses the voom() function to generate weights for the individual genes based on a modelled mean-variance relationship. This modelling allows use to get more information out of small sample sizes as the weights prevent our model from being more heavily influenced by more variable data points. The weights will then allow the linear model to be less influenced by genes with high variability, which is important for small sample sizes. The voom function also generates a mean-variance plot, which is useful for visualising the data.

The voom() function takes as arguments, our DGEList object and our design matrix. It also optionally outputs a plot of the mean-variance relationship of our data, called the ‘voom-plot’.

v <- voom(dge, design, plot = TRUE)

The output of voom() (our variable v) is an EList object which contains the following elements:

genes - a data frame of gene annotation data.
targets - data frame of sample data.
E - numeric matrix of normalised log-cpm values.
weights - numeric matrix of precision weights.
design - the design matrix.

The MA plot shows the mean-variance relationship of the data. The x-axis is the average log2 expression of each gene, and the y-axis is the log2 standard deviation of each gene. The idea is that the variance of a gene can be estimated as a function of its main expression, and that this estimate is more accurate than the direct calculation of the variance.

6.1.4 Fitting the linear model

We are now ready to fit our linear model with lmFit(), which calculates coefficients we defined in our design matrix design. The resulting object, vfit is a MArrayLM object. It contains a information about our genes (the same data frame as genes from our EList object v above), the design matrix and a number of statistical outputs. Of most interest to us is the coefficients, stored in an element called coefficients. The first rows of this matrix is shown below. Each gene is row and is labelled using the EntrezID. Each column gives coefficients for each of our phenotype groups. These coefficients are weighted averages of the log-cpm of each gene in each group.

vfit <- lmFit(v)
head(vfit$coefficients)

            Basal        LP        ML
497097  3.0238407 -4.490714 -3.944753
20671   0.2678024 -2.489068 -2.025190
27395   4.3267906  3.900754  4.365115
18777   5.2066347  4.975762  5.653810
21399   5.2105491  4.901520  4.876117
58175  -1.9300214  3.581004  3.133716

We can then use contrasts.fit() to calculate coefficients for each contrast (or ‘comparison’) we specified in our contr_matrix. The output is also an object of the class MArrayLM (also known as an MArrayLM object). When we inspect the coefficients element now, we can see that each column is a contrast that we specified in our contrast matrix.

vfit <- contrasts.fit(vfit, contrasts = contr_matrix)
head(vfit$coefficients)

        Contrasts
          BasalvsLP   BasalvsML      LPvsML
  497097  7.5145551  6.96859344 -0.54596165
  20671   2.7568708  2.29299235 -0.46387848
  27395   0.4260362 -0.03832424 -0.46436046
  18777   0.2308732 -0.44717502 -0.67804818
  21399   0.3090295  0.33443212  0.02540261
  58175  -5.5110259 -5.06373764  0.44728824

With these values we essentially want to know whether or not the difference in values is significantly different from 0. If the difference is not significantly different from 0, then we have failed to establish a difference in the expression of a particularly gene between two groups. If the difference is significantly greater than 0, then we have up-regulation in the first group of the contrast, and if the difference is significantly less than 0, then we have down-regulation in the first group of the contrast.

6.2 Statistical testing

To actually test if the values are significantly different from 0, we use the eBayes() function to compute moderated t-statistics, moderated F-statistics and log-odds of differential expression for each gene, given a fitted linear model. ‘Moderated’ refers to empirical Bayes moderation, which borrows information across genes to obtain more accurate measures of variability for each gene. This also increases our power to detect differentially expressed genes.

efit <- eBayes(vfit)

We can now look at the number of differentially expressed genes using the decideTests() function. The output of this function is a matrix where each column is a contrast (comparison of interest) and each row is a gene. The numbers 1, -1 and 0 mean up-regulated, down-regulated or not significantly differentially expressed, respectively.

Note that decideTests() also accounts for multiple testing. The default method is Benjamini and Hochberg¹ but several others are also available.

dt <- decideTests(efit)
dt

TestResults matrix
        Contrasts
         BasalvsLP BasalvsML LPvsML
  497097         1         1      0
  20671          1         1      0
  27395          0         0      0
  18777          0        -1     -1
  21399          0         1      0
16619 more rows ...

To obtain the total number of differentially expressed genes for each comparison, we can add the function summary():

summary(dt)

       BasalvsLP BasalvsML LPvsML
Down        4501      4850   2701
NotSig      7306      6996  11821
Up          4817      4778   2102

The function topTable() can be used to obtain more information on the differentially expressed genes for each contrast. topTable() takes as arguments the MArrayLM object output by eBayes() (efit), the contrast name of interest and the number of top differentially expressed genes to output. Note that the contrast name must be given in quotes and must be exactly as written in the contrast matrix contr_matrix.

It outputs a data frame with the following information:

Gene details - gene information, from the gene element of the MArrayLM object (efit).
logFC - the log2 fold change of the contrast.
AveExpr - the average log2 expression of that gene.
t - moderated t-statistic.
P.Value - p value.
adj.P.Val - adjusted p value.
B - log-odds that the gene is differentially expressed.

top <- topTable(efit, coef = "BasalvsLP", n = Inf)
head(top)

       ENTREZID   SYMBOL TXCHROM     logFC  AveExpr         t      P.Value
12521     12521     Cd82    chr2 -4.095480 7.069340 -35.47147 3.381469e-12
22249     22249   Unc13b    chr4 -4.350553 5.662874 -32.38385 8.656449e-12
16324     16324    Inhbb    chr1 -4.721420 6.460624 -30.81124 1.446147e-11
14245     14245    Lpin1   chr12 -3.768976 6.293719 -29.95387 1.934041e-11
218518   218518 Marveld2   chr13 -5.215230 4.929711 -30.87474 1.415794e-11
12759     12759      Clu   chr14 -5.306850 8.856284 -29.55444 2.220734e-11
          adj.P.Val        B
12521  4.655035e-08 18.41399
22249  4.655035e-08 17.42593
16324  4.655035e-08 17.06368
14245  4.655035e-08 16.86037
218518 4.655035e-08 16.76898
12759  4.655035e-08 16.70270

6.3 MA Plot

The MA plot is a plot of log-fold-change (M-values) against log-expression averages (A-values), this is a common plot in RNA sequencing analysis to visualise the result of differential expression tests. It can be created using the plotMA() from the limma package. Creating this plot requires 3 pieces of information:

object = efit: The the fitted object containing the log-fold-change and log-expression averages
coef = 1: The column number of the contrast to plot since there are 3 different contrasts fitted within the object.
status = dt[, 1]: A vector of numerics denoting whether a gene is up-regulated or down-regulated.

plotMA(efit, coef = 1, status = dt[, "BasalvsLP"])

As before, we can try to re-create this plot using ggplot2 to give us more flexibility in plotting. The information we’ll need is the log-fold-change, the average log-expression and the differential expression status of each gene. We can extract this information from the efit and top data frame we created above, as well as gene names for annotation purposes.

ma_plot_data <- tibble(
  logFC = efit$coefficients[, "BasalvsLP"],
  AvgExpr = efit$Amean,
  status = factor(as.numeric(dt[, "BasalvsLP"])),
  gene = efit$genes$SYMBOL,
  p_value = efit$p.value[, "BasalvsLP"]
)

With our tibble prepared, we can use ggplot2 to create the plot as we have done before

ma_plot_data %>%
  ggplot(aes(x = AvgExpr, y = logFC, col = status)) +
  geom_point()

While the coordinates of the points are the same, there is still quite a bit of work to do to make this plot look like the one produced by plotMA(). First we have the fix up the colours of the points. We also have to make it such that the non-significant points are plotted below the other points. To do this we would need to make it so that the non-significant points are plotted first. We can do this by reordering the levels of the status factor so that the non-significant points are plotted first. Labels can also be added using ggrepel for the top 15 most DE genes.

library(ggrepel)

ma_plot_data_fct <- ma_plot_data %>%
  mutate(
    status = fct_recode(
      status,
      "Not DE" = "0",
      "Down" = "-1",
      "Up" = "1"
    )
  ) %>%
  mutate(status = fct_relevel(status, "Not DE", "Down", "Up"))

top_de_genes <- ma_plot_data_fct %>%
  arrange(p_value) %>%
  slice_head(n = 15)

scale_cols <- c(
  "Not DE" = "black",
  "Down" = "blue",
  "Up" = "red"
)

ma_plot <- ma_plot_data_fct %>%
  ggplot(aes(x = AvgExpr, y = logFC, col = status)) +
  geom_point(data = ma_plot_data_fct %>% filter(status == "Not DE"), size = 0.1) +
  geom_point(data = ma_plot_data_fct %>% filter(status != "Not DE"), size = 1) +
  geom_label_repel(
    data = top_de_genes,
    aes(label = gene, col = NULL),
    max.overlaps = 20,
    show.legend = FALSE
  ) +
  scale_colour_manual(values = scale_cols)

6.4 Volcano Plot

Since we have the tibble of the data prepared, we can also choose to create an alternative plot called a volcano plot. Which is also very popular in RNA-seq analysis. The volcano plot is a scatter plot of the negative log10 p-value against the log2 fold-change. This plot is useful for visualising the significance of the differentially expressed genes.

volcano_plot <- ma_plot_data_fct %>%
  ggplot(aes(x = logFC, y = -log10(p_value), col = status)) +
  geom_point(data = ma_plot_data_fct %>% filter(status == "Not DE"), size = 0.1) +
  geom_point(data = ma_plot_data_fct %>% filter(status != "Not DE"), size = 1) +
  geom_label_repel(
    data = top_de_genes,
    aes(label = gene, col = NULL),
    max.overlaps = 20,
    show.legend = FALSE
  ) +
  scale_colour_manual(values = c("black", "blue", "red"))

volcano_plot

6.5 Heatmap

Finally we can create a heatmap of the differentially expressed genes. The heatmap is a useful way to visualise the expression of the differentially expressed genes across the samples. We can use the pheatmap package to create the heatmap.

First we need to install the package if you haven’t already

install.packages("pheatmap")

The data we will use for the heatmap is going to be the log2 cpm expression values of the top differentially expressed genes. We can use the topTable() function to get the top differentially expressed genes for each contrast. We can then use the cpm() function to get the expression values of these genes. We will take the top 50 genes from each contrast and combine them into a single data frame. We will then use extract the expression values for these genes from the full expression matrix. For more interpretable labels, we will use the gene symbols as the row names of the expression matrix.

Note that by default the ENTREZID column is encoded as a numeric value, so to use it to subset our expression matrix we need to convert it to a character vector in order to select the matching IDs rather than trying to select the numerical rows of the data.

top_genes <- bind_rows(
  topTable(efit, coef = "BasalvsLP", n = 50),
  topTable(efit, coef = "BasalvsML", n = 50),
  topTable(efit, coef = "LPvsML", n = 50)
) %>%
  select(ENTREZID, SYMBOL, TXCHROM) %>%
  distinct()

top_gene_expr <- cpm(dge, log = TRUE)[as.character(top_genes$ENTREZID), ]
rownames(top_gene_expr) <- top_genes$SYMBOL

To create the heatmap we will use the pheatmap() function with the expression matrix. We will add in arguments to provide our own colour scale, to scale the data along the rows with a z-score, adding in sample annotation and clustering the data by rows and columns. We will also add in the option to show the row names and column names, and set the font size of the row names. You can experiment with removing each of these options to see how they affect the heatmap, as well as investigate additional options in the pheatmap() documentation.

library(pheatmap)

sample_anno <- dge$samples %>%
  select(group)

heatmap <- pheatmap(
  top_gene_expr,
  color = colorRampPalette(c("blue", "white", "red"))(100),
  scale = "row",
  annotation_col = sample_anno,
  border_color = NA,
  cluster_rows = TRUE,
  cluster_cols = TRUE,
  show_rownames = TRUE,
  show_colnames = TRUE,
  fontsize_row = 6
)

6.6 Saving all the results

Now that we have produced a series of plots and the top genes table, we can save these files to a directory. These figures may form the basis for publication figures, and the results table can partially presented as a result, or the full table can be included as a supplementary file.

dir.create("output")

write_csv(top, "output/top_genes_BasalvsLP.csv")
ggsave("output/ma_plot_BasalvsLP.png", ma_plot)

Saving 10 x 5 in image

ggsave("output/volcano_plot_BasalvsLP.png", volcano_plot)

Saving 10 x 5 in image

ggsave("output/heatmap_BasalvsLP.png", heatmap$gtable)

Saving 10 x 5 in image

With that we have come to the conclusion of the RNA-seq analysis and this course. You have now learned how to use various tidyverse packages to operate on data and create publication quality figures. You have also learned how to use the limma and edgeR packages to perform differential expression analysis on RNA-seq data. You should now be able to apply these skills to perform analysis on your own data.

6.7 References

Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B: Statistical Methodology [Internet]. 1995 Jan 1;57(1):289–300. Available from: http://dx.doi.org/10.1111/j.2517-6161.1995.tb02031.x