Analyze Silac data from MaxQuant

Prerequisites

It is recommended to execute the following exercises using RStudio (even though any R console will work). For the plots you will need the package “ggplot2”. In case it’s not installed yet, execute install.packages(‘ggplot2’).

Load “ggplot2”:

library('ggplot2')

1. Load MaxQuant (MQ) results

Load the results from MQ together with it’s headers. You will have to adapt result_dir:

result_dir <- "/Users/admin/Documents/presentations/proteomics_SIB_2018/Course_II_advanced/R_exercise"
result_file_name <- "proteinGroups_R_exp14_M1.txt"
result_file <- paste(result_dir, result_file_name, sep="/")

res <- read.csv(result_file, sep="\t")

2. Filter

Remove all proteins marked as + in the column Reverse:

res_2 <- subset(res, Reverse != "+")

Remove all contaminants. The table proteinGroups.txt from MQ has a column Contaminant. In this example we added an additional column named Contaminant_2 to mark further proteins we manually identified as contaminants:

res_3 <- subset(res_2, Contaminant != "+" & Contaminant_2 != "+")

Now we should have 4989 protein groups left. Check the number of proteins with nrow(res_3).

3. Analysis 1

3.1 Histogram of H/L ratios

There are 6 different columns containing H/L ratios. Plot a histogram for each of them individually by adapting the variable H_L_Ratio_column:

  H_L_Ratio_column <- "Ratio.H.L.exp14.rep1.20h"
  p <- ggplot(res_3, aes_string(x=H_L_Ratio_column)) +
        geom_histogram() +
        xlim(0,5)
  print(p)

You can also generate all plots automatically by first selecting the 6 columns containing the term “Ratio.H.L” and then loop through each of them:

H_L_Ratio_columns <- grep("Ratio.H.L", colnames(res_3), value=TRUE)

for(column_name in H_L_Ratio_columns){
  p <- ggplot(res_3, aes_string(x=column_name)) +
        geom_histogram() +
        xlim(0,5)
  print(p)
}

3.2 Summary statistics

To get a basic summary statistics of our results table res_3 we can simply call:

  summary(res_3)

But we might want to call it only on the columns containing H/L Ratios:

  H_L_Ratio_columns <- grep("Ratio.H.L", colnames(res_3), value=TRUE)
  summary(res_3[,H_L_Ratio_columns])

3.3 Scatterplot of H/L ratios

Let’s create a scatterplot of H/L ratios for Ratio.H.L.Normalized.exp14.rep1.20h and Ratio.H.L.Normalized.exp14.rep1.20h:

  H_L_Ratio_column <- "Ratio.H.L.Normalized.exp14.rep1.20h" # adapt this variable to plot the second replicate
  p <- ggplot(res_3, aes_string(x="id", y=H_L_Ratio_column)) +
        geom_point(alpha=0.3)
  print(p)

Note the distribution of ratios. Is it easy to evaluete the data?

4. LOG transformation

Make log2 transformation of the normalized H/L Ratios columns:

  H_L_Ratio_columns <- grep('Ratio.H.L.Normalized', colnames(res_3), value=TRUE)

  log2_H_L_Ratios <- log2(res_3[,H_L_Ratio_columns])
  
  # add a "log2" to the column names and add them back to the table
  colnames(log2_H_L_Ratios) <- paste0(colnames(log2_H_L_Ratios), ".log2")
  res_4 <- cbind(res_3, log2_H_L_Ratios)

5. Analysis 2

5.1 Scatterplot of log2

Create the same scatterplot as before, but this time on the log2 data:

  H_L_Ratio_column <- "Ratio.H.L.Normalized.exp14.rep1.20h.log2"
  p <- ggplot(res_4, aes_string(x="id", y=H_L_Ratio_column)) +
        geom_hline(yintercept = 0) +
        geom_point(alpha=0.3)
  print(p)

Select proteins above a certain threshold (e.g. 1) in replicate 1 and look if they’re also high in the other replicates:

  res_above_threshold <- subset(res_4, Ratio.H.L.Normalized.exp14.rep1.20h.log2 > 1)

  H_L_Ratio_column <- "Ratio.H.L.Normalized.exp14.rep1.20h.log2"
  
  p <- ggplot(res_4, aes_string(x="id", y=H_L_Ratio_column)) +
        geom_hline(yintercept = 0) +
        geom_point(alpha=0.3) +
        geom_point(data=res_above_threshold, aes_string(x="id", y=H_L_Ratio_column), col="red")  
  print(p)

Now let’s plot the corresponding H/L ratios for the second replicate:

  H_L_Ratio_column <- "Ratio.H.L.Normalized.exp14.rep2.20h.log2"
  
  p <- ggplot(res_4, aes_string(x="id", y=H_L_Ratio_column)) +
        geom_hline(yintercept = 0) +
        geom_point(alpha=0.3) +
        geom_point(data=res_above_threshold, aes_string(x="id", y=H_L_Ratio_column), col="red")  
  print(p)

5.2 Multiscatterplot and Pearson correlation

We can look at the correlations between the different replicates using a multiscatterplot:

H_L_Ratio_columns <- grep("Ratio.H.L.*.log2", colnames(res_4), value=TRUE)

pairs(res_4[,H_L_Ratio_columns])

The Pearson correlation can be computed with the following command:

cor(res_4[,H_L_Ratio_columns], method="pearson", use="pairwise.complete.obs")

6. Filter more

We filter on the number of peptides observed per protein groups. We’re keeping only proteins which have > 2 proteins:

  res_5 <- subset(res_4, Peptides..seq. > 2)

And we make again a multi scatterplot and compute the Pearson correlation:

# multi scatterplot
pairs(res_5[,H_L_Ratio_columns])

# Pearson correlation
cor(res_5[,H_L_Ratio_columns], method="pearson", use="pairwise.complete.obs")

Is the correlation better now?

7. Statistical tests

Compute the t-test for the normalized log2 ratios:

p_vals <- apply(res_5[,H_L_Ratio_columns], 1, function(x) {
    # we have to wrap the test into a tryCatch, since the t-test is not working when there are too many NA's
    tryCatch({
      one_t_test <- t.test(x, mu=0, var.equal = FALSE)
      one_t_test$p.value
    }, error=function(cond){
      NA
    })
  })

# number of significant p-values
sum(p_vals <= 0.05, na.rm=TRUE)

But we have to adjust for multiple testing (e.g. with Benjamini-Hochberg):

p_vals_adj <- p.adjust(p_vals, method="BH")

How many proteins pass t-test with p-value < 0.05 now?

Generation of this document

This file was generated from “analyze_silac_MQ.Rmd” using knitr. From RStudio you can generate it by clicking on “Knit”. If you don’t have the package “knitr” installed, you will have to execute the following command first install.packages('knitr').