It is recommended to execute the following exercises using RStudio (even though any R console will work). For the plots you will need the package “ggplot2”. In case it’s not installed yet, execute install.packages(‘ggplot2’)
.
Load “ggplot2”:
library('ggplot2')
Load the results from MQ together with it’s headers. You will have to adapt result_dir
:
result_dir <- "/Users/admin/Documents/presentations/proteomics_SIB_2018/Course_II_advanced/R_exercise"
result_file_name <- "proteinGroups_R_exp14_M1.txt"
result_file <- paste(result_dir, result_file_name, sep="/")
res <- read.csv(result_file, sep="\t")
Remove all proteins marked as +
in the column Reverse:
res_2 <- subset(res, Reverse != "+")
Remove all contaminants. The table proteinGroups.txt from MQ has a column Contaminant. In this example we added an additional column named Contaminant_2 to mark further proteins we manually identified as contaminants:
res_3 <- subset(res_2, Contaminant != "+" & Contaminant_2 != "+")
Now we should have 4989
protein groups left. Check the number of proteins with nrow(res_3)
.
There are 6 different columns containing H/L ratios. Plot a histogram for each of them individually by adapting the variable H_L_Ratio_column
:
H_L_Ratio_column <- "Ratio.H.L.exp14.rep1.20h"
p <- ggplot(res_3, aes_string(x=H_L_Ratio_column)) +
geom_histogram() +
xlim(0,5)
print(p)
You can also generate all plots automatically by first selecting the 6 columns containing the term “Ratio.H.L” and then loop through each of them:
H_L_Ratio_columns <- grep("Ratio.H.L", colnames(res_3), value=TRUE)
for(column_name in H_L_Ratio_columns){
p <- ggplot(res_3, aes_string(x=column_name)) +
geom_histogram() +
xlim(0,5)
print(p)
}
To get a basic summary statistics of our results table res_3
we can simply call:
summary(res_3)
But we might want to call it only on the columns containing H/L Ratios:
H_L_Ratio_columns <- grep("Ratio.H.L", colnames(res_3), value=TRUE)
summary(res_3[,H_L_Ratio_columns])
Let’s create a scatterplot of H/L ratios for Ratio.H.L.Normalized.exp14.rep1.20h
and Ratio.H.L.Normalized.exp14.rep1.20h
:
H_L_Ratio_column <- "Ratio.H.L.Normalized.exp14.rep1.20h" # adapt this variable to plot the second replicate
p <- ggplot(res_3, aes_string(x="id", y=H_L_Ratio_column)) +
geom_point(alpha=0.3)
print(p)
Note the distribution of ratios. Is it easy to evaluete the data?
Make log2 transformation of the normalized H/L Ratios columns:
H_L_Ratio_columns <- grep('Ratio.H.L.Normalized', colnames(res_3), value=TRUE)
log2_H_L_Ratios <- log2(res_3[,H_L_Ratio_columns])
# add a "log2" to the column names and add them back to the table
colnames(log2_H_L_Ratios) <- paste0(colnames(log2_H_L_Ratios), ".log2")
res_4 <- cbind(res_3, log2_H_L_Ratios)
Create the same scatterplot as before, but this time on the log2 data:
H_L_Ratio_column <- "Ratio.H.L.Normalized.exp14.rep1.20h.log2"
p <- ggplot(res_4, aes_string(x="id", y=H_L_Ratio_column)) +
geom_hline(yintercept = 0) +
geom_point(alpha=0.3)
print(p)
Select proteins above a certain threshold (e.g. 1) in replicate 1 and look if they’re also high in the other replicates:
res_above_threshold <- subset(res_4, Ratio.H.L.Normalized.exp14.rep1.20h.log2 > 1)
H_L_Ratio_column <- "Ratio.H.L.Normalized.exp14.rep1.20h.log2"
p <- ggplot(res_4, aes_string(x="id", y=H_L_Ratio_column)) +
geom_hline(yintercept = 0) +
geom_point(alpha=0.3) +
geom_point(data=res_above_threshold, aes_string(x="id", y=H_L_Ratio_column), col="red")
print(p)
Now let’s plot the corresponding H/L ratios for the second replicate:
H_L_Ratio_column <- "Ratio.H.L.Normalized.exp14.rep2.20h.log2"
p <- ggplot(res_4, aes_string(x="id", y=H_L_Ratio_column)) +
geom_hline(yintercept = 0) +
geom_point(alpha=0.3) +
geom_point(data=res_above_threshold, aes_string(x="id", y=H_L_Ratio_column), col="red")
print(p)
We can look at the correlations between the different replicates using a multiscatterplot:
H_L_Ratio_columns <- grep("Ratio.H.L.*.log2", colnames(res_4), value=TRUE)
pairs(res_4[,H_L_Ratio_columns])
The Pearson correlation can be computed with the following command:
cor(res_4[,H_L_Ratio_columns], method="pearson", use="pairwise.complete.obs")
We filter on the number of peptides observed per protein groups. We’re keeping only proteins which have > 2 proteins:
res_5 <- subset(res_4, Peptides..seq. > 2)
And we make again a multi scatterplot and compute the Pearson correlation:
# multi scatterplot
pairs(res_5[,H_L_Ratio_columns])
# Pearson correlation
cor(res_5[,H_L_Ratio_columns], method="pearson", use="pairwise.complete.obs")
Is the correlation better now?
Compute the t-test for the normalized log2 ratios:
p_vals <- apply(res_5[,H_L_Ratio_columns], 1, function(x) {
# we have to wrap the test into a tryCatch, since the t-test is not working when there are too many NA's
tryCatch({
one_t_test <- t.test(x, mu=0, var.equal = FALSE)
one_t_test$p.value
}, error=function(cond){
NA
})
})
# number of significant p-values
sum(p_vals <= 0.05, na.rm=TRUE)
But we have to adjust for multiple testing (e.g. with Benjamini-Hochberg):
p_vals_adj <- p.adjust(p_vals, method="BH")
How many proteins pass t-test with p-value < 0.05 now?
This file was generated from “analyze_silac_MQ.Rmd” using knitr. From RStudio you can generate it by clicking on “Knit”. If you don’t have the package “knitr” installed, you will have to execute the following command first install.packages('knitr')
.