Orthologs and paralogs in R
If you are not using the Virtual Machine, for each library you need to first install it from bioconductor or another server as specified:
source("http://bioconductor.org/biocLite.R")
biocLite("LibraryName")
library(LibraryName)
You need to download and source
our collection of functions functions_MRR_TB.R.
Exercise 1: calculating tissue-specificity
First install the package BgeeDB
, and open its documentation from bioconductor. Then obtain all RNA-seq libraries from zebrafish Danio rerio covering more than 10 anatomical structures:
use Bgee$new
to get zebrafish RNA-seq
use getAnnotation()
to obtain the corresponding annotations, then take the subset $experiment.annotation
finally, select on $Organ.count
, and show the Experiment.ID
.
Use getData()
to obtain the expression data from that experiment ID. Subset the columns 4, 6 and 12 (Gene.ID, Anatomical.entity.name, and TPM), then use the following command to obtain a nice expression matrix:
reshape(data_sub, idvar = "Gene.ID", timevar = "Anatomical.entity.name", v.names = "TPM", direction = "wide")
# change data_sub according to your variable name
Remove missing values with na.omit.
Make a temporary numbers only matrix x by removing the first column, set TPM values <1 to 1 in x, and replace values in your matrix from reshape by log2(x)
.
Use apply to run the function fTau on the resulting matrix, columns with numbers only. Store the result in a new variable in your data frame of expression. This should look like:
matrix_name$Tau <- apply(matrix_name[2:13], 1, fTau) # change matrix_name to the name of your matrix
Further use apply to functions from our collection will be similar, and will not be spelled out each time.
Similarly, apply fmean
to the same data (note: we are using log transformed values).
Plot histograms of the Tau and mean distributions, and the relation between them.
Exercise 2: expression of duplicate and singleton zebrafish genes
Use read.table
to import the gene lists of the first practical (from Roux et al 2017) of the day into R.
Change the colnames
to Gene.ID
to be consistent with the data obtained from Bgee, and merge
each gene list with the corresponding expression values by = "Gene.ID"
.
Compare the distributions of mean and Tau of expression among the three gene categories.
Exercise 3: Comparing orthologs and paralogs on gene trees
You will need the following libraries:treeio
, phytools
, ape
, dplyr
, magrittr
, ggplot2
Some functions will be slow, to optimize them it is better to know the number of cores on your computer:
cores <- detectCores()
We have gene trees from Genomicus, which need to be parsed to be useful. As the parsing is quite long, directly load
into the R the rda file from here.
To check one tree, use
plotTree(trees.all[[n]]@phylo,mar=c(5,2,2,2)) # replace n with a number between 1 and 20000 to visualize a tree
data.tree<-trees.all[[n]]@data
nodelabels(col="blue",frame="none")
axisPhylo()
To add branch lengths, apply the function tree.nodedepth
to tree.all
; use mclapply
with mc.cores = cores
, instead of apply
, for faster results.
To gain time, we are providing some precompted files with Tau values for different fish species.
zebrafish_tau.txt, spotted_gar_tau.txt, atlantic_cod_tau.txt, medaka_tau.txt, cavefish_tau.txt
Download them and import them into R, into one dataframe.
Add Tau data to the trees; this takes 5-10 minutes:
tau.annotated.trees<- mclapply(trees.all.modified,
function(tree){
tree@data %<>% left_join(Tau.species.data, by = c("label"))
return(tree)},
mc.cores=cores)
Prune the gene trees by applying drop.leaf
, and remove pruned trees with NA data.
To distinguish speciation and duplication nodes and time calibrate gene trees, we need to modify the clade labels; apply modify.label
.
We need to calibrate trees based on available speciation time points, with dates from TimeTree:
Calibration.time <- data.frame(
label = c("Otophysi","Clupeocephala","Neopterygii","Acanthomorphata"),
Mya = c(153, 230, 315, 148))
Calibration.time <- Calibration.time[order(-Calibration.time$Mya),]
Time calibrate your trees according to the these calibration time points:
Calibrated.timetrees <- lapply(trees.pruned2,
tree.calibrate,
timeframe=Calibration.time,
model="relaxed")
Remove the is.na
and NULL
time trees.
We need to remodify the gene tree clade labels to the initial ones, for PIC calculation as well as for node age storage. Apply remodify.label
.
Assign node age to each node by applying node.age
.
Now we will calculate the phylogenetic independent contrast (PIC) for our trees. Apply contrast.calculation
to the calibrated trees with the right labels.
Summarize all results of "@data" slot of trees in a dataframe for easy calculation:
summary_all_tree<-bind_rows(lapply(pic.Calibrated,
function(tree){
gene_data<-tree@data
drops <- c("B","SIS") # B and SIS have inconsistent types in binding rows, so we remove those columns
tree@data<-gene_data[ , !(names(gene_data) %in% drops)]
return(tree@data)})) # change variable names as needed
nodes_contrast <- summary_all_tree[which(!is.na(summary_all_tree$pic.abs)),]
To avoid biases due to very long branches of old duplication events, remove all events older than the oldest speciation, i.e. with node age > 315.
Finally, we can compare the PICs of speciation and duplication! Compare them using a Wilcoxon test. Examine the results, and plot them as a boxplot.
Exercise 4: Pairwise comparison (optional)
Apply the function pairwise.processing
to the calibrated PICs.
Collect a dataframe of correlation results for different time points of speciation and duplication, defining a correlation function as cor.test(x, y,method="pearson")
and using group_by(events,label)
.
Merge the correlation and calibration time data frames by=c("label")
, and order them by Mya.
Add to this data frame a factor levels=c( "speciation", "duplication" )
.
Restrict again to events <= 315 Mya.
Test for differences between duplication and speciation using an ANCOVA according to the factor duplication or speciation and to node age.
Finally, you can nicely plot the results with this code:
scatterplot(empirical.pairwise,Ancova.result)+
geom_point(aes(size=R),alpha=1,show.legend=F)+
geom_smooth(method = "lm", formula = y ~ x, size =0.5,se=F) # modify variable names as needed
That's all folks!