compgen0918: Orthologs and paralogs in R

This exercise will be done in R. If you have the latest version of R (3.5.0) on your laptop, you can use it directly. Otherwise, please use the Virtual Machine provided (see course homepage).

If you are not using the Virtual Machine, for each library you need to first install it from bioconductor or another server as specified:

source("http://bioconductor.org/biocLite.R")

biocLite("LibraryName")

Whether on your installation of R or the Virtual Machine, to use each library run

library(LibraryName)

You need to download and source our collection of functions functions_MRR_TB.R.

Exercise 1: calculating tissue-specificity

First install the package BgeeDB, and open its documentation from bioconductor. Then obtain all RNA-seq libraries from zebrafish Danio rerio covering more than 10 anatomical structures:

use Bgee$new to get zebrafish RNA-seq

use getAnnotation() to obtain the corresponding annotations, then take the subset $experiment.annotation

finally, select on $Organ.count, and show the Experiment.ID.

Use getData() to obtain the expression data from that experiment ID. Subset the columns 4, 6 and 12 (Gene.ID, Anatomical.entity.name, and TPM), then use the following command to obtain a nice expression matrix:

reshape(data_sub, idvar = "Gene.ID", timevar = "Anatomical.entity.name", v.names = "TPM", direction = "wide")

# change data_sub according to your variable name

Remove missing values with na.omit.

Make a temporary numbers only matrix x by removing the first column, set TPM values <1 to 1 in x, and replace values in your matrix from reshape by log2(x).

Use apply to run the function fTau on the resulting matrix, columns with numbers only. Store the result in a new variable in your data frame of expression. This should look like:

matrix_name$Tau <- apply(matrix_name[2:13], 1, fTau) # change matrix_name to the name of your matrix

Further use apply to functions from our collection will be similar, and will not be spelled out each time.

Similarly, apply fmean to the same data (note: we are using log transformed values).

Plot histograms of the Tau and mean distributions, and the relation between them.

Exercise 2: expression of duplicate and singleton zebrafish genes

Use read.table to import the gene lists of the first practical (from Roux et al 2017) of the day into R.

Change the colnames to Gene.ID to be consistent with the data obtained from Bgee, and merge each gene list with the corresponding expression values by = "Gene.ID".

Compare the distributions of mean and Tau of expression among the three gene categories.

Exercise 3: Comparing orthologs and paralogs on gene trees

You will need the following libraries:treeio, phytools, ape, dplyr, magrittr, ggplot2

Some functions will be slow, to optimize them it is better to know the number of cores on your computer:

cores <- detectCores()

We have gene trees from Genomicus, which need to be parsed to be useful. As the parsing is quite long, directly load into the R the rda file from here.

To check one tree, use

plotTree(trees.all[[n]]@phylo,mar=c(5,2,2,2)) # replace n with a number between 1 and 20000 to visualize a tree

data.tree<-trees.all[[n]]@data

nodelabels(col="blue",frame="none")

axisPhylo()

To add branch lengths, apply the function tree.nodedepth to tree.all; use mclapply with mc.cores = cores, instead of apply, for faster results.

To gain time, we are providing some precompted files with Tau values for different fish species.

zebrafish_tau.txt, spotted_gar_tau.txt, atlantic_cod_tau.txt, medaka_tau.txt, cavefish_tau.txt

Download them and import them into R, into one dataframe.

Add Tau data to the trees; this takes 5-10 minutes:

tau.annotated.trees<- mclapply(trees.all.modified, function(tree){

tree@data %<>% left_join(Tau.species.data, by = c("label")) return(tree)},

mc.cores=cores)

Prune the gene trees by applying drop.leaf, and remove pruned trees with NA data.

To distinguish speciation and duplication nodes and time calibrate gene trees, we need to modify the clade labels; apply modify.label.

We need to calibrate trees based on available speciation time points, with dates from TimeTree:

Calibration.time <- data.frame(

     label = c("Otophysi","Clupeocephala","Neopterygii","Acanthomorphata"),

     Mya = c(153, 230, 315, 148))

Calibration.time <- Calibration.time[order(-Calibration.time$Mya),]

Time calibrate your trees according to the these calibration time points:

Calibrated.timetrees <- lapply(trees.pruned2, 

                                      tree.calibrate,

                                      timeframe=Calibration.time,

                                      model="relaxed")

Remove the is.na and NULL time trees.

We need to remodify the gene tree clade labels to the initial ones, for PIC calculation as well as for node age storage. Apply remodify.label.

Assign node age to each node by applying node.age.

Now we will calculate the phylogenetic independent contrast (PIC) for our trees. Apply contrast.calculationto the calibrated trees with the right labels.

Summarize all results of "@data" slot of trees in a dataframe for easy calculation:

summary_all_tree<-bind_rows(lapply(pic.Calibrated,

     function(tree){

     gene_data<-tree@data 

     drops <- c("B","SIS")  # B and SIS have inconsistent types in binding rows, so we remove those columns

     tree@data<-gene_data[ , !(names(gene_data) %in% drops)]

     return(tree@data)})) # change variable names as needed

nodes_contrast <- summary_all_tree[which(!is.na(summary_all_tree$pic.abs)),]

To avoid biases due to very long branches of old duplication events, remove all events older than the oldest speciation, i.e. with node age > 315.

Finally, we can compare the PICs of speciation and duplication! Compare them using a Wilcoxon test. Examine the results, and plot them as a boxplot.

Exercise 4: Pairwise comparison (optional)

Apply the function pairwise.processing to the calibrated PICs.

Collect a dataframe of correlation results for different time points of speciation and duplication, defining a correlation function as cor.test(x, y,method="pearson") and using group_by(events,label).

Merge the correlation and calibration time data frames by=c("label"), and order them by Mya.

Add to this data frame a factor levels=c( "speciation", "duplication" ).

Restrict again to events <= 315 Mya.

Test for differences between duplication and speciation using an ANCOVA according to the factor duplication or speciation and to node age.

Finally, you can nicely plot the results with this code:

scatterplot(empirical.pairwise,Ancova.result)+

     geom_point(aes(size=R),alpha=1,show.legend=F)+

     geom_smooth(method = "lm", formula = y ~ x, size =0.5,se=F) # modify variable names as needed

That's all folks!

Last modified: Tuesday, 4 September 2018, 9:03 PM