Machine learning for bioinformatics and computational biology
Lausanne, 23-27 February 2015
This course introduces the theoretical basis of several important machine learning algorithms used in bioinformatics and illustrates them with examples of applications in the field of genomics, signalling networks, population genomics, text mining.
Upon completion of this course, you will understand the statistics components and theory of machine learning algorithms. You will also know how to evaluate machine learning parameters and how to apply these tools to biological problems.
Recommended background: Knowledge requirements: basic mathematical background, knowledge of R and one scripting programming knowledge (Python or Perl for example).
Technical requirements: Laptop with R version 3.1.1 and Matlab installed, 3 GB of free disk space, sbv Improver account (register at https://sbvimprover.com/). Some data files will have to be downloaded before the course, precise instructions will follow later. In the case your university does not provide Matlab licenses, please contact us firstname.lastname@example.org.
The registration fees for academics are 200 CHF. This includes course content material and coffee breaks. Participants from non-academic institutions should contact us before application.
Deadline for registration and free-of-charge cancellation is set to the 9 February 2015.
We recommend 1 ECTS credits for this course in the case the exam, at the end of the session, is successfully passed.
You are welcome to register to the SIB courses mailing-list to be informed of all future courses and workshops, as well as all important deadlines using the form here.
For more information, please contact email@example.com.
13h30 to 17h - Fréderic Schutz (SIB and UniL)
In this session, we will cover the basics to understand how to work with classification/machine learning methods. In general, we will be talking about supervised vs unsupervised methods, learning sets vs testing sets, overfitting, false positive and other measures, ROC curves, etc.
9h to 17h - Manfred Claassen (SIB and ETHZ) - Machine learning methods
We will present on learning parameters from data for various machine learning methods. We will focus on the corresponding optimization problems, convex and non-convex ones. This survey should exemplify how all (or most) machine learning techniques share a lot of conceptual similarity when it comes to learn their parameters from data. Examples and exercises will cover learning signaling network parameters from single cell time course data.
9h to 12h30 - Philipp Bucher, René Dreos, Giovanna Ambrosini and Kumar Sunil (SIB and EPFL) - Machine learning applications in clinical diagnosis
Basic concepts and methods of machine learning will be illustrated with a real life application. As an example, we will use the Lung Cancer Diagnostic Signature Challenge organized by sbv Improver, see https://sbvimprover.com/challenge-1/challenge/lung-cancer. During the practical, course participants will have the opportunity to test various combinations of feature selection methods, data reduction techniques, training algorithms and classifier types using the data provided by this challenge.
Reading material: Tarca et al. (2013). Strengths and limitations of microarray-based phenotype prediction: Lessons learned from the IMPROVER Diagnostic Signature Challenge. Bioinformatics. 29, 2892-2899.
Software needed: R version 3.1.1 (version 3.1.2 is not supported) with the following packages installed: affy; affyio; gcrma; limma; GEOquery; hgu133plus2.db; pROC; nnet and maPredictDSC
13h30 - 17h - Olivier François (Imag, Grenoble, France) - Inference of population structure and local adaptation using population genomic data
We will present a survey of methods that estimate population structure from multilocus genotypes, including recent developments based on machine learning techniques. We will also present new approaches to genome-wide ecological association studies, performing tests for association between genetic polymorphisms and ecological variables. We will show how the understanding of population structure helps to control the false discovery rate when using multiple tests. The practicals will be based on the R package LEA.
9h to 17h - Patrick Ruch and Julien Gobeill (SIB and Unige) - Machine learning for text mining and data curation
We are going to introduce the main types of text mining applications: ad hoc retrieval, automatic text classification, information extraction... and how they can be combined and assessed to built text mining pipelines. Some of the core normalization layers and data/software resources (terminologies, stemming, feed-back...) supporting all these tasks will be introduced.
The practical session will be based on the last gene ontology task at BioCreative IV challenge (2013). We will see two subtasks: (1) filtering full-text articles in order to predict relevant passages for curation (2) predicting functional annotations from the selected sentences. For both subtasks we will have a look to the data, then select and implement the best learning algorithm, and finally see how to evaluate our results.
Reading material: Overview of the gene ontology task at BioCreative IV.
9h to 11h - Optional: exam session for those willing to validate 1 ECTS.