Short Course 2

Multivariate Dimension Reduction for Biological Data Integration

Room: TBD (Sunday, 8 July 2017 from 9:00 am – 5:00 pm)


  • Kim-Anh Lê Cao (The University of Melbourne) Melbourne, Australia
  • Sébastien Déjean (Institute of Mathematics, Paul Sabatier University) Toulouse, France


With the advent of high-throughput sequencing technologies, multivariate dimension reduction methods propose powerful statistical analyses to obtain a first understanding of large and complex data sets. They provide insightful visualisations, are efficient on large data sets and make little assumptions about the distribution of the data. In addition, they are highly flexible as unsupervised (exploratory) or supervised (classification) analyses can be performed. The latest innovative developments in this exciting and fast-moving area of research include an integration of different types of data sets and variable selection.

Data integration is often required in a systems biology context when experiments are performed on the same individuals but at different molecular levels. Combining heterogeneous ‘omics data (transcriptomics, for the study of transcripts, proteomics for proteins, metabolomics for metabolites, metagenomics, the study of micro-organisms, etc.) can lead to the discovery of important biological insights, provided that relevant information (variables) can be identified during the integration process.

This hands-on course will introduce key concepts in multivariate dimension reduction, first with Principal Component Analysis, and then by introducing innovative approaches for statistical integration of multiple data sets to select biological features. Each methodology will be illustrated on real biological studies.


Each methodology will be illustrated on a case study using our R package mixOmics. (Instructors will alternate theory and application.)

  • Morning Session – Multivariate analysis of one biological data set
    • Introduction to Principal Component Analysis.
    • Useful graphical outputs to visualise data.
    • LASSO penalisation for feature selection.
  • Afternoon Session – Multivariate integration of two data sets and more
    • Introduction to Canonical Correlation Analysis and Projection to Latent Structures models to integrate two data sets.
    • Ridge and LASSO penalisations to analyse large data sets.
    • A generalised framework to integrate more than two data sets. 

Learning Outcomes

At the end of the course, participants should:

  1. Understand the fundamental principles of Principal Component Analysis as a first mean for multivariate projection-based dimension reduction technique.
  2. Perform statistical integration and feature selection with multivariate methodologies.
  3. Apply those methods to high-throughput biological studies.


Good R programming skills are necessary to make most of this hands-on course. Participants should also have basic knowledge in linear regression, statistical learning and matrix algebra.

About the Instructors

Kim-Anh Lê Cao (The University of Melbourne, Brisbane Australia) was awarded her PhD in 2008 at the Université de Toulouse, France. She then moved to Australia as a postdoctoral fellow at the University of Queensland, Brisbane Australia. Since the beginning of her career Kim-Anh has initiated a wide range of valuable collaborative and research opportunities in both statistics and molecular biology. Her research interests are multidisciplinary as they focus on mathematical statistics characterization of molecular biological systems, and she is interested in developing sound statistical methods to answer new biological questions arising from these frontier molecular technologies. Her main research focus is on variable selection for biological data (‘omics’ data) coming from different functional levels by the means of multivariate dimension reduction approaches.


Since 2009, her team has been developing the R toolkit mixOmics dedicated to the exploration and integration of `omics’ data. mixOmics attracts a growing number of users worldwide (>21K CRAN download, unique IP adresses in 2016). Kim-Anh is a senior lecturer at the University of Melbourne and regularly runs statistical training workshops and short series seminars, as well as mixOmics multi-day workshops (10 mixOmics workshops, totalling 18 days since 2014).

Sébastien Déjean was awarded his PhD in Applied Statistics in 2002 at the Université de Toulouse, France. Previously, he spent four years in the Biometry lab at the French National Institute for Agricultural Research.

He is a research engineer at the Toulouse Mathematics Institute (Université de Toulouse, France) and works in close collaboration with researchers across several disciplines, including high-throughput biology, chemistry and information retrieval. Sébastien is an expert in statistical data analysis, and he contributes to the development of several R packages including mixOmics as a core member. Sébastien teaches introductory statistics and scientific softwares training workshops to scientific and administrative staff and has been teaching the mixOmics workshops in English.