Short Course 5

Compositional Data Analysis (CoDa Course)

Room: TBD (Sunday, 8 July 2017 from 9:00 am – 5:00 pm)

Presenters

  • Josep-Antonio Martín-Fernández (University of Girona) Girona, Spain
  • Jan Graffelman (University of Catalonia) Barcelona, Spain

Summary of Course

The motivation for this course is to increase the awareness of the peculiarities of compositional data (CoDa) and help practitioners to avoid common pitfalls in the analysis of CoDa. The topic is timely, given the upsurge of high-dimensional CoDa sets in the molecular sciences (e.g. omics and microbiome data).

The course will provide an introduction to theoretical and practical aspects of the statistical analysis of CoDa. It will provide mathematical background and an informal discussion forum on more advanced modeling methods. CoDa are vectors whose components show the relative importance of some parts of a whole. Typical examples are data expressed as percentages, ppm, ppb or the like. This type of data appears in many applications, and the interest and importance of a consistent statistical methodology for their analysis cannot be underestimated. The log-ratio approach to CoDa was introduced back in the eighties. Since then, steady progress has been made in understanding the geometry peculiar to the compositional sample space, the D-part simplex. The course will consist of lectures and exercises. Exercises are done with the freeware CoDaPack (http://ima.udg.edu/codapack/). Some datasets and their particular problems will be presented, analysed and discussed interactively. Visit http://www.compositionaldata.com for further information.

Outline

This course will introduce the current state of the art in CoDa analysis and will cover the following topics:

  1. Hypothesis underlying statistical data analysis (simplex sample space and scale).
  2. The Aitchison geometry: Euclidean space, inner product, norm and distance.
  3. Log-ratio coordinate representation; distributions on the simplex.
  4. Exploratory analysis (centering, variation array, biplot and balances-dendrogram).
  5. Pre-processing irregular data: missing data and zero values.
  6. Introduction to available software: CoDaPack. 

Learning Outcomes

At the end of the course, participants should:

  1. Be able to recognize datasets that are of compositional nature and explore such data with adequate numerical and graphical summaries.
  2. Be able to make representations of the data in log-ratio coordinates, and use these coordinates in posterior modeling and multivariate analysis.
  3. Be able to make biplots of compositional data sets and adequately interpret these.
  4. Know how to handle irregular data in CoDa sets with specialized software (CoDaPack and compositional packages in the R environment).

Prerequisites

It is recommended that attendants have undergone at least some first semester courses on statistics, algebra and calculus. Basic knowledge about multivariate statistics may also be handy.

About the Instructors

Josep-Antoni Martín-Fernández has a degree in Mathematics. He got his PhD from the Polytechnic University of Catalonia working on ‘Measurements of difference and non-parametric classification of CoDa’. Currently, he is Associate Professor at the Department of Computer Science, Applied Mathematics and Statistics of the University of Girona, Spain. His interests lie primarily in the statistical analysis of compositional data, with more than 50 publications related with the topic. He focuses his research on the topics ‘Cluster Analysis of Compositional Data’ and ‘Rounded Zeros and Missing Data’. For many years he has also served as principal investigator of a publicly funded research project on CoDa. He has taught many CoDa courses in the past: (the first week of July) CoDa course in Girona (since 2012) and the one day introductory courses taught at CoDa workshops: 2005, 2008 (University of Girona); 2011 (Sant Feliu Guíxols, Spain) and 2013 (Vorau, Austria). More information can be found at http://ima.udg.edu/~jamf/.

Jan Graffelman holds a doctorate degree in Biology (University of Groningen, Netherlands) and a PhD in Statistics from the Technical University of Catalonia. He is Associate Professor at the Department of Statistics and Operations Research of the Technical University of Catalonia in Barcelona, Spain. His main research interests are in statistical genetics, multivariate analysis and compositional data analysis. He is principal investigator of a publicly funded research project on compositional data, in collaboration with the group lead by Martín-Fernández in Girona. He also participated as a professor of the CoDa course at the University of Girona in 2016. More information is located at http://www-eio.upc.edu/~jan. 

Textbook

There are recent textbooks on compositional data analysis, but these are not required for the course. A syllabus with the slides of the course, computer exercises, web references and bibliography will be provided for course participants.