Fakultativer Kurs: Data Analysis and Visualization in R

Attention: This year, the lecture is held in Python

Module IN2339

Credit: 6 ECTS.

Moodle: https://www.moodle.tum.de/course/view.php?id=110723

Lecture Script: https://gagneurlab.github.io/dataviz/

FAQ: See W2526 frequently asked questions 

Contact:  teaching-gagneurlab(at)in.tum.de
 

When and where?

This lecture is given in the winter term.

Lectures every Tuesday from 14:00 - 16:00, starting on Tuesday, Oct  21 5416.01.004 (Hörsaal 1, Jürgen-Manchot-Hörsaal) Lichtenbergstr. 2b in Garching. Additionally, lecture recordings from previous years will be made available. 

Exercises:

The exercises consist of two sessions per week.

- Tutorials will only be held in person: 1h30 sessions, in which exercises are solved and interactively supported by tutors. (Multiple sessions are held at different times throughout the week, where each student attends one.)

- Central exercise session will be held in person: In this plenary session on Monday, 14:00 - 16:00, (“Interims I”, Hörsaal 2, 5620.01.102, Boltzmannstr. 5), the solutions to the homework are presented and discussed.

Make sure that you register twice in TUMonline, for the lecture and also for the separate exercise.

Teaching material will be shared via the platform Moodle. To access Moodle, a registration to the course via TUM online is needed.

Description

This module teaches methodologies and good practices of data science using R. The lecture is structured into three main parts, covering the major steps of data analysis:

1. Get the data: how to fetch and manipulate real-world datasets. How to structure them ("tidy data") to most conveniently work with them.

2. Look at the data: basic and advanced visualization techniques (grammar of graphics, unsupervised learning) will allow students to navigate and identify interesting signals in large and complex datasets and formulate hypotheses.

3. Conclude: concepts of statistical testing will allow conclusions to be drawn about the hypotheses raised. Also, methods from supervised learning will allow us to model data and build accurate predictors. Each week, the lecture is accompanied with exercises. During the exercises, combinations of the concepts seen in the lecture will allow performing more involved data analysis tasks. 

Required background and Computer setup

Prerequisites

The theoretical aspects of data analysis are kept low in this module. However, basics in probabilities are required, e,g. Discrete Probability Theory (IN0018) or an equivalent lecture in probabilities and statistics. Chapters 13-15 ("Introduction to Statistics with R", "Probability" and "Random variables") of the Book "Introduction to Data Science"https://rafalab.github.io/dsbook/ make a good refresher. Make sure all concepts are familiar to you. Check your knowledge by trying the exercises. 

Coding interface
We recommend Google colab which works over a browser and does not necessitate installation on your computer. It is the most convenient. However, you will need a Google account. If you prefer a local installation, we recommend using Visual Studio Code. You can find many tutorials on how to set this up online, e.g. here. Note that we do not provide support for local installations. You will have to maintain your environment and packages.

Who can attend

The module is an elective module for many study programs. Among others, it is in the catalogue of:

BSc and MSc Bioinformatics

MSc Informatics

MSc Information Systems

MSc Data Engineering and Analytics

Medicine students

BSc and MSc 'Management and technology'

Recommended reading

Lecture Script: https://gagneurlab.github.io/dataviz/

R for Data Science, by Garrett Grolemund and Hadley Wickham

Introduction to Data Science, by Rafael A. Irizarry. 

Topics

R programming basics, report generation with R markdown Importing, cleaning and organizing data (tidy data) Plotting and Grammar of graphics Unsupervised learning (hierarchical clustering, k-means, PCA) Drawing robust interpretations (empirical testing by sampling, classical statistical tests) Supervised learning (regression, classification, cross-validation)

Evaluation

The final mark will be determined by a 1.5-hour written exam. The exact exam date has not yet been set. It will be set centrally by the School in due time. 

Teaching team

This lecture is given by a team of scientists with long experience in high-dimensional data analysis in the field of genomics: Prof. Julien Gagneur and members of his lab. 

For questions *not covered in the FAQ*, please contact us via Email to teaching-gagneurlab@in.tum.de

If you cannot register for the exam please contact the secretary of your study program. We cannot register students.