Introduction to module 1
Read and summarize your data
This first module is dedicated to the understanding and handling of your data.
First you will learn to import data and then how to compute summary metrics from your data.
The basis of any statistical analysis is the underlying data.
A data-set is typically presented as a file containing information formatted as a table:
each line corresponds to an observation (individual, sample, ...)
each column correspond to a measured variable (height, sex, gene expression, ...)
To read data files and manipulate the data, you will rely on Pandas.
Pandas is a "high-level" module, designed for statistics/exploratory analysis. A great strength of pandas is its DataFrame which emulates many of the convenient behavior and syntax of their eponym counterpart in the R language.
First make sure that Pandas is correctly installed on your machine. You will find step by step instructions on this link: Pandas.
For the purpose of this course, we will demonstrate the reading operations on different datasets in order to showcase the different real-life situations you may be confronted to.
For the practicals we will use the file heartData_simplified.csv, which regroups various measurement on patients suffering from a heart disease or not. It corresponds to a cleaned and simplified version of the UCI heart disease data set.
Description of the columns of heartData_simplified.csv
age: Patient age in years
sex: Patient sex
chol: Cholesterol level in mg/dl.
thalach: Maximum heart rate during the stress test
oldpeak: Decrease of the ST segment during exercise according to the same one on rest.
ca: Number of main blood vessels coloured by the radioactive dye. The number varies between 0 to 3.
thal: Results of the blood flow observed via the radioactive dye.
fixed -> fixed defect (no blood flow in some part of the heart)
normal -> normal blood flow
reversible -> reversible defect (a blood flow is observed but it is not normal)
target: Whether the patient has a heart disease or not
Module 1 is made of Task 1 and Task 2. You will find two different kinds of practice in the tasks. The first are formative quizzes, while the second are steps of the activity which enables you to practice the code on a sample data representative of the data you will use in your professional activity.
The activity (STEPS) runs throughout the course using the same data. These steps are the practice that you are required to do on your own machine to insure you are able to apply the code to your own data later on in your professional activity. These steps are easily recognizable as they have a blue background color.
The quizzes embedded in the course do not use the same data as the activity. They are right or wrong quizzes or multiple choice questions to highlight important code sequences. They have an orange background color.
You will also find theoretical information which has a beige background color.