Exercises
Visualizing a dataset
The file students.csv contains information collected from biology students at the University of Lausanne. You can load the file in R using the command:
students <- read.csv("students.csv")
You can view the content of the dataset using the command
View(students)
Look at the different variables, and think about how you would be visualize each of them separately.
As a quick reminder about R commands:
- you can select a variable by using the "$" operator: students$height
Then, think about how you would best display them so that you would be able to see if the variable is distributed differently between males and females.
In particular:
- draw a scatterplot of student's height vs weight, specifying a title and using different colours according to sex.
- plot two histograms comparing the distribution of heights for both sexes. Which issues do you have to solve ?
You can try this with both base R and ggplot2.
Survey on graphics
The file quiz.csv provides the average of the scores that you provided for the "utility" and "aesthetic" of the graphs that were shown to you. You can load it from R using the command
quiz <- read.csv("quiz.csv")
How would you visualize this data ? Propose different ways to do this.
Note: if you want to use ggplot2 and split the data according to the type of score (utility vs variable for example), you will need to convert the data to the "long" format (only one value per line, and "type of score" becomes a separate variable). You can do this with the melt() command from the reshape2 package:
library(reshape2)
quiz_long <- melt(quiz)
Additional question. On the Moodle website, you will also find a file with the results from the same survey obtained from different people. How would you combine the information from these two files in order to show how different the two groups of participants have answered this survey ?
Timecourse experiment
The timecourse.csv file (which can be read in the same way as the previous files) contains information about 10 animals (5 wild-type, 5 knock-out), taken over 5 time points.
How would you represent the data ?
How would you represent the data, if we are interested in seeing both the individual data points and the average per group ?
Note: you will probably need to convert the data to the "long" format, as described above.
Country data
The package gapminder contains a gapminder data frame, with information about
- the income in different countries;
- the life expectancy in these countries
- the population in these countries
- the region (continent) these countries belong to (4 main regions)
How would you plot:
- the link between income and life expectancy, indicating as well the population and the region, for a recent year (e.g. 2007)
- the changes observed in income and life expectancy since the beginning of the measurements ?
Geneva official chestnut tree
In Geneva, a chestnut tree is observed at the beginning of every year; the opening of the first leaf is recorded and announced publicly as indicating the beginning of the spring. The series of records starts in 1818 and is an interesting witness of climate change.
The file chestnut_tree.csv contains all the measurements, with the following information:
- the year of the measurement
- the tree (over the past >200 years, several trees have been used)
- the date of the opening of the first leaf
- the number of days between the beginning of the year and that date.
Plot this data graphically in order to show if it indicates a climate warming (e.g. if the first leaf tends to appear earlier every year), indicating all the different trees used. How would you indicate the trend on the graph ?
Once you've created your own graph, you can have a look at the corresponding Wikipedia page, and either try to reproduce some of the features of the graph displayed there, or make suggestions to improve it.
Visualizing gene expression data
The file gene_expression.csv contains information obtained from a transcriptomics experiment (one condition compared to another). A collaborator wants to select genes that have an adjusted p-value below 0.0001, as well as a differential expression (logFC) of at least 1 (in absolute value).
How would you represent the data in a way that shows which genes are selected by these filters and which ones are not ?