This document contain exercises on generalized linear models. You will use several packages. If a package is not found, you will need to install it. For the faraway package for example:
install.packages("faraway")
To get help for a function, use the help function in R
help(glm)
Here, we will study a data set containing food ratings for a number of restaurants, as well as information about whether or not they are included in the Michelin guide. The data set (called MichelinFood.txt) can be downloaded from the webpage of the book A Modern Approach to Regression with R by Simon J Sheather. Download the file and read it into R
michelin <- read.delim("MichelinFood.txt",header=TRUE,sep="\t",as.is=TRUE)
michelin
## Food InMichelin NotInMichelin mi proportion
## 1 15 0 1 1 0.00
## 2 16 0 1 1 0.00
## 3 17 0 8 8 0.00
## 4 18 2 13 15 0.13
## 5 19 5 13 18 0.28
## 6 20 8 25 33 0.24
## 7 21 15 11 26 0.58
## 8 22 4 8 12 0.33
## 9 23 12 6 18 0.67
## 10 24 6 1 7 0.86
## 11 25 11 1 12 0.92
## 12 26 1 1 2 0.50
## 13 27 6 1 7 0.86
## 14 28 4 0 4 1.00
The Food column represents the ranking of the food. The columns InMichelin and NotInMichelin contain the number of restaurants with the given food ranking that are/are not listed in the Michelin guide, respectively. The mi column contains the total number of examined restaurants with the given food ranking. Finally, the proportion column contains the fraction of examined restaurants with the given food ranking that were included in the Michelin guide.
Start by graphically exploring the data
Fit a GLM using a binomial model for the response, using the food ranking as the predictor.
Predict the probabilities for a number of potential food rankings xnew, and plot a smooth function
xnew <- data.frame(Food = seq(from = 14, to = 30, length.out = 50))
Check the model by looking at the residual deviance, other residuals and especially the quantile residuals
Check the model for potential influencial observations.
The next data set comes from Venables and Ripley (page 190) and encodes the number of moths that died after exposure to different doses of trans-cypermethrin. 20 male and 20 female moths were exposed to each dose, and the number of deaths in each group was recorded. We generate a data frame containing the observed values. Since the doses are powers of two, we use log2(dose) as the predictor rather than the actual dose.
moth <- data.frame(sex = rep(c("male", "female"), each = 6),
dose = log2(rep(c(1, 2, 4, 8, 16, 32), 2)),
numdead = c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16))
moth$numalive <- 20 - moth$numdead
Fit a GLM using the sex and dose as predictors. Do we need to include an interaction term in the model?
Does the model fit well? Perform an analysis of deviance
predict the probabilities for different doses and include a smooth line in the plots
xnew <- data.frame(sex = rep(c("male", "female"), each = 30),
dose = rep(seq(from = 0, to = 5, length.out = 30), 2))
The third example of a binomial response considers an experiment where beetles were exposed to carbon disulphide at various concentrations, and the number of beetles who died within five hours were recorded. The data set is studied in the book by Dobson (2002), and comes originally from Bliss (1935).
beetles <- data.frame(dose = c(1.6907, 1.7242, 1.7552, 1.7842, 1.8113,1.8369, 1.8610, 1.8839),
dead = c(6, 13, 18, 28, 52, 53, 61, 60),
alive = c(51, 47, 44, 28, 11, 6, 1, 0))
new_beetle <- data.frame(dose = seq(from = 1.5, to = 1.9, length.out = 100))
The following exercises was taken from faraway, 2006, extending linear models.
The national institute of Diabetes and digestive and kidney diseases conducted a study on 768 adult female Pima Indians living near Phoenix. the purpose of the study was to investigate factors related to diabetes.
library(faraway)
data("pima")
new_pima <- data.frame(pregnant=1,glucose=99,diastolic=64,triceps=22,insulin=76,bmi=27,diabetes=0.25,age=25)
The data for this exercise study infant respiratory disease, namely the proportions of children developing bronchitis or pneumonia in their first year of life by type of feeding, and sex. Data may be found in Payne (1987) and Faraway (2006)
library(faraway)
data(babyfood)
The dvisits data comes from the Australian Health Survey of 1977-1978 and consist of 5190 single adults where young and old have been oversampled. They are available as:
library(faraway)
data("dvisits")
library(faraway)
data("salmonella")
the salmonella data was collected in a salmonella reverse mutagenicity assay. The predictor is the dose level of quinoline and the response is the numbers of revertant colonies of TA98 salmonella observed on each of 3 replicate plates.
Here we will consider a data set used by Erling B. Andersen in 1977, and containing the number of lung cancer cases recorded in different age categories in four different Danish cities. The data set is available from the ISwR package.
library(ISwR)
data(eba1977)
eba1977
## city age pop cases
## 1 Fredericia 40-54 3059 11
## 2 Horsens 40-54 2879 13
## 3 Kolding 40-54 3142 4
## 4 Vejle 40-54 2520 5
## 5 Fredericia 55-59 800 11
## 6 Horsens 55-59 1083 6
## 7 Kolding 55-59 1050 8
## 8 Vejle 55-59 878 7
## 9 Fredericia 60-64 710 11
## 10 Horsens 60-64 923 15
## 11 Kolding 60-64 895 7
## 12 Vejle 60-64 839 10
## 13 Fredericia 65-69 581 10
## 14 Horsens 65-69 834 10
## 15 Kolding 65-69 702 11
## 16 Vejle 65-69 631 14
## 17 Fredericia 70-74 509 11
## 18 Horsens 70-74 634 12
## 19 Kolding 70-74 535 9
## 20 Vejle 70-74 539 8
## 21 Fredericia 75+ 605 10
## 22 Horsens 75+ 782 2
## 23 Kolding 75+ 659 12
## 24 Vejle 75+ 619 7
The cases column contains the number of lung cancer cases for each city and age category.
Model this count using the city and age category as predictors. Fit a Poisson GLM to the data. Is the fit appropriate?
In the previous model, we are not considering the number of potential cases in each group (ie the population size). Modify the model by using an offset which takes the population size into account.
Fit a binomial model to the data by considering success as being lung cancer cases and failures as being (\(population size-number of cases\)).
Compare with the Poisson model.
Here we will study a data set from a cross-sectional study of patients with malignant melanoma. For 400 patients, the site of the tumor and its histological type was recorded. The data are analyzed by Dobson (2002).
mel <- matrix(c(22, 16, 19, 11, 2, 54, 33, 17, 10, 115, 73, 28),nrow = 4, ncol = 3)
colnames(mel) <- c("headneck", "trunk", "extrm")
rownames(mel) <- c("hutch", "superf", "nodular", "indet")
mel
## headneck trunk extrm
## hutch 22 2 10
## superf 16 54 115
## nodular 19 33 73
## indet 11 17 28
The question of interest is whether there is any association between the tumor type and the site (that is, if one tumor type is more or less common than expected in a given site). We can test the hypothesis of no association by a conventional chi-square test:
chisq.test(mel)
##
## Pearson's Chi-squared test
##
## data: mel
## X-squared = 65.813, df = 6, p-value = 2.943e-12
If there is no association between the two variables, the expected number of patients in a given cell is proportional to the product of the marginal totals for the row and column categories under consideration. On a log-scale, this would correspond to summing the site and tumor type effects. We can thus test the association between the tumor type and the site by comparing a model with only the main effects of site and tumor type to a model including also the interaction between the two. First, we need to reformat the data into so called “long” format:
require(reshape2)
mel.long <- melt(mel, varnames = c("tumtype", "site"), value.name = "freq")
Now we can fit the two models using Poisson regression with a log link.
mel.main <- glm(freq ~ tumtype + site, family = poisson,data = mel.long)
mel.int <- glm(freq ~ tumtype*site, family = poisson,data = mel.long)
We can compare the fit with the two models using e.g. the AIC criterion, or by checking the significance of the interaction term in the last model.
AIC(mel.main, mel.int)
## df AIC
## mel.main 6 122.9064
## mel.int 12 83.1114
anova(mel.int, test = "Chisq")
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: freq
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 11 295.203
## tumtype 3 145.106 8 150.097 < 2.2e-16 ***
## site 2 98.302 6 51.795 < 2.2e-16 ***
## tumtype:site 6 51.795 0 0.000 2.05e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the two models, we can also directly get the expected number of cases in each cell under the assumption of no association (from the model without the interaction), as well as the observed number (from the model with the interaction, which fits the data exactly).
(exp.count <- predict(mel.main, type = "response"))
## 1 2 3 4 5 6 7 8 9
## 5.780 31.450 21.250 9.520 9.010 49.025 33.125 14.840 19.210
## 10 11 12
## 104.525 70.625 31.640
(obs.count <- predict(mel.int, type = "response"))
## 1 2 3 4 5 6 7 8 9 10 11 12
## 22 16 19 11 2 54 33 17 10 115 73 28
The sum of the squares of the Pearson residuals of the model without interaction is equal to the chi-square statistic from the conventional test.
sum((residuals(mel.main, type = "pearson"))^2)
## [1] 65.81293
library(faraway)
data(africa)
The dataset africa gives information about the number of military coups in subSaharan Africa and various political and geographical information. Develop a simple but well- fitting model for the number of coups. Give an interpretation of the effect of the variables you include in your model on the response.
Below is a number of data sets and questions that can be addressed using GLMs. For each of them, determine what could be a suitable distribution, fit a model and analyze the fitted model.
The ACF1 data set from the DAAG package contains the number of aberrant crypt foci (ACF) in a colon section in each of 22 rats subjected to a carcinogen. The rats were sacrificed at three different times. Is there a difference between the number of ACF in the three groups of rats? Does the Poisson assumption seem to hold?
require(DAAG)
head(ACF1)
## count endtime
## 1 1 6
## 2 3 6
## 3 5 6
## 4 1 6
## 5 2 6
## 6 1 6
dim(ACF1)
## [1] 22 2
The dengue data set from the DAAG package contains data for 2,000. For each region, it is recorded whether or not dengue was recorded any time between 1961 and 1990 (the NoYes variable). We also have information about other variables, such as temperature and humidity during these years. Which variables seem to influence the occurrence of dengue?
require(DAAG)
data(dengue)
head(dengue)
## humid humid90 temp temp90 h10pix h10pix90 trees trees90
## 1 0.6713889 4.416667 2.037500 8.470835 17.35653 17.80861 0 1.5
## 2 7.6483340 8.167500 12.325000 14.925000 10.98361 11.69167 0 1.0
## 3 6.9790556 9.563058 6.925000 14.591660 17.50833 17.62528 0 1.2
## 4 1.1104163 1.825361 4.641665 6.046669 17.41763 17.51694 0 0.6
## 5 9.0270555 9.742751 18.175000 19.710000 13.84306 13.84306 0 0.0
## 6 8.9141113 9.516778 11.900000 16.643341 11.69167 11.69167 0 0.2
## NoYes Xmin Xmax Ymin Ymax
## 1 0 70.5 74.5 38.0 35.5
## 2 0 62.5 64.5 35.5 34.5
## 3 0 68.5 69.5 36.0 35.0
## 4 0 67.0 68.0 35.0 34.0
## 5 0 61.0 64.5 33.5 32.0
## 6 0 64.5 65.5 36.5 35.0
dim(dengue)
## [1] 2000 13
The leafshape17 data set has 61 observations of 8 variables, representing leaf length, width and petiole measurements from several sites in Australia. Is the leaf architecture (binary variable) associated with any of the other variables?
require(DAAG)
data(leafshape17)
head(leafshape17)
## bladelen petiole bladewid latitude logwid logpet loglen arch
## 186 6.93 0.449064 3.10 17.1 1.1314021 -0.8005899 1.935860 0
## 187 7.27 0.261720 2.50 17.1 0.9162907 -1.3404800 1.983756 0
## 188 7.43 0.766033 2.98 17.1 1.0919233 -0.2665300 2.005526 0
## 189 8.00 0.592800 3.06 17.1 1.1184149 -0.5228982 2.079442 0
## 190 8.08 0.492900 3.03 17.1 1.1085626 -0.7074490 2.089392 0
## 191 8.48 0.819168 4.33 17.1 1.4655675 -0.1994661 2.137710 0
dim(leafshape17)
## [1] 61 8