GLM exercises

Linda Dib, Frederic Schutz

August 2018

This document contain exercises on generalized linear models. You will use several packages. If a package is not found, you will need to install it. For the faraway package for example:

install.packages("faraway")

To get help for a function, use the help function in R

help(glm)

Binomial data

Exercise 1: Michelin food
Exercise 2: Moth death
Exercise 3: Beetle
Exercise 4: Pima
Exercise 5: babyfood

Count data

Exercise 6: dvisits
Exercise 7: salmonella
Exercise 8: lung cancer
Exercise 9: melanoma
Exercise 10: Africa

Extra exercises

Exercise 11: ACF1 data
Exercise 12: dengue data
Exercise 13: leafshape17 data

Binomial data

Exercise 1: Michelin food

(Back to top)

Here, we will study a data set containing food ratings for a number of restaurants, as well as information about whether or not they are included in the Michelin guide. The data set (called MichelinFood.txt) can be downloaded from the webpage of the book A Modern Approach to Regression with R by Simon J Sheather. Download the file and read it into R

michelin <- read.delim("MichelinFood.txt",header=TRUE,sep="\t",as.is=TRUE)
michelin

##    Food InMichelin NotInMichelin mi proportion
## 1    15          0             1  1       0.00
## 2    16          0             1  1       0.00
## 3    17          0             8  8       0.00
## 4    18          2            13 15       0.13
## 5    19          5            13 18       0.28
## 6    20          8            25 33       0.24
## 7    21         15            11 26       0.58
## 8    22          4             8 12       0.33
## 9    23         12             6 18       0.67
## 10   24          6             1  7       0.86
## 11   25         11             1 12       0.92
## 12   26          1             1  2       0.50
## 13   27          6             1  7       0.86
## 14   28          4             0  4       1.00

The Food column represents the ranking of the food. The columns InMichelin and NotInMichelin contain the number of restaurants with the given food ranking that are/are not listed in the Michelin guide, respectively. The mi column contains the total number of examined restaurants with the given food ranking. Finally, the proportion column contains the fraction of examined restaurants with the given food ranking that were included in the Michelin guide.

Start by graphically exploring the data
Fit a GLM using a binomial model for the response, using the food ranking as the predictor.
Predict the probabilities for a number of potential food rankings xnew, and plot a smooth function

xnew <- data.frame(Food = seq(from = 14, to = 30, length.out = 50))

Check the model by looking at the residual deviance, other residuals and especially the quantile residuals
Check the model for potential influencial observations.

Exercise 2: moth death

(Back to top)

The next data set comes from Venables and Ripley (page 190) and encodes the number of moths that died after exposure to different doses of trans-cypermethrin. 20 male and 20 female moths were exposed to each dose, and the number of deaths in each group was recorded. We generate a data frame containing the observed values. Since the doses are powers of two, we use log2(dose) as the predictor rather than the actual dose.

moth <- data.frame(sex = rep(c("male", "female"), each = 6),
                   dose = log2(rep(c(1, 2, 4, 8, 16, 32), 2)),
                   numdead = c(1, 4, 9, 13, 18, 20, 0, 2, 6, 10, 12, 16))
moth$numalive <- 20 - moth$numdead

Fit a GLM using the sex and dose as predictors. Do we need to include an interaction term in the model?
Does the model fit well? Perform an analysis of deviance
predict the probabilities for different doses and include a smooth line in the plots

xnew <- data.frame(sex = rep(c("male", "female"), each = 30),
                   dose = rep(seq(from = 0, to = 5, length.out = 30), 2))

Exercise 3: beetle

(Back to top)

The third example of a binomial response considers an experiment where beetles were exposed to carbon disulphide at various concentrations, and the number of beetles who died within five hours were recorded. The data set is studied in the book by Dobson (2002), and comes originally from Bliss (1935).

beetles <- data.frame(dose = c(1.6907, 1.7242, 1.7552, 1.7842, 1.8113,1.8369, 1.8610, 1.8839),
                      dead = c(6, 13, 18, 28, 52, 53, 61, 60),
                      alive = c(51, 47, 44, 28, 11, 6, 1, 0))

fit a logistic regression to the data using dose as a predictor
fit another logistic regression using the log-log link. Compare the two fits.
Compare the prediction with each of the model.

new_beetle <- data.frame(dose = seq(from = 1.5, to = 1.9, length.out = 100))

Exercise 4: pima, Exercise for credits

(Back to top)

The following exercises was taken from faraway, 2006, extending linear models.

The national institute of Diabetes and digestive and kidney diseases conducted a study on 768 adult female Pima Indians living near Phoenix. the purpose of the study was to investigate factors related to diabetes.

load the data

library(faraway)
data("pima")

Perform simple graphical and numerical summaries of the data. Can you find any obvious irregularities in the data? If you do, take appropriate steps to correct the problems
Fit a model with the result of the diabetes test as the response and all the other variables as predictors. Can you tell whether this model fits the data?
What is the difference in the odds of testing positive for diabetes for a woman with a BMI at the first quartile compared with a woman at the third quartile, assuming that all other factors held constant? Give a confidence interval for this difference.
Do women who test positive have higher diastolic blood pressures? Is the diastolic blood pressure signifcant in the regression model? Explain the distinction between the two questions and discuss why the answers are only apparently contradictory.
Perform diagnostics on the regression model, reporting any potential violations and any suggested improvements to the model
Predict the outcome for a woman with the following predictor values:

new_pima <- data.frame(pregnant=1,glucose=99,diastolic=64,triceps=22,insulin=76,bmi=27,diabetes=0.25,age=25)

Give a confidence interval for your prediction.

Exercise 5: baby food

(Back to top)

The data for this exercise study infant respiratory disease, namely the proportions of children developing bronchitis or pneumonia in their first year of life by type of feeding, and sex. Data may be found in Payne (1987) and Faraway (2006)

library(faraway)
data(babyfood)

Explore the data
What are the proportions of Boys/Girls in the different food categories?
Fit a logistic regression to explain the probability of disease by sex and food.
What is the impact of breast feeding on the odds of respiratory disease compared to bottle feeding? Give a confidence interval for this value.

Count data

(Back to top)

Exercise 6: dvisits

(Back to top)

The dvisits data comes from the Australian Health Survey of 1977-1978 and consist of 5190 single adults where young and old have been oversampled. They are available as:

library(faraway)
data("dvisits")

Explore the dataset
Build a Poisson regression model with doctorco as the response and sex, age, agesq, income, levyplus, freepoor, freerepa, illness, actdays, hscore, chcond1 and chcond2 as possible predictor variables. Considering the deviance of this model, does this model fit the data?
Plot the residuals and the fitted values. Why are there lines of observations on the plot?
Use backward eliminiation with a critical p-value of 5% to reduce the model as much as possible. Report your model.
What sort of person would be predicted to visit the doctor the most under your selected model
For the last person in the dataset, compute the predicted probability distribution for their visits to the doctor, i.e., give the probability they visit 0,1,2,… times.
fit a comparable (Gaussian) linear model and graphically compare the fits. Describe how they differ.

Exercise 7: salmonella

(Back to top)

library(faraway)
data("salmonella")

the salmonella data was collected in a salmonella reverse mutagenicity assay. The predictor is the dose level of quinoline and the response is the numbers of revertant colonies of TA98 salmonella observed on each of 3 replicate plates.

Show that a poisson GLM is inadequate and that some overdispersion must be allowed for. Do not forget to check out other reasons for a high deviance.

Exercise 8: lung cancer

(Back to top)

Here we will consider a data set used by Erling B. Andersen in 1977, and containing the number of lung cancer cases recorded in different age categories in four different Danish cities. The data set is available from the ISwR package.

library(ISwR)
data(eba1977)
eba1977

##          city   age  pop cases
## 1  Fredericia 40-54 3059    11
## 2     Horsens 40-54 2879    13
## 3     Kolding 40-54 3142     4
## 4       Vejle 40-54 2520     5
## 5  Fredericia 55-59  800    11
## 6     Horsens 55-59 1083     6
## 7     Kolding 55-59 1050     8
## 8       Vejle 55-59  878     7
## 9  Fredericia 60-64  710    11
## 10    Horsens 60-64  923    15
## 11    Kolding 60-64  895     7
## 12      Vejle 60-64  839    10
## 13 Fredericia 65-69  581    10
## 14    Horsens 65-69  834    10
## 15    Kolding 65-69  702    11
## 16      Vejle 65-69  631    14
## 17 Fredericia 70-74  509    11
## 18    Horsens 70-74  634    12
## 19    Kolding 70-74  535     9
## 20      Vejle 70-74  539     8
## 21 Fredericia   75+  605    10
## 22    Horsens   75+  782     2
## 23    Kolding   75+  659    12
## 24      Vejle   75+  619     7

The cases column contains the number of lung cancer cases for each city and age category.

Model this count using the city and age category as predictors. Fit a Poisson GLM to the data. Is the fit appropriate?
In the previous model, we are not considering the number of potential cases in each group (ie the population size). Modify the model by using an offset which takes the population size into account.
Fit a binomial model to the data by considering success as being lung cancer cases and failures as being (\(population size-number of cases\)).
Compare with the Poisson model.

Exercise 9: melanoma

(Back to top)

Here we will study a data set from a cross-sectional study of patients with malignant melanoma. For 400 patients, the site of the tumor and its histological type was recorded. The data are analyzed by Dobson (2002).

mel <- matrix(c(22, 16, 19, 11, 2, 54, 33, 17, 10, 115, 73, 28),nrow = 4, ncol = 3)
colnames(mel) <- c("headneck", "trunk", "extrm")
rownames(mel) <- c("hutch", "superf", "nodular", "indet")
mel

##         headneck trunk extrm
## hutch         22     2    10
## superf        16    54   115
## nodular       19    33    73
## indet         11    17    28

The question of interest is whether there is any association between the tumor type and the site (that is, if one tumor type is more or less common than expected in a given site). We can test the hypothesis of no association by a conventional chi-square test:

chisq.test(mel)

## 
##  Pearson's Chi-squared test
## 
## data:  mel
## X-squared = 65.813, df = 6, p-value = 2.943e-12

If there is no association between the two variables, the expected number of patients in a given cell is proportional to the product of the marginal totals for the row and column categories under consideration. On a log-scale, this would correspond to summing the site and tumor type effects. We can thus test the association between the tumor type and the site by comparing a model with only the main effects of site and tumor type to a model including also the interaction between the two. First, we need to reformat the data into so called “long” format:

require(reshape2)
mel.long <- melt(mel, varnames = c("tumtype", "site"), value.name = "freq")

Now we can fit the two models using Poisson regression with a log link.

mel.main <- glm(freq ~ tumtype + site, family = poisson,data = mel.long)
mel.int <- glm(freq ~ tumtype*site, family = poisson,data = mel.long)

We can compare the fit with the two models using e.g. the AIC criterion, or by checking the significance of the interaction term in the last model.

AIC(mel.main, mel.int)

##          df      AIC
## mel.main  6 122.9064
## mel.int  12  83.1114

anova(mel.int, test = "Chisq")

## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: freq
## 
## Terms added sequentially (first to last)
## 
## 
##              Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                            11    295.203              
## tumtype       3  145.106         8    150.097 < 2.2e-16 ***
## site          2   98.302         6     51.795 < 2.2e-16 ***
## tumtype:site  6   51.795         0      0.000  2.05e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the two models, we can also directly get the expected number of cases in each cell under the assumption of no association (from the model without the interaction), as well as the observed number (from the model with the interaction, which fits the data exactly).

(exp.count <- predict(mel.main, type = "response"))

##       1       2       3       4       5       6       7       8       9 
##   5.780  31.450  21.250   9.520   9.010  49.025  33.125  14.840  19.210 
##      10      11      12 
## 104.525  70.625  31.640

(obs.count <- predict(mel.int, type = "response"))

##   1   2   3   4   5   6   7   8   9  10  11  12 
##  22  16  19  11   2  54  33  17  10 115  73  28

The sum of the squares of the Pearson residuals of the model without interaction is equal to the chi-square statistic from the conventional test.

sum((residuals(mel.main, type = "pearson"))^2)

## [1] 65.81293

Exercise 10: Africa

(Back to top)

library(faraway)
data(africa)

The dataset africa gives information about the number of military coups in subSaharan Africa and various political and geographical information. Develop a simple but well- fitting model for the number of coups. Give an interpretation of the effect of the variables you include in your model on the response.

Extra exercises

(Back to top)

Below is a number of data sets and questions that can be addressed using GLMs. For each of them, determine what could be a suitable distribution, fit a model and analyze the fitted model.

ACF1 data set

(Back to top)

The ACF1 data set from the DAAG package contains the number of aberrant crypt foci (ACF) in a colon section in each of 22 rats subjected to a carcinogen. The rats were sacrificed at three different times. Is there a difference between the number of ACF in the three groups of rats? Does the Poisson assumption seem to hold?

require(DAAG)
head(ACF1)

##   count endtime
## 1     1       6
## 2     3       6
## 3     5       6
## 4     1       6
## 5     2       6
## 6     1       6

dim(ACF1)

## [1] 22  2

Dengue data set

(Back to top)

The dengue data set from the DAAG package contains data for 2,000. For each region, it is recorded whether or not dengue was recorded any time between 1961 and 1990 (the NoYes variable). We also have information about other variables, such as temperature and humidity during these years. Which variables seem to influence the occurrence of dengue?

require(DAAG)
data(dengue)
head(dengue)

##       humid  humid90      temp    temp90   h10pix h10pix90 trees trees90
## 1 0.6713889 4.416667  2.037500  8.470835 17.35653 17.80861     0     1.5
## 2 7.6483340 8.167500 12.325000 14.925000 10.98361 11.69167     0     1.0
## 3 6.9790556 9.563058  6.925000 14.591660 17.50833 17.62528     0     1.2
## 4 1.1104163 1.825361  4.641665  6.046669 17.41763 17.51694     0     0.6
## 5 9.0270555 9.742751 18.175000 19.710000 13.84306 13.84306     0     0.0
## 6 8.9141113 9.516778 11.900000 16.643341 11.69167 11.69167     0     0.2
##   NoYes Xmin Xmax Ymin Ymax
## 1     0 70.5 74.5 38.0 35.5
## 2     0 62.5 64.5 35.5 34.5
## 3     0 68.5 69.5 36.0 35.0
## 4     0 67.0 68.0 35.0 34.0
## 5     0 61.0 64.5 33.5 32.0
## 6     0 64.5 65.5 36.5 35.0

dim(dengue)

## [1] 2000   13

leafshape17 data set

(Back to top)

The leafshape17 data set has 61 observations of 8 variables, representing leaf length, width and petiole measurements from several sites in Australia. Is the leaf architecture (binary variable) associated with any of the other variables?

require(DAAG)
data(leafshape17)
head(leafshape17)

##     bladelen  petiole bladewid latitude    logwid     logpet   loglen arch
## 186     6.93 0.449064     3.10     17.1 1.1314021 -0.8005899 1.935860    0
## 187     7.27 0.261720     2.50     17.1 0.9162907 -1.3404800 1.983756    0
## 188     7.43 0.766033     2.98     17.1 1.0919233 -0.2665300 2.005526    0
## 189     8.00 0.592800     3.06     17.1 1.1184149 -0.5228982 2.079442    0
## 190     8.08 0.492900     3.03     17.1 1.1085626 -0.7074490 2.089392    0
## 191     8.48 0.819168     4.33     17.1 1.4655675 -0.1994661 2.137710    0

dim(leafshape17)

## [1] 61  8