Status and Trends of Biological Resources Program

Learn R

R is a free software environment for statistical computing and graphics.
http://www.r-project.org/
Home | Register | Getting started | Schedule | References | FAQ | Participants | Discussion | Queries
Join our e-mailing list for other courses
For more information, please contact Paul Geissler (Paul_Geissler@usgs.gov).

Topic 9: Generalized Linear Models (GLM)

Maindonald and Braun (2007) Chapter 8

Generalized Linear Models (GLM)
Don't confuse generalized linear models with a general linear models, which includes regression, analysis of variance and analysis of covariance.
GLM's
* transform the left side of a model using a link function and
* do not require residuals to be normally distributed.
For a description, see http://en.wikipedia.org/wiki/Generalized_linear_models
Examples include
* logistic regression, where the response variable is 1 or 0 (yes or no) and the predictors can be quantitative or factors as in regression or anova.
* Poisson and quasi-Poisson regression where the response is a count.

Logistic Regression
The response variable takes on values of 0 (failure, no) or 1 (success or yes).
The data are distributed according to the binomial distribution, not the normal distribution as in regression and analysis of variance.
The predicted responses must be between 0% and 100%.
Responses <10% or >90% are probably not a linear function of the predictors.
A logit transformation of the response variable is used: logit=log(p/(1-p)), the log of the odds ratio.
To plot the logit transformation, to see how the proportions near 0 and 1 are stretched out.
p<-(1:999)/1000
logit<-log(p/(1-p))
plot(p,logit,xlab="proportion",ylab="logit",type="l")

Example
rm(list=ls()) # clear workspace
data(anesthetic, package="DAAG")
attach(anesthetic)
?anesthetic
anesthetic[1:10,]
tab=table(move,conc)
t=apply(tab,2,sum) # calculates the sums for columns (dimension 2), total for each concentration
c=as.numeric(colnames(tab)) # concentrations
p=tab["0",]/t # proportions
plot(c,p,xlab="concentration")
m1<-glm(nomove ~ conc, family=binomial(link="logit")) # using original data
m2<-glm(p ~ c, family=binomial(link="logit"), weights=t) # using calculated proportions
summary(m1)
summary(m1)
summary(m1)$coefficients
summary(m2)$coefficients
plot(c,p,xlab="concentration")
abline(m1) # submit with line above
The output looks like the regression output from lm, except that the sum of squares is replaced by deviance.
The deviance is -2 log(likelihood), so minimizing the deviance maximizes the likelihood.
If data are normally distributed, the deviance and sum of squares are equivalent.
In the glm function, binomial was specified as the distribution family because the data are binomial (0, 1) data.
The link specifies the logit transformation of the response variable.

The model can be run from the R Commander menu. Select "Statistics" > "fit models" > "Generalized linear models".

Logistic multiple regression
rm(list=ls()) # clear workspace
data(frogs, package="DAAG")
attach(frogs)
?frogs # detectability is not considered, but it should be!
frogs[1:10,]
summary(frogs)
plot(northing ~ easting, pch=c(1:2)[pres.abs+1], xlab="East", ylab="North", main="Location of Sites", sub="present at triangles")
scatterplot.matrix(~altitude+avrain+distance+meanmax+meanmin+NoOfPools+NoOfSites, reg.line=lm, smooth=TRUE, span=0.5,
          diagonal = 'density', data=frogs) # from menu
# The relationships between altitude,avrain, meanmin and meanmax are close to linear.
# distance, NoOfSites and NoOfPools is skewed so we consider transformations.
par(mfrow=c(3,3))
plot(density(distance),main="",xlab="distance") # submit with line above
plot(density(sqrt(distance)),main="",xlab="sqrt(distance)") # submit with line above
plot(density(log(distance)),main="",xlab="log(distance)") # submit with line above
plot(density(NoOfPools),main="",xlab="NoOfPools") # submit with line above
plot(density(sqrt(NoOfPools)),main="",xlab="sqrt(NoOfPools)") # submit with line above
plot(density(log(NoOfPools)),main="",xlab="log(NoOfPools)") # submit with line above
plot(density(NoOfSites),main="",xlab="NoOfSites") # submit with line above
plot(density(sqrt(NoOfSites)),main="",xlab="sqrt(NoOfSites)") # submit with line above
plot(density(log(NoOfSites)),main="",xlab="log(distance+1)") # submit with line above
# we will use a log transform for distance and NoOfPools, but not transform NoOfSites.
scatterplot.matrix(~altitude+avrain+log(distance)+meanmax+meanmin+log(NoOfPools)+NoOfSites, reg.line=lm, smooth=TRUE, span=0.5,
          diagonal = 'density', data=frogs) # substitute transform in above
m1<-glm(pres.abs ~ altitude + log(distance) + log(NoOfPools) + NoOfSites + avrain + meanmin + meanmax, family = binomial)
summary(m1)
# Note problems with data. No account has been taken of spatial clustering or detectability, so much caution is required.
m2=step(m1)
par(mfrow=c(2,2))
plot(m2) # submit with line above
par(mfrow=c(2,2))
termplot(m2) # submit with line above
# Note scale, with meanmin having the great effect, with probability of frogs increasing with minmean.
# the effect for meanmax goes in the other direction

Logistic models for categorical data
rm(list=ls()) # clear workspace
data(UCBAdmissions, package="datasets")
attach(UCBAdmissions)
?UCBAdmissions
UCBAdmissions
# create a data from from table
dimnames(UCBAdmissions)
UCB=as.data.frame.table(UCBAdmissions["Admitted", , ])
names(UCB)[3]="admit"
UCB$reject<-as.data.frame.table(UCBAdmissions["Rejected", , ]) $ Freq
UCB$total<-UCB$admit + UCB$reject
UCB$p<-UCB$admit / UCB$total
attach(UCB)
UCB
m1=glm(p~ Dept*Gender, family=binomial, weights=total)
# important to fit Dept before Gender to adjust for different admission rates in different departments
anova(m1, test="Chisq")
# The significant interaction suggests that there are department specific gender biases,
# which average out to reduce the gender main effect
levels(Gender) # 1=male, 2=female
cbind(Gender,Dept,model.matrix(m1))
summary(m1)$coef
# The Dept coefficients relate to males.
# The effect for Gender is highly significant and positive for Dept. A.
# The significant interactions show that it is negative for other departments.
UCB
tab=as.data.frame(cbind(Dept=UCB[Gender=="Male","Dept"] , Male=UCB[Gender=="Male","p"], Female=UCB[Gender=="Female","p"] ))
tab$dif<-tab$Male-tab$Female
tab
interaction.plot(UCB$Gender,UCB$Dept,UCB$p)

rm(list=ls()) # clear workspace
# John Fox , 2002, an R and S-Plus Companion to Applied Regression, pages 163-166
# I copied script from Jon Fox's web site http://socserv.mcmaster.ca/jfox/Books/Companion/Ch5-script.txt
# It would have been easier to enter into an Excel spreadsheet
closeness <- factor(rep(c('one.sided', 'close'), c(3,3)), levels=c('one.sided', 'close'))
preference <- ordered(rep(c('weak', 'medium', 'strong'), 2), levels=c('weak', 'medium', 'strong'))
voted <- c(91, 121, 64, 214, 284, 201)
did.not.vote <- c(39, 49, 24, 87, 76, 25)
logit.turnout <- log(voted/did.not.vote)
p <- voted/(voted+did.not.vote)
x=data.frame(closeness, preference, voted, did.not.vote, logit=logit.turnout, p)
attach(x)
dotplot(p ~ preference | closeness)
contrasts(closeness)=contr.sum(levels(closeness))
contrasts(preference)=contr.poly(levels(preference))
m1<-glm(cbind(voted, did.not.vote) ~ closeness * preference, family = binomial)
# response is a matrix with counts voted and did.not.vote
summary(m1)
# model used all degrees of freedom so there is 0 df for the residual
anova(m1)
Anova(m1) # different, note capital A

?anova
?Anova # different, note capital A

Polytomous (multiple response) data
John Fox , 2002, an R and S-Plus Companion to Applied Regression, pages 167-172
rm(list=ls()) # clear workspace
data(Womenlf, package="car")
attach(Womenlf)
?Womenlf
Womenlf[1:15,]
summary(Womenlf)
working <- recode(partic, " 'not.work' = 'no'; else = 'yes' ") # dichotomies, subdividing each category
fulltime <- recode(partic, " 'fulltime'='yes'; 'parttime' = 'no'; 'not.work'=NA ")
mWorking <- glm(working ~ hincome + children + region, family=binomial)
summary(mWorking)
mFulltime <- glm(fulltime ~ hincome + children + region, family=binomial)
summary(mFulltime)
Anova(mWorking)
Anova(mFulltime)
# remove region which is not significant
mWorking1 <- glm(working ~ hincome + children, family=binomial)
summary(mWorking1)
mFulltime1 <- glm(fulltime ~ hincome + children, family=binomial)
summary(mFulltime1)
Anova(mWorking1)
Anova(mFulltime1)
predictors <- expand.grid(hincome=1:45, children=c('absent', 'present')) # create data frame for predictions
predictors[1:10,]
p.work <- predict(mWorking1, predictors, type="response") # predicted probability of working
p.fulltime <- predict(mFulltime1, predictors, type="response") # predicted probability of working full time given working
p.full <- p.work * p.fulltime # predicted probability of working full-time
p.part <- p.work * (1 - p.fulltime) # predicted probability of working part-time
p.not <- 1 - p.work # predicted probability of not working
p <- data.frame(predictors, p.work, p.fulltime, p.full, p.part, p.not)
p[1:10,] # predicted probabilities
par(mfrow=c(1,2))
plot(c(1,45), c(0,1),type='n', xlab="Husband's Income", ylab='Fitted Probability', main="Children Absent")
lines(1:45, p.not[1:45], lty=1, lwd=3)
lines(1:45, p.part[1:45], lty=2, lwd=3)
lines(1:45, p.full[1:45], lty=3, lwd=3)
legend(locator(1), lty=1:3, lwd=3, legend=c('not working', 'part-time', 'full-time'))
plot(c(1,45), c(0,1), type='n', xlab="Husband's Income", ylab='Fitted Probability', main="Children Present")
lines(1:45, p.not[46:90], lty=1, lwd=3)
lines(1:45, p.part[46:90], lty=2, lwd=3)
lines(1:45, p.full[46:90], lty=3, lwd=3)
legend(locator(1), lty=1:3, lwd=3, legend=c('not working', 'part-time', 'full-time'))

Poisson and quasi-Poisson regression
A Poisson model is appropriate for independent counts.
Use a quasi-Poisson model when there is clustering, lack of independence.
In the example, the "sites" for AFC tumors in rat colons are probably not independents because some rats are more susceptible than others.
rm(list=ls()) # clear workspace
data(ACF1, package="DAAG")
attach(ACF1)
?ACF1
ACF1[1:10,]
plot(count ~ endtime, log="x", pch=16)
m1<-glm(count ~ endtime, family=poisson)
summary(m1)
m2<-glm(count ~ endtime + I(endtime^2), family=poisson) # add quadratic effect,
# Identity function I() calculates square without being interpreted by glm.
summary(m2)
24.515/19 # 1.290263 = residual mean deviance should be close to 1.00
# This is an estimate of the dispersion, the increase of the variance relative to the expected variance for the Poisson.
# A better estimate is
disp=sum(resid(m2,type="pearson")^2/19) # 1.254068
sqrt(disp)
# Standard errors should be increased or the t-statistic reduced by sqrt(1.254068)=1.119852
# A quasi-Poisson model makes this adjustment.
m3<-glm(count ~ endtime + I(endtime^2), family=quasipoisson) # quadratic is not significant
summary(m3)
m4<-glm(count ~ endtime, family=quasipoisson)
summary(m4)
# The fitted values are:
unique(fitted(m4))
# We assume that the deviance is about the same for all endtimes.
sapply(split(residuals(m4),endtime), var) # the variance for endtime=18 seems small, but the sample size is small
fligner.test(residuals(m4) ~ factor(endtime)) # no significant deviance differences

rm(list=ls()) # clear workspace
data(moths, package="DAAG")
attach(moths)
?moths
moths[1:10,]
library(lattice)
mothsA=data.frame(habitat,density=A/meters,species="A")
mothsP=data.frame(habitat,density=P/meters,species="P")
mothsAP=rbind(mothsA,mothsP)
dotplot(density~species|habitat,data=mothsAP)
m1<-glm(A ~ habitat + log(meters), family=quasipoisson)
summary(m1) # model log(count) = habitat + log(meters) => count = (habitat density) * (length of transect)
# What went wrong? Standard errors are huge and AIC is NA.
# the model used bank habitat as the reference level because it is first alphabetically.
moths[habitat=="Bank",] # one observation with zero count for species A
# GLM approximations to the standard error breaks down when value for reference level is zero!
aggregate(A,by=list(habitat), FUN=mean)
par(mfrow=c(2,2))
plot(m1)
moths$habitat<-relevel(moths$habitat, ref="Lowerside")
attach(moths) # must attach after change
levels(habitat)
m2<-glm(A ~ habitat + log(meters), family=quasipoisson)
summary(m2)
aggregate(A,by=list(habitat), FUN=mean)
par(mfrow=c(2,2))
plot(m2)
# note that the dispersion parameter is 2.7, compared to 1.0 for a Poisson distribution,
# indicating some clustering with an average cluster size of 2.7..
# Standard error and p-values from the Poisson would be misleading,
# and the standard errors have been multiplied by sqrt(2.70).

# Comparison between Bank and Disturbed habitats - procedure when a fitted value is near 0 or 1.
habitatD<-habitat
habitatD[habitatD=="Bank"]<-"Disturbed" # combine Bank and Disturbed habitats
m3<-glm(A ~ habitatD + log(meters), family=quasipoisson)
anova(m2,m3, test="F" ) # note F test for comparing models
# Comparison between Bank and NWsoak habitats - procedure when coefficient near 0 or 1.
habitatNW<-habitat
habitatNW[habitatNW=="Bank"]<-"NWsoak" # combine Bank and Disturbed habitats
m4<-glm(A ~ habitatNW + log(meters), family=quasipoisson)
anova(m2,m4, test="F") # note F test for comparing models

Log-linear models for contingency tables
rm(list=ls()) # clear workspace
# John Fox , 2002, an R and S-Plus Companion to Applied Regression, pages 181-185
# I copied script from Jon Fox's web site http://socserv.mcmaster.ca/jfox/Books/Companion/Ch5-script.txt
# It would have been easier to enter into an Excel spreadsheet
# These same data on election turnout were analyzed above using a binomial logit model.
rm(list=ls()) # clear workspace
counts <- c(91, 39, 121, 49, 64, 24, 214, 87, 284, 76, 201, 25)
Campbell <- expand.grid(turnout=c('voted','did.not.vote'), preference=c('weak', 'medium', 'strong'), closeness=c('one.sided', 'close'))
Campbell=data.frame(Campbell, counts)
attach(Campbell)
contrasts(closeness)=contr.sum(levels(closeness))
Campbell$preference=ordered(Campbell$preference, levels=c("weak","medium","strong"))
contrasts(Campbell$preference)=contr.poly(levels(Campbell$preference))
attach(Campbell)
m1 <- glm(counts ~ closeness * preference * turnout, data=Campbell, family=poisson)
# treat counts in a contingency tables as poisson counts
summary(m1) # saturated model
Anova(m1) # not anova


Thanks to USGS Patuxent Wildlife Research Center for hosting this page for the USGS Biology Science Staff.

Accessibility FOIA Privacy Policies and Notices

Take Pride in America logo USA.gov logo U.S. Department of the Interior | U.S. Geological Survey
URL: http://www.pwrc.usgs.gov/LearnR9.cfm
Page Contact Information: Paul_Geissler@usgs.gov