|
|
|
Evaluation
of Performance
Model performance
refers to the quality of the predictions from the regression
model. For linear models, the most frequently used measure is
the explained variation, which is abbreviated as R2.
R2 is defined as:
1
- ((y
- yi)2
/ (mean(y)
- yi)2
).
For generalized
linear models, such as logistic or Cox
regression models, several proposals have been made regarding
how to calculate explained variation (Mittlbock
and Schemper, 1996) (Schemper
and Stare, 1996). Instead of focusing on overall
performance, one usually concentrates on two aspects of model
performance in generalized linear models:
-
Calibration,
and
-
Discrimination.
Calibration
refers to whether the predicted probabilities agree with the
observed probabilities. First, we may study "calibration in
the large"; i.e. whether predictions agree on average with observed
probabilities. Second, we may study calibration in more detail
by looking at the correspondence between predictions and observations
over the whole range of predictions. The graph below illustrates
the four combinations of calibration in the large and calibration
in more detail.
Figure
7.2: Calibration
|
---|
|
Illustration
of calibration: actual versus predicted probability.
The thick line illustrates ideal calibration: the actual probabilities
correspond exactly to the predicted probabilities. In a regression context,
the intercept a is zero and the slope b is 1. The dotted line illustrates
the situation that predictions on average correspond
to observations ("calibration-in-the-large"
is OK, a=0), but that low predictions are systematically
too low and high predictions systematically too high (b=1.5). The other
two lines (-.-.- and ---) represent situations where
calibration in the large is not OK; the predicted
odds (probability/(1-probability)) are on average
2 times too high (ln(2) on the logit scale). An example of severe miscalibration
is indicated by the --- line, where predictions are especially too high for
higher predictions. For example, when the predicted probability was around
80%, the actual probability was around 50%. Note that the 3 miscalibrated
lines can easily be corrected by adjusting the intercept a and/or slope
b. |
|
Correspondence
of predictions from a logistic regression model and observations
(or "goodness-of-fit"
is often assessed by the Hosmer-Lemeshow test. A cross-table
is created of observed and expected values, usually by decile
of the predicted probability. The H-L test, however, has limited
power to detect model mis-specification (Hosmer
et al., 1997).
|
|