Chapter 8: Statistical Models for Prognostication: Development of Regression Models

Evaluation of Performance

Model performance refers to the quality of the predictions from the regression model. For linear models, the most frequently used measure is the explained variation, which is abbreviated as R². R² is defined as:

1 - ((y - y_i)² / (mean(y) - y_i)² ).

For generalized linear models, such as logistic or Cox regression models, several proposals have been made regarding how to calculate explained variation (Mittlbock and Schemper, 1996) (Schemper and Stare, 1996). Instead of focusing on overall performance, one usually concentrates on two aspects of model performance in generalized linear models:

Calibration refers to whether the predicted probabilities agree with the observed probabilities. First, we may study "calibration in the large"; i.e. whether predictions agree on average with observed probabilities. Second, we may study calibration in more detail by looking at the correspondence between predictions and observations over the whole range of predictions. The graph below illustrates the four combinations of calibration in the large and calibration in more detail.

Figure 7.2: Calibration

Graphic depiction of calibration, described in text.

Illustration of calibration: actual versus predicted probability. The thick line
illustrates ideal calibration: the actual probabilities correspond exactly to
the predicted probabilities. In a regression context, the intercept a is zero
and the slope b is 1. The dotted line illustrates the situation that predictions on average correspond to observations ("calibration-in-the-large" is OK, a=0), but that low predictions are systematically too low and high
predictions systematically too high (b=1.5). The other two lines (-.-.- and ---) represent situations where calibration in the large is not OK; the predicted odds (probability/(1-probability)) are on average 2 times too high (ln(2) on
the logit scale). An example of severe miscalibration is indicated by the ---
line, where predictions are especially too high for higher predictions.
For example, when the predicted probability was around 80%, the actual
probability was around 50%. Note that the 3 miscalibrated lines can easily
be corrected by adjusting the intercept a and/or slope b.

Correspondence of predictions from a logistic regression model and observations (or "goodness-of-fit" is often assessed by the Hosmer-Lemeshow test. A cross-table is created of observed and expected values, usually by decile of the predicted probability. The H-L test, however, has limited power to detect model mis-specification (Hosmer et al., 1997).