Internal and external validation of predictive models: A simulation study of bias and precision in small samples
Introduction
Optimism is a well-known problem of predictive models: Their performance in new patients is often worse than expected based on performance estimated from the development data set (“apparent performance”) [1], [2], [3]. The extent of optimism of pre-specified models can be estimated for similar patient populations using internal validation techniques such as bootstrapping [4], [5]. Predictive models are, however, usually not pre-specified but are constructed in an iterative way. Model specification may include decisions on coding of variables (e.g., [re-]grouping categorized or continuous variables) and decisions on the inclusion of main effect, nonlinear, and interaction terms in the final model. However, when model specification, such as stepwise selection of predictor variables, can be formulated in a systematic way, it may be replayed entirely in every bootstrap sample [6]. Such a procedure should provide an honest estimate of the optimism of the final model [3], [7]. Because empirical evidence for this claim is limited, our first aim was to study the accuracy of the bootstrap estimate of optimism of a prediction model that is developed using variable selection techniques.
Of more interest than internal validity is external validity, or generalizability [8]. External validity is typically studied in independent validation samples with patients from a different but “plausibly related” population [9]. In a previous study, a diagnostic model developed to estimate the presence of a serious bacterial infection in children with fever without apparent source showed a surprisingly poor external validity in another sample of 179 children [10]. Although a sample size of 179 subjects is not uncommon in diagnostic (validation) research, it raises the question of how large a validation set needs to be. Our second aim was to study the precision of performance estimates in relatively small validation samples and to explore the consequences for the power of validation studies when comparing model performance between development and validation sets.
Section snippets
Patients
We combined two previously described data sets of children presenting with fever without apparent source: a development set from Rotterdam, The Netherlands, diagnosed between 1988 and 1992 (n = 376), and a validation set from Rotterdam and The Hague, diagnosed between 1997 and 1998 (n = 179) [10]. Of these 555 children, 120 (22%) had a serious bacterial infection, which was defined as the presence of bacterial meningitis, sepsis or bacteriemia, pneumonia, urinary tract infection, bacterial
Expected optimism in the full data set
In the full data set of 555 children, four statistically significant predictors were selected after univariable and multivariable stepwise analyses: duration of fever at presentation (days), presence of chest-wall retractions, poor peripheral circulation, and presence of crepitations. The apparent ROC area was 0.727, the R2 was 15.7%, and the calibration slope was unity (Table 1).
According to 1000 bootstrap samples, the expected optimism was 0.056 for the ROC area (0.761−0.706) and 9.8%
Discussion
We found that internally validated estimates of model performance could accurately be obtained with bootstrapping when a stepwise selection strategy was followed in the construction of the predictive model provided that this strategy was systematically replayed in every bootstrap sample. The expected optimism was close to that observed in independent random validation samples for a number of performance measures, including the ROC area. However, the variability (SEs) of these performance
Acknowledgements
This study was inspired by the comments of an anonymous reviewer regarding the sampling variability of external validation studies. We gratefully acknowledge the contributions of all medical students and clinicians involved in data collection, especially Dr. G. Derksen-Lubsen (Juliana Children's Hospital, The Hague, The Netherlands). This work was supported by a fellowship from the Royal Netherlands Academy of Arts and Sciences (EWS).
References (25)
- et al.
Internal validation of predictive models: efficiency of some procedures for logistic regression analysis
J Clin Epidemiol
(2001) - et al.
Asymptotic stability of the bootstrap sampled mean
Stoch Proc Appl
(2002) - et al.
Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis
J Clin Epidemiol
(1999) - et al.
Regression modelling strategies for improved prognostic prediction
Stat Med
(1984) - et al.
Predictive value of statistical models
Stat Med
(1990) - et al.
Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors
Stat Med
(1996) - et al.
An introduction to the bootstrap
- et al.
Bootstrap investigation of the stability of a Cox regression model
Stat Med
(1989) Model uncertainty, data mining and statistical inference
J Royal Stat Soc A
(1995)- et al.
What do we mean by validating a prognostic model?
Stat Med
(2000)
Assessing the generalizability of prognostic information
Ann Intern Med
Cited by (449)
Prediction model protocols indicate better adherence to recommended guidelines for study conduct and reporting
2024, Journal of Clinical EpidemiologyHarnessing multi-source data for individualized care in Hodgkin Lymphoma
2024, Blood ReviewsA synthetic agent to simulate decisional behaviour of designers working with an active recommender framework system
2024, Journal of Integrated Design and Process Science