Internal and external validation of predictive models: A simulation study of bias and precision in small samples

doi:10.1016/S0895-4356(03)00047-7

Journal of Clinical Epidemiology

Volume 56, Issue 5, May 2003, Pages 441-447

https://doi.org/10.1016/S0895-4356(03)00047-7 Get rights and content

Abstract

We performed a simulation study to investigate the accuracy of bootstrap estimates of optimism (internal validation) and the precision of performance estimates in independent validation samples (external validation). We combined two data sets containing children presenting with fever without source (n = 376 + 179 = 555; 120 bacterial infections). Random samples were drawn from this combined data set for the development (n = 376) and validation (n = 179) of logistic regression models. The models included statistically significant predictors for infection selected from a set of 57 candidate predictors. Model development, including the selection of predictors, and validation were repeated in a bootstrapping procedure. The resulting expected optimism estimate in the receiver operating characteristic (ROC) area was compared with the observed optimism according to independent validation samples. The average apparent ROC area was 0.74, which was expected (based on bootstrapping) to decrease by 0.07 to 0.67, whereas the observed decrease in the validation samples was 0.09 to 0.65. Omitting the selection of predictors from the bootstrap procedure led to a severe underestimation of the optimism (decrease 0.006). The standard error of the observed ROC area in the independent validation samples was large (0.05). We recommend bootstrapping for internal validation because it gives reasonably valid estimates of the expected optimism in predictive performance provided that any selection of predictors is taken into account. For external validation, substantial sample sizes should be used for sufficient power to detect clinically important changes in performance as compared with the internally validated estimate.

Introduction

Optimism is a well-known problem of predictive models: Their performance in new patients is often worse than expected based on performance estimated from the development data set (“apparent performance”) [1], [2], [3]. The extent of optimism of pre-specified models can be estimated for similar patient populations using internal validation techniques such as bootstrapping [4], [5]. Predictive models are, however, usually not pre-specified but are constructed in an iterative way. Model specification may include decisions on coding of variables (e.g., [re-]grouping categorized or continuous variables) and decisions on the inclusion of main effect, nonlinear, and interaction terms in the final model. However, when model specification, such as stepwise selection of predictor variables, can be formulated in a systematic way, it may be replayed entirely in every bootstrap sample [6]. Such a procedure should provide an honest estimate of the optimism of the final model [3], [7]. Because empirical evidence for this claim is limited, our first aim was to study the accuracy of the bootstrap estimate of optimism of a prediction model that is developed using variable selection techniques.

Of more interest than internal validity is external validity, or generalizability [8]. External validity is typically studied in independent validation samples with patients from a different but “plausibly related” population [9]. In a previous study, a diagnostic model developed to estimate the presence of a serious bacterial infection in children with fever without apparent source showed a surprisingly poor external validity in another sample of 179 children [10]. Although a sample size of 179 subjects is not uncommon in diagnostic (validation) research, it raises the question of how large a validation set needs to be. Our second aim was to study the precision of performance estimates in relatively small validation samples and to explore the consequences for the power of validation studies when comparing model performance between development and validation sets.

Section snippets

Patients

We combined two previously described data sets of children presenting with fever without apparent source: a development set from Rotterdam, The Netherlands, diagnosed between 1988 and 1992 (n = 376), and a validation set from Rotterdam and The Hague, diagnosed between 1997 and 1998 (n = 179) [10]. Of these 555 children, 120 (22%) had a serious bacterial infection, which was defined as the presence of bacterial meningitis, sepsis or bacteriemia, pneumonia, urinary tract infection, bacterial

Expected optimism in the full data set

In the full data set of 555 children, four statistically significant predictors were selected after univariable and multivariable stepwise analyses: duration of fever at presentation (days), presence of chest-wall retractions, poor peripheral circulation, and presence of crepitations. The apparent ROC area was 0.727, the R² was 15.7%, and the calibration slope was unity (Table 1).

According to 1000 bootstrap samples, the expected optimism was 0.056 for the ROC area (0.761−0.706) and 9.8%

Discussion

We found that internally validated estimates of model performance could accurately be obtained with bootstrapping when a stepwise selection strategy was followed in the construction of the predictive model provided that this strategy was systematically replayed in every bootstrap sample. The expected optimism was close to that observed in independent random validation samples for a number of performance measures, including the ROC area. However, the variability (SEs) of these performance

Acknowledgements

This study was inspired by the comments of an anonymous reviewer regarding the sampling variability of external validation studies. We gratefully acknowledge the contributions of all medical students and clinicians involved in data collection, especially Dr. G. Derksen-Lubsen (Juliana Children's Hospital, The Hague, The Netherlands). This work was supported by a fellowship from the Royal Netherlands Academy of Arts and Sciences (EWS).

References (25)

E.W Steyerberg et al.
Internal validation of predictive models: efficiency of some procedures for logistic regression analysis
J Clin Epidemiol
(2001)
E del Barrio et al.
Asymptotic stability of the bootstrap sampled mean
Stoch Proc Appl
(2002)
E.W Steyerberg et al.
Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis
J Clin Epidemiol
(1999)
F.E Harrell et al.
Regression modelling strategies for improved prognostic prediction
Stat Med
(1984)
J.C Van Houwelingen et al.
Predictive value of statistical models
Stat Med
(1990)
F.E Harrell et al.
Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors
Stat Med
(1996)
B Efron et al.
An introduction to the bootstrap
D.G Altman et al.
Bootstrap investigation of the stability of a Cox regression model
Stat Med
(1989)
C Chatfield
Model uncertainty, data mining and statistical inference
J Royal Stat Soc A
(1995)
D.G Altman et al.
What do we mean by validating a prognostic model?
Stat Med
(2000)

A.C Justice et al.

Assessing the generalizability of prognostic information

Ann Intern Med

(1999)

Bleeker SE, Moll HA, Steyerberg EW, et al. External validation is necessary in prediction research: a clinical example....

Cited by (449)

Prediction model protocols indicate better adherence to recommended guidelines for study conduct and reporting
2024, Journal of Clinical Epidemiology
Protocols are invaluable documents for any research study, especially for prediction model studies. However, the mere existence of a protocol is insufficient if key details are omitted. We reviewed the reporting content and details of the proposed design and methods reported in published protocols for prediction model research.
We searched MEDLINE, Embase, and the Web of Science Core Collection for protocols for studies developing or validating a diagnostic or prognostic model using any modeling approach in any clinical area. We screened protocols published between Jan 1, 2022 and June 30, 2022. We used the abstract, introduction, methods, and discussion sections of The Transparent Reporting of a multivariable prediction model of Individual Prognosis Or Diagnosis (TRIPOD) statement to inform data extraction.
We identified 30 protocols, of which 28 were describing plans for model development and six for model validation. All protocols were open access, including a preprint. 15 protocols reported prospectively collecting data. 21 protocols planned to use clustered data, of which one-third planned methods to account for it. A planned sample size was reported for 93% development and 67% validation analyses. 16 protocols reported details of study registration, but all protocols reported a statement on ethics approval. Plans for data sharing were reported in 13 protocols.
Protocols for prediction model studies are uncommon, and few are made publicly available. Those that are available were reasonably well-reported and often described their methods following current prediction model research recommendations, likely leading to better reporting and methods in the actual study.
Charlson Comorbidity Index and Frailty as Predictors of Resolution Following Middle Meningeal Artery Embolization for Chronic Subdural Hematoma
2024, World Neurosurgery
Research on variables associated with chronic subdural hematoma (cSDH) resolution following middle meningeal artery embolization (MMAE) is limited. This study investigated the clinical utility of age-adjusted Charlson Comorbidity Index (ACCI) and modified 5-item Frailty Index (mFI - 5) for predicting cSDH resolution following MMAE.
We identified patients who underwent MMAE at our institution between January 2018 and December 2022, with at least 20 days of follow-up and one radiographic follow-up study. Patient demographics, characteristics, and outcomes were collected. Complete resolution was defined as absence of subdural collections on CT-scan at last follow-up. Nonage adjusted CCI (CCI), ACCI, and mFI - 5 scores were calculated. Univariate and multivariable logistic regression analyzed the relationship between cSDH resolution and variables. A receiver operating characteristic (ROC) curve established the utility of ACCI and mFI - 5 in predicting hematoma resolution.
The study included 85 MMAE procedures. In univariate analysis, patients without resolution were older, had higher CCI, higher ACCI, higher mFI - 5, and were more likely to have diabetes mellitus. In multivarible analysis, CCI (OR: 0.66, 95% CI: 0.48, 0.91) was independently associated with resolution controlling for age and antithrombotic resumption. The area under the ROC (AUROC) curve was 0.75 (95% CI: 0.65–0.85) for ACCI and 0.64 (95% CI: 0.52–0.76) for mFI - 5. The optimal cutoffs for predicting resolution were ACCI ≥5 (sensitivity = 0.63, specificity = 0.77), and mFI - 5 > 0 (sensitivity = 0.84, specificity = 0.43).
ACCI and mFI - 5 moderately predict MMAE resolution and may aid in medical decision-making.
Harnessing multi-source data for individualized care in Hodgkin Lymphoma
2024, Blood Reviews
Hodgkin lymphoma is a rare, but highly curative form of cancer, primarily afflicting adolescents and young adults. Despite multiple seminal trials over the past twenty years, there is no single consensus-based treatment approach beyond use of multi-agency chemotherapy with curative intent. The use of radiation continues to be debated in early-stage disease, as part of combined modality treatment, as well as in salvage, as an important form of consolidation. While short-term disease outcomes have varied little across these different approaches across both early and advanced stage disease, the potential risk of severe, longer-term risk has varied considerably.
Over the past decade novel therapeutics have been employed in the retrieval setting in preparation to and as consolidation after autologous stem cell transplant. More recently, these novel therapeutics have moved to the frontline setting, initially compared to standard-of-care treatment and later in a direct head-to-head comparison combined with multi-agent chemotherapy.
In 2018, we established the HoLISTIC Consortium, bringing together disease and methods experts to develop clinical decision models based on individual patient data to guide providers, patients, and caregivers in decision-making. In this review, we detail the steps we followed to create the master database of individual patient data from patients treated over the past 20 years, using principles of data science. We then describe different methodological approaches we are taking to clinical decision making, beginning with clinical prediction tools at the time of diagnosis, to multi-state models, incorporating treatments and their response. Finally, we describe how simulation modeling can be used to estimate risks of late effects, based on cumulative exposure from frontline and salvage treatment.
The resultant database and tools employed are dynamic with the expectation that they will be updated as better and more complete information becomes available.
The Adelaide Facial Bone Rule: A simple prediction model and clinical guideline for the presence of facial fractures using CT brain scans in victims of minor trauma
2024, Injury
Facial fractures bleed, resulting in high-density fluid in the sinuses (haemosinus) on computed tomography (CT) scans. A CT brain scan includes most maxillary sinuses in the scan field, which should allow detection of haemosinus as an indirect indicator of a facial fracture without the need for an additional CT facial bone scan, yet no robust evidence for this exists in the literature. The aim of this study was to determine whether the presence of haemosinus on a CT brain scan, alone or in combination with other clinical information, can predict the presence of facial fractures.
1231 adult patients, who had both brain and facial CT scans performed on the same day, were selected from a seven year period. Patients were eligible if scans were requested for trauma. Brain and facial scans were reviewed separately for the presence of facial fractures, haemosinus, emphysema and intra-cranial haemorrhage. Prediction modelling was used to assess whether findings from brain scans could be used to identify patients requiring further CT scanning.
The full prediction model included four predictors and showed excellent discrimination (AUROC 0.982; 95 % CI 0.971 – 0.993). A simplified model, more suitable for clinical implementation, used only facial fractures and haemosinus as predictors. This model showed only marginally poorer discrimination (AUROC 0.964; 95 % CI 0.945 – 0.983) and excellent performance on other measures.
Based on the excellent performance of the simplified prediction model, we present the Adelaide Facial Bone Rule: The absence of blood in the sinuses or facial fractures on a CT brain scan means a CT facial bone scan does not need to be routinely performed in the setting of clinically-determined minor trauma.
Predicting arrhythmic event score in Brugada syndrome: Worldwide pooled analysis with internal and external validation
2023, Heart Rhythm
Brugada syndrome is an inherited arrhythmic disease associated with major arrhythmic events (MAE). Risk predictive scores were previously developed with various performances.
The purpose of this study was to create a novel score—Predicting Arrhythmic evenT (PAT)—with internal and external validation.
A systematic review was performed to identify risk factors for MAE. The odds ratios (ORs) of each factor were pooled across studies. The PAT scoring scheme was developed based on pooled ORs. The PAT score was internally validated with published 105 Asian patients (follow-up 8.0 ± 4.1 [SD] years) and externally validated with unpublished 164 multiracial patients (82.3% White, 14.6% Asian, 3.2% Black; mean follow-up 8.0 ± 6.9 years) with Brugada syndrome. Performances were assessed and compared with previous scores using receiver operating characteristic curve (ROC) analysis.
Sixty-seven studies published between 2002 and 2022 from 26 countries (7358 patients) were included. Pooled ORs were estimated, indicating that 15 of 23 risk factors were significant. The PAT score was then developed accordingly. The PAT score had significantly better discrimination (ROC 0.9671) than the BRUGADA-RISK score (ROC 0.7210; P = .006), Shanghai Score System (ROC 0.7079; P = .003), and Sieira et al score (ROC 0.8174; P = .026) in an external validation cohort. PAT score $\geq$ 10 predicted the first MAE with 95.5% sensitivity and 89.1% specificity (ROC 0.9460) and the recurrent MAE (ROC 0.7061) with 15.4% sensitivity and 93.3% specificity.
The PAT score was shown to be useful in predicting MAE for primary prevention in patients with Brugada syndrome.
A synthetic agent to simulate decisional behaviour of designers working with an active recommender framework system
2024, Journal of Integrated Design and Process Science

View all citing articles on Scopus

View full text

Internal and external validation of predictive models: A simulation study of bias and precision in small samples

Abstract

Introduction

Section snippets

Patients

Expected optimism in the full data set

Discussion

Acknowledgements

J Clin Epidemiol

Stoch Proc Appl

J Clin Epidemiol

Regression modelling strategies for improved prognostic prediction

Stat Med

Predictive value of statistical models

Stat Med

Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors

Stat Med

An introduction to the bootstrap

Bootstrap investigation of the stability of a Cox regression model

Stat Med

Model uncertainty, data mining and statistical inference

J Royal Stat Soc A

What do we mean by validating a prognostic model?

Stat Med

Assessing the generalizability of prognostic information

Ann Intern Med