Skip to main content
Advertisement
  • Loading metrics

Assessing eligibility for lung cancer screening using parsimonious ensemble machine learning models: A development and validation study

  • Thomas Callender ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Software, Validation, Writing – original draft, Writing – review & editing

    t.callender@ucl.ac.uk

    Affiliation Department of Respiratory Medicine, University College London, London, United Kingdom

  • Fergus Imrie,

    Roles Methodology, Writing – review & editing

    Affiliation Department of Electrical and Computer Engineering, University of California, Los Angeles, California, United States of America

  • Bogdan Cebere,

    Roles Software

    Affiliation Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, United Kingdom

  • Nora Pashayan,

    Roles Methodology, Supervision, Writing – review & editing

    Affiliation Department of Applied Health Research, University College London, London, United Kingdom

  • Neal Navani,

    Roles Writing – review & editing

    Affiliation Department of Respiratory Medicine, University College London, London, United Kingdom

  • Mihaela van der Schaar ,

    Roles Methodology, Software, Supervision, Writing – review & editing

    ‡ These authors are joint senior authors on this work.

    Affiliations Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, United Kingdom, Cambridge Centre for AI in Medicine, University of Cambridge, Cambridge, United Kingdom, Alan Turing Institute, London, United Kingdom

  • Sam M. Janes

    Roles Conceptualization, Supervision, Writing – review & editing

    ‡ These authors are joint senior authors on this work.

    Affiliation Department of Respiratory Medicine, University College London, London, United Kingdom

Abstract

Background

Risk-based screening for lung cancer is currently being considered in several countries; however, the optimal approach to determine eligibility remains unclear. Ensemble machine learning could support the development of highly parsimonious prediction models that maintain the performance of more complex models while maximising simplicity and generalisability, supporting the widespread adoption of personalised screening. In this work, we aimed to develop and validate ensemble machine learning models to determine eligibility for risk-based lung cancer screening.

Methods and findings

For model development, we used data from 216,714 ever-smokers recruited between 2006 and 2010 to the UK Biobank prospective cohort and 26,616 high-risk ever-smokers recruited between 2002 and 2004 to the control arm of the US National Lung Screening (NLST) randomised controlled trial. The NLST trial randomised high-risk smokers from 33 US centres with at least a 30 pack-year smoking history and fewer than 15 quit-years to annual CT or chest radiography screening for lung cancer. We externally validated our models among 49,593 participants in the chest radiography arm and all 80,659 ever-smoking participants in the US Prostate, Lung, Colorectal and Ovarian (PLCO) Screening Trial. The PLCO trial, recruiting from 1993 to 2001, analysed the impact of chest radiography or no chest radiography for lung cancer screening. We primarily validated in the PLCO chest radiography arm such that we could benchmark against comparator models developed within the PLCO control arm. Models were developed to predict the risk of 2 outcomes within 5 years from baseline: diagnosis of lung cancer and death from lung cancer. We assessed model discrimination (area under the receiver operating curve, AUC), calibration (calibration curves and expected/observed ratio), overall performance (Brier scores), and net benefit with decision curve analysis.

Models predicting lung cancer death (UCL-D) and incidence (UCL-I) using 3 variables—age, smoking duration, and pack-years—achieved or exceeded parity in discrimination, overall performance, and net benefit with comparators currently in use, despite requiring only one-quarter of the predictors. In external validation in the PLCO trial, UCL-D had an AUC of 0.803 (95% CI: 0.783, 0.824) and was well calibrated with an expected/observed (E/O) ratio of 1.05 (95% CI: 0.95, 1.19). UCL-I had an AUC of 0.787 (95% CI: 0.771, 0.802), an E/O ratio of 1.0 (95% CI: 0.92, 1.07). The sensitivity of UCL-D was 85.5% and UCL-I was 83.9%, at 5-year risk thresholds of 0.68% and 1.17%, respectively, 7.9% and 6.2% higher than the USPSTF-2021 criteria at the same specificity. The main limitation of this study is that the models have not been validated outside of UK and US cohorts.

Conclusions

We present parsimonious ensemble machine learning models to predict the risk of lung cancer in ever-smokers, demonstrating a novel approach that could simplify the implementation of risk-based lung cancer screening in multiple settings.

Author summary

Why was this study done?

  • Screening and disease prevention programmes are increasingly bespoke; however, their simultaneous delivery at a population-scale presents considerable challenges.
  • Lung cancer is the most common cause of cancer death worldwide, with poor survival in the absence of early detection.
  • Screening for lung cancer among those at high-risk could reduce lung cancer-specific mortality by 20% to 24% among those screened, but the ideal way to determine if someone is high-risk remains uncertain and existing approaches are resource intensive.

What did the researchers do and find?

  • We used data from the UK Biobank and US National Lung Screening Trial to develop novel, parsimonious models, to simplify the prediction of lung cancer risk and selection to lung cancer screening programmes.
  • Using ensemble machine learning and 3 predictors—age, smoking duration, and pack-years—we found our models achieved or exceeded parity in performance with leading comparators despite requiring one-third of the variables.
  • Our models were externally validated in the US Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial and benchmarked against models that are either in use or have performed strongly in previous analyses.

What do these findings mean?

  • Risk assessment for lung cancer screening can be simplified without reducing performance, potentially improving the uptake and effectiveness of national lung cancer screening programmes, and therefore contributing to reducing deaths from lung cancer.
  • Future research should focus on the application of this modelling approach to other conditions such as cardiovascular disease, diabetes, and chronic kidney disease to support the implementation at scale of multiple concurrent risk-stratified prevention and early detection programmes for major causes of morbidity and mortality.
  • This study has key limitations as it is based on past data from the US and UK, so prospective evaluation in different countries and regions should be considered.

Introduction

Screening, early detection, and disease prevention programmes are increasingly bespoke, with risk prediction algorithms determining an individual’s eligibility and management [13]. Such personalisation promises to improve the benefit-to-harm profile of such interventions and ultimately health outcomes [46]. However, the delivery of these programmes at a population scale requires 2 conditions of risk prediction models: that they generalise well to contexts where there are insufficient data for model development, retraining, or validation; and, that the trade-off between model complexity and implementation feasibility is considered.

Screening for lung cancer—the foremost cause of death from cancer worldwide [7]—with low-dose computed tomography (LDCT) has been associated with a 20% to 24% reduction in lung cancer-specific mortality among those at high risk and are screened [8,9]. However, the ideal method to identify those at high risk remains unresolved. The US Preventive Services Taskforce (USPSTF) recommends the use of dichotomous criteria—age 50 to 80, ≥20 pack-years smoked, and <15 quit-years for former smokers—to select screening participants [10]. Nevertheless, identifying individuals for lung cancer screening based on risk prediction models has been shown to have both better benefit-to-harm profiles and cost-effectiveness than using dichotomous risk factor-based criteria alone [1114], leading to risk-model-based selection criteria in European lung cancer screening pilots [15].

To date, most externally validated prediction models for lung cancer have been developed in United States datasets [12,1621], reflecting the relatively limited availability of suitable cohorts with long-term follow-up for prediction modelling. This implies that most global healthcare systems that implement risk-based lung cancer screening will use prediction models developed in a US population, often using variables such as ethnicity, whose categorisation varies between countries and individual datasets, and academic qualifications that differ both over time and between jurisdictions. In the United Kingdom, existing models have been shown to underperform in specific groups, such as the more socioeconomically deprived, where underestimation of risk could lead to a screening programme systematically widening health inequalities [22].

Furthermore, the risk models currently in use are a challenge to implement. In the UK, eligibility for lung cancer screening pilots is based on the PLCOm2012 and Liverpool Lung Project risk models, requiring 17 unique variables, few of which are routinely available [23]. Collecting these variables from an individual who is potentially eligible and explaining the results currently averages between 5 and 10 min. To determine the screening eligibility of 1 million people would therefore require between 48 and 95 full-time staff a year. In the UK, there are an estimated 6.8 million ever-smokers aged 55 to 74 who are potentially eligible for lung cancer screening, with another 500,000 turning 55 on average each year [24,25]. As lung cancer screening is just one of several risk-based programmes that are either in development or in use, in their current form, these assessments present a formidable obstacle to the effective implementation of national screening programmes.

In this study, we hypothesised that using ensemble machine learning with training data spanning different geographic regions, populations, and average risk levels, we could develop predictive models for lung cancer screening with a minimum number of features that has broad applicability. In so doing, we aimed to combine the simplicity of risk-factor-based criteria with the improved predictive performance of risk models, while maintaining generalisability to new settings.

Methods

Ethical statement

The University College London Research Ethics Committee gave ethical approval for this study (reference: 19131/001).

Data sources and study population

Development and internal validation datasets.

For model development, we first used data on 216,714 ever-smokers without a prior history of lung cancer from the UK Biobank [26] before creating a multicountry dataset that combined UK Biobank and US National Lung Screening Trial (NLST) [8] data (n = 26,616) (Fig 1; participant flow diagrams in Figs A and B in S1 Appendix). The UK Biobank is a large prospective cohort recruited between 2006 and 2010 from 22 British centres that combines phenotypical data with ongoing linkage to hospital and registry data [27]. During this timeframe, the UK has not had a systematic screening programme for lung cancer. The NLST was a randomised controlled trial of lung cancer screening comparing computed tomography (CT) against chest radiography in 33 US centres between 2002 and 2004 with follow-up through 2009 [28]. Participation in the NLST was restricted to those considered at high risk of developing lung cancer: a 30 pack-year smoking history and, if a former smoker, to have quit within 15 years of enrolment [28].

thumbnail
Fig 1. Developing the UCL models to determine lung cancer screening eligibility.

A multicountry dataset comprising the UK Biobank and NLST was used to develop new models before external validation in the PLCO Trial chest radiography arm (allowing benchmark comparison with existing models developed in the PLCO control arm) and the full PLCO cohort (a). The ensemble modelling approach involves optimising individual modelling pipelines before combining their results as a single prediction for each individual. (b) Shows details of the UCL-D model, including the weights attributed to each pipeline in generating a single prediction for the five-year risk of lung cancer for any individual. (c) Shows the contribution of different variables to overall predictions as well as interactions between predictors, analysed using Shapely Additive Explanations (SHAP) on the UK Biobank [35]. The first subfigure in (c) shows that smoking duration was the most important variable when making predictions of an individual’s risk of dying from lung cancer, followed by pack-years smoked, and finally age. The 3 subsequent dependence subplots show the relationship between the predictor (x-axis) against the outcome (y-axis)—the importance of knowing that predictor value when making a prediction. The vertical dispersion shows the degree of interaction effects present, while the colour corresponds to a second variable. The plots show that smoking for less than approximately 35 years had relatively little impact on model predictions, with a steep inflection and increasing interaction between smoking duration and pack-years after this point. This relationship between smoking duration and pack-years mirrors that seen in the previous subfigure, with duration trumping quantity of cigarettes smoked unless both are high. In other words, those individuals who smoke for short periods of time have a lower predicted risk, even if they smoke relatively large quantities. This reflects our understanding of lung biology and the ability of the lung to repair itself if an individual stops smoking [56]. Lastly among subfigures of (c), we see that age has relatively limited impact on the model under the age of 60. NLST, National Lung Screening Trial; PLCO, Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial.

https://doi.org/10.1371/journal.pmed.1004287.g001

We selected the NLST because it is geographically distinct, includes a higher risk cohort, and has greater ethnic diversity than the UK Biobank. By combining NLST data with the UK Biobank, which by contrast is known to represent a cohort with lower mortality risks than the UK general population [29], our prediction models would be trained on a wider range of participants, with potentially improved model performance.

External validation datasets.

For model validation, we used data from 40,593 ever-smokers without a prior history of lung cancer from the chest radiography arm of the US Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial (PLCO) [30] trial (Fig C in S1 Appendix). This allowed benchmarking against comparator models that were developed in the control arm of the PLCO trial. Chest radiography was found to have no impact on lung cancer mortality, nor a statistically significant impact on lung cancer incidence [30]. In secondary analyses presented in S1 Appendix, we report model performance in both arms of the full PLCO dataset together (n = 80,659).

Missing data

We used multiple imputation by chained equations (MICE) with predictive mean matching to generate imputed development and validation datasets [31]. We generated 10 imputed sets of the UK Biobank, based on an average missingness among candidate predictors in the UK Biobank of 11%. As missingness was <5% for all relevant variables in the PLCO and NLST, we created 5 imputed PLCO and NLST datasets. See Table A and Figs D–F in S1 Appendix for further details.

Outcomes

We developed models to predict the absolute cumulative risk of 2 outcomes within 5 years from baseline: diagnosis of lung cancer and death from lung cancer. Lung cancer status and primary cause of death in the UK Biobank were determined by linked national cancer registry and Office for National Statistics data [26]. In the NLST and PLCO, primary cause of death was confirmed by independent review of death certificates [28,30]. In the NLST, lung cancer diagnoses were ascertained through medical record abstraction, and in the PLCO, through mailings and telephone calls to participants [8,30].

Model development

We developed ensembles of machine learning pipelines using AutoPrognosis, open-source automated machine learning software [32,33]. In this analysis, AutoPrognosis was used to optimise pipelines consisting of a variable preprocessing step followed by model selection and training. These optimised pipelines were subsequently combined and a single prediction for any individual generated by a weighted combination of the predictions made by each of the 4 pipelines independently, with weighting by Bayesian model averaging (Fig 1) [32]. We trialled model algorithms including logistic regression, random forests, and Gradient Boosting approaches (see subsection Model Development in S1 Appendix for further details). Throughout, pipelines were trained and selected to maximise model discrimination, measured with the area under the receiver operating curve (AUC).

Model explanation

We used the Kernel Shapely Additive Explanations (SHAP) [34] algorithm for model explanation and analysis of predictor interactions (Fig 1). Kernel SHAP is a permutation-based method theoretically based in coalitional game theory. In summary, each variable is passed to a model one-by-one, with the change in predictions that occurs attributed to this model [35,36]. Further details are available in subsection Variable Importance and Interaction in S1 Appendix.

Variable selection

For pragmatic reasons, we considered candidate predictors from the UK Biobank that were also present in the NLST and PLCO (Table B in S1 Appendix). We settled on our final list of predictors based on the literature, domain expertise, variable distributions, generalisability to multiple settings, and model discrimination in the UK Biobank. This led to the development of full models. Reviewing feature importance, we found that age, smoking duration, and pack-years were driving predictions, leading us to develop models using these 3 predictors (Fig G in S1 Appendix).

Statistical analysis

We considered a model’s overall performance with the Brier score [37], discrimination with the AUC, calibration with calibration curves and the ratio of expected-to-observed cases, and clinical usefulness with decision curve analysis [38]. Calibration curves were calculated by splitting individuals into 10 risk deciles based on their predicted risk before compared predicted probability against observed risk, the latter calculated using a Kaplan–Meier model. For a measure of clinical utility, we considered the net benefit of models across a range of risk thresholds [38]. We compared model discrimination with a two-tailed bootstrap test using the methods of Hanley and MacNeil, modified by Robin and colleagues [39,40]. To determine potential risk thresholds for our models, we used a fixed population strategy, comparing the number of individuals eligible for screening in the entire PLCO external validation dataset using the USPSTF-2021 criteria.

In both internal and external validation, we generated 1,000 bootstrap resamples with replacement for all analyses; central estimates and 95% confidence intervals were calculated with the percentile method. We used optimism-corrected metrics for internal validation. All analyses were conducted with R [41] and Python [42].

Model comparisons

For benchmark comparisons, we compared our new models to the USPSTF-2021 criteria (age 50 to 80, ≥20 pack-year smoking history, and quit within the last 15 years if a former smoker) [10], as well as existing risk models that are either in use (PLCOm2012 [18] and Liverpool Lung Project (LLP) version 2 [43]) or have been externally validated and consistently shown to outperform other risk models (the Lung Cancer Death Risk Assessment Tool [LCDRAT] and Lung Cancer Risk Assessment Tool [LCRAT] [19]) (Table C in S1 Appendix) [13,22,44,45]. All comparator models predict the five-year risk of death (LCDRAT) or developing lung cancer (LCRAT, LLP) except for the PLCOm2012 that predicts the six-year risk of lung cancer occurrence. A third, recalibrated, version of LLP has been developed. Because it is not currently in use, we present full comparative analyses in the Appendix but note that in using the same predictors and coefficients as LLP version 2, its discrimination is equivalent. Further, we also compared against Cox models developed using the same dataset (see Methods in S1 Appendix) and the constrained versions of the LCDRAT, LCRAT, and PLCOm2012 models.

All variables were available for comparator models except the LLP. For the LLP, in the UK Biobank, data were not available for age at which a family member developed lung cancer. Following ten Haaf and colleagues [44], and reflecting UK lung cancer epidemiology [46], we assumed that all with a family history of lung cancer were aged over 60. In the PLCO dataset, asbestos exposure and prior history of pneumonia were not available and were set to zero. We used the lcmodels package in R to calculate predictions for the PLCOm2012, LCRAT, and LCDRAT models [47].

Results

The descriptive characteristics of the UK Biobank and NLST development datasets and the PLCO external validation dataset are presented in Table 1. Characteristics by outcome are presented in Tables D–G in S1 Appendix. The number of cancers diagnosed and deaths from lung cancer are presented by follow-up period in Table H in S1 Appendix.

thumbnail
Table 1. Descriptive characteristics of the development and validation cohorts.

https://doi.org/10.1371/journal.pmed.1004287.t001

We found that age, smoking duration (years), and pack-years of smoking, drove most predictions. This led us to focus our analyses on developing 2 models: UCL-D and UCL-I, that used just these 3 variables. UCL-D predicts the five-year risk of dying from lung cancer and was a weighted ensemble consisting of 4 modelling algorithms: AdaBoost [48,49], LightGBM [50], Logistic Regression, and Linear Discriminant Analysis. UCL-I predicts the five-year risk of developing lung cancer and included AdaBoost [48,49], LightGBM [50], Bagging, and CatBoost [51] algorithms. Details of the ensemble pipelines, their weightings and algorithm hyperparameters are presented Figs H and I and Tables I and J in S1 Appendix.

UCL models

In internal and external validation, UCL-D and UCL-I showed good discrimination (Tables 2 and 3 and Table K in S1 Appendix), overall performance (Tables L and M in S1 Appendix), and calibration (Fig 2, Table 4, and Table N in S1 Appendix), both overall and across subgroups. In external validation in the PLCO radiography arm, UCL-D had an AUC of 0.803 (95% CI: 0.783, 0.824), an expected/observed (E/O) ratio of 1.05 (95% CI: 0.95, 1.19), and a Brier score of 0.0084 (95% CI: 0.0075, 0.0093). UCL-I had an AUC of 0.787 (95% CI: 0.771, 0.802), an E/O ratio of 1.0 (0.92, 1.07), and a Brier score of 0.0153 (0.0142, 0.0164).

thumbnail
Fig 2. Calibration curves.

Calibration curves showing UCL and comparator models in the UK Biobank (dark blue dashed lines) and US PLCO Cancer Screening Trial chest radiography arm (light blue line). The 45-degree lines in grey indicate perfect calibration. Curves were generated by splitting individuals into 10 risk deciles based on their predicted risk. Each curve shows the mean predicted risk against the observed risk by risk decile. Observed risk was calculated using a Kaplan–Meier estimator. The UCL models showed good calibration in external validation in the PLCO intervention arm, particularly at predicted risk between 1% and 2% at which risk thresholds are commonly set. At these thresholds, there was modest underprediction with the LCDRAT, LCRAT, and PLCOm2012 models in the PLCO intervention arm. All models modestly overpredicted risk in the UK Biobank, with the exception of the Liverpool Lung Project (LLPv2) version 2 model, which strongly overpredicted risk. LCDRAT, Lung Cancer Death Risk Assessment Tool; LCRAT, Lung Cancer Risk Assessment Tool. UCL-D predicts lung cancer death; UCL-I predicts occurrence of lung cancer.

https://doi.org/10.1371/journal.pmed.1004287.g002

thumbnail
Table 2. Discriminative accuracy (AUC) overall and by subgroup in the PLCO chest radiography cohort.

https://doi.org/10.1371/journal.pmed.1004287.t002

thumbnail
Table 3. Discrimination of UCL-D, Cox models, and the constrained LCDRAT, LCRAT, and PLCOm2012 models.

https://doi.org/10.1371/journal.pmed.1004287.t003

thumbnail
Table 4. Calibration overall and by subgroup in the PLCO chest radiography cohort.

https://doi.org/10.1371/journal.pmed.1004287.t004

Discrimination

Despite using approximately one-quarter of the variables, UCL-D achieved parity in discrimination with the LCDRAT (AUC: 0.811, 95%: 0.793, 0.829, p = 0.18 for difference with UCL-D). UCL-I achieved parity with PLCOm2012 (AUC: 0.792, 95% CI: 0.779, 0.808, p = 0.15 for difference in AUCs) and showed greater discrimination than LLP versions 2 and 3 (p < 0.001).

Calibration

The UCL models were well calibrated across risk thresholds at which eligibility for screening is typically set, tending modestly towards underprediction in the highest risk decile in the PLCO radiography arm (Fig 2). By contrast, PLCOm2012 and LCRAT tended modestly towards underprediction at deciles corresponding to observed risks of 1% to 4%, which is more clinically disadvantageous than overprediction. As the PLCOm2012, LCDRAT, and LCRAT models were developed in the control arm of the PLCO trial, the strong relative performance of the UCL models is notable. All models modestly overpredicted risk in the UK Biobank cohort, with the extent of overprediction most notable for the LLP version 2.

Overall performance

When considering Brier scores, an overall measure of model performance comparing the closeness of predicted probabilities and observed outcomes [37], there was little or no distinction between the models in the UK Biobank and PLCO radiography arm (Tables L and M in S1 Appendix). In the PLCO radiography arm, both models predicting the five-year risk of death, UCL-D and LCDRAT had a Brier score of 0.0084 (95% CI: 0.0075, 0.0093). Brier scores vary with prevalence; consequently, models predicting the risk of developing lung cancer had higher scores. Nevertheless, the same pattern was observed: UCL-I had a Brier score of 0.0153 (95% CI: 0.0142, 0.0164), LCRAT a score of 0.0152 (95% CI: 0.0143, 0.0164), and LLP version 2 a score of 0.0153 (95% CI: 0.0143, 0.0165).

Risk thresholds to select individuals for screening

Using the USPSTF-2021 criteria, 34,654 (43.0%) of the entire PLCO dataset would be eligible for lung cancer screening. All UCL models had higher sensitivity than the USPSTF-2021 at an equivalent specificity, with the gains in sensitivity higher when predicting five-year risk of death from lung cancer (Table O in S1 Appendix). For UCL-I at a five-year risk threshold of 1.17%, the gains in sensitivity were 6.2% relative to the USPSTF-2021 criteria (83.9% [95% CI: 82.0, 86.1%] versus 77.7% [95% CI: 75.8, 80.2%]). By contrast, UCL-D at a five-year risk threshold of 0.68% would lead to a 7.9% increase in sensitivity (85.5% [95% CI: 82.8, 88.2%] versus 77.5% [95% CI: 74.6, 80.9%]) for the same specificity.

At the aforementioned risk cut-offs, 96.2% of individuals selected by UCL-D would also have been eligible for screening with UCL-I. By 10-years of follow-up, those selected for screening with UCL-D but not UCL-I tended towards a greater risk of developing and dying from lung cancer than those selected by UCL-I but not UCL-D, though this trend was not statistically significant (Fig J in S1 Appendix; Logrank test: p = 0.15 for differences in lung cancer deaths and p = 0.41 for differences in lung cancers).

Using decision curve analysis, at all risk thresholds, the net benefit of the UCL models is greater than screening using the USPSTF-2021 criteria (Fig 3 and Fig K in S1 Appendix). At suggested risk thresholds, the net benefit of compared risk models other than LLP are equivalent.

thumbnail
Fig 3.

Decision curves of selected models in the PLCO validation cohort. Net benefit across a range of thresholds of models predicting five-year risk of death from lung cancer (A) and developing lung cancer (B) compared against US Preventive Services Taskforce (USPSTF) 2021 screening eligibility criteria in the PLCO Cancer Screening Trial chest radiography arm validation dataset. The PLCOm2012 model predicts six-year risk of lung cancer. As the performance of PLCOm2012 over a five-year timeframe was similar to that of six-years, for comparability, predictions over a five-year timeframe are shown here. All models studied except the Liverpool Lung Project (LLPv2) version 2 had a greater net clinical benefit than using the USPSTF-2021 criteria for screening eligibility across all risk thresholds. All other risk models had a comparable net benefit to each other. LCDRAT, Lung Cancer Death Risk Assessment Tool; LCRAT, Lung Cancer Risk Assessment Tool. UCL-D predicts lung cancer death; UCL-I predicts occurrence of lung cancer.

https://doi.org/10.1371/journal.pmed.1004287.g003

Discussion

We have developed parsimonious models for lung cancer screening that combine the simplicity of existing risk factor-based criteria with the predictive performance of complex risk prediction models. Furthermore, we show in benchmarking comparisons that ensemble machine learning models with 3 predictors—age, smoking duration, and smoking pack-years—have equivalent predictive performance and clinical usefulness to existing models requiring 11 predictors.

In this analysis, we used ensemble machine learning to leverage the predictions of several optimised model pipelines. Ensemble modelling is based on the concept that different models make different types of mistake, and their errors begin to cancel each other out, such that combining these statistical models could be expected to improve the performance that any one might achieve [52]. By iteratively trialling and optimising a wide range of modelling approaches before subsequently creating ensembles of these approaches, AutoPrognosis ensures that the strongest performing model for that dataset will be derived and allows reproducibility by transparently showing how models were selected. This avoids the need to develop multiple independent models.

In the UK, eligibility for National Health Service screening pilots is based on meeting either a five-year absolute risk of lung cancer of ≥2.5% with the LLP risk score or a six-year absolute risk of ≥1.51% with the PLCOm2012 [23]. The use of 2 risk scores where eligibility differs by more than a percentage point in predicted absolute risk, and where a higher risk is tolerated over a five-year period than a six-year period, highlights the policy challenge in adopting the optimal risk-based approach for a particular setting. This approach requires the collection of 17 different unique predictors, as well as the mapping of US educational levels and US ethnicity categorisations to the UK. With an estimated 7 million current smokers in the UK [25]—even ignoring former smokers—the time and resource requirements to determine screening eligibility at a population scale will be challenging. Using 3 unambiguous variables but with equivalent or improved performance, the UCL models could be completed more easily online or in primary healthcare, simplifying the implementation of lung cancer screening.

A potential risk with using only few predictors is that the models will underperform in different subgroups. Across all major subgroups the performance of the UCL models was equivalent to existing models and importantly this included all 4 ethnic groups available in this analysis. The UCL models were also well calibrated in different subgroups, with no sex, smoking status, or age differences. However, there was some undercalibration in 2 groups used as predictors in comparator models: a history of COPD and family history of lung cancer. Furthermore, all models were undercalibrated in at least 1 ethnic group. As discrimination was good for most models, this suggests that lower thresholds could be considered, particularly in black populations, analogous to the relaxation in the age and smoking intensity criteria made between the USPSTF-2013 and USPSTF-2021 screening recommendations [53]. More work is required to improve calibration among different ethnicities, although it is notable that simply including this predictor in models had limited impact.

In keeping with Katki and colleagues [19], we found that UCL-D, predicting the risk of death from lung cancer, had greater discrimination than models predicting lung cancer occurrence. In these analyses, there was >96% overlap between UCL-D and UCL-I in terms of those selected for screening, with those selected by UCL-D but not UCL-I showing a trend towards a greater risk of death from lung cancer with longer follow-up (Fig J in S1 Appendix). In microsimulation modelling, overall outcomes differed little between a model predicting death from lung cancer compared with models predicting developing lung cancer [13]. Given this, UCL-D would be the more appropriate model to consider for implementation.

This study has strengths. We used large prospective cohorts for model development and validation. Our external validation cohort is both temporally and geographically distinct. We used robust methods for model development and internal validation while externally validating our models extensively using multiple approaches and in a wide range of subgroups. Further, we benchmarked our models against leading comparators. Moreover, by using few, unambiguous, variables, our models could be widely applied after further validation and, where necessary, recalibration. Finally, we have made our models openly available for independent assessment.

This study has several limitations. We have used retrospective data, such that findings may differ if used to prospectively determine screening eligibility. However, both the PLCOm2012 and the LLP models have been studied in prospective settings, establishing the benefits of risk-model against risk-factor-based screening. By benchmarking against these models, we can be confident in the performance of our models in a screening programme. To confirm the generalisability of our models, validation in datasets from beyond the US and UK will be the subject of further work. Our analyses have been performed in research cohorts rather than routinely collected electronic health records that may better reflect the broader population. In keeping with the models used as benchmark comparators in this work, the UCL models may not perform to the same extent in routinely collected electronic health records as smoking data are not usually present in the same depth [54]. Nevertheless, screening programmes are unlikely to rely on existing electronic health records given known challenges of missing and inaccurately coded predictors [55]. Finally, our risk models exclude never-smokers. To date, no risk model has been able to discriminate those never smokers with sufficient risk to meet existing criteria for lung cancer screening.

In conclusion, we used ensemble machine learning to explicitly maximise model parsimony, an approach that holds promise in multiple disease areas. Our prediction models to determine lung cancer screening eligibility require only 3 variables—age, smoking duration, and pack-years—and perform at or above parity with existing risk models in use. Further validation in alternative datasets as well as prospective implementation should be considered.

Supporting information

Acknowledgments

This research has been conducted using the UK Biobank Resource under application number 68073 and we wish to thank all participants in the included studies, as well as the National Cancer Institute for access to NCI’s data collected by the National Lung Screening Trial (NLST) and Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial (PLCO). We also wish to thank Arjun Nair and Sujin Kang for their feedback on earlier versions of this project, as well as Stephen Duffy for his comments on this work.

The statements contained herein are solely those of the authors and do not represent or imply concurrence or endorsement by NCI.

References

  1. 1. Oudkerk M, Devaraj A, Vliegenthart R, Henzler T, Prosch H, Heussel CP, et al. European position statement on lung cancer screening. Lancet Oncol. 2017;18:e754–e766. pmid:29208441
  2. 2. Lee A, Mavaddat N, Wilcox AN, Cunningham AP, Carver T, Hartley S, et al. BOADICEA: a comprehensive breast cancer risk prediction model incorporating genetic and nongenetic risk factors. Genet Med. 2019;0:1–11. pmid:30643217
  3. 3. Hippisley-Cox J, Coupland C, Brindle P. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study. BMJ. 2017;357. pmid:28536104
  4. 4. Pashayan N, Antoniou AC, Ivanus U, Esserman LJ, Easton DF, French D, et al. Personalized early detection and prevention of breast cancer: ENVISION consensus statement. Nat Rev Clin Oncol. 2020;1–19. pmid:32555420
  5. 5. Fitzgerald RC, Antoniou AC, Fruk L, Rosenfeld N. The future of early cancer detection. Nat Med. 2022;28:666–677. pmid:35440720
  6. 6. The Lancet Public Health. Next generation public health: towards precision and fairness. Lancet Public Health. 2019;4:e209. pmid:31054633
  7. 7. World Health Organization. The Global Cancer Observatory. [cited 2021 May 24]. Available from: https://gco.iarc.fr/.
  8. 8. National Lung Screening Trial Research Team, Aberle DR, Adams AM, Berg CD, Black WC, Clapp JD, et al. Reduced lung-cancer mortality with low-dose computed tomographic screening. N Engl J Med. 2011;365:395–409. pmid:21714641
  9. 9. de Koning HJ, van der Aalst CM, de Jong PA, Scholten ET, Nackaerts K, Heuvelmans MA, et al. Reduced Lung-Cancer Mortality with Volume CT Screening in a Randomized Trial. N Engl J Med. 2020;382:503–513. pmid:31995683
  10. 10. US Preventive Services Task Force, Krist AH, Davidson KW, Mangione CM, Barry MJ, Cabana M, et al. Screening for Lung Cancer: US Preventive Services Task Force Recommendation Statement. JAMA. 2021;325:962–970. pmid:33687470
  11. 11. Meza R, Jeon J, Toumazis I, Ten Haaf K, Cao P, Bastani M, et al. Evaluation of the Benefits and Harms of Lung Cancer Screening With Low-Dose Computed Tomography: Modeling Study for the US Preventive Services Task Force. JAMA. 2021;325:988–997. pmid:33687469
  12. 12. Toumazis I, Bastani M, Han SS, Plevritis SK. Risk-Based lung cancer screening: A systematic review. Lung Cancer. 2020;147:154–186. pmid:32721652
  13. 13. ten Haaf K, Bastani M, Cao P, Jeon J, Toumazis I, Han SS, et al. A Comparative Modeling Analysis of Risk-Based Lung Cancer Screening Strategies. J Natl Cancer Inst. 2020;112:466–479. pmid:31566216
  14. 14. Landy R, Young CD, Skarzynski M, Cheung LC, Berg CD, Rivera MP, et al. Using Prediction Models to Reduce Persistent Racial and Ethnic Disparities in the Draft 2020 USPSTF Lung Cancer Screening Guidelines. J Natl Cancer Inst. 2021;113:1590–1594. pmid:33399825
  15. 15. Kauczor H-U, Baird A-M, Blum TG, Bonomo L, Bostantzoglou C, Burghuber O, et al. ESR/ERS statement paper on lung cancer screening. Eur Radiol. 2020;30:3277–3294. pmid:32052170
  16. 16. Bach PB, Kattan MW, Thornquist MD, Kris MG, Tate RC, Barnett MJ, et al. Variations in lung cancer risk among smokers. J Natl Cancer Inst. 2003;95:470–478. pmid:12644540
  17. 17. Spitz MR, Hong WK, Amos CI, Wu X, Schabath MB, Dong Q, et al. A risk model for prediction of lung cancer. J Natl Cancer Inst. 2007;99:715–726. pmid:17470739
  18. 18. Tammemägi MC, Katki HA, Hocking WG, Church TR, Caporaso N, Kvale PA, et al. Selection criteria for lung-cancer screening. N Engl J Med. 2013;368:728–736. pmid:23425165
  19. 19. Katki HA, Kovalchik SA, Berg CD, Cheung LC, Chaturvedi AK. Development and Validation of Risk Models to Select Ever-Smokers for CT Lung Cancer Screening. JAMA. 2016;315:2300–2311. pmid:27179989
  20. 20. Wilson DO, Weissfeld J. A simple model for predicting lung cancer occurrence in a lung cancer screening program: The Pittsburgh Predictor. Lung Cancer. 2015;89:31–37. pmid:25863905
  21. 21. Cheung LC, Berg CD, Castle PE, Katki HA, Chaturvedi AK. Life-Gained-Based Versus Risk-Based Selection of Smokers for Lung Cancer Screening. Ann Intern Med. 2019;171:623–632. pmid:31634914
  22. 22. Robbins HA, Alcala K, Swerdlow AJ, Schoemaker MJ, Wareham N, Travis RC, et al. Comparative performance of lung cancer risk models to define lung screening eligibility in the United Kingdom. Br J Cancer. 2021;124:2026–2034. pmid:33846525
  23. 23. England NHS. Targeted Screening for Lung Cancer with Low Radiation Dose Computed Tomography: Standard Protocol prepared for the Targeted Lung Health Checks Programme. Jan 2019 [cited 2022 Jun 13]. Available from: https://www.england.nhs.uk/wp-content/uploads/2019/02/B1646-standard-protocol-targeted-lung-health-checks-programme-v2.pdf.
  24. 24. Office for National Statistics. Estimates of the population for the UK, England, Wales, Scotland and Northern Ireland. Office for National Statistics; 21 Dec 2022 [cited 2023 May 24]. Available from: https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/populationestimatesforukenglandandwalesscotlandandnorthernireland.
  25. 25. Office for National Statistics. Adult smoking habits in the UK—2019. Office for National Statistics; 6 Jul 2020 [cited 2022 May 13]. Available from: https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/healthandlifeexpectancies/bulletins/adultsmokinghabitsingreatbritain/2019.
  26. 26. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–209. pmid:30305743
  27. 27. Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12:e1001779. pmid:25826379
  28. 28. National Lung Screening Trial Research Team, Aberle DR, Berg CD, Black WC, Church TR, Fagerstrom RM, et al. The National Lung Screening Trial: overview and study design. Radiology. 2011;258:243–253. pmid:21045183
  29. 29. Fry A, Littlejohns TJ, Sudlow C, Doherty N, Adamska L, Sprosen T, et al. Comparison of Sociodemographic and Health-Related Characteristics of UK Biobank Participants With Those of the General Population. Am J Epidemiol. 2017;186:1026–1034. pmid:28641372
  30. 30. Oken MM, Hocking WG, Kvale PA, Andriole GL, Buys SS, Church TR, et al. Screening by Chest Radiograph and Lung Cancer Mortality. JAMA. 2011;306:1865. pmid:22031728
  31. 31. Wilson S. Miceforest. [cited 2022 Feb 24]. Available from: https://github.com/AnotherSamWilson/miceforest.
  32. 32. Alaa A, van der Schaar M. AutoPrognosis: Automated Clinical Prognostic Modeling via Bayesian Optimization with Structured Kernel Learning. Proceedings of the 35th International Conference on Machine Learning. PMLR. 2018;139–148. Available from: https://proceedings.mlr.press/v80/alaa18b.html.
  33. 33. Imrie F, Cebere B, McKinney EF, van der Schaar M. AutoPrognosis 2.0: Democratizing diagnostic and prognostic modeling in healthcare with automated machine learning. PLoS Digit Health. 2023;2:e0000276. pmid:37347752
  34. 34. Lundberg S. SHAP Package. [cited 2022 Jun 8]. Available from: https://shap-lrjball.readthedocs.io/en/latest/.
  35. 35. Lundberg SM, Lee S-I. A Unified Approach to Interpreting Model Predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017; 4768–4777. https://doi.org/10.5555/3295222.3295230
  36. 36. Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From Local Explanations to Global Understanding with Explainable AI for Trees. Nat Mach Intell. 2020;2:56–67. pmid:32607472
  37. 37. Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78:1–3.
  38. 38. Vickers AJ, Elkin EB. Decision Curve Analysis: A Novel Method for Evaluating Prediction Models. Med Decis Making. 2006;26:565–574. pmid:17099194
  39. 39. Hanley JA, McNeil BJ. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology. 1983;148:839–843. pmid:6878708
  40. 40. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77. pmid:21414208
  41. 41. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria; 2021. Available from: https://www.R-project.org/.
  42. 42. Python Software Foundation. Python. Available from: https://www.python.org/.
  43. 43. Field JK, Vulkan D, Davies MPA, Duffy SW, Gabe R. Liverpool Lung Project lung cancer risk stratification model: calibration and prospective validation. Thorax. 2021;76:161–168. pmid:33082166
  44. 44. ten Haaf K, Jeon J, Tammemägi MC, Han SS, Kong CY, Plevritis SK, et al. Risk prediction models for selection of lung cancer screening candidates: A retrospective validation study. PLoS Med. 2017;14:e1002277. pmid:28376113
  45. 45. Katki HA, Kovalchik SA, Petito LC, Cheung LC, Jacobs E, Jemal A, et al. Implications of Nine Risk Prediction Models for Selecting Ever-Smokers for Computed Tomography Lung Cancer Screening. Ann Intern Med. 2018;169:10–19. pmid:29800127
  46. 46. Cancer Research UK. Lung cancer incidence statistics. [cited 2022 Jun 13]. Available from: https://www.cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/lung-cancer/incidence.
  47. 47. Cheung L, Kovalchik SA, Hormuzd KA. R Package for Individual Risks of Lung Cancer and Lung Cancer Death. [cited 2022 Aug 22]. Available from: https://dceg.cancer.gov/tools/risk-assessment/lcmodels.
  48. 48. Freund Y, Schapire RE. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J Comput System Sci. 1997;55:119–139.
  49. 49. Scikit-learn. An AdaBoost Classifier. [cited 2023 Jan 10]. Available from: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier.
  50. 50. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: A highly efficient gradient boosting decision tree. Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017; 3149–3157.
  51. 51. Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: Unbiased Boosting with Categorical Features. Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018; 6639–6649.
  52. 52. Breiman L. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Stat Sci. 2001;16:199–231.
  53. 53. Pu CY, Lusk CM, Neslund-Dudas C, Gadgeel S, Soubani AO, Schwartz AG. Comparison Between the 2021 USPSTF Lung Cancer Screening Criteria and Other Lung Cancer Screening Criteria for Racial Disparity in Eligibility. JAMA Oncol. 2022;8:374–382. pmid:35024781
  54. 54. O’Dowd EL, Ten Haaf K, Kaur J, Duffy SW, Hamilton W, Hubbard RB, et al. Selection of eligible participants for screening for lung cancer using primary care data. Thorax. 2021. pmid:34716280
  55. 55. Dickson JL, Hall H, Horst C, Tisi S, Verghese P, Worboys S, et al. Utilisation of primary care electronic patient records for identification and targeted invitation of individuals to a lung cancer screening programme. Lung Cancer. 2022;173:94–100. pmid:36179541
  56. 56. Teixeira VH, Pipinikas CP, Pennycuick A, Lee-Six H, Chandrasekharan D, Beane J, et al. Deciphering the genomic, epigenomic, and transcriptomic landscapes of pre-invasive lung cancer lesions. Nat Med. 2019;25:517–525. pmid:30664780