Discussion
We set-out to predict the need for TKR at 2 and 5 years, using predictive variables that represent routinely collected data. With the exception of the DTs, the ML models produced were able to predict the need for a TKR with a clinically acceptable performance, using an independent external test set for validation. The GBM model achieved the highest predictive power at two (AUC 0.913 (95% CI 0.876 to 0.951)) and 5 years (0.873 (95% CI 0.839 to 0.907)). The key features driving the predictions of this best performing model were KLG, JSN, physical score features; WOMAC, SF-12 scores and educational attainment. This is the first study to validate predictive models externally, and the lack of reliance on MRI, with its associated costs and limited accessibility, facilitates the wider application of our tools through ease of interpretability, implementation and scalability to various clinical settings.
In terms of other non-imaging-reliant models, Wang et al7 used OAI data to develop an LR model using selected demographic and clinical information, with an AUC of 0.77±0.02, which is lower than the internal and external dataset performance of all our top-performing models. Another study shared our novelty in using non-MRI-based features to predict TKR within 4 years17 although their evaluation metrics did not include AUC but reported the total percentage of correctly predicted knees as 80% (69%–89%). However, this was not externally validated and conducted only on a sample of subjects as only the 165 patients receiving TKR were analysed. This study also used an artificial neural network, which carries advantages in terms of information processing, fault and noise tolerance compared with our ML models,17 but they function as a ‘black-box’ and this lack of transparency may limit doctor and patient confidence in the model’s predictions.18 19
In terms of imaging-dependent models, Tolpadi et al used direct imaging to predict TKR at 5 years for OAI subjects with varying OA severity.5 The paper evaluated six models: raw imaging-based, non-imaging-based and integrated (both), for radiographic and MRI imaging, concluding that the model integrating MRI and non-imaging features outperformed the others. Interestingly, our AUC (GBM, internal test; 0.945 at 5 years) exceeded all six of their models performed on their internal test data: 0.868 (non-imaging), 0.848 (radiographic images only), 0.890 (integrated radiographic model), 0.886 (MRI images only) and 0.834 (integrated MRI model). Jamshidi et al,6 a study that predicted TKR and time to TKR, also used MRI quantitative imaging data from OAI, developing a model with an AUC of 0.86, although this did not outperform our model. It should be noted that none of these previous studies validated their results using an external dataset, and so the real-world performance of their models remains uncertain.
Of note, Tolpadi et al’s model sensitivities exceeded that suggested by our F1-scores. This is important to consider as while the AUC considers the models’ ability to assess both negative and positive cases, the F1-score considers precision; a measure of positive predictive power; the model’s ability to predict TKR cases. Prediction of positive cases at both timepoints was <0.3 reflective of a lower sensitivity than Tolpadi et al’s. This suggests a bias of the AUC evaluation towards the majority class (negative cases), revealing that our models were better able to predict negative cases than positive. Explanation of our lower positive predicative abilities in comparison to Tolpadi et al’s potentially lie in their use of deep learning such as convolutional neural networks which use more advanced feature extraction to better manage the complex prognostic features that determine TKR risk,5 20 21 thus, strengthening their positive predictive power. A distinct advantage of our models, however, was their simplicity and thus transparency as well as reliance on more obtainable data, particularly considering the higher costs and reduced availability of MRI.8 Recent statistics estimate a single MRI scan to cost as much as US$1430 and £450 in the USA and UK, respectively.22
The transparency of our ML models also allowed us to examine the key predictors used by our most accurate model (GBM), and reassuringly they mostly align with previous literature findings.12 23 A study which used RF modelling of the OAI dataset to explore TKR incidence over 2 years, selected the predictive variables used by our model that is, KLG, WOMAC and SF-12.23 While this study was performed on OAI and thus, similarities with our findings are expected, the external validation of our study confirms the importance of these variables across different datasets. Elsewhere in the literature, a prospective Canadian population-based study identified WOMAC summary scores as key predictors for TKR risk, supporting our findings.12 The other advantage of knowing which variables are most important is that data collection can be targeted, thus reducing the paperwork burden for both patients and physicians.
Interestingly, our models also identified education as a key predictor. While low socioeconomic status is well recognised as one of the strongest predictors of morbidity and mortality from many chronic diseases, there are little data regarding its impact within KOA.24 One paper’s analysis of the socioeconomic effect on KOA found educational attainment was associated with decreased KOA prevalence in their initial analyses, however, this association was lost after confounder adjustments.24 Our finding may be a function of a correlation between higher rates of manual work, which are associated with increased risk of OA, among lower educational groups. Indeed, a study of pain disparities in underserved populations, within OAI, identified more severe OA in lower socioeconomic groups (inclusive of education) in addition to disparities in pain, and this was not accounted for by objective OA measures.25 Alternatively, it may reflect the US insurance-based healthcare, with education serving as proxy for income and access to early healthcare intervention in a timely manner. Further exploration of educational attainment, in relation to OA and TKR may be merited.
The clinical relevance of our tool is dictated by its ability to use routinely collected data and transparent ML techniques to predict TKR with a clinically acceptable accuracy which surpasses previous models. Our model’s independence from MRI scanning is important, because it resolves many of the issues of cost and accessibility and in doing so increases its potential for use in both the developed and developing countries. Our tool has the potential to facilitate targeted non-operative management efforts to modify risks for patients, particularly those predicted to require a TKR in 5 years time, with the aim of improving their QOL and potentially delaying the need for TKR. For patients predicted to require a TKR in 2 years, as well as modifying risk factors, this may assist with planning of care to closely monitor these patients and identify the ideal time to intervene surgically. Knowledge regarding the likelihood of requiring a TKR will empower and motivate patients, and facilitate informed shared decision making with their clinicians. It also has clear potential benefits for health economists tasked with planning future resource allocation.
A limitation of our study is the class imbalance in the dataset with the majority of patients included not progressing to have a TKR during the studied period. This is reflected in the low F1-scores, which suggest that our models were better at predicting negative cases, that is, patients not requiring a TKR at 2 or 5 years. Another limitation is the demographic imbalance in the OAI primary data, which has a bias towards older patients, as well as a higher proportion of female and white patients. Additionally, both datasets used were USA based, and further studies are required to confirm that the models are applicable outside of the USA.
This study presents the first externally validated ML model using simple and routinely available patient data, while delivering clinically acceptable levels of predictive power, to forecast a patient’s need for TKR at 2 and 5 years. The simplicity and transparency of our models in terms of design and input, with no reliance on MRI, increases the likelihood of its adoption as a treatment decision aid, identifying patients who are more likely to benefit from non-operative management and risk factor modification. Sharing this information with patients would also be expected to facilitate shared decision making and empower them to play an active role in their KOA management. Future research will explore the accuracy of our models in non-US populations and the use of advanced sampling techniques to address the class distribution balance.