Original Research | Published: 15 February 2023

Predicting total knee replacement at 2 and 5 years in osteoarthritis patients using machine learning


Request reuse permissionopen-url
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) licenseopen-url


Objectives Knee osteoarthritis is a major cause of physical disability and reduced quality of life, with end-stage disease often treated by total knee replacement (TKR). We set out to develop and externally validate a machine learning model capable of predicting the need for a TKR in 2 and 5 years time using routinely collected health data.

Design A prospective study using datasets Osteoarthritis Initiative (OAI) and the Multicentre Osteoarthritis Study (MOST). OAI data were used to train the models while MOST data formed the external test set. The data were preprocessed using feature selection to curate 45 candidate features including demographics, medical history, imaging assessments, history of intervention and outcome.

Setting The study was conducted using two multicentre USA-based datasets of participants with or at high risk of knee OA.

Participants The study excluded participants with at least one existing TKR. OAI dataset included participants aged 45–79 years of which 3234 were used for training and 809 for internal testing, while MOST involved participants aged 50–79 and 2248 were used for external testing.

Main outcome measures The primary outcome of this study was prediction of TKR onset at 2 and 5 years. Performance was evaluated using area under the curve (AUC) and F1-score and key predictors identified.

Results For the best performing model (gradient boosting machine), the AUC at 2 years was 0.913 (95% CI 0.876 to 0.951), and at 5 years 0.873 (95% CI 0.839 to 0.907). Radiographic-derived features, questionnaire-based assessments alongside the patient’s educational attainment were key predictors for these models.

Conclusions Our approach suggests that routinely collected patient data are sufficient to drive a predictive model with a clinically acceptable level of accuracy (AUC>0.7) and is the first such tool to be externally validated. This level of accuracy is higher than previously published models utilising MRI data, which is not routinely collected.

What is already known on this topic

  • The demand for total knee replacement (TKR) has increased exponentially in recent years, exerting a pressure on patients, surgeons and hospitals to decide on the timing of surgery.

  • Machine learning has the potential to forecast the need for TKR.

What this study adds

  • This study is the first to develop machine learning models using routinely collected accessible data and test these models using an external dataset and provides evidence that an externally validated machine learning model can predict the need for TKR with an acceptable level of accuracy.

How this study might affect research, practice or policy

  • Potential adoption of our tool provides early knee osteoarthritis patients with useful information regarding their likelihood of requiring TKR surgery over the next 2–5 years thus empowering them to make treatment decisions as well as lifestyle changes to reduce this risk.

  • The information would also assist health economists to understand and meet the future demand for knee replacement surgery.


Osteoarthritis (OA) is the most common degenerative joint disease and a major cause of physical disability, pain and reduced quality of life (QOL) for patients, with increasing global prevalence due to ageing populations and obesity.1 The resultant global socioeconomic burden of OA is estimated to cost in excess of £4.2 billion.2 Total knee replacement (TKR) is an effective treatment for end-stage knee OA (KOA),1 and in line with increasing disease prevalence, its use in the UK alone is expected to rise significantly from 70 000 per year at present, to at least 119 000 per year by 2035.3

A tool to evaluate the likelihood of a patient requiring a TKR over the next 5 years has much appeal. It would allow informed decision making by patients, both in terms of non-operative treatment such as lifestyle modification, and the timing of any surgical intervention. For clinicians and health economists, a better understanding of the likely case-load over a period of 2–5 years would allow for appropriate planning to meet demand.

Predictive modelling of the need for a TKR using machine learning (ML) has been explored. Among the earliest of TKR prediction tools was a population-based study using patient-reported risk factors to predict 10-year TKR risk.4 The tool, however, was restricted to older female patients, limiting generalisability. Further studies have since been conducted using more complex ML strategies including deep learning. Studies exploring this pertain strong dependence on MRI image input and have previously predicted TKR risk at 2, 4 and 5 years5 6 to predictive performances of up to area under the curve (AUC) 0.87±0.02.7 Such studies have made strides in predicting TKR, however, dependence on MRI imaging is both costly and not routinely performed,8 in addition to the use of deep learning strategies that are not very well understood and require significant computational power to analyse.6 7 Additionally, despite promising predictive abilities none of the published ML models have been externally validated to date, which is a significant limitation to their general applicability.

To address these limitations, we set out to develop and validate a tool that predicts which patients, with or at high risk of KOA, will likely require a TKR in 2 and 5 years time, using patient information collected during routine clinical practice. Six different ML classification models were evaluated including multivariable logistic regression (LR), LASSO, RIDGE, decision tree (DT), random forest (RF) and gradient boosting machine (GBM). A number of factors may be considered when selecting ML models including understandability and complexity as while a complex model can identify more interesting patterns in the data, at the same time, it is harder to maintain and explain. Six of the simplest ML models that are best explained were thus selected.9


A summary of the methodology is found in figure 1.

Figure 1
Figure 1

A Summary of the methodology, based on subject and feature disposition. The flow chart demonstrates the initial cohort, exclusion, approaches implemented at each stage, and resulting subjects and features included in analysis. The shaded section reflects the separation of the external dataset throughout. OAI, Osteoarthritis Initiative; TKR, total knee replacement.


Data source and exclusion criteria

This study used data from two multicentre USA-based prospective cohort studies of patients with, or at high risk, of KOA; the Osteoarthritis Initiative (OAI) and Multicentre Osteoarthritis Study (MOST).10 11 The OAI study enrolled 4976 subjects (ages 45–79 years) between February 2004 and May 2006 at four clinical sites (Baltimore, Maryland; Columbus, Ohio; Pittsburgh, Pennsylvania; and Pawtucket, Rhode Island) and MOST enrolled 3026 subjects (ages 50–79) from April 2003 to April 2005 at two sites (Birmingham, Alabama and Iowa City, Iowa). Eligibility for OAI included subjects with, or at risk for, symptomatic femoral-tibial KOA, a cohort defined by the presence of both osteophytes and frequent symptoms in one or both knees, or frequent knee symptoms without radiographic changes, in one or both knees. For MOST, similar eligibility was used to select subjects but with a reliance on MRI rather than radiographs. Subjects with unilateral or bilateral TKR at baseline were excluded.

Data pre-processing

Feature selection

OAI and MOST databases included 96 and 103 features, respectively. Those representing possible risk factors for progression of KOA were identified based on literature and expert knowledge.12 13 Forty-five relevant features present in both datasets were then selected (summarised in online supplemental table 1). Of note, the criteria for the feature ‘steroid injection history’ was different between the datasets, being recorded over the previous 12 months in OAI, and 6 months in MOST.

Feature extraction

Selected features were categorised into the following domains: demographic, medical history, imaging assessments, history of intervention and outcome, with 39 non-imaging features and six image-based features. Medical history comprised both clinical examination and patient-reported outcomes. Image-based variables were quantitative radiographic measures: Kellgren-Lawerence grade (KLG) and joint space narrowing (JSN) .

The MOST protocol imputed random numbers for missing feature responses, and we applied the same approach to any missing features in the OAI dataset (online supplemental table 2).

Data split

The dataset was divided into three for the purposes of analysis (figure 1): 80% of the OAI dataset was used to develop and optimise the models (training set) with the remaining 20% of the dataset used for internal evaluation (internal test set). The MOST dataset was used for external validation. The OAI training and test datasets were randomly stratified in R to contain similar proportions of positive (having had a TKR) and negative (no TKR) cases.

Data output

Our study outcome variable of TKR was a binary ‘yes’ or ‘no’ for each patient case at 2 and 5 years.


Model development and training

Model configuration and optimisation

Supervised ML models were used to predict the outcome, categorising new probabilistic observations into the predefined categories of ‘yes’ or ‘no’ TKR at 2 and 5 years. ML software packages were used on R V.3.6.3 (packages used detailed in online supplemental table 3) for reproducibility). The following ML classification models were selected: multivariable LR, LASSO, RIDGE, DT, RF and GBM. For each model, a number of tuneable knobs (parameters and hyperparameters) were adjusted to optimise performance (see online supplemental material ‘model optimisation’).

Evaluation metrics

Model performance on all three data sets was evaluated with the area under the receiver operating characteristic (ROC) curve (AUC) for discrimination, with focused reporting on the internal test and external test sets. We considered AUC >0.7 to provide a clinically acceptable performance.14 F1-scores were calculated for the best performing metrics as a harmonic mean of the precision and recall (sensitivity)15 16 and a measure of positive predictive power. Key predictors in the best-performing model, at 2 and 5 years, were identified using variable importance evaluation functions of the ML models.

Model calibration

The optimal threshold for calibration, in line with the variation in numbers of positive and negative cases within datasets12 was determined using F1-score, in order to optimise positive predictive ability (online supplemental figure 1).


Data distribution

The distribution of key candidate features is displayed in table 1. The training set comprised 3234 patients of which 41.6% were male, and 43.3% and 41.4% had radiographic, moderate or severe left KOA and right KOA, respectively. The internal test set consisted of 809 patients of which 42.5% were male, and 45.1% and 43.2% had radiographic, moderate or severe left KOA and right KOA, respectively. The external test set included 2248 patients of which 42.1% were male, and 34.3% and 37.1% had radiographic, moderate or severe left KOA and right KOA, respectively. Correlation between features within the primary dataset is visualised as a correlation heatmap (online supplemental figure 2).

Table 1
Data were split into training, internal test set and external test set as displayed

Training and internal test performance

Optimised predictive abilities for each model applied to the training and internal test sets are detailed in table 2. The best performing model at 2 years was GBM AT 0.945 (95% CIs 0.901 to 0.988) and RIDGE at 5 years 0.869 (0.803 to 0.935). The worst performing model at 2 years was LR with an AUC of 0.730 (95% CI 0.496 to 0.965) and at 5 years DT at an AUC of 0.688 (95% CI 0.608 to 0.768). The DT model was unable to categorise any cases at 2 years because the uniform probability threshold selected for model calibration was not optimal. Performances on the external dataset (table 3) revealed that GBM models were best for both time points, with an AUC of 0.913 (95% CI 0.876 to 0.951) and 0.873 (95% CI 0.839 to 0.907) for 2 and 5 years, respectively. When applied to the external test set, low positive predictive ability is evident across both years as denoted by low F1-scores.

Table 2
Displaying AUC for all five models predicting TKR at 2 years and 5 years when applied to training and internal test sets
Table 3
Displaying the performance of three models when applied to the external testset (MOST), as evaluated by AUC and F1-score

Overall, the best performing models, based on performance on the internal test set, were GBM, RIDGE and LASSO. TKR prediction at 2 years was also consistently more accurate than at 5 years for the three best performing models.

External test performance

ROC curves for the three best performing models are shown in figure 2A–C when applied to the internal test set as well as figure 2B,C, when applied to the external test set at both time points. Performances across models are slightly reduced when applied to the external test set at both time points but remains within comparison to the internal test set, with the exception of GBM which exceeds its original performance when applied to the external test set at 5 years (AUC 0.855 compared with 0.873). GBM consistently forms the best performing model (AUC-2years=0.913, AUC-5years=0.873).

Figure 2
Figure 2

Comparison of the top three performing ml models’ performance as receiver operating characteristic (ROC) curves for TKR prediction at 2 and 5 years. (A) and (C) demonstrate ROC curves on internal test set only (B) and (D) on external test set (MOST), with additional dashed lines that are the test set overlain to allow direct comparison. In all curves, the black line signifies the performance of a random classifier (area under the curve, AUC=0.500). The legends in the subplots indicate the AUC of the models with 95% CIs. GBM, gradient boosting machine; MOST, Multicentre Osteoarthritis Study; TKR, total knee replacement.

Model predictors

Relative influence is ranked to show order of the most important feature in training the model in table 4. For instance, 21.54 relative influence means it accounts for 21.54% of the reduction to the loss function given this set of features as opposed to 21.54% of variance. Radiographic features; KLG formed the highest predictor in the best performing model (GBM) across both prediction years, followed by less important features The Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), Short-Form-12 (Physical and mental components) and educational attainment (table 4). JSN also appeared to have a relatively high influence at 5 years despite being non-notable at 2 years. A number of remaining features or predictors were ‘0’, and thus ‘unnecessary’ in predicting TKR at 2 and 5 years under the GBM model.

Table 4
Denotes the largest predictors for the best-performing model (GBM) alongside their relative influence at 2 years and 5 years. *The Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) .


We set-out to predict the need for TKR at 2 and 5 years, using predictive variables that represent routinely collected data. With the exception of the DTs, the ML models produced were able to predict the need for a TKR with a clinically acceptable performance, using an independent external test set for validation. The GBM model achieved the highest predictive power at two (AUC 0.913 (95% CI 0.876 to 0.951)) and 5 years (0.873 (95% CI 0.839 to 0.907)). The key features driving the predictions of this best performing model were KLG, JSN, physical score features; WOMAC, SF-12 scores and educational attainment. This is the first study to validate predictive models externally, and the lack of reliance on MRI, with its associated costs and limited accessibility, facilitates the wider application of our tools through ease of interpretability, implementation and scalability to various clinical settings.

In terms of other non-imaging-reliant models, Wang et al7 used OAI data to develop an LR model using selected demographic and clinical information, with an AUC of 0.77±0.02, which is lower than the internal and external dataset performance of all our top-performing models. Another study shared our novelty in using non-MRI-based features to predict TKR within 4 years17 although their evaluation metrics did not include AUC but reported the total percentage of correctly predicted knees as 80% (69%–89%). However, this was not externally validated and conducted only on a sample of subjects as only the 165 patients receiving TKR were analysed. This study also used an artificial neural network, which carries advantages in terms of information processing, fault and noise tolerance compared with our ML models,17 but they function as a ‘black-box’ and this lack of transparency may limit doctor and patient confidence in the model’s predictions.18 19

In terms of imaging-dependent models, Tolpadi et al used direct imaging to predict TKR at 5 years for OAI subjects with varying OA severity.5 The paper evaluated six models: raw imaging-based, non-imaging-based and integrated (both), for radiographic and MRI imaging, concluding that the model integrating MRI and non-imaging features outperformed the others. Interestingly, our AUC (GBM, internal test; 0.945 at 5 years) exceeded all six of their models performed on their internal test data: 0.868 (non-imaging), 0.848 (radiographic images only), 0.890 (integrated radiographic model), 0.886 (MRI images only) and 0.834 (integrated MRI model). Jamshidi et al,6 a study that predicted TKR and time to TKR, also used MRI quantitative imaging data from OAI, developing a model with an AUC of 0.86, although this did not outperform our model. It should be noted that none of these previous studies validated their results using an external dataset, and so the real-world performance of their models remains uncertain.

Of note, Tolpadi et al’s model sensitivities exceeded that suggested by our F1-scores. This is important to consider as while the AUC considers the models’ ability to assess both negative and positive cases, the F1-score considers precision; a measure of positive predictive power; the model’s ability to predict TKR cases. Prediction of positive cases at both timepoints was <0.3 reflective of a lower sensitivity than Tolpadi et al’s. This suggests a bias of the AUC evaluation towards the majority class (negative cases), revealing that our models were better able to predict negative cases than positive. Explanation of our lower positive predicative abilities in comparison to Tolpadi et al’s potentially lie in their use of deep learning such as convolutional neural networks which use more advanced feature extraction to better manage the complex prognostic features that determine TKR risk,5 20 21 thus, strengthening their positive predictive power. A distinct advantage of our models, however, was their simplicity and thus transparency as well as reliance on more obtainable data, particularly considering the higher costs and reduced availability of MRI.8 Recent statistics estimate a single MRI scan to cost as much as US$1430 and £450 in the USA and UK, respectively.22

The transparency of our ML models also allowed us to examine the key predictors used by our most accurate model (GBM), and reassuringly they mostly align with previous literature findings.12 23 A study which used RF modelling of the OAI dataset to explore TKR incidence over 2 years, selected the predictive variables used by our model that is, KLG, WOMAC and SF-12.23 While this study was performed on OAI and thus, similarities with our findings are expected, the external validation of our study confirms the importance of these variables across different datasets. Elsewhere in the literature, a prospective Canadian population-based study identified WOMAC summary scores as key predictors for TKR risk, supporting our findings.12 The other advantage of knowing which variables are most important is that data collection can be targeted, thus reducing the paperwork burden for both patients and physicians.

Interestingly, our models also identified education as a key predictor. While low socioeconomic status is well recognised as one of the strongest predictors of morbidity and mortality from many chronic diseases, there are little data regarding its impact within KOA.24 One paper’s analysis of the socioeconomic effect on KOA found educational attainment was associated with decreased KOA prevalence in their initial analyses, however, this association was lost after confounder adjustments.24 Our finding may be a function of a correlation between higher rates of manual work, which are associated with increased risk of OA, among lower educational groups. Indeed, a study of pain disparities in underserved populations, within OAI, identified more severe OA in lower socioeconomic groups (inclusive of education) in addition to disparities in pain, and this was not accounted for by objective OA measures.25 Alternatively, it may reflect the US insurance-based healthcare, with education serving as proxy for income and access to early healthcare intervention in a timely manner. Further exploration of educational attainment, in relation to OA and TKR may be merited.

The clinical relevance of our tool is dictated by its ability to use routinely collected data and transparent ML techniques to predict TKR with a clinically acceptable accuracy which surpasses previous models. Our model’s independence from MRI scanning is important, because it resolves many of the issues of cost and accessibility and in doing so increases its potential for use in both the developed and developing countries. Our tool has the potential to facilitate targeted non-operative management efforts to modify risks for patients, particularly those predicted to require a TKR in 5 years time, with the aim of improving their QOL and potentially delaying the need for TKR. For patients predicted to require a TKR in 2 years, as well as modifying risk factors, this may assist with planning of care to closely monitor these patients and identify the ideal time to intervene surgically. Knowledge regarding the likelihood of requiring a TKR will empower and motivate patients, and facilitate informed shared decision making with their clinicians. It also has clear potential benefits for health economists tasked with planning future resource allocation.

A limitation of our study is the class imbalance in the dataset with the majority of patients included not progressing to have a TKR during the studied period. This is reflected in the low F1-scores, which suggest that our models were better at predicting negative cases, that is, patients not requiring a TKR at 2 or 5 years. Another limitation is the demographic imbalance in the OAI primary data, which has a bias towards older patients, as well as a higher proportion of female and white patients. Additionally, both datasets used were USA based, and further studies are required to confirm that the models are applicable outside of the USA.

This study presents the first externally validated ML model using simple and routinely available patient data, while delivering clinically acceptable levels of predictive power, to forecast a patient’s need for TKR at 2 and 5 years. The simplicity and transparency of our models in terms of design and input, with no reliance on MRI, increases the likelihood of its adoption as a treatment decision aid, identifying patients who are more likely to benefit from non-operative management and risk factor modification. Sharing this information with patients would also be expected to facilitate shared decision making and empower them to play an active role in their KOA management. Future research will explore the accuracy of our models in non-US populations and the use of advanced sampling techniques to address the class distribution balance.