Article Text

Download PDFPDF

Long-term active surveillance of implantable medical devices: an analysis of factors determining whether current registries are adequate to expose safety and efficacy problems
  1. Samprit Banerjee1,
  2. Bruce Campbell2,
  3. Josh Rising3,
  4. Allan Coukell3 and
  5. Art Sedrakyan1
  1. 1 Healthcare Policy and Research, Weill Cornell Medical College, New York City, New York, USA
  2. 2 University of Exeter Medical School, Exeter, UK
  3. 3 Health Care Programs, The Pew Charitable Trusts, Washington, DC, USA
  1. Correspondence to Dr Samprit Banerjee; sab2028{at}

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


Ensuring the long-term safety and effectiveness of medical devices is critical to public health. Collecting outcomes through registries has exposed some high-profile failures of implanted devices, such as metal-on-metal hip prostheses.1 Well-established registries exist in many countries and more are being developed. Recently, the US Food and Drug Administration (FDA) began supporting registry-based efforts to expand the current infrastructure for recording the real-world performance of medical devices.2–5 However, the utility of registry-based active surveillance for detecting safety signals and efficacy concerns is by no means certain. Do they collect the key outcomes most likely to identify problems? Do they accrue sufficient numbers of patients to detect underperforming devices? Do they receive and review patient outcomes at appropriate intervals after implantation of devices? They need to produce relevant information in a timely way to help regulators and clinicians recognize devices with safety problems quickly. The data they provide can also help device manufacturers develop better next-generation products.

In this Analysis, we focus on four high-risk and widely used implanted medical devices: total hip replacement (THR) and total knee replacement (TKR) devices, stent grafts for endovascular aneurysm repair (EVAR), and surgical mesh implanted for pelvic organ prolapse (POP). We review evidence on device performance and analyze the likely numbers of patients whose data need to be recorded to detect a device which is performing significantly worse than expected. We review the current capacity of registries’ active surveillance and the supporting infrastructure to conduct these analyses.

Long-term active device surveillance and the role of registries

Many questions about the long-term safety and effectiveness of medical devices remain unanswered at the time of a device’s approval by a regulatory agency, because of the time required to gather sufficient data (especially data on long-term performance) and the high costs associated with data collection. An agreed plan for real-world long-term active surveillance can reduce preapproval data collection burden and give patients earlier access to new and safe technologies. The balance between preapproval and postapproval data collection required by regulatory agencies (eg, FDA) is gradually shifting.6 Traditionally, long-term active surveillance of device safety has relied on adverse event reports that physicians, healthcare institutions, manufacturers, and patients submit to the FDA and to regulators in other countries, and on regulator-required postapproval studies that manufacturers conduct. However, previous research has demonstrated the shortcomings of postapproval studies7 and adverse event reports: these include under-reporting and lack of denominator data to conduct population-level adverse event estimates.7 8

Registries offer a promising alternative or adjunct: they have exposed serious device-related problems in the past and recent reports have highlighted their potential.9–11 Linkage of registries and creation of coordinated registry networks (CRN) are likely to become an important method for tracking patient outcomes and assessing the performance of devices.

Focus on particular high-risk implanted devices

We chose four types of implanted devices as test cases, which are in frequent use and for which registries and CRNs are being developed, to track their performance:

  • THR and TKR—as two separate device categories: Total joint replacement (TJR), both hip and knee, is the fastest growing elective device-based surgery worldwide. In the USA, over 400 000 hip and 610 000 knee replacements are performed annually, with combined numbers projected to reach 6 million by 2030.12–14

  • EVAR: Abdominal aortic aneurysm (AAA) repair occurs more frequently in the USA (278 921) compared with the UK (29 300) during 2005–2012 yet aneurysm-related death is three times more likely in the UK compared with the USA.15 Aneurysms were traditionally treated by inserting a synthetic graft in a major open surgery, but now the most common approach is via a significantly less invasive endovascular procedure with a stent graft: this is EVAR. In 2012, over 30 000 EVAR procedures were performed in the USA.16

  • Surgical mesh for POP: Surgical mesh is often used in POP repair. In 2010, an estimated 300 000 POP repairs were completed in the USA.17 There are current concerns about the safety of these procedures: 1 in 11 women experienced problems with vaginal mesh implants in the UK according to National Health Service (NHS) data on 92 000 women from Hospital Episodes Statistics.18

For each type of device, we selected three primary outcome measures to assess the performance of each device. The choice of outcomes was based on the most commonly collected key data items, related to safety, effectiveness and/or patient-reported outcomes and are summarized in table 1.

Table 1

Summary and background for three measures chosen for each of the four devices

Do registries enroll sufficient number of patients?

The number needed to follow (NNF) is the number of patients a registry needs for a specific brand/type/class of device, to detect statistically significantly (p<0.05) worse performance compared with a prespecified threshold. Appropriate choice of the threshold and of follow-up time for each device and outcome combination, used in our analysis, is described in more detail below. Operationally, the threshold for binary outcomes was a gold standard event rate and for continuous outcomes was a mean baseline or preoperative score. Then, the probability of detecting a significant departure from this gold standard or statistical power was calculated. For binary outcomes, this was a one-side test of proportions using the formula:

Embedded Image

where Embedded Image NNF, Embedded Image is the device performance rate and Embedded Image is the gold standard performance rate, Embedded Image is the standard normal cumulative distribution function, Embedded Image = Embedded Image quantile of a standard normal distribution, α = significance level and Embedded Image = power. Power for a generic binary outcome (table 2) was calculated for 1.5 times increased odds of underperforming compared with a gold standard performance rate per recommendations.19 For continuous outcomes, the power was computed for a one-sided alternative in a generalized estimating equations framework with a quasiscore-based test statistic20 assuming only one group, a linear effect of time and a range (0.05–0.50) of intraclass correlation coefficient, a parameter that accounts for correlations among repeated measures. Power for a generic continuous outcome (table 3) was calculated by considering small to moderate departures (0.3–0.5 SD) in the scale of Cohen’s d which is recommended for quality and patient-reported outcomes.21 22

Table 2

Power for a generic binary outcome.

Table 3

Sample size table for generic continuous outcome.

Of note, these computations do not reflect missed follow-up measures, the rates of which vary widely with respect to procedures and outcomes. Our recommendation is to inflate NNF by appropriate context-dependent expected missing rates in the planning stage and in the analysis stage using statistical methods for ascertainment bias correction due to missing data such as robust or doubly robust methods (eg, inverse probability weighting or augmented inverse probability weighting) or use multiple imputation.23 Another important caveat is that our calculations do not account for subgroup analysis requirements (eg, hip implants performance in men and women separately) and the NNF needs to be adjusted if such subgroup comparisons are planned in advance.

At what time points should registries evaluate device performance?

The number of times a device is evaluated is conceptualized in two ways—‘one look’, when performance evaluation is performed at the end of a specified time period (eg, 30 days for 30-day mortality); and ‘multiple looks’ when performance evaluation is performed repeatedly at several time intervals (eg, 1, 5 and 10 years). While the choice of time intervals is guided by narrative review of literature, they are presented for illustrative purposes only and our methodology can be adopted for any choice of time point. Critically, however, the overall type I error or alpha, in every situation, should be controlled at 5% and adjusted for multiple looks, at certain time intervals, using a simple and conservative method such as Bonferroni’s method (as we use here) or other alpha-spending methods such as O’Brien-Fleming, Pocock or Lan-DeMets.

What indicates that a device is not performing as expected?

The expected performance of devices at each follow-up time point was guided by a narrative review of literature (see online supplementary tables 1-4). When available (eg, endoleak endpoint after AAA) we gave priority to estimated thresholds from studies that conducted meta-analysis or meta-regression. In other instances, our estimated performance thresholds are conservative and a formal systematic review or meta-analysis is unlikely to change the results substantially.

Supplemental material

Hip and knee replacement

Key outcomes were scores of physical function, quality of life (QoL) measures and revision rates. Our assessment of Harris Hip Score (HHS) showed that following a cohort of 341 patients will provide >90% power to detect an underperforming device with a clinically meaningful change of 4.5 HHS points (or d=0.3) after 5 and 10 years (table 4). Similarly, our assessment for Knee Society Score showed that a cohort of 341 patients is required to provide >90% power to detect an outlier device with a clinically meaningful change of 4.5 points (or d=0.3) at the end of 5 and 10 years.

Table 4

Number needed to follow (NNF) of three measures of four devices and a power of >90%

Postoperative Short Form 12 (SF-12) is the most commonly used measure of general QoL. Using SF-12 we found that a cohort of 181 (66) patients is needed to identify a small (moderate) change of 1.4 (2.4) points for THR or 1.6 (2.7) points for TKR, for ‘one look’ at the end of 1 year with 90% statistical power.

Based on National Institute for Health and Care Excellence guidelines,24 rates of revision surgery for joint replacements should be <5% at 10 years as a metric of good performance and is commonly assessed at years 2, 5, and 10.25 Following up a cohort of 2500 patients for 5 years would have >90% power to detect 1.5 times increased odds of revision compared with the annualized revision rate of 0.5%.

Endovascular aneurysm repair

Three key outcomes are 30-day mortality, occurrence of endoleaks, and the need for secondary interventions (for endoleaks and other complications associated with the stent graft). Thirty-day all-cause mortality varies between 0.5% and 3.6% (see online supplementary table 3) with a best estimate around 2%. Following up a cohort of 2200 patients would provide >90% power to detect 1.5 times increased odds of 30-day all-cause mortality compared with the performance goal of 2% mortality.

A recent meta-regression26 estimated the cumulative rate of endoleak (excluding type II) occurrence to be 5.67% at 2 years or an annualized rate of 2.84%. A cohort of 1400 participants would provide >90% power to detect 1.5 times higher odds of endoleak occurrence at the end of 2 and 5 years. However, our NNF estimate is 525 to detect 1.75 times higher odds with 80% power that is similar to that estimated in Kent et al’s study. 26

Secondary vascular interventions are estimated to occur at an annual rate of 4% (online supplementary table 3). With this performance goal, the NNF is 450 for 4 and 6 years (see table 1) to have >90% power to detect 1.5 times increased odds of secondary vascular intervention.

Surgical mesh for POP

Key outcomes are QoL measures, mesh erosion and need for reoperation. Effectiveness outcomes in published reports on mesh repair are typically based on QoL scores (P-QoL) before and after the procedure.27–29 Following up a cohort of 181 patients has >90% power to identify a clinically meaningful 8.6 point change in P-QoL for ‘one look’ at 1 year postoperatively.

Mesh erosion is an important safety endpoint, and a cohort of 800 women followed up for 1 year would provide >94% power to detect 1.5 times higher odds for erosion compared with an overall erosion rate of 6% (see online supplementary table 4).

The risk of reoperation following POP repair with mesh due to postoperative complication or prolapse recurrence was estimated to be 3%–4% within 1 year.30–32 A cohort of 1500 patients would provide >90% power to detect 1.5 times increased odds of reoperation.

Are current registries sufficient?

Our analyses suggest that registries do not need to follow unrealistically large cohorts to identify outlier performance of types of devices (eg, metal-on-metal hip implants). We evaluated current registry infrastructure to identifying deviations from expected performance of the devices considered using three objective criteria—whether the registry contains enough patient records, captures the relevant measures, and conducts enough longitudinal follow-up—and found most registries either contain sufficient numbers of patients already or are expected to be so in future.

Specifically for hip and knee replacements, the American Joint Replacement Registry, Function and Outcomes Research for Comparative Effectiveness in Total Joint Replacement registry, the Kaiser-Permanente National Total Joint Replacement Registry, and the Michigan Arthroplasty Registry together have data on over 1 million TJRs annually but lack robust capture of functional and quality measures (eg, HHS, QoL). The US Vascular Quality Initiative (VQI), launched by the Society for Vascular Surgery in 2011, has over 350 participating centers across 46 US states and Ontario, and has collected data on over 32 200 AAA repairs and adequately captures key performance measures. The Pelvic Floor Disorders Registry, founded to support US FDA recommendations for increased monitoring of transvaginal mesh use, began collecting data in late 2015 and is currently too new to conduct robust analyses.33

Of note, it is unclear if the current registries are sufficiently large for each device brand (eg, DePuy metal-on-metal implant) because such details are not always recorded and future efforts to harmonize definitions need to be undertaken. In such scenarios, device classes could be studied by combining multiple brands and types (eg, compare metal-on-metal with metal-on-ceramic hip replacement devices).

What are the obstacles to progress?

Conducting long-term active surveillance is not easy and incomplete follow-up can limit the usefulness of registry data. This problem could be addressed by linking registries with other data sources (eg, insurance claims data in private health systems) that collect patient outcomes over extended periods of time (eg, efforts are underway to link the VQI with claims data resources,34 providing longer follow-up to evaluate secondary vascular intervention and possibly endoleak occurrence after device use). In the USA, the FDA has proposed a new National Evaluation System for Health Technology,3 which will integrate registries with claims data to provide long-term follow-up. This method is better suited for clinically measured safety or effectiveness outcomes (eg, revision surgery for joint replacement) than for patient-reported outcomes (eg, QoL) which requires direct data collection (eg, telephone interviews). In the UK, linkages with routinely collected NHS statistics, national mortality data and with the Clinical Practice Research Dataset35 are important linkages for long-term data acquisition.

Incorporation of patient-reported outcomes into long-term active device surveillance will remain a challenge, because registries, claims data, or electronic health records do not currently capture such outcomes. Registry organizers, manufacturers, and healthcare providers will all have a part to play in developing mechanisms to collect patient-reported outcomes.

The future

Advancing technology will enable increasing amounts of useful data to be collected by device registries. Electronic health records, smartphone apps and wearable devices will enable a range of information, including patient-reported outcome measures, to be captured. Increasing capacity for data linkage offers huge potential, but this will need careful attention to its governance. International linkages will increase the power of information gathering, but will require attention to differing national data legislative requirements. Manufacturer’s registries straddle national boundaries, but these need assurances of transparency and unbiased oversight to provide confidence in their data. Clear plans for the use of registries when new devices are presented for regulatory approval will help shift the balance that is required between premarket and postmarket evidence. It will allow earlier access to new products while providing assurance of a mechanism to monitor performance. The data from registries will help manufacturers to develop ever safer and more effective devices for future use.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.
  38. 38.
  39. 39.
  40. 40.
  41. 41.
  42. 42.
  43. 43.
  44. 44.
  45. 45.
  46. 46.
  47. 47.
  48. 48.
  49. 49.
  50. 50.
  51. 51.
  52. 52.
  53. 53.
  54. 54.
  55. 55.


  • Collaborators Jialin Mao; Emma Briggs; Anqi Lu.

  • Contributors SB and AS were responsible for the study concept and design and review the literature. SB was responsible for the statistical analysis and drafting the manuscript. All authors interpreted the data and contributed to writing the manuscript. All authors critically revised the manuscript for important intellectual content. BC played a major role in providing a global perspective to the study. AS supervised the study and was the guarantor.

  • Funding This project was partially supported by Pew Charitable Trusts and by the US Food and Drug Administration through grant 1U01FD005478.

  • Competing interests AS and SB received funding through Pew Charitable Trusts and the US Food and Drug Administration through grant 1U01FD005478.

  • Patient consent for publication Not required.

  • Provenance and peer review Commissioned; internally peer reviewed.