Discussion
Comprehensive, criterion-referenced, and authentic assessment of intraoperative performance is key to evaluate surgeons’ competency and quality of surgical care—and a cornerstone of CBME.1 3 4 We introduce a multistep approach for operationalization of the universal framework of intraoperative performance14 with definition of respective performance indicators as well as its implementation in surgeons’ performance assessment using computer-assisted metrics. Moreover, we demonstrate its feasibility and report first evidence for its validity within automated performance assessment in a simulated procedure. Our findings thus contribute in several aspects to current surgical knowledge base and educational practice.
First, we describe a methodology to use the surgical performance domain constructs proposed through the universal framework of intraoperative performance.14 After delineating the demands and characteristics of a specific procedure (VP), experts defined meaningful criterion-referenced performance indicators founded on task analysis, anatomical constraints and empirical data. Furthermore, we demonstrate how interactive annotation tools can be used to obtain case-specific definitions based on expert knowledge where such data are unavailable. Our approach therefore can be readily transferred to other surgical procedures beyond the field of minimally invasive spine surgery. Our findings thus serve as a first example for adoption of this widely acknowledged framework and its performance domains to actual surgical interventions and assessments. A wider adoption of the framework would further help to standardize assessments and limit the current inflation of assessment approaches and constructs.19
Second, we demonstrate how adoption of the framework can be achieved through automated performance assessments. We developed computer-assisted metrics using preoperative planning and intraoperative performance data in conjunction with procedural and case-specific anatomy characteristics, and experts’ annotation data. Our findings advocate that in the era of CBME, computer-assisted assessments represent the next step toward objective analysis of surgical performance that will drastically advance the way surgical trainees learn and are assessed.44 In addition to objective and reliable scoring, a particular strength of our assessment approach is the direct relation of the scores to procedure-specific and patient case-specific characteristics. This supports meaningful criterion-referenced interpretation of assessment results. Additionally, our approach considers a comprehensive range of competencies within the continuum of technical and non-technical performance. It is thus in contrast to previous computer-assisted assessments in surgery which were almost exclusively limited to norm-referenced assessment of psychomotor aspects of surgical technique.11 12 19
Third, we show how computer-assisted authentic assessment of intraoperative surgical competency can be applied in SWBA settings, particularly through highly contextualized OR simulation. To our knowledge, this is the first study to investigate objective, computer-assisted assessment in a simulated workplace that is functionally aligned with a full-scale OR setting. We furthermore included a multiprofessional OR team, what is in contrast to previous decontextualized, single-user benchtop models or VR simulators.10 This is of particular relevance to CBME with its strong focus on assessments mimicking real surgical tasks conducted in authentic settings ‘in the trenches’.3(p362) Empirical research has shown that the stimulus format is the paramount factor that determines validity of assessments.45 For integrated competency assessment, simulations that closely approximate real surgical performance are thus essential.45 Furthermore, high contextualization minimizes surgeon’s distortion of the naturalistic and context sensitive responses, they need to develop for real surgical situations.46 Our mixed-reality approach allows for a simulated representation of the procedure with little change to the environment and resulted in a functionally active involvement of all OR members. It includes many of the elements identified as central for authenticity of simulated environments,47 specifically, content drawn from real life (ie, patient-specific simulation using real patient data), interaction and feedback (ie, natural and dynamic interaction with the patient and the team), performance expectations (ie, full performance of the procedure lasting as long as in real life), preparation of the environment (ie, real, functional equipment and devices), presence of a patient manikin, logical and adaptive scenario and sociological fidelity (ie, including all members of the interprofessional team). Accordingly, surveyed surgeons appraised the authenticity of our simulation in general and regarding procedural workflow and interaction with team members in particular.
Fourth, we establish first empirical evidence for validity of assessment based on the universal framework of intraoperative performance.14 We evaluated our approach in the light of all five sources of evidence for validity.22 23 Our findings further yield, for the first time, empirical insights into the interactions between the performance domains and their relation to surgeons’ experience, competency, and surgical outcome. Interdomain associations of performance scores showed no overly strong relationships. This suggests that, as intended, we measured conceptually different performance types.14 Notwithstanding, we observed meaningful associations between some domains. This observation warrants further research into potential overlaps and similarities during intraoperative practice, for example, the role of PMS for PR. Regarding surgeons’ experience, we identified a substantial association with ACS. Given our limited sample size and sample’s intermediate level of experience, however, this finding should be interpreted with caution. Perhaps even more relevant to surgical practice, we found a considerable association between domain scores (PMS, DK, PR, and ACS) and observational technical-skills assessment (OSATS). Post hoc, we assume that the framework’s performance scores representing technical aspects of competency are well appraised by expert raters through observational assessments; this association was particularly pronounced for ACS. The consequences of using our assessment approach to classify surgeons’ performance as non-competent or competent is favourably supported by no false negatives or false positives in comparison to expert-based OSATS pass/fail judgment in our sample.
Finally, we found that our ACS metric is central to surgical expertise, patient safety, and outcome.14 We obtained significant group differences for ACS between groups of successful versus unsuccessful performance outcomes (ie, OSATS pass/fail). We also observed moderate yet non-significant associations between ACS and non-technical skills what may tentatively confirm the key role of ACS encoding higher cognitive functions during intraoperative task performance and surgical teamwork.14 Moreover, high correlations between our SP score and surgeon’s experience, technical and non-technical performance support the applicability and validity as a global outcome assessment. Together, these findings suggest that ACS and SP should serve as the central markers of outcome-relevant competency to guide decisions in summative assessment. Immediate assessment outcomes could inform decisions whether a corresponding competency milestone goal has been achieved or if entrustable professional activities (EPAs) can be granted to junior surgeons.4
Regarding potential formative assessment in the future, our automated assessment approach facilitates immediate feedback and fulfils the requirement to provide individualized, meaningful, and case-specific guidance.14 This may help to design curricula ‘that target and deliberately train non-experts to think and behave like experts’14(p263); for example, through granular and immediate performance feedback in case of insufficient technical or non-technical performance. A particular benefit emerging from the framework used is that learner feedback can be specifically directed to performance domains and skills requiring special developmental guidance.
Limitations
Our approach has some limitations that should be acknowledged. We laid out our use of performance metrics specifically to minimally invasive spine surgery. Although various of our metrics are generic to surgical performance, such as accuracy, precision, pace, and use of intraoperative imaging, further investigations into further performance metrics are necessary, for example, for open procedures. Although we used a systematic approach to obtain expert consensus on our performance indicators and associated metrics, we cannot infer how specialists in other surgical domains might appraise specific performance outcomes. Given the vast advancements and adoption of technology in surgical performance assessment, we acknowledge that future adoption of computer-assisted assessments will incorporate further performance indicators, for example, intraoperative stress assessed through ambulatory assessments. Our simulation environment was based on a rigorous, in-depth development process to authentically mimic an OR setting as well as multiprofessional surgical practice.27 Yet, surgeons might not have performed to their fullest potential in the simulation, for example, due to lack of familiarity with the setting or hesitation to being observed. Participants other than the surgeon were confederates to the study team. While this helped to standardize assessment, it also added to the costs and efforts in preparations and simulation planning. Further limitations include our convenience sampling approach and limited sample size. Future investigations should include a larger number of participants and a differentiated analysis of experience levels and subgroups (eg, interns, residents, attendings, etc). Finally, the interpersonal communication (IPC) domain suggested in the original framework14 was not considered. Future studies need to further apply all performance domains and gather evidence for validity of assessment across different surgical procedures and specialties beyond spine surgery. Particular focus should be devoted to validity evidence in terms of learners’ operative performance and patient outcomes in the long term, that is, functional or morbidity outcomes.
Implications for research and surgical practice
Our approach may inform further research in several ways: first, future investigations should scrutinize the utility of the performance domain model in other surgical procedures or conditions (ie, varying patient factors like high acuity, high body mass index, pediatric vs adult, or unusual anatomy). Moreover, our empirical findings concerning the key role of ACS should be corroborated with particular attention to implications for patient safety and surgical outcomes.14 Second, the range of skills covered should be broadened further toward non-technical aspects. The domain of IPC should be incorporated using computer-assisted assessment, for example, employing machine learning techniques,48 and association with non-technical observational assessment scores (eg, OTAS) should be investigated. Such assessments need to provide criterion-referenced indicators relating to competencies such as communication, teamwork, or leadership. Moreover, investigations should address how automated assessment results can be best formatted and fed back to support interpretation and provide idiosyncratic guidance, that is, feedback which is ‘individualized, meaningful, and case-specific’.14(p263) Third, assessments and validity evidence should be further extended to cover the entire surgical team. It should be of particular interest, how intra-team coordination and cooperation can be automatically and objectively assessed and interpreted with regard to procedure and case-specific demands.14 Assessments therefore need to include team members other than the surgeon as surveyed participants and criterion-referenced performance indicators have to be developed to also capture their intraoperative performance. Fourth, operationalization of the universal framework of intraoperative performance should be implemented in traditional observational work-based assessments in real OR settings, eventually reducing the current inflation of assessment tools19 for different specialties and aspects of performance (ie, technical and non-technical). Moreover, consistency of participants’ intraoperative performance should be investigated in both, the real OR and contextualized simulation setting. For summative assessment, the correlation of assessments in both settings is of particular interest as part of the validity argument.5 Regarding formative assessments, investigations into individual as well as joint effects of assessment and feedback in contextualized simulation settings on patient outcomes and surgical performance should be of particular interest (ie, by using a prospective pre intervention–post intervention design).
Concerning implications to surgical practice, our approach draws on the principles of CBME and advocates objective criterion-referenced assessment of a comprehensive range of competencies in authentic and contextualized simulated surgical tasks as a complement to traditional workplace based assessments. Our computer-assisted assessment approach may complement rater-based observational assessment, which is time-consuming, inherently subjective, and therefore fraught with different biases.15 Yet, it shall not be intended as a replacement. Some degree of subjective professional judgment may actually be considered a necessary element of assessment as it can provide valuable feedback and add to a more authentic and holistic appraisal of learners’ competence enhancing the validity of assessments.49
Computer-assisted assessment in simulated workplaces allows standardizing many of the factors that affect assessments in real workplace environments, for example, effects for raters, the specific patient or scenario, and concurrently enables efficient assessment of key competencies in an OR context.50 Standardization supports the scoring and generalisation argument; contextualization strengthens the extrapolation argument of assessment score validity.50 Standardization is of particular interest in formative assessment where alignment with the individual developmental needs of surgical trainees is required. Here, our approach can help to establish required case numbers and cover variation in operative conditions in a structured way.16 In summative assessment, our approach contributes to establishing a balance between standardization and contextualization of assessment criteria.50 51