Formative evaluation of the accuracy of a clinical decision support system for cervical cancer screening
- Kavishwar Balwant Wagholikar1,
- Kathy L MacLaughlin2,
- Thomas M Kastner3,
- Petra M Casey3,
- Michael Henry4,
- Robert A Greenes5,6,
- Hongfang Liu1,
- Rajeev Chaudhry7
- 1Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota, USA
- 2Division of Family Medicine, Mayo Clinic, Rochester, Minnesota, USA
- 3Division of Obstetrics–Gynecology, Mayo Clinic, Rochester, Minnesota, USA
- 4Division of Anatomic Pathology, Mayo Clinic, Rochester, Minnesota, USA
- 5Department of Biomedical Informatics, Arizona State University, Phoenix, Arizona, USA
- 6Department of Health Science Research, Mayo Clinic, Scottsdale, Arizona, USA
- 7Division of Primary Care Internal Medicine, Center for Innovation, Mayo Clinic, Rochester, Minnesota, USA
- Correspondence to Dr Kavishwar Wagholikar, Division of Biomedical Statistics and Informatics, Mayo Clinic, 200 First Street SW, Rochester, MN 55901, USA;
- Received 1 January 2013
- Revised 16 February 2013
- Accepted 6 March 2013
- Published Online First 5 April 2013
Objectives We previously developed and reported on a prototype clinical decision support system (CDSS) for cervical cancer screening. However, the system is complex as it is based on multiple guidelines and free-text processing. Therefore, the system is susceptible to failures. This report describes a formative evaluation of the system, which is a necessary step to ensure deployment readiness of the system.
Materials and methods Care providers who are potential end-users of the CDSS were invited to provide their recommendations for a random set of patients that represented diverse decision scenarios. The recommendations of the care providers and those generated by the CDSS were compared. Mismatched recommendations were reviewed by two independent experts.
Results A total of 25 users participated in this study and provided recommendations for 175 cases. The CDSS had an accuracy of 87% and 12 types of CDSS errors were identified, which were mainly due to deficiencies in the system's guideline rules. When the deficiencies were rectified, the CDSS generated optimal recommendations for all failure cases, except one with incomplete documentation.
Discussion and conclusions The crowd-sourcing approach for construction of the reference set, coupled with the expert review of mismatched recommendations, facilitated an effective evaluation and enhancement of the system, by identifying decision scenarios that were missed by the system's developers. The described methodology will be useful for other researchers who seek rapidly to evaluate and enhance the deployment readiness of complex decision support systems.
- Uterine Cervical Neoplasms
- Decision Support Systems, Clinical
- Guideline Adherence
- Validation Studies as Topic
- Vaginal Smears
Although cervical cancer can be largely prevented with screening, it still continues to be a major cause of female cancer-related deaths.1 Several national organizations have released guidelines for cervical cancer screening and surveillance.2–5 However, the guidelines are complex and are based on a multitude of factors. Consequently, they cannot be easily recalled by care providers and many patients do not receive the optimal screening.6–9
As a potential solution we have previously developed and reported a prototype clinical decision support system (CDSS), which automatically analyzes patient data in the electronic health record (EHR), and suggests the guideline-based recommendation to care providers.10 However, the system is susceptible to failures due to its complexity as it is based on multiple guidelines and free-text processing. Another shortcoming of the prototype was that only a single guideline expert was involved in its development. Therefore, further evaluation was necessary to ensure the readiness of the system for deployment in clinical practice. This paper reports the methodology used to evaluate and improve the CDSS with participation of multiple users and experts, before clinical deployment. In contrast to the widely published summative evaluations that determine the post-deployment effectiveness/impact, the aim of this work is to perform a formative evaluation before deployment, in order to ensure the system's post-deployment effectiveness.
Cervical cancer screening
Worldwide, cervical cancer was diagnosed in approximately 530 000 women and resulted in approximately 275 000 deaths in 2008.11 Despite the confirmed effectiveness of routine screening, the American Cancer Society estimates 12 170 cases of cervical cancer and 4220 deaths in the USA in 2012.1 A meta-analysis of 42 multinational studies reported that over half of the women diagnosed with cervical cancer had inadequate screening or no screening, and that lack of appropriate follow-up of abnormal tests contributed to 12% of diagnoses.12
Cervical cancer screening/surveillance involves an evaluation of cervical cells (cytology) through a liquid-based specimen or Papanicolaou (Pap) smear. Human papilloma virus (HPV) testing may be additionally performed to detect the presence of high-risk strains of HPV (the cause of cervical pre-cancer and cancer). Several national organizations including the American Cancer Society, US Preventive Services Task Force, American College of Obstetricians and Gynecologists and the American Society for Colposcopy and Cervical Pathology have released guidelines for cervical cancer screening and/or management of abnormal screening tests.2–5 However, the guidelines are complex and are based on a multitude of factors including age, risk factors for cervical cancer and previous screening test results. Therefore, recalling and following the evidence-based guidelines is challenging for care providers, as a result of which many patients do not receive optimal screening.6–9
Apart from efforts to improve guideline adherence of the providers, several other interventions focused on patients have been investigated in the past two decades.13 The interventions to improve screening rates are adjuvant to strategies for reducing the risk factors for HPV infection.14 They can be broadly categorized as educational,15 reminders,16 interactive voice response17 or telephone call,18 counseling19 and economic incentives.20 Reminders and educational interventions have been found to be most effective.21–23 With the growing use of EHR in the USA, the use of decision support systems such as ours to implement reminders for providers and patients has a high potential for improving the screening and surveillance rates.24 The following subsection provides an overview of the challenges for the utilization of such systems.
Clinical decision support
CDSS25 ,26 have been developed for a variety of decision problems including preventive services,27 ,28 therapeutic management,29 prevention of adverse events,30 diagnosis,31 ,32 risk estimation,33 and chronic disease management.34 CDSS have been found to improve health service delivery across diverse settings, but there is sparse evidence for their impact on clinical outcomes.35 The potential positive impact of CDSS on the quality of care is not always realized, because the systems are not always utilized or are not implemented effectively.26 Some of the possible reasons for ineffective implementations are alert fatigue,36 lack of accuracy,37 lack of integration with workflow,38 and prolonged response time.39
Formative evaluations to ensure the acceptable levels of the above performance parameters may play a crucial role for effective implementation.40 In contrast to the widely published summative evaluations that determine the impact/effectiveness of the system, the aim of formative evaluations is to address the factors that will determine the effectiveness, during the development phase itself.41 Formative evaluations have been emphasized as critical components of EHR implementation42 and health information technology projects in general.43 Formative evaluations to rectify failure points of a CDSS before deployment may enhance the effectiveness of deployment in the clinical setting.
Our CDSS is particularly prone to multiple points of failure, because it is based on a complex model synthesized from multiple guidelines, it requires highly accurate natural language processing (NLP), which can be a challenging task, and it utilizes data from a multitude of information sources10 (see figures 1 and 2). Moreover, the CDSS is aimed to be comprehensive—to generate screening and surveillance recommendations for all female primary care patients in the institution, which is a major advancement over current systems.44–46 Therefore, a rigorous validation is required for our system to ensure user acceptability and clinical impact. This paper reports the methodology used to evaluate and improve the CDSS with participation of multiple users and experts, before clinical deployment. The objective is to ensure that the recommendations of the CDSS are of sufficient accuracy to be acceptable and useful to the providers. Testing for usability and work-flow integration are excluded from the scope of the current study.
The recommendations of potential end-users for a random sample of patients were recorded and compared to the recommendations generated by the CDSS. Mismatched recommendations were resolved by independent experts, and an error analysis was performed to improve the CDSS. The study was conducted using a web-based application. The detailed methodology is as follows.
Overview of CDSS architecture
As shown in figure 1, the CDSS has three modules: data module, guideline engine, and NLP module. The latter two modules contain respective rulebases, viz a guideline rulebase for representing the screening and management guidelines and a NLP rulebase for interpreting cervical cytology (Pap) reports. When the CDSS is initiated for a particular patient, the guideline engine parses the guideline rules (figure 2) and queries the data module for the required patient parameters. The data module in turn interfaces with the EHR to retrieve the patient information and when the data involves free-text information, for example, a cytology report, the data module calls the NLP module to extract the relevant variables. Based on its constituent rules the guideline engine continues to seek patient parameters, until it has sufficient data to compute the recommendation. The architecture of the CDSS is elaborated elsewhere.10
Expert review of guideline model
Before initiating this study, the guideline model (rulebase implemented in the system) was reviewed and approved by several experts who did not participate in the development of the CDSS prototype. Figure 2 shows the flowchart representation of the system's guideline model.
Construction of test set
We randomly selected 6033 patients who had visited Mayo Clinic Rochester in March 2012 and had consented to make their medical records available for research. The CDSS was run to compute the screening and surveillance recommendations for these patients. Based on the recommendations the patients were mapped to the branches in the guideline flowchart for cervical cancer screening/management (figure 2). This flowchart was developed before the 2012 updates in the national guidelines.2–5 Each pathway in the flowchart corresponds to a distinct combination of patient variables, and it represents a unique decision scenario. As some decision scenarios occur more frequently during practice than others, a randomly selected test set can be biased towards the frequent decision scenarios. Therefore, to ensure that the evaluation was not biased to the frequent scenarios, we performed stratified random sampling, restricting the selection to a maximum of 14 cases per decision scenario. The total number of cases in the test set was 196.
We invited 89 potential users of the CDSS to participate in this study. The recruitment was done by sending mass emails as well as by specifically contacting potential users. The participants were of diverse background and training. They included staff consultants, residents and nurse practitioners from the institution's departments of family medicine, internal medicine and obstetrics and gynecology. We created a web-based application to collect the recommendations of the healthcare providers for the test set (figure 3). The web application was deployed on the institution's internal network.
Collection of provider recommendations
The web system was available from 12 April 2012 to 4 May 2012. When a participant logged into the system, a 1-min training video was presented. Subsequent to the video presentation, the web system randomly selected (without repetition) a case number from the test set and presented it to the participants. The participants assessed the information for the presented case by chart review using the EHR system, and recorded the most appropriate guideline-based recommendation for the case, by selecting the appropriate options in the web system's interface (figure 3). In addition to the template recommendation options, a free-text box was provided, to allow the participants to input recommendations that were not covered in the template options. Each participant completed seven different cases. The web system also recorded the time taken by the providers to input their recommendations.
The care providers’ recommendations were compared with those of the CDSS (figures 4 and 5). When there was a mismatch in the recommendations, the case was reviewed by one of two experts who did not participate in the development of the prototype, to decide if the CDSS or the provider recommendation was more accurate/optimal. If the CDSS was found to be less optimal, an error analysis was performed to identify the fault in the CDSS. The CDSS was then improved to correct the identified errors.
Projection of CDSS impact on clinical practice
The CDSS was modified and re-evaluated on the test set, in order to ensure that the errors identified in the above analysis were rectified. Finally, we compared the recommendations of the corrected CDSS with those of the providers to identify provider errors. These cases were analyzed to identify the decision scenarios that were difficult for the providers, in order to project the potential of the CDSS to assist with the decisions. The average time taken by the providers to make the recommendations was computed, after excluding outliers.
Figure 5 summarizes the results of the CDSS evaluation. Of the 89 providers who were invited to participate in the study, 28 agreed to participate, and finally 25 completed the exercise of annotating the test cases with their recommendation. A total of 175 cases was annotated by the participants. The CDSS was found to generate an error flag for six cases because it could not obtain the pathology reports due to bugs in the interface to the EHR system. In the remaining 169 cases, the recommendations by the healthcare providers did not match the recommendation made by the CDSS for 75 cases.
The mismatch cases were presented to one of two experts (who co-authored this paper). The experts reviewed the recommendations and decided on the final optimal recommendation for the patient. The experts were blinded to the identity of the healthcare provider who made the recommendation for the individual cases. The CDSS was found to be suboptimal compared to the provider in 22 cases. Therefore, the accuracy computed to 147/169=87.0% (figure 5 and table 1).
CDSS error analysis
Analysis of the 22 CDSS failure cases, led to identification of 12 errors/failure points in the CDSS (table 2 and figure 6). The errors were classified as modeling errors and programming errors. Modeling errors are due to deficiencies in the system's guideline rulebase/model, for example, missing a decision scenario, or incorrect logic. Programming errors include errors/bugs in the developed software, for example, incorrect rounding for age cut-off. The CDSS was robust in extracting the patient information from the EHR, except for history of hysterectomy. A summary of the errors is as follows (figure 6):
The upper age limit for screening recommendation was not set, because the approach was to err on the side of caution and let the provider overrule the system's recommendation for stopping screening (errors 1, 2, 4 and 9). This has now been rectified by considering the high-risk status of the patients.
Some of the error cases were due to the system stopping screening after the patient's 65th birthday. In these cases the age limit was applied after rounding the age (error 2). Therefore, to define the age explicitly and avoid rounding, the guideline model has been changed to the condition of <66 instead of ≤65 as defined earlier.
History of hysterectomy was missed when it was reported in the problem list. This was a programming error that was resolved (error 6). In one case, hysterectomy was not mentioned in the problem list but occurred in the clinical notes, which are not searched by the system. This case was resolved after concepts that implied hysterectomy, for example, ‘vaginal wall prolapse after hysterectomy’ were included for determining history of hysterectomy, as this concept was present in the patient's problem list (error 10).
The scenario of atypical squamous cells of undertermined significance (ASCUS) cytology with HPV not performed was not anticipated. This has now been included in the corrected model (error 7). A report of inadequate endocervical transformation zone is now ignored for high-risk patients, because it does not impact their management. This is because they are already having annual screening (error 12).
After the errors were rectified in the CDSS, it was found to generate optimal recommendations for all but one failure case. The one case that could not be resolved was due to the inability of the CDSS to identify history of hysterectomy in a patient, when both the problem list and patient annual questionnaire database had no documentation about the patient's hysterectomy. The experts inferred that the patient had undergone hysterectomy from the clinical notes. The CDSS failed because it was not designed to perform NLP on clinical notes to extract this information.
Provider errors analysis
After the recommendations of corrected CDSS were compared to those recorded by the providers, the providers were found to provide suboptimal recommendations in 56 of the 169 cases (33.1%), which is 34 (20.1%) more cases with suboptimal recommendations compared to the CDSS. Several of these patients had abnormal screening reports such as abnormal (other than ASCUS) cytology, ASCUS cytology, positive HPV or inadequate endocervical transformation zone. Some of the provider errors were due to incorrect determination of the risk status of the patient, due to boundary conditions such as age cut-offs. The mean time taken by the providers to make the recommendation was 1 min 39 s.
The study facilitated a comprehensive evaluation of the CDSS on a large and diverse set of patients that covered nearly all possible decision scenarios. The CDSS was evaluated to have a fair accuracy, and by performing the error analysis of failure cases the CDSS was considerably improved.
The formative evaluation based on the reference set annotated by the care providers led to the identification of several failure points in the system. Several logical steps necessary to apply the national guidelines were missed when the guideline model was inspected by the experts before the study. The use of representative cases and their decision annotations by the care providers in this study helped draw attention to the particular scenarios in which the logical steps were missed. The task of modeling the free-text guidelines as rules is challenging due to ambiguity of the natural language used in the guidelines, and due to the difficulty in envisioning decision scenarios that can occur in clinical practice.47 ,48 Our results indicate that guideline models based on abstraction from textual guidelines need to be tested with consistency checks on real-life cases. This finding is consistent with earlier research that demonstrates the critical importance of carefully analyzing the reasons for practising clinician disagreements with decision support, in order to improve CDSS design and effectiveness.49
The analysis identified situations/factors when the CDSS was prone to make errors, for example, hysterectomy cases. It also identified guideline areas in which the care providers need decision support. The providers were found to have difficulties in decision making for cases with abnormal findings, as reported by other studies.12 ,50 Lack of follow-up referral after a positive screening test has also previously been documented in the context of colorectal cancer screening.51 ,52 As the patients with abnormal screening reports are especially at risk of developing cancer, the screening/surveillance recommendations made by the providers can have far-reaching consequences for the patients. The CDSS was notably found to perform consistently well for such patients, and its deployment can be expected to improve the quality of the screening services considerably. Moreover, the CDSS can lead to provider time savings of 1 min 39 s per patient consultation, as determined in this study.
An alternative approach to evaluate the CDSS before deployment in clinical practice is to conduct a pilot study with a subset of potential end-users, who will verify the system's recommendation and provide feedback for improving the system. There are several disadvantages to this approach: the evaluation will be biased towards frequently occurring decision scenarios unless a special effort is made to identify the less frequent but high impact scenarios in the evaluation; and there will be a risk of missing validation for rare but important decision scenarios. Our approach of identifying distinct decision scenarios for the evaluation by using the prototype CDSS helped avoid bias towards the frequent decision scenarios, and allowed for an efficient utilization of the efforts of the participating providers and experts.
Similarly, our approach to blind the users to the CDSS recommendation has an advantage over seeking user feedback after deployment, because in the post-deployment setting, the user's judgment can be influenced by knowledge of the output of the CDSS.49 Consequently, in the latter approach some of the failure points may be missed. Moreover, it may not be possible to project the clinical impact of the system, due to the modification of user behavior. With the current approach the decision scenarios that were difficult for the users were identified, and the usefulness of the system after deployment could be projected. Another advantage is that the end-users are not directly exposed to the CDSS before the formative evaluation; therefore, there is no loss of user confidence.53
A difficulty in performing CDSS evaluation is that it is often not feasible to involve a large number of users in system evaluation. The crowd-sourcing approach used in this study allowed a large number of users to participate, which in turn facilitated the construction of a large reference dataset of real-life decision scenarios. Consequently, the CDSS could be evaluated comprehensively for a wide variety of scenarios.
Literature on CDSS mainly consists of summative evaluations measuring impact on service and clinical outcomes.54 ,55 Studies on performance aspects of the CDSS are rare, which suggests a lack of effort to ensure effective implementation. Our results demonstrate that such studies may be increasingly needed as complex CDSS that have an increased risk of failures are developed. Furthermore, research into developing efficient and practically feasible methods for pre-deployment evaluation of CDSS is called for. We believe that the approach described will be useful for developing complex systems that support wider and more complex domains of care.28 ,56 The formative evaluation to ensure that the decision model itself is accurate will facilitate subsequent enquiries after deployment for quantifying guideline adherence of the providers, and for measuring clinical impact.
Crowd-sourcing can be useful for the development and validation of decision support applications. McCoy et al57 have earlier used crowd-sourcing for building a knowledge base of problem–medication pairs. In their institution it was mandatory for clinicians to link prescriptions to patient problems, and McCoy et al57 leveraged the resulting database as a resource to construct their knowledge base. On the other hand, our approach was to seek volunteer effort from the care providers for creating a gold standard for validating the CDSS.
Our analysis identified that the incompleteness of problem list and patient-provided information for hysterectomy is a challenge to accurate working of the CDSS for the subset of patients with hysterectomy. We plan to extend the NLP module of the CDSS to identify history of hysterectomy from clinical notes, if more such patients are encountered in the future.56 Overall, the CDSS has a high level of accuracy, and has the potential to improve providers’ recommendations especially in the high utility areas of the guidelines, and can thereby significantly advance the quality of screening. However, the corrected CDSS was not tested with new cases, which would be of benefit to determine whether further discrepancies in recommendations need to be addressed. We expect that the majority of the errors have been identified in the current analysis, and we plan to perform additional evaluations with a different set of cases to ensure system accuracy before deployment.58
We restricted the scope of the evaluation to accuracy and did not test the usability and integration with workflow, which are also major factors that determine utilization and clinical impact of the CDSS. These will be tested separately with pilot studies. Nonetheless, we expect that elimination (or at least minimization) of the issue of delivering the correct recommendations will facilitate the subsequent pilots.
The use of an unfamiliar interface may have induced participants’ mistakes, although we had provided a training video and designed a simple interface to record the participants’ recommendations. On the other hand, the participants were focused on the task of making screening decisions, and their performance can be expected to be better than target users who will have other tasks during the patient visit. As a result of these factors, further research is necessary to determine the usefulness of our approach to quantify provider errors. Nevertheless, our results indicate that the methodology is useful to identify qualitatively the areas for decision making that are difficult for the providers.
Updated cervical cancer screening guidelines were published at the end of our evaluation period.59 ,60 It is possible that some of the participating providers were aware of the forthcoming change in the guideline and provided recommendations in accordance with the anticipated guideline.
We limited the expert review to cases in which there was a mismatch in recommendations of the CDSS and the providers, because the proportion of errors is expected to be high in this subset of cases. Consequently, there is a chance of missing erroneous decisions, when the recommendations of both the provider and CDSS are not optimal. However, such cases are expected to be small in number and are likely to have a representation in the mismatch group. The strategy of focusing on the mismatch group facilitates a judicious use of the expert reviewers’ efforts.
Double blinding of reviewers was not done. It may be useful to blind the expert reviewers as to whether the source of the recommendations was the care provider or CDSS.
Our case study demonstrates that the approach to crowd-source the construction of the reference recommendations set, coupled with the expert review of mismatched decisions, can facilitate an effective evaluation of the accuracy of a CDSS. It is especially useful to identify decision scenarios that may be missed by the system's developers. The methodology will be useful for researchers who seek rapidly to evaluate and enhance the deployment readiness of next generation decision support systems that are based on complex guidelines.
The authors are thankful to the medical residents, nurse practitioners and consultants from Mayo Clinic Rochester, who contributed to this project. The authors are also grateful to the anonymous reviewers for their insightful suggestions.
Contributors KBW led the design, implementation and analysis of the study. KLM coordinated the participation of care providers in study, and made major contributions to the design and analysis. TMK and PMC performed the expert reviews in this study. MH, RAG, HL and RC participated in the design and analysis. HL and RC supervised the project. All authors contributed to the manuscript and approved the final version.
Competing interests None.
Ethics approval This study was approved by the institutional review board at Mayo Clinic, Rochester.
Provenance and peer review Not commissioned; externally peer reviewed.
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/3.0/