A Text Mining Approach to the Prediction of Disease Status from Clinical Discharge Summaries
- Correspondence: Goran Nenadic, Manchester Interdisciplinary Biocentre, University of Manchester, 131 Princess Street, Manchester M1 7DN, UK; e-mail: < >.
- Received 7 December 2008
- Accepted 7 April 2009
Objective The authors present a system developed for the Challenge in Natural Language Processing for Clinical Data—the i2b2 obesity challenge, whose aim was to automatically identify the status of obesity and 15 related co-morbidities in patients using their clinical discharge summaries. The challenge consisted of two tasks, textual and intuitive. The textual task was to identify explicit references to the diseases, whereas the intuitive task focused on the prediction of the disease status when the evidence was not explicitly asserted.
Design The authors assembled a set of resources to lexically and semantically profile the diseases and their associated symptoms, treatments, etc. These features were explored in a hybrid text mining approach, which combined dictionary look-up, rule-based, and machine-learning methods.
Measurements The methods were applied on a set of 507 previously unseen discharge summaries, and the predictions were evaluated against a manually prepared gold standard. The overall ranking of the participating teams was primarily based on the macro-averaged F-measure.
Results The implemented method achieved the macro-averaged F-measure of 81% for the textual task (which was the highest achieved in the challenge) and 63% for the intuitive task (ranked 7th out of 28 teams—the highest was 66%). The micro-averaged F-measure showed an average accuracy of 97% for textual and 96% for intuitive annotations.
Conclusions The performance achieved was in line with the agreement between human annotators, indicating the potential of text mining for accurate and efficient prediction of disease statuses from clinical discharge summaries.
The objective of the 2008 i2b2 obesity challenge1 in Natural language processing (NLP) for clinical data was to evaluate NLP systems on their performance in identifying patient obesity and associated co-morbidities based on hospital discharge summaries. Fifteen related diseases were considered: Diabetes mellitus (DM), Hypercholesterolemia, Hypertriglyceridemia, Hypertension (HTN), Atherosclerotic CV disease (CAD), Heart failure (CHF), Peripheral vascular disease (PVD), Venous insufficiency, Osteoarthritis (OA), Obstructive sleep apnea (OSA), Asthma, GERD, Gallstones/Cholecystectomy, Depression, and Gout. The aim was to label each document with disease/co-morbidity status, indicating whether:
a patient was diagnosed with a disease/co-morbidity (Y—yes, disease present),
a patient was diagnosed with not having a disease/co-morbidity (N—no, disease absent),
it was uncertain whether a patient had a disease/co-morbidity or not (Q—questionable), or
a disease/co-morbidity status was not mentioned in the discharge summary (U—unmentioned).
The challenge consisted of two tasks, textual and intuitive. The textual task was to identify explicit references to the diseases in the narrative text. Each hospital report was to be labeled using one of four possible disease status labels (Y, N, Q, or U). The intuitive task focused on inferring the disease status even when the evidence was not explicitly asserted. Possible intuitive labels were Y, N, and Q for each disease. The organizers provided a training set with 730 hospital discharge summaries manually annotated with more than 22,000 labels.
We implemented a hybrid approach that combined three types of features: lexical, terminological and semantic, exploited by dictionary look-up, rule-based and machine-learning methods. We assembled a set of resources to lexically and semantically profile the diseases and their associated symptoms, treatments, etc. The methods were applied on a set of 507 previously unseen discharge summaries, and the predictions were evaluated against the manually prepared gold standard. In the textual task, a macro-averaged F-measure (81%) for our approach was the highest achieved in the challenge. In the intuitive task, we achieved the macro-averaged F-measure of 63%. The micro-averaged F-measure showed an average accuracy of 97% for textual annotation and 96% for intuitive annotation, indicating the potential of text mining techniques to accurately extract the disease status from hospital discharge summaries.
The general idea underlying our approach was to identify sentences that contained evidence to support a judgment for a given disease, and then to integrate evidence gathered at the sentence level to make a prediction at the document level. The system workflow consisted of three major steps: report pre-processing, textual prediction and intuitive prediction, with the final integration of the textual and intuitive results (see Fig 1). The prediction steps were applied for each of the 16 diseases/co-morbidities separately.
The report pre-processing involved basic textual processing of input discharge narratives. In the textual prediction step, explicit evidence was identified and combined to derive textual predictions. The intuitive prediction module focused on capturing intuitive clues that could associate the report with the disease. The finial intuitive judgments were combined with the textual ones. Figure 2 depicts a detailed architecture of the system. In the following sections we describe each module and the basic steps performed (for further details see a JAMIA online data supplement at http://www.jamia.org).
Report Pre-processing Module
Input discharge summaries were first split into sections using a set of flexible lexical matching rules that identified section titles and classified them into six predefined categories: “Diagnosis”, “Past or Present History of Illness”, “Social/Family History”, “Physical or Laboratory Examination”, “Medication/Disposition”, and “Other”. Section titles were recognized by matching the most frequent title keywords collected semi-automatically from the training dataset. In addition, each section type was assigned a weight reflecting its predictive capacity for a given disease (see the Training Data Analyses section). The sections were decomposed into sentences using LingPipe.2 Part of Speech (POS) tagging and shallow parsing were performed using the GeniaTagger, which is specifically tuned for biomedical text.3
Textual Prediction Module
The main objective of this module was to identify sentences that, given a disease, explicitly mentioned the disease itself and/or associated clinical terms. We lexically profiled each disease by collecting (1) its name and synonyms from public resources including the UMLS4, (2) disease sub-classes (e.g., diabetes type II) and their synonyms, (3) disease superclasses (e.g., reflux for GERD and arthritis for OA) and their synonyms, and (4) clinical terms closely related to the disease (e.g., associated symptoms and treatments), imported from public medical resources or selected from the training dataset based on their occurrence statistics. All clinical terms collected were assigned confidence levels taking into account the quality of the prediction results obtained from the training dataset (available as an online data supplement at www.jamia.org).
Initially, the sentences that contained any term from the lexical profile were labeled with Y, and, in the subsequent steps, the evidence was challenged and potentially reversed to N, Q, or U based on the context in which they were used. The sentence-based predictions were then combined at the document level. The four processing steps in this module are described briefly below (further details are given in the online supplement).
Step T1: Term matching. To cater for terminological variation, terms that characterize a disease were matched against the text approximately, taking into account morphological variants, and if necessary ignoring word order and tolerating the distance between the words within a term (e.g., both “stent placement” and “placement of coronary stent” referred to the same treatment for CAD).
Step T2: Sentence filtering. Sentences that did not mention a disease-related term were filtered out. We also discarded sentences from the sections deemed less important for the textual task (namely “Social/Family History” and “Other”), sentences that potentially referred to family members, and sentences containing ambiguous disease terms.
Step T3: Sentence labeling. After filtering, the remaining sentences were initially considered to support the judgment of disease presence (Y). We then applied a set of lexico-semantic patterns (see Table 1 for examples) to potentially re-label them with N, Q, or U judgments, using a pattern matching algorithm similar to NegEx.5 The patterns generalized the structure of manually collected examples that indicated negative, questionable or unmentioned status of diseases. If any of these patterns was matched successfully, the disease status was changed using the label associated with the pattern.
Step T4: Result integration. When a report contained multiple sentences with conflicting labels associated, we employed a weighted voting scheme. The score for each disease status label was obtained by collecting all sentences with the given label, and adding up the weights associated with the container sections. The highest-scored label was suggested as the final annotation, with potential tie cases labeled as Q.
We submitted the results of two runs for the textual task: in run 1, all clinical terms from the associated lexical profiles were used, whereas clinical terms with lower confidence were excluded in run 2.
Intuitive Prediction Module
The intuitive task focused on the prediction of the disease status (Y, N, and Q) based on both explicit and implicit textual assertions. We relied on a combination of term- and clinical inference rule-matching to extract disease information at the sentence level, and a supervised learning method for disease status classification at the document level. The module consisted of five steps, described briefly here (further details are given in the online supplement).
Step I1: Candidate sentence identification. In the first step, the system identified potential evidence sentences (labeled Y initially) by looking for any of the following three evidence types within the sentences:
Terms referring to the disease symptoms (e.g., RCA occlusion for CAD). The first two intuitive runs (1 and 2) differed in the predictive capacity of the symptoms used (all terms vs. most important ones, respectively for runs 1 and 2; see the Training Data Analyses section).
Important clinical facts or conditions related to the disease (e.g., weight > 200 lbs; systolic blood pressure > 135). Around 20 manually designed inference rules were used.
Medications typically used to treat the disease and/or symptoms (appearing within the “Medication/Disposition” sections).
Step I2: Sentence labeling. This step was analogous to textual prediction (step T3).
Step I3: Sentence-level result integration. Similarly to textual predictions (step T4), the integration of sentence-level predictions was performed when some sentences had different labels attached for the same disease. Three factors were considered: (a) the confidence level of disease symptom terms found in the sentences; (b) the weight of the section where the evidence appeared; and (c) the significance of the three types of sentence evidence (step I1) for the given disease.
Step I4: Document-level labeling. This was an optional step with only one run submitted (run 3, see below). We applied a support vector machine (SVM) classifier to assign disease labels at the document level. Phrases recognized in the pre-processing stage by GeniaTagger were mapped to the UMLS concepts using approximate string matching. Concepts mentioned in a negative context were identified using a negation module similar to NegEx. The weight assigned to a feature was calculated as the difference between the number of positive mentions of the corresponding concept and their negative mentions. Finding questionable evidence at a document level was considered unfeasible (there were too few examples for machine learning), so we trained a binary SVM classifier that differentiated between potential Y and N labels only.
Step I5: Final result integration. Textual Y and N predictions were given high confidence and were recycled as intuitive predictions (see the section Training Data Analyses). Only Q and U textual judgments were adjusted in cases where intuitive evidence suggested different labels. More precisely, when new implicit evidence was established for a previously assigned textual Q or U judgment, then it was changed to an intuitive Y, N, or Q label based on the procedure described in steps I1–I3. If no new sentence-level implicit evidence was established for a Q or U textual judgment, then the SVM-based document classification was taken into account. If the classifier produced a highly confident Y label, then the final intuitive label for the disease would be amended to Y. Otherwise, a textual Q judgment would be kept unchanged, whereas a textual U judgment would change to N in the final intuitive annotation. This approach was used to provide the intuitive run 3.
Experiments and Results
The training and testing data for the challenge were collected from the Research Patient Data Repository of Partners Health Care (see Table 2 for the distribution of the annotations provided manually by two experts).
Training Data Analyses
We compared textual and intuitive annotations assigned to each document-disease pair (see Table 3). Intuitive annotations largely agreed with the textual ones in case of Y and N labels. Intuitive annotations differed primarily from the textual Q and U labels. This observation motivated our integration strategy—the intuitive results “inherited” all textual Y and N predictions, and only Q and U textual labels were considered eligible for re-annotation in the intuitive part.
The training data were further analyzed to estimate the relevance of certain features and their predictive capacity. We first analyzed the relevance of six section types. Relative relevance weights were assigned to each section type based on the ratio between the number of sentences in the given section type whose labels were consistent with the expert-generated judgments (at the document level) and the total number of evidence sentences that supported the correct annotations. This gave us relative predictive capacity of the section types to enable inference of the document label. Similar distributional analyses were performed for other features (see the online supplement for further details).
Testing Environment and Results
Each of the 28 teams taking part in the challenge was allowed to submit the results of up to three system runs. The system performance was measured using a set of three standard measures: recall (R), precision (P) and F-measure. The results were micro- and macro-averaged across the status labels for each of the diseases considered. The overall performance was measured in the same way for all diseases taken together. The participating teams were primarily ranked based on the macro-averaged F-measure. Hereafter, we only report a single averaged score for the micro values as the values for P-Micro, R-Micro and F-Micro were identical.
The results of two textual runs were submitted (see Table 4). Run 2 improved the results, but only by a small margin. The macro-averaged F-measure for run 2 was the highest one achieved in the challenge and was substantially better than the mean average of all participating teams (81 versus 56%). Similarly, the micro-averaged F-measure was high (97%), compared to the mean average calculated for all participating teams (91%). A detailed analysis of the results is available in the online supplement.
The results of three runs were submitted for the intuitive task (see Table 5): run 1 was the best run with the macro-averaged F-measure of 63% (ranked 7th) and the micro-averaged F-measure of 96% (ranked fifth overall). A detailed analysis of the results is available in the online supplement.
Table 6 shows the detailed evaluation of the results for the individual diseases. In the textual task, the micro-averaged F-measure ranged from 92% (CAD) to 100% (hyper-triglyceridemia), whereas for the intuitive task it ranged from 89% (depression) to 99% (OSA). The micro-averaged values were more consistent across different diseases, whereas there were substantial differences in the macro-evaluated metrics. A detailed analysis and full discussion of the results are available in the online supplement.
The system implementing the methodology described achieved excellent results with an average micro accuracy of 97% for the textual task and 96% for the intuitive task. The macro-averaged F-measure of 81% for the textual task was the highest achieved in the challenge, and the macro-averaged F-measure of 63% (the highest was 66%) for the intuitive task was ranked 7th out of 28. The macro-averaged measures showed that prediction of questionable labels was most challenging, in particular in the intuitive task.
The system's performance may be improved in several ways. More work is required to expand the set of clinical inference rules and match them reliably in textual narratives. Dynamic expansion of abbreviations that could correctly map ambiguous abbreviations to corresponding medical terms should improve identification of key clinical findings. Finally, the estimation of discriminative power of medications used to treat specific diseases should improve intuitive predictions.
Overall, the performance of our system and most of the other systems developed for the i2b2 obesity challenge was comparable to that of a human expert, indicating that text mining techniques have substantial potential to accurately and efficiently extract the disease status from hospital discharge summaries. However, more research is required to investigate if the methodologies used can be easily ported between different areas of medical practice. For our system, the infrastructure developed is general enough to be re-used across the clinical domain. However, few details require knowledge elicitation from domain experts or medical resources, and manual changes to the system (e.g., clinical inference rules). Still, a major bottleneck faced by medical text mining systems in general is the provision of the training data, which need to be analyzed manually and statistically to identify clues to be exploited in both rule-based and machine-learning approaches.
This work was partially supported by the UK BBSRC project “Mining Term Associations from Literature to Support Knowledge Discovery in Biology”. Irena Spasic gratefully acknowledges the support of the BBSRC and EPSRC via “The Manchester Centre for Integrative Systems Biology” grant.
Dr. Yang is currently with the Department of Computing, Open University, UK.