A Reliability Study for Evaluating Information Extraction from Radiology Reports
- Affiliations of the authors: Columbia University, New York, New York (GH, CF, DFH); Partners Healthcare Systems, Boston, Massachusetts (GJK); Queens College CUNY, New York, New York (CF)
- Correspondence and reprints: George Hripcsak, MD, 161 Fort Washington Avenue, DAP-1310, New York, NY 10032; e-mail: 〈hripcsak{at}columbia.edu〉
- Received 6 August 1998
- Accepted 5 November 1998
Abstract
Goal To assess the reliability of a reference standard for an information extraction task.
Setting Twenty-four physician raters from two sites and two specialities judged whether clinical conditions were present based on reading chest radiograph reports.
Methods Variance components, generalizability (reliability) coefficients, and the number of expert raters needed to generate a reliable reference standard were estimated.
Results Per-rater reliability averaged across conditions was 0.80 (95% CI, 0.79-0.81). Reliability for the nine individual conditions varied from 0.67 to 0.97, with central line presence and pneumothorax the most reliable, and pleural effusion (excluding CHF) and pneumonia the least reliable. One to two raters were needed to achieve a reliability of 0.70, and six raters, on average, were required to achieve a reliability of 0.95. This was far more reliable than a previously published per-rater reliability of 0.19 for a more complex task. Differences between sites were attributable to changes to the condition definitions.
Conclusion In these evaluations, physician raters were able to judge very reliably the presence of clinical conditions based on text reports. Once the reliability of a specific rater is confirmed, it would be possible for that rater to create a reference standard reliable enough to assess aggregate measures on a system. Six raters would be needed to create a reference standard sufficient to assess a system on a case-by-case basis. These results should help evaluators design future information extraction studies for natural language processors and other knowledge-based systems.
Footnotes
-
This work was supported by grants R29-LM05627, R29-LM05397, and R01-LM06274 from the National Library of Medicine; grant R01-HS08927 from the Agency for Health Care Policy and Research; and a Center for Advanced Technology grant from the New York State Science and Technology Foundation.
-
↵* The term “generalizability”14 has been coined as a replacement for “reliability” because it reflects generalizability theory's emphasis on explicitly defining the universe to which one is allowed to generalize. Nevertheless, the generalizability coefficient is a legitimate estimate of reliability as defined in this paper, and generalizability theory authors continue to use both words.15 In this paper, we use the word “reliability” throughout.








