rss
J Am Med Inform Assoc 1999;6:143-150 doi:10.1136/jamia.1999.0060143
  • Original Investigation
  • Research Paper

A Reliability Study for Evaluating Information Extraction from Radiology Reports

  1. George Hripcsak,
  2. Gilad J Kuperman,
  3. Carol Friedman,
  4. Daniel F Heitjan
  1. Affiliations of the authors: Columbia University, New York, New York (GH, CF, DFH); Partners Healthcare Systems, Boston, Massachusetts (GJK); Queens College CUNY, New York, New York (CF)
  1. Correspondence and reprints: George Hripcsak, MD, 161 Fort Washington Avenue, DAP-1310, New York, NY 10032; e-mail: 〈hripcsak{at}columbia.edu
  • Received 6 August 1998
  • Accepted 5 November 1998

Abstract

Goal To assess the reliability of a reference standard for an information extraction task.

Setting Twenty-four physician raters from two sites and two specialities judged whether clinical conditions were present based on reading chest radiograph reports.

Methods Variance components, generalizability (reliability) coefficients, and the number of expert raters needed to generate a reliable reference standard were estimated.

Results Per-rater reliability averaged across conditions was 0.80 (95% CI, 0.79-0.81). Reliability for the nine individual conditions varied from 0.67 to 0.97, with central line presence and pneumothorax the most reliable, and pleural effusion (excluding CHF) and pneumonia the least reliable. One to two raters were needed to achieve a reliability of 0.70, and six raters, on average, were required to achieve a reliability of 0.95. This was far more reliable than a previously published per-rater reliability of 0.19 for a more complex task. Differences between sites were attributable to changes to the condition definitions.

Conclusion In these evaluations, physician raters were able to judge very reliably the presence of clinical conditions based on text reports. Once the reliability of a specific rater is confirmed, it would be possible for that rater to create a reference standard reliable enough to assess aggregate measures on a system. Six raters would be needed to create a reference standard sufficient to assess a system on a case-by-case basis. These results should help evaluators design future information extraction studies for natural language processors and other knowledge-based systems.

Footnotes

  • This work was supported by grants R29-LM05627, R29-LM05397, and R01-LM06274 from the National Library of Medicine; grant R01-HS08927 from the Agency for Health Care Policy and Research; and a Center for Advanced Technology grant from the New York State Science and Technology Foundation.

  • * The term “generalizability”14 has been coined as a replacement for “reliability” because it reflects generalizability theory's emphasis on explicitly defining the universe to which one is allowed to generalize. Nevertheless, the generalizability coefficient is a legitimate estimate of reliability as defined in this paper, and generalizability theory authors continue to use both words.15 In this paper, we use the word “reliability” throughout.

This Article

Services

  1. Request permissions

Responses

  1. Submit a response
  2. No responses published

Social bookmarking

Access policy for JAMIA

All content published in JAMIA is deposited with PubMed Central by the publisher with a 12 month embargo. Authors/funders may pay an Unlocked fee of $2,000 to make the article free on the JAMIA website and PMC immediately on publication.

All content older than 12 months is freely available on this website.

AMIA members can log in with their JAMIA user name (email address) and password or via the AMIA website.