rss
J Am Med Inform Assoc 1999;6:76-87 doi:10.1136/jamia.1999.0060076
  • Original Investigation
  • Research Paper

Representing Information in Patient Reports Using Natural Language Processing and the Extensible Markup Language

  1. Carol Friedman,
  2. George Hripcsak,
  3. Lyuda Shagina,
  4. Hongfang Liu
  1. Affiliations of the authors: Columbia University, New York (CF, GH, LS); Queens College City University New York, New York (CF, HL)
  1. Corresdpondence and reprints: Carol Friedman, PhD, Department of Medical Informatics, Columbia University, 161 Fort Washington Avenue, AP-1310, New York, NY 10032. e-mail: 〈friedma{at}flux.cpmc.columbia.edu
  • Received 18 May 1998
  • Accepted 16 September 1998

Abstract

Objective To design a document model that provides reliable and efficient access to clinical information in patient reports for a broad range of clinical applications, and to implement an automated method using natural language processing that maps textual reports to a form consistent with the model.

Methods A document model that encodes structured clinical information in patient reports while retaining the original contents was designed using the extensible markup language (XML), and a document type definition (DTD) was created. An existing natural language processor (NLP) was modified to generate output consistent with the model. Two hundred reports were processed using the modified NLP system, and the XML output that was generated was validated using an XML validating parser.

Results The modified NLP system successfully processed all 200 reports. The output of one report was invalid, and 199 reports were valid XML forms consistent with the DTD.

Conclusions Natural language processing can be used to automatically create an enriched document that contains a structured component whose elements are linked to portions of the original textual report. This integrated document model provides a representation where documents containing specific information can be accurately and efficiently retrieved by querying the structured components. If manual review of the documents is desired, the salient information in the original reports can also be identified and highlighted. Using an XML model of tagging provides an additional benefit in that software tools that manipulate XML documents are readily available.

Footnotes

  • This work was supported in part by grants LM06274 and LM05627 from the National Library of Medicine and by the Columbia Center for Advanced Technology (CAT) in High-performance Computing and Communications in Healthcare with funding from the New York State Science and Technology Foundation.

Access policy for JAMIA

All content published in JAMIA is deposited with PubMed Central by the publisher with a 12 month embargo. Authors/funders may pay an Unlocked fee of $2,000 to make the article free on the JAMIA website and PMC immediately on publication.

All content older than 12 months is freely available on this website.

AMIA members can log in with their JAMIA user name (email address) and password or via the AMIA website.