UMLS Concept Indexing for Production Databases
A Feasibility Study
- Correspondence and reprints: Prakash M. Nadkarni, MD, Center for Medical Informatics, Yale University School of Medicine, P.O. Box 208009, New Haven, CT 06520-8009; e-mail: <Prakash.Nadkarni{at}yale.edu>
- Received 29 February 2000
- Accepted 31 July 2000
Abstract
Objectives To explore the feasibility of using the National Library of Medicine's Unified Medical Language System (UMLS) Metathesaurus as the basis for a computational strategy to identify concepts in medical narrative text preparatory to indexing. To quantitatively evaluate this strategy in terms of true positives, false positives (spuriously identified concepts) and false negatives (concepts missed by the identification process).
Methods Using the 1999 UMLS Metathesaurus, the authors processed a training set of 100 documents (50 discharge summaries, 50 surgical notes) with a concept-identification program, whose output was manually analyzed. They flagged concepts that were erroneously identified and added new concepts that were not identified by the program, recording the reason for failure in such cases. After several refinements to both their algorithm and the UMLS subset on which it operated, they deployed the program on a test set of 24 documents (12 of each kind).
Results Of 8,745 matches in the training set, 7,227 (82.6 percent ) were true positives, whereas of 1,701 matches in the test set, 1,298 (76.3 percent) were true positives. Matches other than true positive indicated potential problems in production-mode concept indexing. Examples of causes of problems were redundant concepts in the UMLS, homonyms, acronyms, abbreviations and elisions, concepts that were missing from the UMLS, proper names, and spelling errors.
Conclusions The error rate was too high for concept indexing to be the only production-mode means of preprocessing medical narrative. Considerable curation needs to be performed to define a UMLS subset that is suitable for concept matching.
Footnotes
-
This work was supported in part by National Institutes of Health grants R01 LM06843-01 from the National Library of Medicine and U01 CA78266-02 from the National Cancer Institute.
-
The authors will provide the relational UMLS schema used in this study (plus UMLS data from sources that have not imposed restrictions on distribution) and the Microsoft Access front end (which includes Concept Locator) to anyone who makes a written request.
-
↵* This process, termed indexing, is described in Appendix 1, which appears as supplemental material to this article in JAMIA Online, at www.jamia.org.
-
↵† An overview of the UMLS schema is provided in Appendix 2, which also appears as supplemental material to this article in JAMIA Online.
-
↵‡ Sentence-based approaches are discussed in Appendix 3, which appears as supplemental material to this article in JAMIA Online, at www.jamia.org.
-
↵§ Part of a discharge summary is illustrated in Appendix 4, Figure 1, which appears as supplemental material to this article in JAMIA Online, at www.jamia.org.
-
↵¶ Details of each step, including the concept-matching algorithm, are also provided in Appendix 4, which appears online.









