rss
J Am Med Inform Assoc 9:612-620 doi:10.1197/jamia.M1139
  • Original Investigation
  • Research Paper

Creating an Online Dictionary of Abbreviations from MEDLINE

  1. Jeffrey T Chang,
  2. Hinrich Schütze,
  3. Russ B Altman
  1. Affiliations of the authors: Department of Genetics, Stanford Medical Informatics, Stanford, California (JTC, RBA); Novation Biosciences, Stanford, California (HS)
  1. Correspondence and reprints: Russ B. Altman, MD, PhD, Depart-ment of Genetics, Stanford Medical Informatics, Stanford School of Medicine, Medical School Office Building, X-215, 251 Campus Dr., Stanford, CA 94305; e-mail: <russ.altman{at}stanford.edu>
  • Received 30 March 2002
  • Accepted 26 June 2002

Abstract

Objective The growth of the biomedical literature presents special challenges for both human readers and automatic algorithms. One such challenge derives from the common and uncontrolled use of abbreviations in the literature. Each additional abbreviation increases the effective size of the vocabulary for a field. Therefore, to create an automatically generated and maintained lexicon of abbreviations, we have developed an algorithm to match abbreviations in text with their expansions.

Design Our method uses a statistical learning algorithm, logistic regression, to score abbreviation expansions based on their resemblance to a training set of human-annotated abbreviations. We applied it to Medstract, a corpus of MEDLINE abstracts in which abbreviations and their expansions have been manually annotated. We then ran the algorithm on all abstracts in MEDLINE, creating a dictionary of biomedical abbreviations. To test the coverage of the database, we used an independently created list of abbreviations from the China Medical Tribune.

Measurements We measured the recall and precision of the algorithm in identifying abbreviations from the Medstract corpus. We also measured the recall when searching for abbreviations from the China Medical Tribune against the database.

Results On the Medstract corpus, our algorithm achieves up to 83% recall at 80% precision. Applying the algorithm to all of MEDLINE yielded a database of 781,632 high-scoring abbreviations. Of all the abbreviations in the list from the China Medical Tribune, 88% were in the database.

Conclusion We have developed an algorithm to identify abbreviations from text. We are making this available as a public abbreviation server at \url{http://abbreviation.stanford.edu/}.

Footnotes

  • This work was supported by NIH LM 06244 and GM61374, NSF DBI-9600637, and a grant from the Burroughs-Wellcome Foundation.

Free Sample

This recent issue is free to all users to allow everyone the opportunity to see the full scope and typical content of JAMIA.
View free sample issue >>

Access policy for JAMIA

All content published in JAMIA is deposited with PubMed Central by the publisher with a 12 month embargo. Authors/funders may pay an Open Access fee of $2,000 to make the article free on the JAMIA website and PMC immediately on publication.

All content older than 12 months is freely available on this website.

AMIA members can log in with their JAMIA user name (email address) and password or via the AMIA website.

Navigate This Article