J Am Med Inform Assoc 12:576-586 doi:10.1197/jamia.M1757
  • Original Investigation
  • Research Paper

ALICE: An Algorithm to Extract Abbreviations from MEDLINE

  1. Hiroko Ao,
  2. Toshihisa Takagi
  1. Affiliations of the authors: Department of Computational Biology, University of Tokyo, Chiba, Japan (HA, TT); Basic Research Laboratory, Kanebo Cosmetics, Inc., Kanagawa, Japan (HA)
  1. Correspondence and reprints: Hiroko Ao, MSc, Department of Computational Biology, University of Tokyo CB01, 5-1-5, Kashiwanoha, Kashiwa-shi, Chiba, 277-8561, Japan; e-mail: <aohiroko{at}>
  • Received 2 December 2004
  • Accepted 23 April 2005


Objective To help biomedical researchers recognize dynamically introduced abbreviations in biomedical literature, such as gene and protein names, we have constructed a support system called ALICE (Abbreviation LIfter using Corpus-based Extraction). ALICE aims to extract all types of abbreviations with their expansions from a target paper on the fly.

Methods ALICE extracts an abbreviation and its expansion from the literature by using heuristic pattern-matching rules. This system consists of three phases and potentially identifies valid 320 abbreviation-expansion patterns as combinations of the rules.

Results It achieved 95% recall and 97% precision on randomly selected titles and abstracts from the MEDLINE database.

Conclusion ALICE extracted abbreviations and their expansions from the literature efficiently. The subtly compiled heuristics enabled it to extract abbreviations with high recall without significantly reducing precision. ALICE does not only facilitate recognition of an undefined abbreviation in a paper by constructing an abbreviation database or dictionary, but also makes biomedical literature retrieval more accurate. This system is freely available at


  • This work was partly supported by a grant from the Grant-in-Aid for Scientific Research in Priority Areas Genome Information Science, Japanese Ministry of Education, Culture, Sports, and Technology. The authors thank the staff of the Department of Computational Biology, University of Tokyo, and the staff of the Basic Research Laboratory, Kanebo Cosmetics, Inc., for their contribution to this study. The authors are also grateful to Yasunori Yamamoto of the Department of Computer Science, University of Tokyo, for editing the manuscript.

  • * In this paper, the term digits refers to one or more digits, and the following terms are used similarly: alphanumeric characters, alphabetic characters, hyphens, spaces, under-bars, periods, primes, commas, uppercase letters, and slashes.

Free Sample

This recent issue is free to all users to allow everyone the opportunity to see the full scope and typical content of JAMIA.
View free sample issue >>

Access policy for JAMIA

All content published in JAMIA is deposited with PubMed Central by the publisher with a 12 month embargo. Authors/funders may pay an Open Access fee of $2,000 to make the article free on the JAMIA website and PMC immediately on publication.

All content older than 12 months is freely available on this website.

AMIA members can log in with their JAMIA user name (email address) and password or via the AMIA website.

Navigate This Article