rss
JAMIA 2009;16:247-255 doi:10.1197/jamia.M2844
  • Original Investigation
  • Research Paper

BioTagger-GM: A Gene/Protein Name Recognition System

  1. Manabu Toriia,
  2. Zhangzhi Hub,
  3. Cathy H Wuc,
  4. Hongfang Liud
  1. aThe Imaging Science and Information Systems Center, Georgetown University Medical Center, Washington, DC
  2. bDepartment of Oncology, Georgetown University Medical Center, Washington, DC
  3. cProtein Information Resource, Georgetown University Medical Center, Washington, DC
  4. dDepartment of Biostatistics, Bioinformatics, and Biomathematics, Georgetown University Medical Center, Washington, DC
  1. Correspondence: Manabu Torii, The Imaging Science and Information Systems Center, Georgetown University Medical Center, 2115 Wisconsin Avenue NW, Washington, DC 20057; e-mail: <torii{at}isis.georgetown.edu>
  • Received 30 April 2008
  • Accepted 5 December 2008

Abstract

Objectives Biomedical named entity recognition (BNER) is a critical component in automated systems that mine biomedical knowledge in free text. Among different types of entities in the domain, gene/protein would be the most studied one for BNER. Our goal is to develop a gene/protein name recognition system BioTagger-GM that exploits rich information in terminology sources using powerful machine learning frameworks and system combination.

Design BioTagger-GM consists of four main components: (1) dictionary lookup—gene/protein names in BioThesaurus and biomedical terms in UMLS Metathesaurus are tagged in text, (2) machine learning—machine learning systems are trained using dictionary lookup results as one type of feature, (3) post-processing—heuristic rules are used to correct recognition errors, and (4) system combination—a voting scheme is used to combine recognition results from multiple systems.

Measurements The BioCreAtIvE II Gene Mention (GM) corpus was used to evaluate the proposed method. To test its general applicability, the method was also evaluated on the JNLPBA corpus modified for gene/protein name recognition. The performance of the systems was evaluated through cross-validation tests and measured using precision, recall, and F-Measure.

Results BioTagger-GM achieved an F-Measure of 0.8887 on the BioCreAtIvE II GM corpus, which is higher than that of the first-place system in the BioCreAtIvE II challenge. The applicability of the method was also confirmed on the modified JNLPBA corpus.

Conclusion The results suggest that terminology sources, powerful machine learning frameworks, and system combination can be integrated to build an effective BNER system.

Footnotes

  • Supported by IIS-0639062 from the National Science Foundation.

Access policy for JAMIA

All content published in JAMIA is deposited with PubMedCentral by the publisher but with varying embargo times. Authors/funders may pay an Unlocked fee of $2,000 to make the article free on the JAMIA website and PMC immediately on publication. Research funded by government and other recognised agencies is deposited with a 12 month embargo. All other content is deposited with a 36 month embargo.

The Journal of the American Medical Informatics Association is published for the American Medical Informatics Association by BMJ Publishing Group Ltd.