rss
JAMIA 2004;11:174-178 doi:10.1197/jamia.M1453
  • Original Investigation
  • Research Paper

A Simple and Practical Dictionary-based Approach for Identification of Proteins in Medline Abstracts

  1. Sergei Egorov,
  2. Anton Yuryev,
  3. Nikolai Daraselia
  1. Affiliation of the authors: Ariadne Genomics, Inc., Rockville, MD
  1. Correspondence and reprints: Nikolai Daraselia, PhD, Ariadne Genomics, Inc., 9700 Great Seneca Highway, Rockville, MD 20850; e-mail: <nikolai{at}ariadnegenomics.com>
  • Received 8 September 2003
  • Accepted 11 January 2004

Abstract

Objective The aim of this study was to develop a practical and efficient protein identification system for biomedical corpora.

Design The developed system, called ProtScan, utilizes a carefully constructed dictionary of mammalian proteins in conjunction with a specialized tokenization algorithm to identify and tag protein name occurrences in biomedical texts and also takes advantage of Medline “Name-of-Substance” (NOS) annotation. The dictionaries for ProtScan were constructed in a semi-automatic way from various public-domain sequence databases followed by an intensive expert curation step.

Measurements The recall and precision of the system have been determined using 1,000 randomly selected and hand-tagged Medline abstracts.

Results The developed system is capable of identifying protein occurrences in Medline abstracts with a 98% precision and 88% recall. It was also found to be capable of processing approximately 300 abstracts per second. Without utilization of NOS annotation, precision and recall were found to be 98.5% and 84%, respectively.

Conclusion The developed system appears to be well suited for protein-based Medline indexing and can help to improve biomedical information retrieval. Further approaches to ProtScan's recall improvement also are discussed.

Footnotes

  • Supported in part by NIH 1R43-GM067276. Some of the authors of the article are stockholders of Ariadne Genomics, Inc., a small private company whose business model is development and sales of bioinformatics software and, in particular, pathway analysis software. The described algorithm for identification of protein names is one component of the protein function information extraction pipeline, which, in turn, is part of PathwayAssist, the software product for pathway analysis (Bioinformatics. 2003;19:2155–7). The authors do not directly receive any royalties or fees from the sales of the described algorithm; however, the sales of the entire software product provide the significant portion of the company's income.

Access policy for JAMIA

All content published in JAMIA is deposited with PubMedCentral by the publisher but with varying embargo times. Authors/funders may pay an Unlocked fee of $2,000 to make the article free on the JAMIA website and PMC immediately on publication. Research funded by government and other recognised agencies is deposited with a 12 month embargo. All other content is deposited with a 36 month embargo.

AMIA members log in here to access the full text of JAMIA.

Register for free content

Individuals may register for a free 30 day online trial to all content.

The Journal of the American Medical Informatics Association is published for the American Medical Informatics Association by BMJ Publishing Group Ltd.