A Simple and Practical Dictionary-based Approach for Identification of Proteins in Medline Abstracts
- Correspondence and reprints: Nikolai Daraselia, PhD, Ariadne Genomics, Inc., 9700 Great Seneca Highway, Rockville, MD 20850; e-mail: <nikolai{at}ariadnegenomics.com>
- Received 8 September 2003
- Accepted 11 January 2004
Abstract
Objective The aim of this study was to develop a practical and efficient protein identification system for biomedical corpora.
Design The developed system, called ProtScan, utilizes a carefully constructed dictionary of mammalian proteins in conjunction with a specialized tokenization algorithm to identify and tag protein name occurrences in biomedical texts and also takes advantage of Medline “Name-of-Substance” (NOS) annotation. The dictionaries for ProtScan were constructed in a semi-automatic way from various public-domain sequence databases followed by an intensive expert curation step.
Measurements The recall and precision of the system have been determined using 1,000 randomly selected and hand-tagged Medline abstracts.
Results The developed system is capable of identifying protein occurrences in Medline abstracts with a 98% precision and 88% recall. It was also found to be capable of processing approximately 300 abstracts per second. Without utilization of NOS annotation, precision and recall were found to be 98.5% and 84%, respectively.
Conclusion The developed system appears to be well suited for protein-based Medline indexing and can help to improve biomedical information retrieval. Further approaches to ProtScan's recall improvement also are discussed.
Footnotes
-
Supported in part by NIH 1R43-GM067276. Some of the authors of the article are stockholders of Ariadne Genomics, Inc., a small private company whose business model is development and sales of bioinformatics software and, in particular, pathway analysis software. The described algorithm for identification of protein names is one component of the protein function information extraction pipeline, which, in turn, is part of PathwayAssist, the software product for pathway analysis (Bioinformatics. 2003;19:2155–7). The authors do not directly receive any royalties or fees from the sales of the described algorithm; however, the sales of the entire software product provide the significant portion of the company's income.








