Using Citation Data to Improve Retrieval from MEDLINE
- Elmer V Bernstam,
- Jorge R Herskovic,
- Yindalon Aphinyanaphongs,
- Constantin F Aliferis,
- Madurai G Sriram,
- William R Hersh
- Affiliations of the authors: School of Health Information Sciences, The University of Texas Health Science Center at Houston, Houston, TX (EVB, JRH, MGS); Department of Biomedical Informatics, Vanderbilt University, Nashville, TN (YA, CFA); Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University, Portland, OR (WRH)
- Correspondence and reprints: Elmer Bernstam, MD, School of Health Information Sciences, The University of Texas Health Science Center at Houston, 7000 Fannin Street, Suite 600, Houston, TX 77030; e-mail: <elmer.v.bernstam{at}uth.tmc.edu>
- Received 12 July 2005
- Accepted 16 September 2005
Abstract
Objective To determine whether algorithms developed for the World Wide Web can be applied to the biomedical literature in order to identify articles that are important as well as relevant.
Design and Measurements A direct comparison of eight algorithms: simple PubMed queries, clinical queries (sensitive and specific versions), vector cosine comparison, citation count, journal impact factor, PageRank, and machine learning based on polynomial support vector machines. The objective was to prioritize important articles, defined as being included in a pre-existing bibliography of important literature in surgical oncology.
Results Citation-based algorithms were more effective than noncitation-based algorithms at identifying important articles. The most effective strategies were simple citation count and PageRank, which on average identified over six important articles in the first 100 results compared to 0.85 for the best noncitation-based algorithm (p < 0.001). The authors saw similar differences between citation-based and noncitation-based algorithms at 10, 20, 50, 200, 500, and 1,000 results (p < 0.001). Citation lag affects performance of PageRank more than simple citation count. However, in spite of citation lag, citation-based algorithms remain more effective than noncitation-based algorithms.
Conclusion Algorithms that have proved successful on the World Wide Web can be applied to biomedical information retrieval. Citation-based algorithms can help identify important articles within large sets of relevant results. Further studies are needed to determine whether citation-based algorithms can effectively meet actual user information needs.
Footnotes
-
Supported in part by NLM grant 5 K22 LM008306 and a training fellowship from the W. M. Keck Foundation to the Gulf Coast Consortia through the Keck Center for Computational and Structural Biology.
-
The authors are also grateful to Thomson-ISI for granting use of the Science Citation Index for research purposes.
-
↵* Recall = percentage of relevant articles contained in the database that are retrieved.
-
↵† Precision = percentage of retrieved articles that are relevant.
-
↵‡ Results are actually reported in reverse order of entry into the PubMed database. In some cases, this is not exactly the same as reverse chronological order.
Term frequency (TF) = number of times that a specific term occurs in a given document. Therefore, TF will be large when the term occurs multiple times in a document. Document frequency = Number of number of documents which contain the specific term. Therefore, the inverse document frequency (IDF) will be large for terms which occur in a small number of documents.








