Enhancing Text Categorization with Semantic-enriched Representation and Training Data Augmentation
- Affiliations of the authors: Department of Biostatistics, Bioinformatics and Epidemiology (XL, BZ), Medical University of South Carolina, Charleston, SC; Department of Electrical and Computer Engineering (AV), University of Illinois at Urbana-Champaign, Urbana, IL; Department of Computer Science (CXZ), University of Illinois at Urbana-Champaign, Urbana, IL
- Correspondence and reprints: Xinghua Lu, MD, PhD, Department of Biostatistics, Bioinformatics and Epidemiology, 135 Cannon St, Suite 303, Charleston, SC 29425; e-mail: <lux{at}musc.edu>
- Received 9 January 2006
- Accepted 6 June 2006
Abstract
Objective Acquiring and representing biomedical knowledge is an increasingly important component of contemporary bioinformatics. A critical step of the process is to identify and retrieve relevant documents among the vast volume of modern biomedical literature efficiently. In the real world, many information retrieval tasks are difficult because of high data dimensionality and the lack of annotated examples to train a retrieval algorithm. Under such a scenario, the performance of information retrieval algorithms is often unsatisfactory, therefore improvements are needed.
Design We studied two approaches that enhance the text categorization performance on sparse and high data dimensionality: (1) semantic-preserving dimension reduction by representing text with semantic-enriched features; and (2) augmenting training data with semi-supervised learning. A probabilistic topic model was applied to extract major semantic topics from a corpus of text of interest. The representation of documents was projected from the high-dimensional vocabulary space onto a semantic topic space with reduced dimensionality. A semi-supervised learning algorithm based on graph theory was applied to identify potential positive training cases, which were further used to augment training data. The effects of data transformation and augmentation on text categorization by support vector machine (SVM) were evaluated.
Results and Conclusion Semantic-enriched data transformation and the pseudo-positive-cases augmented training data enhance the efficiency and performance of text categorization by SVM.
Footnotes
-
Xinghua Lu is partially supported by the NIH grants 5P20 RR016434-04 and T15 LM07438-02. ChengXiang Zhai is supported by NSF grants IIS-0347933 and IIS-0428472. Atulya Velivelli is partially supported by the NSF Grant CCF 04-26627. The authors thank Dr. John Lafferty for insightful discussions, Dr. Thorsten Joachims for making SVMlight available, and the organizers of the TREC genomics track for preparing the data set. The authors also thank the anonymous reviewers for their constructive critiques and suggestions.








