rss
JAMIA 2004;11:320-331 doi:10.1197/jamia.M1533
  • Original Investigation
  • Research Paper

A Multi-aspect Comparison Study of Supervised Word Sense Disambiguation

  1. Hongfang Liu,
  2. Virginia Teller,
  3. Carol Friedman
  1. Affiliations of the authors: Department of Information Systems, University of Maryland at Baltimore County, Baltimore, MD (HL); Department of Computer Science, Hunter College, City University of New York, New York, NY (VT); Department of Biomedical Informatics, Columbia University, New York, NY (CF)
  1. Correspondence and reprints: Hongfang Liu, PhD, Department of Information Systems, University of Maryland at Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250; e-mail: <hfliu{at}umbc.edu>
  • Received 9 January 2004
  • Accepted 16 March 2004

Abstract

Objective The aim of this study was to investigate relations among different aspects in supervised word sense disambiguation (WSD; supervised machine learning for disambiguating the sense of a term in a context) and compare supervised WSD in the biomedical domain with that in the general English domain.

Methods The study involves three data sets (a biomedical abbreviation data set, a general biomedical term data set, and a general English data set). The authors implemented three machine-learning algorithms, including (1) naïve Bayes (NBL) and decision lists (TDLL), (2) their adaptation of decision lists (ODLL), and (3) their mixed supervised learning (MSL). There were six feature representations (various combinations of collocations, bag of words, oriented bag of words, etc.) and five window sizes (2, 4, 6, 8, and 10).

Results Supervised WSD is suitable only when there are enough sense-tagged instances with at least a few dozens of instances for each sense. Collocations combined with neighboring words are appropriate selections for the context. For terms with unrelated biomedical senses, a large window size such as the whole paragraph should be used, while for general English words a moderate window size between 4 and 10 should be used. The performance of the authors' implementation of decision list classifiers for abbreviations was better than that of traditional decision list classifiers. However, the opposite held for the other two sets. Also, the authors' mixed supervised learning was stable and generally better than others for all sets.

Conclusion From this study, it was found that different aspects of supervised WSD depend on each other. The experiment method presented in the study can be used to select the best supervised WSD classifier for each ambiguous term.

Footnotes

  • Supported in part by NLM grant LM06274 and NSF grant NSF 0312250.

Access policy for JAMIA

All content published in JAMIA is deposited with PubMedCentral by the publisher but with varying embargo times. Authors/funders may pay an Unlocked fee of $2,000 to make the article free on the JAMIA website and PMC immediately on publication. Research funded by government and other recognised agencies is deposited with a 12 month embargo. All other content is deposited with a 36 month embargo.

Register for free content

Individuals may register for a free 60 day online trial to all content.

The Journal of the American Medical Informatics Association is published for the American Medical Informatics Association by BMJ Publishing Group Ltd.