rss
J Am Med Inform Assoc 2008;15:36-39 doi:10.1197/jamia.M2442
  • Focus On Medical Record Identification of Smoking Status
  • Case Report

Identifying Smokers with a Medical Extraction System

  1. Cheryl Clarka,
  2. Kathleen Goodb,
  3. Lesley Jeziernyb,
  4. Melissa Macphersonb,
  5. Brian Wilsonb,
  6. Urszula Chajewskab
  1. aThe MITRE Corporation, Bedford, MA
  2. bDictaphone Healthcare Solutions, Nuance Communications, Inc., Burlington, MA
  1. Correspondence: Cheryl Clark, The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730; e-mail: <cclark{at}mitre.org>
  • Received 16 March 2007
  • Accepted 3 October 2007

Abstract

The Clinical Language Understanding group at Nuance Communications has developed a medical information extraction system that combines a rule-based extraction engine with machine learning algorithms to identify and categorize references to patient smoking in clinical reports. The extraction engine identifies smoking references; documents that contain no smoking references are classified as UNKNOWN. For the remaining documents, the extraction engine uses linguistic analysis to associate features such as status and time to smoking mentions. Machine learning is used to classify the documents based on these features. This approach shows overall accuracy in the 90s on all data sets used. Classification using engine-generated and word-based features outperforms classification using only word-based features for all data sets, although the difference gets smaller as the data set size increases. These techniques could be applied to identify other risk factors, such as drug and alcohol use, or a family history of a disease.

Footnotes

  • a A description of related work is included in the full paper published as the online data supplement at www.jamia.org.

  • b Details of Smoking Challenge results are included in the full paper published as the online supplement at www.jamia.org.

  • c The learning models for the challenge and our own experiments were created using the Waikato Environment for Knowledge Analysis (WEKA) system.10

  • d Please see Table 1, available in the full paper published as a JAMIA online data supplement at www.jamia.org.

  • e A discussion of per-class accuracy and class confusions is available in the full paper published as the online data supplement at www.jamia.org.

Access policy for JAMIA

All content published in JAMIA is deposited with PubMed Central by the publisher with a 12 month embargo. Authors/funders may pay an Unlocked fee of $2,000 to make the article free on the JAMIA website and PMC immediately on publication.

All content older than 12 months is freely available on this website.

AMIA members can log in with their JAMIA user name (email address) and password or via the AMIA website.