Identifying Patient Smoking Status from Medical Discharge Records
- Özlem Uzunera,b,
- Ira Goldsteina,
- Yuan Luoa,
- Isaac Kohanec
- aUniversity at Albany, State University of New York, Albany, NY
- bMassachusetts Institute of Technology, Boston, MA
- cChildren’s Hospital and Harvard Medical School, Boston, MA
- Correspondence: Özlem Uzuner, PhD, University at Albany, SUNY, Draper 114A, 135 Western Avenue, Albany, NY 12222 e-mail: <mailto:ouzuner{at}albany.edu>
- Received 21 February 2007
- Accepted 30 June 2007
Introduction
Clinical narrative records contain much useful information. However, most clinical narratives are in the form of fragmented English free text, showing the characteristics of a clinical sublanguage. This makes their linguistic processing, search, and retrieval challenging.1 Traditional natural language processing (NLP) tools are not designed for the fragmented free text found in narrative clinical records; therefore, they do not perform well on this type of data.2 Limited access to clinical records has been a barrier to the widespread development of medical language processing (MLP) technologies. In the absence of a standardized, publicly available ground truth that encourages the development of MLP systems and allows their head-to-head comparison, successful MLP efforts have been limited, e.g., MedLEE3 and Symtxt.4 A few MLP systems have been developed,5 and such efforts have successfully shown the usefulness of MLP in clinical settings.6 7 8
To improve the availability of clinical records and to contribute to the advancement of the state of the art in MLP, within the i2b2 (Informatics for Integrating Biology to the Bedside) project, the authors de-identified and released a set of clinical records from Partners HealthCare. These records provided the basis for the development of ground truth for two challenge questions:
-
1 Automatic de-identification of clinical data, i.e., de-identification challenge.
-
2 Automatic evaluation of the smoking status of patients based on medical records, i.e., smoking challenge.
Representative teams from the MLP community participated in the two challenges and met at a workshop organized by the authors to discuss the results of the challenges. The workshop was co-sponsored by the American Medical Informatics Association and met in conjunction with its Fall Symposium in November 2006. This article provides an overview of the smoking challenge and the findings of the workshop. An overview of the de-identification challenge can …









