Evaluating the State-of-the-Art in Automatic De-identification
- Affiliations of the authors: University at Albany, SUNY(OU,YL), Albany, NY;MIT CSAIL(PS), Cambridge, MA
- Correspondence and reprints: Özlem Uzuner, PhD, University at Albany, SUNY, Draper 114A, 135 Western Ave., Albany, NY 12222;email: <ouzuner{at}albany.edu>
- Received 19 March 2007
- Accepted 15 June 2007
Abstract
To facilitate and survey studies in automatic de-identification, as a part of the i2b2 (Informatics for Integrating Biology to the Bedside) project, authors organized a Natural Language Processing (NLP) challenge on automatically removing private health information (PHI) from medical discharge records. This manuscript provides an overview of this de-identification challenge, describes the data and the annotation process, explains the evaluation metrics, discusses the nature of the systems that addressed the challenge, analyzes the results of received system runs, and identifies directions for future research. The de-indentification challenge data consisted of discharge summaries drawn from the Partners Healthcare system. Authors prepared this data for the challenge by replacing authentic PHI with synthesized surrogates. To focus the challenge on non-dictionary-based de-identification methods, the data was enriched with out-of-vocabulary PHI surrogates, i.e., made up names. The data also included some PHI surrogates that were ambiguous with medical non-PHI terms. A total of seven teams participated in the challenge. Each team submitted up to three system runs, for a total of sixteen submissions. The authors used precision, recall, and F-measure to evaluate the submitted system runs based on their token-level and instance-level performance on the ground truth. The systems with the best performance scored above 98% in F-measure for all categories of PHI. Most out-of-vocabulary PHI could be identified accurately. However, identifying ambiguous PHI proved challenging. The performance of systems on the test data set is encouraging. Future evaluations of these systems will involve larger data sets from more heterogeneous sources.
Footnotes
-
This work was supported in part by the National Institutes of Healththrough research grants 1 RO1 EB001659 from the National Institute ofBiomedical Imaging and Bioengineering and through the NIH Roadmapfor Medical Research, Grant U54LM008748. IRB approval hasbeen granted for the studies presented in this manuscript. The authorsthank all participating teams for their contribution to the challenge. Theauthors also thank AMIA for their support in the organization of theworkshop that accompanied this challenge. We thank Stephen deLongfor his feedback on this article.
-
↵b For each unit, flip a coin to decide whether to exchange system responses for it.
-
↵c In this paper, we refer to systems with the last name of the first author and a submission id.
-
↵d Throughout this paper, we describe regular expressions using Perl syntax.
-
↵e Rules are captured by regular expression templates. In this manuscript, we use the terms “rules” and “regular expression templates,” and “format templates” interchangeably.









