Rapidly Retargetable Approaches to De-identification in Medical Records
- Ben Wellner,
- Matt Huyck,
- Scott Mardis,
- John Aberdeen,
- Alex Morgan,
- Leonid Peshkin,
- Alex Yeh,
- Janet Hitzeman,
- Lynette Hirschman
- Affiliations of the authors: The MITRE Corporation (BW, SM, JA, Am, AY, JH, LH), Bedford, MA; Center for Biomedical Informatics, Harvard Medical School (MH, LP), Boston, MA; Department of Computer Science, Brandeis University (BW), Waltham, MA; Stanford Biomedical Informatics (AM), Palo Alto, CA
- Correspondence and reprints: John Aberdeen, 202 Burlington Road, Bedford, MA 01730; e-mail: <aberdeen{at}mitre.org>
- Received 13 March 2007
- Accepted 11 June 2007
Abstract
Objective This paper describes a successful approach to de-identification that was developed to participate in a recent AMIA-sponsored challenge evaluation.
Method Our approach focused on rapid adaptation of existing toolkits for named entity recognition using two existing toolkits, Carafe and LingPipe.
Results The “out of the box” Carafe system achieved a very good score (phrase F-measure of 0.9664) with only four hours of work to adapt it to the de-identification task. With further tuning, we were able to reduce the token-level error term by over 36% through task-specific feature engineering and the introduction of a lexicon, achieving a phrase F-measure of 0.9736.
Conclusions We were able to achieve good performance on the de-identification task by the rapid retargeting of existing toolkits. For the Carafe system, we developed a method for tuning the balance of recall vs. precision, as well as a confidence score that correlated well with the measured F-score.
Footnotes
-
↵a The gradient of the log-likelihood is a vector where each component, corresponding to a particular feature, is the difference between the observed frequency of that feature in the training data and the expected frequency of that feature according to the current model (i.e., the current set of weights). These feature expectations can be computed efficiently using the forward-backward algorithm. See Sha and Pereira, 20038 for details.
-
↵b We carried out a number of experiments with different Gaussian prior values and noticed remarkably little difference in the results with different values on these data.









