State-of-the-art Anonymization of Medical Records Using an Iterative Machine Learning Framework
- Affiliations of the authors: Department of Informatics, University of Szeged (GS); Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged (RF, RB-F), Szeged, Hungary
- Correspondence and reprints: György Szarvas, University of Szeged, Department of Informatics, 6720, Szeged, Árpád tér 2., Hungary; e-mail: <szarvas{at}inf.u-szeged.hu>
- Received 16 March 2007
- Accepted 11 June 2007
Abstract
Objective The anonymization of medical records is of great importance in the human life sciences because a de-identified text can be made publicly available for non-hospital researchers as well, to facilitate research on human diseases. Here the authors have developed a de-identification model that can successfully remove personal health information (PHI) from discharge records to make them conform to the guidelines of the Health Information Portability and Accountability Act.
Design We introduce here a novel, machine learning-based iterative Named Entity Recognition approach intended for use on semi-structured documents like discharge records. Our method identifies PHI in several steps. First, it labels all entities whose tags can be inferred from the structure of the text and it then utilizes this information to find further PHI phrases in the flow text parts of the document.
Measurements Following the standard evaluation method of the first Workshop on Challenges in Natural Language Processing for Clinical Data, we used token-level Precision, Recall and Fβ=1 measure metrics for evaluation.
Results Our system achieved outstanding accuracy on the standard evaluation dataset of the de-identification challenge, with an F measure of 99.7534% for the best submitted model.
Conclusion We can say that our system is competitive with the current state-of-the-art solutions, while we describe here several techniques that can be beneficial in other tasks that need to handle structured documents such as clinical records.
Footnotes
-
Supported in part by the Computer and Automation Research Institute of the Hungarian Academy of Sciences and by NKFP-2/051/2004.
-
The authors thank the task organizers for organizing the challenge and their help; and the anonymous reviewers for valuable comments.
-
↵a The best performing system (Wellner et al., 2006)12 at the workshop was also the adaptation of an existing NER model to clinical data. Our system came second, with the difference in performance between the two systems being below the level of significance. These facts prove the feasibility of adapting a NER system to anonymization.
-
↵b Guo et al.’s system (2006)14 made use of only a subset of the available training data, due to SVM’s higher time complexity.
-
↵c This model was a similar boosted decision tree classifier, but without regular expression features, document heading information and iterative learning process.
-
↵d Personal Health Information had to be concealed from the challenge participants. To achieve this, the organizers removed all PHI from the corpus and replaced them with artificially generated realistic substitutes. For more information about this, see Uzuner et al. (2007).1
-
↵e ITR2_VOTE is an out-of-competition result as we had no time to prepare all three second iteration systems in the evaluation period of the competition. The differences between the three best systems are only marginal, however.
-
↵f The higher values in the first column of Table 2 tell us that our model is better at precision than recall.









