A Globally Optimal k-Anonymity Method for the De-Identification of Health Data
- Khaled El Emam, PhDa,b,
- Fida Kamal Dankar, PhDa,
- Romeo Issa, MSd,
- Elizabeth Jonker, BAa,
- Daniel Amyot, PhDc,
- Elise Cogo, NDa,
- Jean-Pierre Corriveau, PhDd,
- Mark Walker, MS, MDe,
- Sadrul Chowdhury, MSc,
- Regis Vaillancourt, BPharm, PharmD, a,
- Tyson Roffey, BAa,
- Jim Bottomley, BScH, MHAa
- aChildren's Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada
- bPediatrics, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada
- cSchool of Information Technology and Engineering, University of Ottawa, Ottawa, Ontario, Canada
- dSchool of Computer Science, Carleton University, Ottawa, Ontario, Canada
- eOttawa Hospital Research Institute, Ottawa, Ontario, Canada
- Correspondence: Khaled El Emam, CHEO Research Institute, 401 Smyth Road, Ott, ON K1H 8L1, Canada (Email: kelemam{at}uottawa.ca).
- Received 19 January 2009
- Accepted 2 June 2009
Abstract
Background Explicit patient consent requirements in privacy laws can have a negative impact on health research, leading to selection bias and reduced recruitment. Often legislative requirements to obtain consent are waived if the information collected or disclosed is de-identified.
Objective The authors developed and empirically evaluated a new globally optimal de-identification algorithm that satisfies the k-anonymity criterion and that is suitable for health datasets.
Design Authors compared OLA (Optimal Lattice Anonymization) empirically to three existing k-anonymity algorithms, Datafly, Samarati, and Incognito, on six public, hospital, and registry datasets for different values of k and suppression limits.
Measurement Three information loss metrics were used for the comparison: precision, discernability metric, and non-uniform entropy. Each algorithm's performance speed was also evaluated.
Results The Datafly and Samarati algorithms had higher information loss than OLA and Incognito; OLA was consistently faster than Incognito in finding the globally optimal de-identification solution.
Conclusions For the de-identification of health datasets, OLA is an improvement on existing k-anonymity algorithms in terms of information loss and performance.









