An Evaluation of the Current State of Genomic Data Privacy Protection Technology and a Roadmap for the Future
- Affiliation of the author: Data Privacy Laboratory, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
- Correspondence and reprints: Bradley Malin, MS, MPhil, Carnegie Mellon University, School of Computer Science, Institute for Software Research International, Wean Hall Room 1320 B, Pittsburgh, PA 15213-3890; e-mail: <malin{at}cs.cmu.edu>
- Received 9 April 2004
- Accepted 21 August 2004
Abstract
The incorporation of genomic data into personal medical records poses many challenges to patient privacy. In response, various systems for preserving patient privacy in shared genomic data have been developed and deployed. Although these systems de-identify the data by removing explicit identifiers (e.g., name, address, or Social Security number) and incorporate sound security design principles, they suffer from a lack of formal modeling of inferences learnable from shared data. This report evaluates the extent to which current protection systems are capable of withstanding a range of re-identification methods, including genotype–phenotype inferences, location–visit patterns, family structures, and dictionary attacks. For a comparative re-identification analysis, the systems are mapped to a common formalism. Although there is variation in susceptibility, each system is deficient in its protection capacity. The author discovers patterns of protection failure and discusses several of the reasons why these systems are susceptible. The analyses and discussion within provide guideposts for the development of next-generation protection methods amenable to formal proofs.
Footnotes
-
Supported by a National Science Foundation IGERT Program grant and the Data Privacy Laboratory in the Institute for Software Research International, a department in the School of Computer Science at Carnegie Mellon University. The opinions expressed in this research are solely those of the author and do not necessarily reflect those of the National Science Foundation.
-
The author thanks Alessandro Acquisti, Michael Shamos, Latanya Sweeney, Jean Wylie, and the members of the Data Privacy Library at Carnegie Mellon University for their insightful comments and discussion.
-
↵* For example, the PopSet database at the National Center for Biotechnology Information contains publicly available DNA sequence data, which are not subject to oversight by an Institutional Review Board.
-
↵† The reader is directed toward references 9–12 for examples of computational data privacy in the biomedical community.
-
↵‡ Details on the second protocol and its mapping to this paper's formalism are available in reference 19.
-
↵§ The set of attributes Additional Demographic Features corresponds to demographic attributes deemed useful by deCode.
-
↵∥ Details of the combinatorics for more complex combinations of family-disease structures are provided elsewhere.15
-
↵¶ At the time of writing, the website http://www.rat.de/kuijsten/navigator/ provided links to a number of genealogical resources.








