rss
JAMIA 2009;16:256-266 doi:10.1197/jamia.M2902
  • Original Investigation
  • Model Formulation

Evaluating Predictors of Geographic Area Population Size Cut-offs to Manage Re-identification Risk

  1. Khaled El Emama,b,
  2. Ann Browna,
  3. Philip AbdelMalikc
  1. aChildren's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada
  2. bPediatrics, Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada
  3. cGIS Infrastructure, Office of Public Health Practice, Public Health Agency of Canada, Ottawa, ON, Canada
  1. Correspondence: Khaled El Emam, CHEO Research Institute, 401 Smyth Road, Ottawa, ON K1H 8L1, Canada; e-mail: <kelemam{at}uottawa.ca>
  • Received 18 June 2008
  • Accepted 30 November 2008

Abstract

Objective In public health and health services research, the inclusion of geographic information in data sets is critical. Because of concerns over the re-identification of patients, data from small geographic areas are either suppressed or the geographic areas are aggregated into larger ones. Our objective is to estimate the population size cut-off at which a geographic area is sufficiently large so that no data suppression or further aggregation is necessary.

Design The 2001 Canadian census data were used to conduct a simulation to model the relationship between geographic area population size and uniqueness for some common demographic variables. Cut-offs were computed for geographic area population size, and prediction models were developed to estimate the appropriate cut-offs.

Measurements Re-identification risk was measured using uniqueness. Geographic area population size cut-offs were estimated using the maximum number of possible values in the data set and a traditional entropy measure.

Results The model that predicted population cut-offs using the maximum number of possible values in the data set had R2 values around 0.9, and relative error of prediction less than 0.02 across all regions of Canada. The models were then applied to assess the appropriate geographic area size for the prescription records provided by retail and hospital pharmacies to commercial research and analysis firms.

Conclusions To manage re-identification risk, the prediction models can be used by public health professionals, health researchers, and research ethics boards to decide when the geographic area population size is sufficiently large.

Footnotes

  • This work was approved by the research ethics board of The Children's Hospital of Eastern Ontario Research Institute.

Access policy for JAMIA

All content published in JAMIA is deposited with PubMedCentral by the publisher but with varying embargo times. Authors/funders may pay an Unlocked fee of $2,000 to make the article free on the JAMIA website and PMC immediately on publication. Research funded by government and other recognised agencies is deposited with a 12 month embargo. All other content is deposited with a 36 month embargo.

Register for free content

Individuals may register for a free 60 day online trial to all content.

The Journal of the American Medical Informatics Association is published for the American Medical Informatics Association by BMJ Publishing Group Ltd.