Protecting Privacy Using kAnonymity
 ^{a}Children's Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada
 ^{b}Pediatrics, Faculty of Medicine, University of Ottawa, Ottawa, Ontario, Canada
 Correspondence: Khaled El Emam, Children's Hospital of Eastern Ontario Research Institute, 401 Smyth Road, Ottawa, Ontario K1J 8L1, Canada (Email: <kelemam{at}uottawa.ca>)
 Received 9 January 2008
 Accepted 21 May 2008
Abstract
Objective There is increasing pressure to share health information and even make it publicly available. However, such disclosures of personal health information raise serious privacy concerns. To alleviate such concerns, it is possible to anonymize the data before disclosure. One popular anonymization approach is kanonymity. There have been no evaluations of the actual reidentification probability of kanonymized data sets.
Design Through a simulation, we evaluated the reidentification risk of kanonymization and three different improvements on three large data sets.
Measurement Reidentification probability is measured under two different reidentification scenarios. Information loss is measured by the commonly used discernability metric.
Results For one of the reidentification scenarios, kAnonymity consistently overanonymizes data sets, with this overanonymization being most pronounced with small sampling fractions. Overanonymization results in excessive distortions to the data (i.e., high information loss), making the data less useful for subsequent analysis. We found that a hypothesis testing approach provided the best control over reidentification risk and reduces the extent of information loss compared to baseline kanonymity.
Conclusion Guidelines are provided on when to use the hypothesis testing approach instead of baseline kanonymity.
Introduction
The sharing of raw research data is believed to have many benefits, including making it easier for the research community to confirm published results, ensuring the availability of original data for metaanalysis, facilitating additional innovative analysis on the same data sets, getting feedback to improve data quality for ongoing data collection efforts, achieving cost savings from not having to collect the same data multiple times by different research groups, minimizing the need for research participants to provide data repeatedly, facilitating linkage of research data sets with administrative records, and making data available for instruction and education.1 2 3 4 5 6 7 8 9 10 11 12 13 14 Consequently, there are pressures to make such research data more generally available.8 15 16 For example, in January 2004 Canada was a signatory to the OECD Declaration on Access to Research Data from Public Funding.17 This is intended to ensure that data generated through public funds are publicly accessible for researchers as much as possible.18 To the extent that this is implemented, potentially more personal health data about Canadians will be made available to researchers world wide. The European Commission has passed a regulation facilitating the sharing with external researchers of data collected by Community government agencies.19 There is interest by the pharmaceutical industry and academia to share raw data from clinical trials.16 20
Researchers in the future may have to disclose their data. The Canadian Medical Association Journal has recently contemplated requiring authors to make the full data set from their published studies available publicly online.3 Similar calls for depositing raw data with published manuscripts have been made recently.2 5 7 20 21 22 The Canadian Institutes of Health Research (CIHR) has a policy, effective on 1^{st} January 2008, that requires making some data available with publications.23 The UK MRC policy on data sharing sets the expectation that data from their funded projects will be made publicly available.24 The UK Economic and Social Research Council requires its funded projects to deposit data sets in the UK Data Archive (such projects generate health and lifestyle data on, for example, diet, reproduction, pain, and mental health).25 The European Research Council considers it essential that raw data be made available preferably immediately after publication, but not later than six months after publication.26 The NIH in the US expects investigators seeking more than $500,000 per year in funding to include a data sharing plan (or explain why that is not possible).27 Courts, in criminal and civil cases, may compel disclosure of research data.11 28
Such broad disclosures of health data pose significant privacy risks.38 The risks are real, as demonstrated by recent successful reidentifications of individuals in publicly disclosed data sets (see the examples in Table 1). One approach for protecting the identity of individuals when releasing or sharing sensitive health data is to anonymize it.19
A popular approach for data anonymization is kanonymity.39 40 41 42 With kanonymity an original data set containing personal health information can be transformed so that it is difficult for an intruder to determine the identity of the individuals in that data set. A kanonymized data set has the property that each record is similar to at least another k1 other records on the potentially identifying variables. For example, if k = 5 and the potentially identifying variables are age and gender, then a kanonymized data set has at least 5 records for each value combination of age and gender. The most common implementations of kanonymity use transformation techniques such as generalization, global recoding, and suppression.39 40 42 43 44 45
Any record in a kanonymized data set has a maximum probability 1/κ of being reidentified.44 In practice, a data custodian would select a value of k commensurate with the reidentification probability they are willing to tolerate—a threshold risk. Higher values of k imply a lower probability of reidentification, but also more distortion to the data, and hence greater information loss due to kanonymization. In general, excessive anonymization can make the disclosed data less useful to the recipients because some analysis becomes impossible or the analysis produces biased and incorrect results.46 47 48 49 50 51
Thus far there has been no empirical examination of how close the actual reidentification probability is to this maximum. Ideally, the actual reidentification probability of a kanonymized data set would be close to 1/κ since that balances the data custodian's risk tolerance with the extent of distortion that is introduced due to kanonymization. However, if the actual probability is much lower than 1/κ then kanonymity may be overprotective, and hence results in unnecessarily excessive distortions to the data.
In this paper we make explicit the two reidentification scenarios that kanonymity protects against, and show that the actual probability of reidentification with kanonymity is much lower than 1/κ for one of these scenarios, resulting in excessive information loss. To address that problem, we evaluate three different modifications to kanonymity and identify one that ensures that the actual risk is close to the threshold risk and that also reduces information loss considerably. The paper concludes with guidelines for deciding when to use the baseline versus the modified kanonymity procedure. Following these guidelines will ensure that reidentification risk is controlled with minimal information loss when using kanonymity.
Background
The Two Reidentification Scenarios for a kAnonymized Data Set
The concern of kanonymity is with the reidentification of a single individual in an anonymized data set.44 There are two reidentification scenarios for a single individual:52 53 54

Reidentify a specific individual (known as the prosecutor reidentification scenario). The intruder (e.g., a prosecutor) would know that a particular individual (e.g., a defendant) exists in an anonymized database and wishes to find out which record belongs to that individual.

Reidentify an arbitrary individual (known as the journalist reidentification scenario). The intruder does not care which individual is being reidentified, but is only interested in being able to claim that it can be done. In this case the intruder wishes to reidentify a single individual to discredit the organization disclosing the data.
Reidentification Risk under the Prosecutor Scenario
The set of patients in the file to be disclosed is denoted by s. Before the file about s can be disclosed, it must be anonymized. Some of the records in the file will be suppressed during anonymization, therefore a different subset of patients, s′, will be represented in the anonymized version of this file. Let the anonymized file be denoted by ζ. There is a onetoone mapping between the records in ζ and the individuals in s'.
Under the prosecutor scenario, a specific individual is being reidentified, say, a VIP. The intruder will match the VIP with the records in ζ on quasiidentifiers. Variables such as gender, date of birth, postal code, and race are commonly used quasiidentifiers. Records in ζ that have the same values on the quasiidentifiers are called an equivalence class.55
Let the number of records in ζ that have exactly the same quasiidentifier values as the VIP be f. The reidentification risk for the VIP is then f. For example, if the individual being reidentified is a 50 year old male, then f is the number of records on 50 year old males in ζ. The intruder has a probability 1/f of getting a correct match.
Since the data custodian does not know, a priori, which equivalence class a VIP will match against, one can assume the worse case scenario. Under the worse case scenario, the intruder will have a VIP who matches with the smallest equivalence class in ζ, which in a kanonymized data set will have a size of at least k. Hence the reidentification probability will be at most 1/κ.
Therefore, under the prosecutor reidentification scenario kanonymity can ensure that the reidentification risk is approximately equal to the threshold risk, as intended by the data custodian. This, however, is not the case under the journalist reidentification scenario.
Reidentification Risk under the Journalist Scenario
We assume that there exists a large finite population of patients denoted by the set U. We then have s′. An intruder would have access to an identification database about the population U, and uses this identification database to match against the patients in ζ. The identification database is denoted by Z, and the records in Z have a onetoone mapping to the individuals in U.
In the example of Figure 1 we have a data set about 14 individuals that needs to be disclosed. This data set is 2anonymized to produce the anonymized data set, ζ. After 2anonymization, there are only 11 records left in ζ since three had to be suppressed.
An intruder gets hold of an identification database with 31 records. This is the Z database. The intruder then attempts reidentification by matching an arbitrary record against the records in ζ on year of birth and gender. In our example, once an arbitrary individual is reidentified, the intruder will have that individual's test result.
The discrete variable formed by crossclassifying all values on the quasiidentifiers in ζ can take on J distinct values. Let X_{ζ,i} denote the value of a record i in the ζ data set. For example, if we have two quasiidentifiers, such as gender and age, then we may have , , and so on. Similarly let X_{Z,i} denote the value of record i in the Z data set.
The sizes of the different equivalence classes are given by where f^{j} is the size of a ζ equivalence class and I(·) is the indicator function. Similarly we have where F_{j} is the size of an equivalence class in Z.
Under the journalist reidentification scenario, the probability of reidentification of a record in an equivalence class j is 1/F_{j}.56 57 However, a smart intruder would focus on the records in equivalence classes with the highest probability of reidentification. Equivalence classes with the smallest value for F_{j} have the highest probability of being reidentified, and therefore we assume that a smart intruder will focus on these. The probability of reidentification of an arbitrary individual by a smart intruder is then given by:
If we consider Figure 1 again, the 2anonymized file had the age converted into 10 year intervals. In that example we can see that Θ_{max} = 0.25 because the smallest equivalence class in Z has 4 records (ID numbers 1, 4, 12, and 27). With 2anonymization the data custodian was using a threshold risk of 0.5, but the actual risk of reidentification, Θ_{max}, was half of that. This conservatism may seem like a good idea, but in fact it has a large negative impact on data quality. In our example, 2anonymization resulted in converting age into ten year intervals and the suppression of more than one fifth of the records that had to be disclosed (3 of 14 records had to be suppressed). By most standards, losing one fifth of a data set due to anonymization would be considered extensive information loss.
Now consider another approach: kmap. With kmap it is assumed that the data custodian can kanonymize the identification database itself (and hence directly control the F_{j} values). Let's say that the Z identification database is kanonymized to produce Z′. The kmap property states that each record in ζ is similar to at least k records in Z′.34 41 This is illustrated in Figure 2. Here, the data custodian 2anonymizes the identification database directly, and then implements the transformations to the data set to be disclosed. In this example Θ_{max} = 0.5 because the smallest equivalence classes in Z′ for records 1 to 14 have two records. Also, the extent of information loss is reduced significantly: there are no records suppressed in the disclosed data set and the age is converted into 5 year intervals rather than 10 year intervals. By using the kmap property we have ensured that the actual reidentification risk is what the data custodian intended and we have simultaneously reduced information loss.
In practice, the kmap model is not used because it is assumed that the data custodian does not have access to an identification database, but that an intruder does.34 41 Therefore, the kanonymity model is used instead.
There are good reasons why the data custodian would not have an identification database. Often, a population database is expensive to get hold of. Plus, it is likely that the data custodian will have to protect multiple populations, hence multiplying the expense. For example, the construction of a single professionspecific database using semipublic registries that can be used for reidentification attacks in Canada costs between $150,000 to $188,000.58 Commercial databases can be comparatively costly. Furthermore, an intruder may commit illegal acts to get access to population registries. For example, privacy legislation and the Elections Act in Canada restrict the use of voter lists to running and supporting election activities.58 There is at least one known case where a charity allegedly supporting a terrorist group has been able to obtain Canadian voter lists for fund raising.59 60 61 A legitimate data custodian would not engage in such acts.
However, a number of methods have been developed in the statistical disclosure control literature to estimate the size of the equivalence classes in Z from a sample. If these estimates are accurate, then they can be used to approximate kmap. Approximating kmap will ensure that the actual risk is close to the threshold risk, and consequently that there will be less information loss. Three such methods are considered below.
Proposed Improvements to kAnonymity under the Journalist Reidentification Scenario
We consider three alternative approaches to reduce the extent of overanonymization under the journalist reidentification scenario. These three approaches extend kanonymity to approximate kmap. The details of the proposed approaches are provided in Appendix A (Risk Estimates) (available as a JAMIA onlineonly data supplement at http://www.jamia.org).
Baseline (D1)
Baseline kAnonymization algorithms apply transformations, such as generalization, global recoding, and suppression until all equivalence classes in ζ are of size k or more.
Individual Risk Estimation (D2)
The actual reidentification risk for each equivalence class in , can be directly estimated.
Subsequently, the kanonymization algorithm should ensure that all equivalence classes meet the following condition . One estimator for has been studied extensively57 62 63 64 65 66 and was also incorporated in the muargus tool (which was developed by the Netherlands national statistical agency and used by many other national statistical agencies for disclosure control purposes),67 68 69 70 but it has never been evaluated in the context of kanonymity. To the extent that this individual risk estimator is accurate, it can ensure that the actual risk is as close as possible to the threshold risk.
Hypothesis Testing Using a Poisson Distribution (D3)
One can use a hypothesis testing approach for determining if F_{j} > κ.56 71 If we assume that the size of the sample equivalence classes f_{j} follow a Poisson distribution, we can construct the null Poisson distribution for H_{0} : F_{j} < κ and determine which observed value of f_{j} will reject the null hypothesis at α = 0.1. Let's denote this value as κ′. Then the kanonymity algorithm should ensure that the following condition f_{j} ≥ min (κ,κ′) is met for all equivalence classes.
Hypothesis Testing Using a Truncatedatzero Poisson Distribution (D4)
In practice, we ignore equivalence classes that do not appear in the sample, therefore, the value of f_{j} cannot be equal to zero. An improvement in the hypothesis testing approach above would then be to use a truncatedatzero Poisson distribution72 73 to determine the value of κ′. The kanonymity algorithm should ensure that the condition f_{j} ≥ min (κ,κ′) is met for all equivalence classes.
Methods
Our objective was to evaluate the three methods described above, and compare their performance to the baseline kanonymity approach. We performed a simulation study to evaluate (a) the actual reidentification probability for kanonymized data sets under the journalist reidentification scenario, and (b) the information loss due to this kanonymization. We use values of k = 5, 10, and 15. Even though a minimum k value of 3 is often suggested,54 74 a common recommendation in practice is to ensure that there are at least five similar observations (k = 5).75 76 77 78 79 80 It is uncommon for data custodians to use values of k above 5, and quite rare that values of k greater than 15 are used in practice.
Data Sets
For our simulation we used 3 data sets which served as our populations. The first is the list of physicians and their basic demographics from the College of Physicians and Surgeons of Ontario with 23,590 observations.81 The quasiidentifiers we used were: postal code (5,349 unique values), graduation year (70 unique values), and gender (2 unique values). The second was a data set from the Paralyzed Veterans Association on veterans with spinal cord injuries or disease with 95,412 observations.82 The quasiidentifiers we used were: zip code (10,909 unique values), date of birth (901 unique values), and gender (2 values). The third data set is the fatal crash information database from the department of transportation with 101,034 observations.83 The quasiidentifiers used were age (98 unique values), gender (2 unique values), race (19 unique values), and date of death (386 unique values).
The quasiidentifiers we used in our three data sets are ones known to make it easy to link with publicly available information in Canada and the US.32 58 84 85
kAnonymization
One thousand simple random samples were drawn from each data set at nine different sampling fractions (0.1 to 0.9 in increments of 0.1). Any identifying variables were removed and each sample was kanonymized.
An existing global optimization algorithm44 was implemented to kanonymize the samples. This algorithm uses a cost function to guide the kanonymization process (the objective is to minimize this cost). A commonly used cost function to achieve baseline kanonymity is the discernability metric.44 86 87 88 89,90 91 In Appendix A (Risk Estimates) we describe how this cost function is adjusted to implement the approaches D2, D3, and D4 within the same global optimization algorithm.
Note that records with missing values on the quasiidentifiers were removed from our analysis.
Evaluation
For each kanonymized data set the actual risk is measured as Θ_{max} and the information loss is measured in terms of the discernability metric. Averages were calculated for each sampling fraction across the 1000 samples.
The results are presented in the form of three sets of graphs:
Risk. This shows the value of Θ_{max} against sampling fraction for each of the four approaches.
Information Loss. Because the discernability metric is affected by the sample size (and hence makes it difficult to compare across differing sampling fractions), we normalize it for D2, D3, and D4 by the baseline value. For example, a value of 0.8 (or 80%) for D2 means that the information loss for D2 is 80% of that for the baseline kanonymity approach. The graph shows the normalized discernability metric for these three approaches against the sampling fraction. The value for D1 will by definition always be 1 (or 100%).
Suppression. Because the extent of suppression is an important indicator of data quality by itself, we show graphs of the percentage of suppressed records against sampling fraction for the four approaches.
Results
We will only present the results for k = 5 (i.e., a risk threshold = 0.2) here, with the remaining graphs for k = 10 and k = 15 provided in Appendix B (Results for k = 10 and k = 15) (available as a JAMIA onlineonly data supplement at http://www.jamia.org). The conclusions for k = 10 and k = 15 support the k = 5 results.
The actual reidentification risk Θ_{max} is shown using the four approaches in Figure 3. The baseline approach (D1), which is current practice, is quite low and exhibits a wide gap between the actual risk and the 0.2 risk threshold at k = 5. This gap is quite marked for small sampling fractions and disappears for large sampling fractions. At higher sampling fractions there is no difference among the baseline approach and the other three in terms of actual risk (they all converge to 0.2 as the sampling fraction approaches 1).
The individual risk estimation approach (D2) results in particularly large actual risk values for sampling fractions as high as 0.6. Even though individual risk estimates may be relatively accurate on average across many equivalence classes, their use will not result in a reasonable level of protection against a smart intruder who will focus only on the smallest equivalence class.
Approach D3 is better, but still results in actual risk values above the threshold quite often and by a wide gap for sampling fractions of 0.3 and less. But the fact that D3 cannot maintain risk below the threshold for sampling fractions as high as 0.3 make it unsuitable for practical use.
The best approach is D4, whereby it does maintain the actual risk closest to and below the threshold risk of 0.2. Compared to D1, its actual risk is higher. But because it is below or very close to the threshold, its behaviour is consistent with what a data custodian would expect.
Figure 4 shows the normalized information loss in terms of the discernability metric. As expected, the D1 approach has the largest information loss among all four approaches, especially at lower sampling fractions (this is evidenced by the normalized discernability metric always having values below 100%). At higher sampling fractions all approaches tend to converge.
The hypothesis testing approach D4 has higher information loss than D3, but in many cases that difference is not very pronounced. But D4 is a significant improvement on D1, especially for low sampling fractions. For example, for the CPSO data set, D4 has 45% of the information loss of D1 at a sampling fraction of 0.1. Approach D2 has the lowest information loss. This is to be expected since the actual reidentification risk for D2 is often very high (as we saw in Figure 3).
While suppression is accounted for within the discernability metric, it is informative to consider the proportion of records suppressed by itself under each approach (Figure 5). The baseline approach results in a significant amount of suppression for small samples; in some cases as much as 50% of the records are suppressed. The D4 approach does reduce that percentage quite considerably, especially for small sampling fractions.
Discussion
We made explicit the two reidentification scenarios that kanonymity was designed to protect against, known as prosecuter and journalist scenarios. The baseline kanonymity model, which represents current practice, would work well for protecting against the prosecutor reidentification scenario. However, our empirical results show that the baseline kanonymity model is very conservative in terms of reidentification risk under the journalist reidentification scenario. This conservatism results in extensive information loss. The information loss is exacerbated for small sampling fractions.
The reason for these results is that the appropriate disclosure control criterion for the journalist scenario is kmap, not kanonymity. We then evaluated three methods that extend kanonymity to approximate kmap. These can potentially ensure that the actual risk is close to the threshold risk. A hypothesis testing method based on the truncatedatzero Poisson distribution ensures that the actual risk is quite close to the threshold risk, even for small sampling fractions, and therefore is a good approximation of kmap. It is a considerable improvement over the baseline kanonymity approach because it provides good control of risk consistent with the expectations of a data custodian. Furthermore, this hypothesis testing approach always results in significantly less information loss than the baseline kanonymity approach. This is an important benefit because we have shown that a significant percentage of records would be suppressed using the baseline approach.
Suppression results in discarding data that was costly to collect and potentially result in a considerable loss of statistical power in any subsequent analysis. Furthermore, unless record suppression is completely random, it will bias analysis results.92 If we take a simple example of a single quasiidentifier, records will be suppressed for the rare and extreme values on that variable. Therefore, by definition, the pattern of suppression will not be completely random.
Some kanonymization algorithms suppress individual cells rather than full records. In practice, this may not have as much of a positive impact on the ability to do data analysis as one would hope. One common approach to deal with suppressed cells is complete case analysis (CCA), whereby only records without suppressed values are included in an analysis. Deletion of full records with any suppressed values is the default approach in most statistical packages92 and is common practice in epidemiologic analysis.93 It is known that CCA can result in discarding large proportions of a data set. For example, with only 2% of the values missing at random in each of 10 variables, one would lose 18.3% of the observations on average using CCA, and with 5 variables having 10% of their values missing at random, 41% of the observations would be lost with CCA, on average.94 Another popular approach is available case analysis (ACA), whereby the records with complete values on the variables used in a particular analysis are used. For example, in constructing a correlation matrix different records are used for each pair of variables depending on the availability of both values. This, however can produce nonsense results.94 Both CCA and ACA are only appropriate under the strong assumption that suppression is completely at random,92 93 and we have noted above that with kanonymity this will not be the case by definition. Therefore, full record or individual cell suppression are both detrimental to the quality of a data set.
Guidelines for Applying kAnonymity
The way in which kanonymity would be applied depends on the reidentification scenario one is protecting against. To protect against the prosecutor reidentification scenario, then kanonymity should be used. If the prosecutor scenario is not applicable, then kanonymity is not recommended, and kmap should be used instead (or our approximations of it using the hypothesis testing approach D4). If both scenarios are plausible, then kanonymity should be used because this is the most protective. Therefore, being able to make a decision on whether the prosecutor scenario is applicable is important.
If we assume a threshold risk of 0.2, then under the prosecutor scenario the data custodian would just kanonymize with k = 5. Under the journalist scenario the data custodian would determine κ′ using the hypothesis testing approach (D4) and then kanonymize with κ = min (κ′,5).
An intruder would only pursue a prosecutor reidentification scenario if s/he has certainty that the VIP has a record in ζ. There are three ways in which an intruder can have such certainty:79 95

The disclosed data set represents the whole population (e.g., a population registry) or has a large sampling fraction. If the whole population is being disclosed then the intruder would have certainty that the VIP is in the disclosed data set. Also, a large sampling fraction means that the VIP is very likely to be in the disclosed data set.

If it can be easily determined who is in the disclosed sample. For example, the sample may be a data set from an interview survey conducted in a company and it is generally known who participated in these interviews because the participants missed half a day of work. In such a case it is known within the company, and to an internal intruder, who is in the disclosed data set.

The individuals in the disclosed data set selfreveal that they are part of the sample. For example, subjects in clinical trials do generally inform their family, friends, and even acquaintances that they are participating in a trial. One of the acquaintances may attempt to reidentify one of these selfrevealing subjects. However, it is not always the case that individuals do know that their data is in a data set. For example, for studies were consent has been waived or where patients provide broad authorization for their data or tissue samples to be used in research, the patients may not know that their data is in a specific data set, providing no opportunity for selfrevealing their inclusion.
If any of the above conditions apply, then protecting against the prosecutor scenario is required. However, many epidemiologic and health services research studies, including secondary use studies, would not meet the criteria set out above. In such a case, protection against the journalist scenario with the D4 approach is recommended.
Relationship to Other Work
Reidentification risk is sometimes measured or estimated as the proportion of records that are unique in the population. Such uniqueness is then used as a proxy for reidentification risk. One approach for estimating population uniqueness from a sample uses the Poisson–gamma model with the α and β parameters estimated by the method of moments,96 97 but it overestimates with small sampling fractions and underestimates as the sampling fraction increases.98 Another method that uses subsampling performs well for larger sampling fractions.99 100 101 More recent work developed probability models and estimators for two attackbased reidentification risk measures.102 However, uniqueness measures of risk will by definition give an answer of zero for any kanonymized data set, and therefore are inappropriate in this context.
Limitations
We limited our simulations to quasiidentifiers that have been demonstrated to be useful for reidentification attacks using public and semipublic data sources. There is evidence that information loss becomes unacceptably large as the number of quasiidentifiers increases, even for small values of k.103 Therefore, had we used more quasiidentifiers, the information loss effects that we have shown would have been more pronounced.
There are other approaches that have been proposed for achieving kanonymity that we did not consider, for example, local recoding.104 105 106 107 With local recoding, observations may have different and overlapping response intervals. For instance, one observation may have an age of 27 recoded to the interval 20–29, and another observation may have an age of 27 recoded to the interval 25–35. This makes any data analysis of the kanonymized data set more complex than having the same recoding intervals for all observations, and precludes the use of common and generally accepted statistical modeling techniques. Our implementation of kanonymity used global recoding instead, and this ensures that response intervals are the same across all observations.
Conclusions
There is increasing pressure to disclose health research data, and this is especially true when the data has been collected using public funds. However, the disclosure of such data raises serious privacy concerns. For example, consider an individual who participated in a clinical trial having all of their clinical and lab data published in a journal web site accompanying the article on the trial. If it was possible to reidentify the records of that individual from this public data it would be a breach of privacy. Such an incident could result in fewer people participating in research studies because of privacy concerns, and if it happened in Canada, would be breaking privacy laws.
It is therefore important to understand precisely the types of reidentification attacks that can be launched on a data set and the different ways to properly anonymize the data before it is disclosed.
Anonymization techniques result in distortions to the data. Excessive anonymization may reduce the quality of the data making it unsuitable for some analysis, and possibly result in incorrect or biased results. Therefore, it is important to balance the amount of anonymization being performed against the amount of information loss.
In this paper we focused on kanonymity, which is a popular approach for protecting privacy. We considered the two reidentification scenarios that kanonymity is intended to protect against. For one of the scenarios, we showed that actual reidentification risk under the baseline kanonymity is much lower than the threshold risk that the data custodian assumes, and that this results in an excessive amount of information loss, especially at small sampling fractions. We then evaluated three alternative approaches and found that one of them consistently ensures that the reidentification risk is quite close to the actual risk, and always has lower information loss than the baseline approach.
It is recommended that data custodians determine which reidentification scenarios apply on a casebycase basis, and anonymize the data before disclosure using the baseline kanonymity model or our modified kanonymity model accordingly.
Acknowledgments
The authors thank Bradley Malin, Vanderbilt University, and JeanLouis Tambay, Statistics Canada, for their detailed feedback and suggestions on earlier versions of this paper. The authors also thank the anonymous reviewers for many helpful suggestions.