Privacypreserving heterogeneous health data sharing
 ^{1}Department of Computer Science and Software Engineering, Concordia University, Montreal, Quebec, Canada
 ^{2}Division of Biomedical Informatics, University of California, San Diego, California, USA
 Correspondence to Noman Mohammed, Department of Computer Science and Software Engineering, Concordia University, 1455 De Maisonneuve Blvd West, Montreal, QC H3G 1M8, Canada; no_moham{at}cse.concordia.ca
 Received 20 April 2012
 Revised 13 April 2012
 Accepted 1 November 2012
 Published Online First 13 December 2012
Abstract
Objective Privacypreserving data publishing addresses the problem of disclosing sensitive data when mining for useful information. Among existing privacy models, εdifferential privacy provides one of the strongest privacy guarantees and makes no assumptions about an adversary's background knowledge. All existing solutions that ensure εdifferential privacy handle the problem of disclosing relational and setvalued data in a privacypreserving manner separately. In this paper, we propose an algorithm that considers both relational and setvalued data in differentially private disclosure of healthcare data.
Methods The proposed approach makes a simple yet fundamental switch in differentially private algorithm design: instead of listing all possible records (ie, a contingency table) for noise addition, records are generalized before noise addition. The algorithm first generalizes the raw data in a probabilistic way, and then adds noise to guarantee εdifferential privacy.
Results We showed that the disclosed data could be used effectively to build a decision tree induction classifier. Experimental results demonstrated that the proposed algorithm is scalable and performs better than existing solutions for classification analysis.
Limitation The resulting utility may degrade when the output domain size is very large, making it potentially inappropriate to generate synthetic data for large health databases.
Conclusions Unlike existing techniques, the proposed algorithm allows the disclosure of health data containing both relational and setvalued data in a differentially private manner, and can retain essential information for discriminative analysis.
Introduction
With the wide deployment of electronic health record systems, health data are being collected at an unprecedented rate. The need for sharing health data among multiple parties has become evident in several applications ,1 such as decision support, policy development, and data mining. Meanwhile, major concerns have been raised about individual privacy in health data sharing. The current practice of privacy protection primarily relies on policies and guidelines, for example, the Health Insurance Portability and Accountability Act (HIPAA)2 in the USA. HIPAA defines two approaches to achieve deidentification: the first is Expert Determination, which requires that an expert certify that the reidentification risk inherent in the data is sufficiently low; the second is Safe Harbor, which requires the removal and suppression of a list of attributes.3 Safe Harbor requires data disclosers to follow a checklist4 to remove specific information to deidentify the records.
However, there are numerous controversies on both sides of the privacy debate regarding these HIPAA privacy rules.5 Some think that the protections provided in the deidentified data are not sufficient.6 Others contend that these privacy safeguards hamper biomedical research, and that observing them may preclude meaningful studies of medical data that depend on suppressed attributes, for example, finegrained epidemiology studies in areas with fewer than 20 000 residents or geriatric studies requiring detailed ages in those over 89.3 There are concerns that privacy rules will erode the efficiencies that computerized health records may create, and in some cases, interfere with law enforcement.5 Recently, the Institute of Medicine Committee on Health Research and the Privacy of Health Information concluded that the privacy rules do not adequately safeguard privacy and also significantly impede highquality research.7 The result is that patients’ health records are not well protected at the same time that researchers cannot effectively use them for discoveries.8 Technical efforts are highly encouraged to make published health data both privacypreserving and useful.9
Anonymizing health data is a challenging task due to inherent heterogeneity. Modern health data are typically composed of different types, for example relational data (eg, demographics) and setvalued data (eg, diagnostic codes and laboratory tests). In relational data (eg, gender, age, body mass index), records contain only one value for each attribute. On the other hand, setvalued data (eg, diagnostic codes and laboratory tests) contain one or more values (cells) for each attribute. For example, the attributevalue {1*, 2*} of the diagnostic code contains two separate cells: {1*} and {2*}. For many medical problems, different types of data need to be published simultaneously so that the correlation between different data types can be preserved. Such an emerging heterogeneous datapublishing scenario, however, is seldom addressed in the existing literature on privacy technology. Current techniques primarily focus on a single type of data10 and therefore are unable to thwart privacy attacks caused by inferences involving different data types. In this article, we propose an algorithm so that heterogeneous health data can be published yet retain essential information for supporting data mining tasks in a differentially private manner. The following reallife scenario further illustrates the privacy threats resulting from heterogeneous health data sharing.
Example 1 Consider the raw patient data in table 1 (the attribute ID is just for the purposes of illustration). Each row in the table represents information from a patient.
The attributes Sex, Age, and Diagnostic code are categorical, numerical, and setvalued, respectively. Suppose that the data owner needs to release table 1 for the purpose of classification analysis on the class attribute, which has two values, Y and , indicating whether or not the patient is deceased. If a record in the table is too specific such that not many patients can match it, releasing the data may lead to the reidentification of a patient. For example, Loukides et al11 demonstrated that for the International Classification of Diseases (ICD), Ninth Revision (ICD9) codes (or ‘diagnostic codes’ for brevity), one source of setvalued data could be used by an adversary for linkage to patients’ identities. Needless to say, the knowledge of both relational and setvalued data about a victim makes the privacy attack easier for an adversary. Suppose that the adversary knows that the target patient is female and her diagnostic codes contain {11}. Then, record #4 can be uniquely identified, since she is the only Female with diagnostic codes {11,12} in the raw data. Thus, identifying her record results in disclosure that she also has {12}. Note that we do not make any assumption about the adversary's background knowledge. An adversary may have partial or full information about the setvalued data and can try to use any background knowledge to identify the victim.
To prevent such linking attacks, a number of partitionbased privacy models have been proposed.12–16 However, recent research has indicated that these models are vulnerable to various privacy attacks17–20 and provide insufficient privacy protection. In this article, we employ differential privacy,21 a privacy model that provides provable privacy guarantees and that is, by definition, immune against all aforementioned attacks. Differential privacy makes no assumption about an adversary's background knowledge. A differentially private mechanism ensures that the probability of any output (released data) is almost equally likely from all nearly identical input data sets and thus guarantees that all outputs are insensitive to any single individual's data. In other words, an individual's privacy is not at risk because of inclusion in the disclosed data set.
Motivation
Existing algorithms that provide differential privacy guarantees are based on two approaches: interactive and noninteractive. In an interactive framework, a data miner can pose aggregate queries through a private mechanism, and a database owner answers these queries in response. Most of the proposed methods for ensuring differential privacy are based on an interactive framework.22–26 In a noninteractive framework the database owner first anonymizes the raw data and then releases the anonymized version for public use. In this article, we adopt the noninteractive framework as it has a number of advantages for data mining.10 Current techniques that adopt the noninteractive approach publish contingency tables or marginals of the raw data.27–30 The general structure of these approaches is to first derive a frequency matrix of the raw data over the database domain.
For example, table 2 shows the contingency table of table 3. After that, noise is added to each count to satisfy the privacy requirement. Finally, the noisy frequency matrix is published. However, this approach is not suitable for highdimensional data with a large domain because when the added noise is relatively large compared to the count, the utility of the data is significantly destroyed. We also confirm this point in the ‘Experimental description’ section. Our proposed solution instead first probabilistically generates a generalized contingency table and then adds noise to the counts. For example, table 4 is a generalized contingency table of table 3. Thus the count of each partition is typically much larger than the added noise.
Contributions
We propose a novel technique for publishing heterogeneous health data that provides an εdifferential privacy guarantee. While protecting privacy is a critical element in data publishing, it is equally important to preserve the utility of the published data, since this is the primary reason for data release. Taking the decision tree induction classifier as an example, we show that our sanitization algorithm can be effectively tailored for preserving information in data mining tasks. The contributions of this article are:

To our knowledge, a differentially private data disclosure algorithm that simultaneously handles both relational and setvalued data has not been previously developed. The proposed differentially private data algorithm is based on a generalization technique and preserves information for classification analysis. Previous work31 suggests that deterministic generalization techniques cannot be used to achieve εdifferential privacy, as they depend heavily on the data to be disseminated. Yet, we show that differentially private data can be released through the addition of uncertainty in the generalization procedure.

The proposed algorithm can also handle numerical attributes. Unlike existing methods,30 it does not require the numerical attributes to be prediscretized. The algorithm adaptively determines the split points for numerical attributes and partitions the data based on the workload, while guaranteeing εdifferential privacy. This is an essential requirement for obtaining accurate classification, as we show in the ‘Discussion’ section. Moreover, the algorithm is computationally efficient.

It is well acknowledged that εdifferential privacy provides a strong privacy guarantee. However, the utility of data disclosed by differentially private algorithms has received much less study. Does an interactive approach offer better data mining results than a noninteractive approach? Does differentially private data disclosure provide less utility than disclosure based on kanonymous data? Experimental results demonstrate that our algorithm outperforms the recently proposed differentially private interactive algorithm for building a classifier26 and the topdown specialization (TDS) approach32 that publishes kanonymous data for classification analysis.
This article is organized as follows. The ‘Preliminaries’ section provides an overview of the generalization technique and presents the problem statement. Our anonymization algorithm is explained in ‘The algorithm’ section. In the ‘Experimental description’ section, we experimentally evaluate the performance of our solution, and we summarize our main findings in the ‘Discussion’ section.
Preliminaries
In this section, we introduce the notion of generalization in the context of data publishing, followed by a problem statement.
Generalization
Let be a multiset of records, where each record represents the information of an individual with attributes . We represent the data set in a tabular form and use the terms ‘data set’ and ‘data table’ interchangeably. We assume that each attribute has a finite domain, denoted by . The domain of is defined as . To generalize a data set , we replace the value of an attribute with a more general value. The exact general value is determined according to the attribute partition.
Definition 2.1 (Partition) The partitions of a numerical attribute are the intervals in such that
For categorical and setvalued attributes, partitions are defined by a set of nodes from the taxonomy tree such that it covers the whole tree, and each leaf node belongs to exactly one partition.
For example, is the general value of according to the taxonomy tree of in figure 1. Similarly, 23 and 11 can be represented by the interval and the code 1*, respectively. For numerical attributes, these intervals are determined adaptively from the entire data.
Definition 2.2 (Generalization) Generalization is defined by a function , where maps each value to a .
Clearly, given a data set over a set of attributes , many alternative generalization functions are feasible. Each generalization function partitions the attribute domains differently. To satisfy εdifferential privacy, the algorithm must determine a generalization function that is insensitive to the underlying data. More formally, for any two data sets and , where (ie, they differ on at most one record), the algorithm must ensure that the ratio of and is bounded, where is a randomized algorithm (see online supplementary appendix A1). One naive solution satisfying εdifferential privacy is to have a fixed generalization function, irrespective of the input data set (ie, by definition zerodifferentially private but useless). However, the proper choice of a generalization function is crucial since the data mining result varies significantly for different choices of partitioning. In the ‘Experimental description’ section, we present an efficient algorithm for determining an adaptive partitioning technique for classification analysis while guaranteeing εdifferential privacy. Online supplementary appendix A1 presents an overview of εdifferential privacy and the core mechanisms to achieve εdifferential privacy.
Problem statement
Suppose a data owner wants to release a deidentified data table where the symbols and correspond to predictor attributes and the class attribute, respectively, for release to the public for classification analysis. The attributes in are classified into three categories: (1) an identifier attribute that explicitly identifies an individual, such as SSN (social security number), and Name. These attributes are removed before releasing the data as per the HIPAA Privacy Rule; (2) a class attribute that contains the class value; the goal of the data miner is to build a classifier to accurately predict the value of this attribute; and (3) a set of predictor attributes , whose values are used to predict the binary label of the class attribute.
We require the class attribute to be categorical, and the predictor attribute can be categorical, numerical, or setvalued. Further, we assume that for each categorical or setvalued attribute , a taxonomy tree is provided. The taxonomy tree of an attribute specifies the hierarchy among the values. Our problem statement can be written as: given a data table and the privacy parameter , our objective is to generate a deidentified data table such that (1) satisfies εdifferential privacy, and (2) preserves as much information as possible for classification analysis.
The algorithm
In this section, we present an overview of our Differentially private algorithm based on Generalization (DiffGen). We elaborate the key steps, and prove that the algorithm is εdifferentially private in online supplementary appendix A2. In addition, we present the implementation details and analyze the complexity of the algorithm in online supplementary appendix A3.
Algorithm 1 DiffGen

Input: Raw data set , privacy budget ε, and number of specializations Output: Generalized data set

1: Initialize every value in to the topmost value (see figure 2 for details);

2: Initialize to include the topmost value (see figure 2 for details);

3: Set a privacy budget for specification of predictors ;

4: Determine the split value for each with probability ;

5: Compute the score for each candidate (see online supplementary appendix A2 for details);

6: for to do

7: Select with probability ;

8: Specialize on and update ;

9: Determine the split value for each new with probability

;

10: Update score for ;

11: end for

12: return each group with count , where denotes the probability density function of Laplacian distribution.
Algorithm 1 first generalizes the predictor attributes and divides the raw data into several equivalence groups, where all the records within a group have the same attribute values. Then, the algorithm publishes the noisy counts of the groups. The general idea is to sanitize the raw data by a sequence of specializations, starting from the topmost general state as shown in figure 2. A specialization, written as , where denotes the set of child values of , replaces the parent value with a child value. The specialization process can be viewed as pushing the ‘cut’ of each taxonomy tree downwards. A cut of the taxonomy tree for an attribute , denoted by , contains exactly one value on each roottoleaf path. The value of the setvalued attribute of a record can be generalized to a cut if every item in the record can be generalized to a node in the cut and every node in the cut generalizes some items in the record. For example, the value can be generalized to the hierarchy cuts and , but not . Figure 2 shows a solution cut indicated by the dashed curve representing table 5, which corresponds to deidentified data to be disseminated.
Initially, DiffGen creates a single partition by generalizing all values in to the topmost value in their taxonomy trees (line 1). The contains the topmost value for each attribute (line 2). The specialization starts from the topmost cut and pushes down the cut iteratively by specializing some value in the current cut. At each iteration, DiffGen uses an exponential mechanism33 (see online supplementary appendix A1) to select a candidate for specialization (line 7). Candidates are selected based on their score values (see online supplementary appendix A2), and different heuristics (eg, information gain and max frequency) can be used to determine the score of the candidates. Then, the algorithm specializes and updates (line 8). As taxonomy trees for the numerical attributes are not given, DiffGen again uses the exponential mechanism to determine the split value dynamically for each numerical candidate (lines 4 and 9). DiffGen specializes by recursively distributing the records from the parent partition into disjoint child partitions with more specific values based on the taxonomy tree. For setvalued attributes, the algorithm computes the noisy count of each child partition to determine whether it is empty or not. Only ‘nonempty’ partitions are considered for further split in the next iteration. We provide additional details for candidate selection and the split value determination steps in online supplementary appendix A2. DiffGen also calculates the score for each new candidate due to the specialization (line 10). The algorithm terminates after a given number of specializations. Finally, the algorithm adds Laplace noise (see online supplementary appendix A1) to each equivalence group of the leaf partition to construct the sanitized data table . We use the following example to facilitate understanding of how to use score functions, which are based on heuristics (eg, information gain and max frequency), for specification.
Example 2 Consider table 1 with ε=1 and h=2. Initially, the algorithm creates one root partition containing all the records that are generalized to . includes . To find the first specialization among the candidates in , we compute the scores of , [18–65), and . We show how to compute the information gain (InfoGain) and maximum frequency (Max) scores of for the specialization . Details of these two utility functions are discussed in online supplementary appendix 2.

Information gain:

2. Max:
.
Let the first specialization be . The algorithm then creates three child partitions with the child values , , and , respectively, by replacing the node with different combinations of its children, leading , , , and to the child partition and , , , and to the child partition . Suppose that the noisy counts indicate that these two child partitions are ‘nonempty’. Then further splits are needed for them. There is no need to explore the child partition any more, as it is considered ‘empty’. is updated to . Suppose that the next specialization is , which creates further specialized partitions. Finally, the algorithm outputs the equivalence groups of each leaf partition along with their noisy counts as shown in figure 2 under the dotted line.
Please refer to online supplementary appendix A2 for privacy analysis of DiffGen and to online supplementary appendix A3 for implementation details.
Experimental description
In this section our objectives are to study the impact of enforcing differential privacy on the data quality in terms of classification accuracy (CA), and to evaluate the scalability of the proposed algorithm for handling large data sets. We also compare DiffGen with DiffPC4.5,26 a differentially private interactive algorithm for building a classifier, and with the TDS approach32 that publishes kanonymous data for classification analysis. All experiments were conducted on an Intel Core i7 2.7GHz PC with 12GB RAM.
We employed two reallife data sets: MIMIC and Adult. We retrieved the MIMIC data set from the Multiparameter Intelligent Monitoring in Intensive Care II research database,34 which contains over 36 000 intensive care unit episodes. Specifically, we picked eight features (ie, marital status, gender, ethnicity, payment description, religion description, admission type, admission source, and ICD9 code) and a target variable (ie, mortality). Of the eight features, the first seven are categorical attributes while the last one is a setvalued attribute. The publicly available Adult35 data set is the 1994 US census data set that has been widely used for testing many sanitization algorithms. Adult has 45 222 census records with six numerical attributes, eight categorical attributes, and a binary class column representing two income levels, ≤US$50 K or >US$50 K. Please refer to Fung et al32 for the description of attributes.
To evaluate the impact on classification quality, we divided the data into training and testing sets. First, we applied our algorithm to sanitize the training set and determine the . Then, the same was applied to the testing set to produce a generalized test set. Next, we built a classifier on the sanitized training set and measured the CA on the generalized records of the test set. For classification models, we used the wellknown C4.5 classifier.36 For each experiment, we executed 10 runs and averaged the results over the runs.
MIMIC data set
We applied DiffGen to the MIMIC data set for both utility functions (ie, Max and InfoGain). Figure 3 shows the CA for Max and InfoGain, where the privacy budget ε=0.1, 0.25, 0.5, 1, and the number of specializations h=5. We used two thirds of the records to build the classifier and measured the accuracy on the remaining third of the records. Both utility functions have similar performance, where CA spans from 86% to 89% under different privacy budgets. The experimental result suggests that the proposed algorithm can achieve good CA on heterogeneous health data. We could not directly compare our method with others for the MIMIC data set because we are not aware of an approach that can sanitize heterogeneous data while ensuring εdifferential privacy.
Adult data set
To better visualize the cost and benefit of our approach, we provide additional measures: baseline accuracy (BA) is the CA measured on the raw data without sanitization. BA−CA represents the cost in terms of classification quality for achieving a given εdifferential privacy requirement. At the other extreme, we measure lowerbound accuracy (LA), which is the accuracy on the raw data with all attributes (except for the class attribute) removed. CA−LA represents the benefit of our method over the naive nondisclosure approach.
Figure 4A depicts the CA for the utility function Max, where the privacy budget ε=0.1, 0.25, 0.5, 1, and the number of specializations 4≤h≤16. The BA and LA are 85.3% and 75.5%, respectively, as shown in the figure by the dotted lines. For ε=1 and h=10, BA−CA is around 3% and CA−LA is 6.74%. For ε=0.5, BA−CA spans from 3.57% to 4.8%, and CA−LA spans from 5% to 6.23%. However, when ε decreases to 0.1, CA quickly decreases to about 78% (highest point), the cost increases to about 7%, and the benefit decreases to about 3%. These results suggest that for an acceptable privacy budget such as , the cost for achieving εdifferential privacy is small, while the benefit of our method over the naive method is large. Figure 4B depicts the CA for the utility function InfoGain. The performance of InfoGain is not as good as that of Max because the difference between the scores of a good and a bad attribute is much smaller for InfoGain as compared to Max. Therefore, the exponential mechanism does not work as effectively in the case of InfoGain as it does for Max.
Figure 5A shows the CA of DiffGen, DiffPC4.5, and TDS. For DiffGen, we use utility function Max and fix the number of specializations h=15. DiffPC4.5 also uses the Adult data set and all the results of the DiffPC4.5 are taken from the paper by Friedman and Schuster.26 For TDS we fixed the anonymity threshold k=5 and conducted the experiment ourselves. Following the same setting,26 we executed 10 runs of 10fold crossvalidation to measure the CA.
The accuracy of DiffGen is clearly better than that of DiffPC4.5 for privacy budgets ε≤2. Note that the privacy budget should be typically smaller than 1.21 Even for a higher budget, the accuracy of DiffGen is comparable to that of DiffPC4.5. The major advantage of our algorithm is that we publish data and the data miner has much better flexibility to perform the required data analysis. On the other hand, in DiffPC4.5 the classifier is built through interactive queries; therefore, the database has to be permanently shut down to satisfy the privacy requirement after generating only one classifier. The experimental result also shows that DiffGen performs better than TDS. For a higher anonymity threshold k, the accuracy of TDS will be lower. One advantage of DiffGen is that, unlike TDS, it does not need to ensure that every equivalence group contains k records; therefore, DiffGen is able to provide more detailed information than TDS. This result demonstrates for the first time that, if designed properly, a differentially private algorithm can provide better utility than a partitionbased approach.
All previous experiments can finish the sanitization process within 30 s. We further study the scalability of our algorithm over large data sets. We generate different data sets of different sizes by randomly adding records to the Adult data set. For each original record r, we create α−1 variations of the record by replacing some of the attribute values randomly from the same domain. Here α is the blowup scale and thus the total number of records is α×45 222 after adding random records. Figure 5B depicts the runtime for 200 000 to million records for h=15 and ε=1.
Summary
We observed two general trends from the experiments. First, the privacy budget has a direct impact on the CA. A higher budget results in better accuracy since it ensures better attribute partitioning and lowers the magnitude of noise that is added to the count of each equivalence group. Second, the CA initially increases with the increase in the number of specializations, but decreases after a certain threshold. This is an interesting observation. The number of equivalence groups increases quite rapidly with an increase in the number of specializations, resulting in a smaller count per group. Up to a certain threshold it has a positive impact due to more precise values; however, the influence of the Laplace noise gets stronger as the number of specializations grows. Note that if the noise is as large as the count, then the disclosed data are useless. This confirms that listing all possible combinations of values (ie, contingency table) and then adding noise to their counts is not a good approach for highdimensional data since the noise will be as big as the count. Since this is a noninteractive approach, the data owner can try different values of h to find the threshold and then release the sanitized data. Determining a good value of h adaptively, given the data set and the privacy budget, is an interesting approach for future work that we plan to investigate.
Discussion
Does the algorithm yield a globally optimal solution? Can the algorithm be easily modified to anonymize sequential data? Are noise addition and generalizationbased techniques desirable for the users of medical data sets? In this section, we provide answers to these questions.
Globally optimal
The proposed algorithm does not yield an optimal solution cut, rather it is suboptimal. The algorithm uses an exponential mechanism which probabilistically chooses a candidate with a high score. Thus, it is possible that a different solution cut may provide better utility. However, it is important to note that maximizing the overall sum of the Score for specializations in the training data does not guarantee having the lowest classification error in the testing data.
Sequential data
While setvalued data (not considering the order) are useful for many data analysis tasks, we acknowledge that in some other analysis tasks the order of items could provide extra useful information and may pose new reidentification risks. The proposed anonymization algorithm can be extended to handle sequential data as well. In order to anonymize sequential data, we need to preprocess each item in the sequence by its order and then consider the item–order pair as the new ‘item’. Then, these new item sets could be used as an input to our algorithm, which will then prevent reidentification attacks based on the order of items. However, this is not currently implemented in the paper.
Usefulness of our approach
One of the inputs of our algorithm is the number of generalizations. Data users could specify the desired degree of generalizations by setting a proper value for h. Concerning the negative impact of noise (ie, on data utility) added to satisfy differential privacy, we expect the sanitized data from our algorithm to be useful for a few tasks (eg, classification as illustrated in this paper), but not all data analysis tasks, especially those that focus on attribute values of individuals. This is an inherent limitation of differential privacy methodology, and it is the price we pay to achieve a provable privacy guarantee. We acknowledge this limitation and will seek for better solutions in future work.
In summary, the generalization technique used in DiffGen might result in loss of information, which leads to a tradeoff problem between data utility and privacy protection, like many other privacy enhancement algorithms. The generalization is also context dependent. For example, if we want to generalize diseases (ie, asthma with respiratory disease and psoriasis with dermatological disease), resultant outputs might be adequate to answer some scientific questions (ie, classification tasks determined upfront) but may be insufficient to answer others, (eg, immunologic diseases, which could encompass both asthma and psoriasis). This is a fundamental limitation of generalization techniques and the type of lumping must be used carefully.
Conclusion
This paper presents a new anonymization algorithm that achieves εdifferential privacy and supports effective classification analysis for heterogeneous health data. The proposed solution connects the classical generalization technique with output perturbation to effectively sanitize raw data. Experimental results suggest that the proposed solution provides better utility than a pure differentially private interactive approach or an approach to simply produce kanonymous data.
Acknowledgments
We thank Dr Shuang Wang for helping with the experiments and useful discussions.
Footnotes

Contributors The authors are ranked according to their contributions. The authorship credit is based on meeting all the following criteria: (1) substantial contributions to conception and design, acquisition of data, or analysis and interpretation of data; (2) drafting the article or revising it critically for important intellectual content; and (3) final approval of the version to be published.

Funding BCMF, NM, and RC are supported in part by Discovery Grants (3560652008) and Canada Graduate Scholarships from the Natural Sciences and Engineering Research Council of Canada (NSERC). XJ and LOM were funded in part by NIH grants 1K99LM 01139201, R01LM009520, U54 HL108460, R01HS019913, and UL1RR031980.

Competing interests None.

Provenance and peer review Not commissioned; externally peer reviewed.