Grid Binary LOgistic REgression (GLORE): building shared models without sharing data
 Division of Biomedical Informatics, Department of Medicine, University of California San Diego, La Jolla, California, USA
 Correspondence to Dr Yuan Wu, Division of Biomedical Informatics, Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA; y6wu{at}ucsd.edu

Contributors YW and LOM contributed equally to the writing of this article. The other authors are ranked according to their contributions.
 Received 19 January 2012
 Accepted 19 March 2012
 Published Online First 17 April 2012
Abstract
Objective The classification of complex or rare patterns in clinical and genomic data requires the availability of a large, labeled patient set. While methods that operate on large, centralized data sources have been extensively used, little attention has been paid to understanding whether models such as binary logistic regression (LR) can be developed in a distributed manner, allowing researchers to share models without necessarily sharing patient data.
Material and methods Instead of bringing data to a central repository for computation, we bring computation to the data. The Grid Binary LOgistic REgression (GLORE) model integrates decomposable partial elements or nonprivacy sensitive prediction values to obtain model coefficients, the variancecovariance matrix, the goodnessoffit test statistic, and the area under the receiver operating characteristic (ROC) curve.
Results We conducted experiments on both simulated and clinically relevant data, and compared the computational costs of GLORE with those of a traditional LR model estimated using the combined data. We showed that our results are the same as those of LR to a 10^{−15} precision. In addition, GLORE is computationally efficient.
Limitation In GLORE, the calculation of coefficient gradients must be synchronized at different sites, which involves some effort to ensure the integrity of communication. Ensuring that the predictors have the same format and meaning across the data sets is necessary.
Conclusion The results suggest that GLORE performs as well as LR and allows data to remain protected at their original sites.
 Logistic regression
 grid computing
 HL test
 Area under the ROC curve
 data mining
 machine learning
 statistical learning
 predictive modeling
 statistical learning
 privacy technology
Introduction
In biomedical, translational, and clinical research, it is important to share data to obtain sample sizes that are meaningful and potentially accelerate discoveries.1 This is necessary to expedite pattern recognition related to relatively rare events or conditions, such as complications from invasive procedures, adverse events associated with new medications, association of disease with a rare gene variant, and many others. Although electronic data networks have been established for this purpose, in the form of disease registries, clinical data warehouses for quality improvement and cohort discovery related to clinical trial recruitment, etc, many of these initiatives are based on federated models in which the actual data never leave the institution of origin, for example, as in the model used at the Clinical Evaluative Sciences in Ontario (ICES), Manitoba Centre for Health Policy (MCHP).2 However, the statistics and predictive models that can be developed in these distributed networks have been very limited, often consisting of simple counts (which still need to be somewhat obfuscated to preserve the privacy of individuals).3 ,4 Many clinical pattern recognition tasks5–8 are highly complex, involving multiple factors. To support human decision making in complex situations, numerous prediction models9–16 have been developed and applied in a clinical context. Recently, various systems were developed for assisting with tasks as diverse as automatically discovering drug treatment patterns in electronic health records,17 improving patient safety via automated laboratorybased adverse event grading,18 predicting the outcome of renal transplantation,19 guiding the treatment of hypercholesterolemia,11 making prognoses for patients undergoing surgical procedures,20 ,21 and estimating the success of assisted reproduction techniques.22 Multiple risk calculators for cardiovascular disease prediction are based on the Framingham study.13 Among the most popular prediction models, the logistic regression (LR)23 model is widely adopted in biomedical research, such as the Model for Endstage Liver Disease (MELD)24 and many other clinical applications,25–27 owing largely to its simplicity and the interpretability of the estimated parameters. In an LR model, the independent variables constitute a vector X of several variables that help classify a case into positive or negative as represented by the dependent binary variable Y. In order to do this, the LR model estimates coefficients for each of the dependent variables. For example, the classification of temperature (independent variable X) into ‘fever’ (dependent variable Y) may be done by an LR model and sufficient examples, such that the model ‘discovers’ that temperatures above 38°C (100.4°F) are associated with ‘fever’. The LR is based on a simple sigmoid function (see figure 1) and is backed by information theory,28 which provides a theoretical justification for its power.
The classic LR model, however, has limitations in operating on federated data sets, or distributed data, since the training phase (ie, learning of parameters) involves looking at all the data, which are usually located at a central repository. Data that are distributed across institutions have to be combined for the classic LR algorithm to work. Although sharing and dissemination can largely improve the power of the analysis,29 it is often not possible to combine distributed data due to concerns related to the privacy of individuals30 ,31 or the privacy of institutions.32–34 Such a scenario brings new challenges to the classic learning problem of the LR model. Although we, among others, have shown that certain machine learning models like boosting35 and support vector machines36 can be trained in a distributed fashion37–40 and produce accurate models, extending this advantage to LR requires a specialized strategy. A recent work to compute LR with Mapreduce41 is most relevant to our work, but its focus lies in parallelization for computational speed rather than for privacypreserving data analysis. Furthermore, it does not elaborate on how to provide evaluation indices for these models (eg, HosmerLemeshow goodnessoffit test or areas under the receiver operating characteristic (ROC) curve (AUCs)) in a distributed fashion. Researchers often refine their models for inclusion or exclusion of some predictors, variable transformations, and other preprocessing steps. Without evaluation strategies that can be privacypreserving, the value of performing LR in a distributed fashion is very limited. Previous work by other authors in privacypreserving distributed linear regression was based on vertical partitions of data: multiple data owners each had different attributes for the same observation.42 Our previous work in distributed support vector machines is also related to vertical partitions.40 In this manuscript, we propose a new algorithm, Grid Binary LOgistic REgression (GLORE), to fit a LR model in a distributed fashion using information from locally hosted databases containing different observations that share the same attributes (ie, horizontal partitions of data—stackable sets of patient records), without sharing the sensitive original patient data from these databases, as shown in figure 2. This is not a trivial task: distributed linear regression is much easier to implement than distributed logistic regression, since there is a closed form solution for the former but not for the latter.
Specifically, GLORE estimates the coefficients of an ordinary LR model by integrating decomposable intermediary results (and not the actual patient data) to calculate model parameters and test statistics. The resulting model is calculated in a privacypreserving manner and performs as well as classic LR. In the Methodology section, we will discuss details for estimating model coefficients, estimating the variancecovariance matrix of coefficients, performing a goodnessoffit test, and calculating the AUC in a distributed fashion. Commonly reported statistics for LR, including CIs, Z test statistics and their p values, and ORs can be obtained by using these estimated coefficients and their variancecovariance matrix.
Methodology
The LR model is an instance of a generalized linear model with a logit (ie, ) link function (illustrated in figure 1).
where , and denote probabilities of an event Y to be 1 (ie, ‘fever’) or 0 (ie, absence of ‘fever’), conditioned on observation of a feature vector X (eg, 101°F), respectively. The parameter vector β corresponds to the set of weights or coefficients that need to be estimated and that will be multiplied by the values for the features X (ie, Xβ) to make predictions.
Estimating model coefficients
In order to explain how GLORE works, it is important to remind readers how traditional LR works. To estimate the final value for the parameter vector β, an LR model iteratively maximizes the likelihood of obtaining X given an initial β. At each iteration, the algorithm produces , which is based on the previous . The initial β can start with all coefficients set to zero, eg, the estimated probability of is based on observations of a binary response Y and a feature vector X (ie, a set of predictors) from each of the data sites. To compute the maximum likelihood of getting the observed data, the LR estimation algorithm first finds the derivative of the log likelihood function, then applies the NewtonRaphson method43 to find its maximum, that is, the value of β for which the derivative function equals zero. For details of the log likelihood function, please refer to section 1 of the online supplementary appendix. We also explain how the NewtonRaphson method works in section 4 of the online supplementary appendix.
In each NewtonRaphson iteration for the LR parameter estimation problem, the first and second derivatives of log likelihood function are both decomposable, that is, they can be calculated separately for a subset of observations, and then combined, with the same result as if they were calculated on the complete set. Hence, a GLORE update can be finished by combining intermediary results from across all local sites. Please refer to the online appendix (sections 1 and 2) for technical details on model coefficient estimation.
Because intermediary results from individual sites do not lose any information, GLORE guarantees accurate estimation of parameters through summation. Note that the information exchanged consists of partially aggregated intermediary results rather than the raw data, hence they are more privacypreserving than would be the case if we transmitted all patient data to a central site.
Furthermore, since the calculations can be done in parallel, each step takes only as long as the maximum time for the sites (ie, the slowest site will determine how long each step takes). The time to transmit the intermediary results is usually negligible, as only one vector of coefficient adjustments needs to be sent. After setting the same initial values, at each iteration, GLORE uses the summation of partial intermediary results from multiple sites to update the coefficients and sends them back to the sites for another iteration.
A central engine is efficient in terms of computation, but the process could occur via peertopeer transmittal of intermediary results. Assuming that there are m features (ie, variables) available over multiple sites, at each iteration, intermediary results of a (m+1)×(m+1) matrix (ie, the variancecovariance matrix of coefficients) and a (m+1)dimensional vector (ie, the gradients for coefficient adjustment) from each site must be transmitted to the central engine or to all other sites, depending on the design choice. The GLORE framework with a central computing node can reduce the probability of network delays when compared to the GLORE framework based on a peertopeer architecture.
Estimating the variancecovariance matrix
Besides model coefficients, variancecovariance matrix estimation is also important, as it is necessary for the computation of the CIs of individual predictions.9 Like the model coefficient estimation, it can be done by integrating decomposable partial elements. Please refer to the online supplementary appendix (section 2) for technical details.
Estimating goodnessoffit test statistics
The Hosmer and Lemeshow (HL) test23 ,44 ,45 is commonly used to check model fit for LR. This section discusses how to perform an HL test to check for GLORE's fit, without sharing patient data. The null and alternative hypotheses of the HL test are that (1) ‘the model provides an adequate fit,’ and (2) ‘the model does not fit the data,’ respectively. That is, when the p values for this test are below 0.05, we can reject the hypothesis that the model fits the data well, meaning that the model is not well calibrated. To calculate the HL statistic, we have to sort cases by their predictions and create groups from which we establish a proxy for the ‘true probability’ (ie, the fraction of positive cases in the group). Let us denote O_{j} as the number of positive observations in the jth group and E_{j} as the sum of predictions in the jth group, respectively. The parameter n_{j} refers to the number of records in the jth group. In the box ‘Algorithm 1’ below, we introduce a procedure to obtain the HL test statistic for GLORE (ie, the C version of this test, that is based on percentiles to determine the groups, which is more robust than the H version of the test that is based on fixed thresholds to determine the groups46) without sharing patient data (ie, individual patient features or individual patient outcomes). In the following algorithm, class labels are not shared with all other sites or the central engine. Instead, only the total number of labeled records with outcome ‘1’ per group from a site needs to be transmitted to the central engine.
Computing the HL statistic for GLORE (C version)
Step 1. Each site transmits probability estimates, that is, 's for their observations to the central engine.
Step 2. The central engine sorts all 's and evenly groups the sorted 's into deciles*, and computes the estimated probability for each decile as the sum of predictions in that decile.
Step 3. The central engine sends the indices of predictions in each decile to their original sites.
Step 4. Each site finds the number of records labeled ‘1’ in each decile among its own records based on the indices from the central engine, and transmits this number to the central engine.
Step 5. The central engine combines the numbers of records labeled ‘1’ from all sites to obtain the total number of records labeled ‘1’ in each category, and, together with the results from step 2, computes the HL statistic.
*Deciles are commonly used in the HL C test, but other types of percentiles can be used depending on the size of the data set.
Estimating the area under the ROC curve (AUC)
The AUC47 is widely used to evaluate the discrimination performance of predictive models. To calculate the AUC, it is necessary to estimate the total number of pairs for which positive observations (ie, ‘one’labeled records) rank higher than negative observations (ie, ‘zero’labeled records). The box ‘Algorithm 2’ below shows the details of how AUCs are estimated without sharing individual patient features or patient outcomes. Besides transmissions between the central engine and all sites, this algorithm requires peertopeer communication to keep patient outcomes protected.
Computing the AUC using GLORE
Step 1. Each site transmits probability estimates, that is, ’s of their observations to all other sites.
Step 2. Each site finds the ranking* of each predicted probability transmitted from all other sites among the zerolabeled records in this site, and sends the rankings of these probabilities back to their original sites.
Step 3. Each site calculates the rank sum for each of its onelabeled records using the ranks sent back from all other sites.
Step 4. Each site finds the ranking* of each of its onelabeled records among the zerolabeled records in this site.
Step 5. Each site computes the sums of the ranks regarding its own onelabeled records using the intermediary results from steps 3 and 4.
Step 6. Each site sends rank sums from step 5 and counts of their own onelabeled and zero labeled records to the central engine.
Step 7. The central engine computes the AUC as the summation of all rank sums from step 6 divided by the product of the total onelabeled and zerolabeled records.
*Ranking is the sum of the number of zerolabeled records with smaller predictions and half the number of zerolabeled records with equal predictions.
In figure 3 we use a simple artificial example with only two sites (A and B) to explain how algorithm 2 works. and O^{A} indicate predicted probabilities and class labels from site A, and and O^{B} are predicted probabilities and class labels from site B. Note that in procedures 1 and 5, only predicted probabilities (ie, no class labels) are sent to the other site, as in our previous strategy for HL tests. The AUC, which is equivalent to the cindex,48 can be calculated in three steps: (1) count all onelabeled records that have predictions that are larger than the predictions across all zerolabeled records; (2) count all onelabeled records that have predictions that are equal to the predictions across all zerolabeled records; and (3) sum the counts of step (1) and half the counts of step (2) and divide this number by the total number of pairs formed by zerolabeled and onelabeled records.47
Sometimes, we might also want to display an entire ROC curve instead of calculating a single AUC score. In this case, using as threshold each prediction , a full contingency table (ie, true positive, false positive, true negative, and false negative) can be calculated for each threshold. The ROC curve results from the points (1−specificity, sensitivity) that are calculated from each of these contingency tables. Please refer to Zou et al49 and a review article by Lasko et al50 for details on ROC curves. In GLORE, one site needs to send all predictions and their corresponding contingency tables to the central engine. The central engine then needs to merge the information to compute the sensitivity and specificity at all thresholds. It is worth noting that, although this procedure is straightforward, it may lead to more privacy leak than algorithm 1, since class labels (ie, patient outcomes) associated with predicted probabilities can be recovered by the central or peer site, depending on the strategy selected.
Remark
We verified in both simulated and clinical data experiments that the proposed GLORE will produce the same estimated coefficients, that is, with precision 0 (10^{−15}), together with accurate variancecovariance matrix estimation, goodnessoffit statistics, and AUC, when compared to the classic LR trained in a centralized manner. It is also worth noting that, although the GLORE coefficient estimation process needs to transmit intermediary results in each NewtonRaphson iteration, usually a small number (<15) of iterations is necessary to achieve high precision such as O(10^{−6}). After the parameter estimation is done, only a onetime data transmission is needed for estimating the variancecovariance matrix, computing the model fit statistic, and computing the AUC.
Experiments
We used the statistic computing language R to conduct our experiments with simulated data in which the true generating model is known, and also on clinical data to validate GLORE.
Simulation study
In a simulation study, we compared twosite GLORE (assuming data are evenly partitioned between two sites) and ordinary LR (combining all data for computation). We used a total sample size of 1000 (500 for each site) and nine features (ie, variables). First, we simulated all features from a standard normal distribution, then simulated the response from a binomial distribution assuming that the log odds of the response being 1 was a linear function of features (ie, all coefficients were set to 1). We conducted the study on 100 runs to compare coefficient estimation difference between GLORE and LR for the same simulated data. This simple study shows that the number of NewtonRaphson iterations to convergence is always six when 10^{−6} precision is set for the iteration stop criterion.
Table 1 shows the mean absolute difference between twosite GLORE and LR estimations for all 10 coefficients (nine features plus one intercept) at each iteration, where the mean is calculated for 100 runs. There are no substantial differences between estimations from GLORE and LR for all coefficients at all iterations. We also pick one of the 100 runs to graphically present the convergence paths of the estimations for three coefficients (intercept, X1, and X2) in figures 3 and 4, showing that there is no difference between two convergence paths for these three coefficients.
Experiments using a clinical data set
Our clinical data set is related to myocardial infarction at Edinburgh, UK,51 which has one binary outcome, 48 features and 1253 records, and was used to illustrate GLORE with two sites. Records are evenly partitioned (627 vs 626) between the two sites. We picked nine nonredundant features in this data set using methods described in Kennedy et al51 for GLORE fitting. We also used another data set,49 which contains two cancer biomarkers (CA19 and CA125). This data set has 141 records, one binary outcome denoting the presence or absence of cancer, and two features denoting the two cancer markers. The 141 records were split into 71 and 70 for two sites. Tables 2 and 3 show coefficient estimates and their standard errors, Z test statistics and p values for the Edinburgh data and for the CA19 and CA125 cancer marker data, respectively. Using algorithm 1, the HL test statistic equals 8.036 with a p value 0.430 for the Edinburgh data, and the HL test statistic equals 3.510 with a p value 0.898 for the CA19 and CA125 data, which are no different from the results obtained from traditional centralized LR models. Seven and 12 NewtonRaphson iterations were needed for convergence with 10^{−6} precision for the Edinburgh data and the CA19/CA125 data, respectively. In addition, using algorithm 2, we found that the AUCs were 0.965 and 0.891 for the Edinburgh data and for the CA19 and CA125 data, respectively, which are no different from the results obtained from traditional centralized LR models.
Discussion
Our study shows that the proposed GLORE framework can use intermediary results that do not contain patient data from multiple local sites to build an accurate prediction model without increasing the computation cost. This is important in situations in which data cannot leave a particular institution, but the institution still wants to contribute to the development of a predictive model. For the study of rare diseases and many other situations in which polling data from multiple sources has the potential to improve the power of statistical analyses and generalizability of predictive models, developing techniques that allow shared analyses without necessarily allowing collaborating parties to see each other's data is very important. Many authors have reported on federated queries across distributed clinical data warehouses,52 ,53 and how the results of some of these queries need to be transformed to protect the privacy of the individuals in these data sets.3 Sharing intermediary results calculated at each site is less prone to privacy compromise, although further studies of the privacy risk involved with the use of this strategy are certainly warranted.
In any modelbuilding task, significant preprocessing of data needs to be performed. GLORE should follow the same general guidelines as those recommended for ordinary LR. For example, redundant predictors should be removed; categorical predictors need to be converted to a set of dummy variables based on the number of categorical values (this can be done by statistical software or manually). Furthermore, preprocessing operations done at all sites need to ensure that the resulting data are comparable across the sites.
The data need to be harmonized (at the syntactic and semantic levels) before GLORE can be successfully applied. Data harmonization can be challenging even within a single institution. Creating new data models to suit the purposes of a particular application may be tempting, but researchers should consider the mapping of concepts to an information model and standardized terminology to make the model extensible and the effort in data harmonization reusable. The rate of missing values in a particular site has to stay below an agreed level, and data imputation techniques need to be jointly discussed. Researchers should agree on a minimal set of variables that can be easily mapped and are expected to have sufficient predictive ability or usefulness in adjusting for confounders (as per automated multivariate feature selection procedures and/or expert opinion). It is advisable that the investigators exchange descriptive statistics and test out their environments using artificial data. Expert opinion should be sought as to whether it makes sense to combine data if the data from different institutions were collected under different circumstances and employed different assumptions. A trivial case, for example, occurs when one institution had a binary variable named ‘outofcontrol diabetes,’ and another has actual continuous HbA1C values: it is important to determine how ‘outofcontrol’ was determined before trying to turn the continuous value for HbA1C into a binary variable. It may not make sense to combine the data if the determination of ‘outofcontrol’ was not clear. The security settings need to be agreed upon. Firewalls, authentication protocols, and other security aspects need to be extensively tested before implementation. The robustness of network connections and intermediary result transmission needs to be tested for realtime computation, or an asynchronous process needs to be established.
The combination of data from different sites is often desirable in surveillance systems, in which low signals need to be detected from noisy data with the help of a large number of samples. The proposed system is applicable when the institutions are collaborating in such systems but cannot share individual level data. As noted above, the method is not applicable when it is not possible to harmonize the data across institutions, when the data have too many missing values, when the descriptive statistics suggest strong sitespecific patterns that cannot be adjusted for, or when the security and network infrastructures do not allow for reliable realtime computation (although an asynchronous version of the proposed infrastructure could be used in this case).
Limitations
We have not compared the added value of models derived from the combination of data from different sites with models derived from a single site. The combination will only add value if the data from the additional site are representative of the population on which the model will be applied. Additionally, there are communication costs at each iteration, and unreliable connections may lead to interrupted analyses that can only be resumed when connections are restored. Furthermore, the privacy preserving qualities of GLORE have not been fully investigated in a rigorous framework such as differential privacy.54 It is possible that, for very small data sets at a particular institution, privacy can be compromised by releasing partial elements or prediction values obtained at those institutions. This problem would be more exacerbated in the distributed calculation of the HL test, and even worse for ROC curve plotting, as discussed before.
Conclusion
We showed that a LR model performed in a distributed fashion provides the same results as a conventional LR model performed centrally. This has implications in terms of preservation of individual privacy and may facilitate construction of predictive models across institutions that have limited ability to actually share patient data. GLORE is not a panacea for multiple obstacles that exist for researchers to collaborate, but provides a reliable solution for the problem of having too few cases to construct and evaluate a predictive model at a single institution.
Acknowledgments
We thank Dr. Hamish Fraser and Dr. Kelly Zou for providing the clinical data, and Mr. Kiltesh Patel for helpful discussions. We thank two anonymous reviewers for their feedback, which helped us improve the original manuscript.
Footnotes

Funding The authors were funded in part by NIH grants R01LM009520, U54HL108460, R01HS019913, and UL1RR031980.

Competing interests None.

Provenance and peer review Not commissioned; externally peer reviewed.
This is an openaccess article distributed under the terms of the Creative Commons Attribution Noncommercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/bync/2.0/ and http://creativecommons.org/licenses/bync/2.0/legalcode.