Transcriptional network predicts viral set point during acute HIV-1 infection
- Hsun-Hsien Chang1,
- Kelly Soderberg2,
- Jason A Skinner3,
- Jacques Banchereau3,
- Damien Chaussabel3,
- Barton F Haynes2,
- Marco Ramoni1,
- Norman L Letvin4,5
- 1Children's Hospital Informatics Program, Harvard-MIT Division of Health Sciences and Technology, Harvard Medical School, Boston, Massachusetts, USA
- 2Duke Human Vaccine Institute, Duke University, Durham, North Carolina, USA
- 3Baylor Institute for Immunology Research, Dallas, Texas, USA
- 4Department of Medicine, Harvard Medical School, Boston, Massachusetts, USA
- 5Division of Viral Pathogenesis, Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA
- Correspondence to Dr Hsun-Hsien Chang, Children's Hospital Informatics Program, Harvard Medical School, 300 Longwood Ave, Enders 144, Boston, MA 02115, USA;
Contributors HHC designed the method, conducted the analysis, and prepared the manuscript; KAS collected Malawian samples; JAS, JB, and DC collected samples for building up transcriptional network models; BFH, MFR, and NLL directed the study and prepared the manuscript.
- Received 20 January 2012
- Accepted 14 May 2012
- Published Online First 14 June 2012
Background HIV-1-infected individuals with higher viral set points progress to AIDS more rapidly than those with lower set points. Predicting viral set point early following infection can contribute to our understanding of early control of HIV-1 replication, to predicting long-term clinical outcomes, and to the choice of optimal therapeutic regimens.
Methods In a longitudinal study of 10 untreated HIV-1-infected patients, we used gene expression profiling of peripheral blood mononuclear cells to identify transcriptional networks for viral set point prediction. At each sampling time, a statistical analysis inferred the optimal transcriptional network that best predicted viral set point. We then assessed the accuracy of this transcriptional model by predicting viral set point in an independent cohort of 10 untreated HIV-1-infected patients from Malawi.
Results The gene network inferred at time of enrollment predicted viral set point 24 weeks later in the independent Malawian cohort with an accuracy of 87.5%. As expected, the predictive accuracy of the networks inferred at later time points was even greater, exceeding 90% after week 4. The composition of the inferred networks was largely conserved between time points. The 12 genes comprising this dynamic signature of viral set point implicated the involvement of two major canonical pathways: interferon signaling (p<0.0003) and membrane fraction (p<0.02). A silico knockout study showed that HLA-DRB1 and C4BPA may contribute to restricting HIV-1 replication.
Conclusions Longitudinal gene expression profiling of peripheral blood mononuclear cells from patients with acute HIV-1 infection can be used to create transcriptional network models to early predict viral set point with a high degree of accuracy.
The best established parameter for predicting the rate of clinical progression in HIV-1-infected individuals is the set point plasma virus RNA level,1 usually called viral set point. By a few weeks after infection, peak virus replication is reached in most infected patients. As immune responses are mobilized, partial containment of HIV-1 occurs and a steady-state level of virus replication is reached. This viral set point is associated with the rate of disease progression.1 A high viral set point is associated with rapid disease progression, while a lower viral set point is associated with slower disease progression.
A delineation of the mechanisms that contribute to establishing the set point virus load in HIV-1-infected individuals would add important information to our understanding of how HIV-1 replication is contained. Studies in HIV-1-infected humans and simian immunodeficiency virus (SIV)-infected rhesus monkeys have implicated virus-specific CD8+ cytotoxic T lymphocytes in the early control of AIDS virus replication.2–4 Genome-wide association studies have underscored the importance of this cellular immune response in early HIV-1 control through demonstrating the contribution of MHC genes in determining viral set point.5 ,6 However, this genome-wide association studies work also suggests that other as yet undefined factors also contribute to early HIV-1 control.
The set point plasma HIV-1 RNA level can provide a useful clinical tool for determining the timing for initiating anti-retroviral therapy for infected individuals. For example, patients having high set point values can be started on aggressive anti-retroviral therapy and patients having low set point values can be monitored without initiating therapy. However, if an acutely-infected patient presents to a physician before set point virus replication is reached, the appropriate therapy for an infected individual can be difficult to determine. Therefore, a means of predicting set point plasma virus RNA levels in a recently infected patient would provide a useful tool for establishing a treatment strategy for that individual.
RNA microarray technology provides both a powerful tool for exploring mechanisms underlying biologic phenomena and a means of categorizing those phenomena into groups with differing clinical outcomes.7 Therefore, the whole blood RNA transcriptional profile of individuals during the period of acute HIV-1 infection may provide a gene expression signature that may be associated with particular clinical sequelae. The present study was done to determine computationally whether the expression of a limited network of genes by peripheral blood mononuclear cells sampled from HIV-1-infected individuals early after exposure to the virus can predict viral set point.
Transcriptional network predictors of viral set point
Viral set point predictors were developed from the longitudinal gene expression data from the first cohort of untreated, acutely HIV-1-infected individuals. We collected the peripheral blood mononuclear cells of the patients during a course of 24 weeks: enrollment, weeks 1, 2, 4, 12, and 24. At each time point, we used a multivariate statistical analysis, known as Bayesian networks, to infer a transcriptional network comprised of viral load and transcripts, and then identified the dependency pattern between the network and viral set point. Figure 1 shows the inferred networks at each time, where an arrow represents the target node's dependency on its source node. The network models were used to predict viral set point by substituting the viral load and the expression levels of the genes in the networks. Table 1 reports the predicted viral set points of individual patients obtained from leave-one-out cross-validation calculations. Table 2 summarizes the predictive accuracy, and shows that the transcriptional network models can achieve at least 90% accuracy.
In addition to this leave-one-out cross-validation, we applied the network models to predict viral set points of an independent Malawian cohort. Table 1 reports the predicted viral set points of individual subjects, and table 2 presents the predictive accuracy at specified times. The independent validation achieves 87.5% predictive accuracy in the early weeks and over 90% accuracy in the later weeks following infection.
Signature genes of transcriptional network predictor
Table 3 lists the signature genes associated with the set point viral load prediction at specified times. There were only a small number of genes that, in association with viral load, predicted the set point viral load. There are more signature genes at early times following the infection, (from enrollment to week 2) than from week 4 to week 24. Therefore, in the early weeks following infection, when viral load was a less dominant predictor of set point, we need to take into account more transcripts interacting with virus to achieve an accurate prediction of set point viral load.
There are 12 genes in these networks (C4BPA, CCNB2, CYP1B1, HLA-DRB1, IFI27, IFIT1, LOC649210, MMP9, OAS1, OSBP2, OTOF, TYMS), 6 of which (C4BPA, HLA-DRB1, IFI27, MMP9, OSBP2, TYMS) are repeated. For example, C4BPA appears 3 times (weeks 1, 2, and 12). The pathway analyses performed using DAVID Bioinformatics Resources8 and Ingenuity Pathways Analysis (Ingenuity® Systems, http://www.ingenuity.com) shows that the association with interferon signaling (p<0.0003) and the membrane fraction (p<0.02) is over-represented in this collection of signature genes. Furthermore, 9 of the 12 signature genes (C4BPA, CCNB2, CYP1B1, HLA-DRB1, IFI27, IFIT1, MMP9, OAS1, TYMS) are related to each other through 4 hubs: hydrogen peroxide, IFNG, TGFB1, and TNF (figure 2). The crucial mechanisms associated with these genes are cell-cell signaling and interaction (p<0.001), cell cycle (p<0.001), molecular transport (p<0.001), immune cell trafficking (p<0.001), and hematological system development and function (p<0.01).
In silico knockout study
The viral set point predictors allowed us to conduct a computational knockout study. Knocking out a gene was computationally equivalent to setting the expression level of the gene to zero in the model. We knocked out a signature gene at a time, and then predicted viral set point. Figure 3 illustrates that the results, where each curve depicts the distribution of predicted viral set points when the corresponding gene was eliminated. When the predicted viral set point is greater than the actual value, it implies that the gene knockout inhibits HIV-1 replication; similarly, the predicted viral set point smaller than the actual value indicates that the gene knockout promotes HIV-1 replication. The results at early times (ie, enrollment and week 1) show that HLA-DRB1 and C4BPA are able to decrease HIV viremia; this finding is consistent with the ability of HLA genes to restrict AIDS progression.9 ,10
Comparison between transcriptional network and viral load alone models
A viral load measurement correlated with viral set point with increasing accuracy, as the time of that measurement began to approximate the time of the steady-state level of viral replication in an individual.11 We created a univariate regression model that uses viral load alone to predict set point. A graphical representation of the regression model in which viral load is the single variable determining viral set point is shown in figure 1G. Table 2 also reports the predictive accuracy for viral set point generated using the regression models. At early times following infection (ie, enrollment, and weeks 1 and 2), the transcriptional network models predicted set point with 5% greater accuracy than the viral load alone model (p<0.05). As expected, when the time following infection more closely approximated chronic infection (weeks 4, 12, and 24), the predictive accuracy of the two types of models were comparable.
Comparison between transcriptional network and correlating predictors
We further contrast our transcriptional network with correlating predictors. Correlating predictors, or correlation-based predictors, are conventional techniques to search for the best transcripts whose expression levels statistically significantly correlate with continuous outcomes and are able to predict the outcomes. We used a statistical package in Matlab (MathWorks, Natick, Massachusetts, USA) to find the correlating transcripts of viral set point at each time. The results are summarized in table 4. We noted that only three transcripts (IFI27, HLA-DRB5, MMP9) were selected for viral set point prediction. However, the correlating predictors have worse predictive accuracy than transcriptional network model (p<0.005).
Various strategies for predicting viral set point in HIV-1-infected humans and SIV-infected- or simian HIV (SHIV)-infected-rhesus monkeys have been explored. The peak plasma SIV RNA levels during primary infection are associated with set point levels in rhesus monkeys, with high peak values predicting high set point values.12 The magnitude of a vaccine-elicited CD8+ cytotoxic T lymphocyte response has also been shown to predict the post-challenge set point plasma virus RNA level in rhesus monkeys following a pathogenic CXCR4-tropic SHIV challenge.13 The utility of these predictive values in a clinical setting is, however, limited because of the difficulty of ascertaining when an individual became infected with HIV-1 and determining the magnitude of a cytotoxic T lymphocyte population.
The early containment of an AIDS virus infection is mediated to a significant extent by CD8+ cytotoxic T lymphocytes. This has been shown most dramatically in SIV-infected rhesus monkeys, when depletion of CD8+ lymphocytes by infusion of a monoclonal anti-CD8 antibody eliminates early virus containment, and animals die before they reach a set point level of virus replication.14 It is therefore likely that many of the signature gene transcripts are expressed in CD8+ T lymphocytes or cells that regulate these effector lymphocytes.
A number of the 12 transcripts associated with viral control in the present study have been implicated in HIV-1 human interactions. Some (IFI27, IFIT1, MMP9, and OAS1) have been shown to be induced by HIV-1 infection of cell lines in vitro.15–17 Some have been reported to interact directly with HIV-1 proteins: CCNB2 interacts with HIV-1 Vpr in the induction of cell cycle arrest; the expression of IFI27, MMP9 and OAS1 are upregulated by Tat.16 ,18–20 The group of transcripts also includes interferon inducible genes with well documented antiviral activity: IFI27, IFIT1 and OAS1.15–17 Interestingly, HLA-DRB1**1303 has been associated with decreased virus load as well as strong, polyfunctional mucosal CD4+ T cell responses in HIV-1-infected individuals.21 ,22 Other of these transcripts have not previously been associated with HIV-1 biology or infections, and two have no previously reported functions.
A microarray assay focused on the signature transcripts defined in this study might be used in a clinical setting. Rather than employing an array that monitors 44 000 gene transcripts like the one used in this study, a microarray assay focused on the limited number of signature genes defined in the present study could be devised. Such an assay might then be used in association with other clinical data to determine whether a newly diagnosed HIV-1-infected patient should receive immediate treatment or monitored until the progression of disease warrants instituting treatment.
Materials and methods
A cohort of 10 acute HIV-1-infected patients was enrolled from 2 USA and 6 African sites. In this cohort, there are 2 Caucasians, 1 African American, 2 African Blacks, 2 African Chewas, 1 African Ngoni, 1 African Lomwe, and 1 African Tumbuka. At enrollment they were verified as acute Fiebig stages 4 to 6 (plasma RNA+, third generation EIA+, Western blot indeterminant or +).23 Follow-up samples were collected from patients at 1, 2, 4, 12, and 24 weeks post-enrollment. All patients were untreated throughout the 24-week period of study.
A second cohort of 10 untreated Malawians with acute HIV-1 infection in Fiebig stages 4–6 was enrolled in the study as an independent validation set. Blood samples from these patients were collected at enrollment and at the same intervals as the first cohort (at 1, 2, 4, 12, and 24 weeks after enrollment).
Whole blood sample collection and microarray hybridization
Whole blood was collected using standardized conditions into Tempus vacutainer tubes. Total RNA was isolated from lysates of the whole blood.24 All samples passing quality control were then amplified and labeled using the Illumina TotalPrep-96 RNA amplification kit. Amplified RNA was then hybridized to Illumina HT-12 V3 beadchips and scanned on an Illumina Beadstation 500 according to the protocol detailed by the manufacturer (http://www.illumina.com). The gene expression of the first cohort was assessed at Baylor Institute for Immunology Research, and the second cohort was assessed at Duke University.
Viral set point measurement
Viral set point, the steady-state viral load after acute infection, was determined for all subjects by an experienced infectious-disease clinician following the standard protocol.5
Statistical analysis of transcriptional network
To model the interactions among transcripts, viral load, and viral set point, we carried out a multivariate dependency analysis using dynamic Bayesian networks.25 ,26 Such an analysis has already been applied to several types of genomic data including gene regulation,27 protein-protein interactions,28 SNPs29 ,30 and pedigrees.31 A Bayesian network is a directed acyclic graph in which nodes represent random variables and arcs define directed dependencies quantified by conditional probability distributions. Besides inferring transcriptional interactions, Bayesian networks can be exploited for prediction. Thus, we reasoned that a network capturing the relationship between transcripts and viral set point might be used to compute the most probable quantity for viral set point when given the levels of an individual's expression of these transcripts.
Our analysis began by selecting gene transcripts with at least a twofold change in expression with respect to enrollment over the entire 24-weeks of the study. We then log10-transformed all data, including gene expression, viral loads and viral set points. Finally, we analyzed the data using a Bayesian network method. Our aim was to search for the most probable network of gene-gene dependency for each time point. To find such a network, the analysis explored a space of different network models, scored each model by its posterior probability conditional on the available data, and returned the model with maximum posterior probability. This probability was computed by Bayes theorem as , where p(D|M) is the probability that the observed data are generated from the network model M and p(M) is the prior probability encoding knowledge about the model M before seeing any data. We assumed that all models were equally likely a priori, so p(M) is uniform and p(M|D) becomes proportional to p(D|M), a quantity known as marginal likelihood. The marginal likelihood averages the likelihood functions for different parameter values and is calculated as where p(D|θ) is the traditional likelihood function and p(θ) is the prior probability density of parameters. When we assumed the log-transformed data to be Gaussian, has a closed-form solution.25 We used a greedy search algorithm25 with order permutation to identify the most probable network model with highest marginal likelihood .
Evaluation of predictive accuracy
Viral set point is a continuous quantity that we want to predict. The popular method to evaluate the prediction of continuous variables is based on root-mean-square-error (RMSE): where N is the number of subjects, and and are the predicted and actual values, respectively. However, this method does not explain the degree of error/accuracy. When two distinct cohorts share the same predictive RMSE, one cohort with higher actual values has greater predictive accuracy than the other. To better evaluate the predictive accuracy, we in this paper define the accuracy A as follows: where NRMSE is the normalized RMSE. Unlike RMSE, the normalized RMSE encodes the error rate, and helps us better quantify the predictive accuracy.
We used the first data set to infer a transcriptional network model. To assess the robustness of the transcriptional network for sampling variability, we used a leave-one-out cross-validation strategy: a single observation from the data was treated as the validation sample, the remaining observations were used to re-estimate the parameters of the network model, and the newly parameterized model was used to predict the viral set point of the validation sample. This process was repeated until each observation in the data set was used once as the validation sample. We then evaluated the predicted set points by mean error rates deviating from the true values.
To confirm the results, we used the transcriptional network models learned from the first data set to predict the viral set points of 10 Malawian patients who were not included in the network learning process. The validation was performed by quantile-normalizing the Malawian data to the first data, followed by log10-transformation and viral load prediction. The measure of predictive accuracy was the same as the leave-one-out cross-validation.
Funding This work was supported by NIH grant 5U19A1067854-05.
Competing interests None.
Patient consent Obtained.
Ethics approval Ethics approval was provided by Duke Human Vaccine Institute, Duke University.
Provenance and peer review Not commissioned; externally peer reviewed.