Multiplex meta-analysis of RNA expression to identify genes with variants associated with immune dysfunction
- 1Biomedical Informatics Graduate Training Program and Division of Systems Medicine, Department of Pediatrics, Stanford University, Stanford, California, USA
- 2Laboratory of Clinical Infectious Diseases, National Institute of Allergy and Infectious Diseases, Bethesda, Maryland, USA
- 3Division of Immunology, Department of Pediatrics, Stanford University, Stanford, California, USA
- Correspondence to Dr Atul Janardhan Butte, Division of Systems Medicine, Department of Pediatrics, Stanford University, 1265 Welch Road X-163 MS-5415, CA 94305-5415, USA;
- Received 22 October 2011
- Accepted 29 December 2011
Objective We demonstrate a genome-wide method for the integration of many studies of gene expression of phenotypically similar disease processes, a method of multiplex meta-analysis. We use immune dysfunction as an example disease process.
Design We use a heterogeneous collection of datasets across human and mice samples from a range of tissues and different forms of immunodeficiency. We developed a method integrating Tibshirani's modified t-test (SAM) is used to interrogate differential expression within a study and Fisher's method for omnibus meta-analysis to identify differentially expressed genes across studies. The ability of this overall gene expression profile to prioritize disease associated genes is evaluated by comparing against the results of a recent genome wide association study for common variable immunodeficiency (CVID).
Results Our approach is able to prioritize genes associated with immunodeficiency in general (area under the ROC curve = 0.713) and CVID in particular (area under the ROC curve = 0.643).
Conclusions This approach may be used to investigate a larger range of failures of the immune system. Our method may be extended to other disease processes, using RNA levels to prioritize genes likely to contain disease associated DNA variants.
One of the major goals of translational bioinformatics is to equip clinical medicine with the ability to use information about a patient's genome for diagnosis and decision-making. Several commercial companies provide disease risk information according to individual genotype,1 and other approaches have been developed and used to analyze and interpret high-depth patient sequence data to guide treatment2 3; these are all instances of personalized medicine approaches that integrate genomics. Genotyping using arrays is a well-established commercial service, and the technology exists to provide high-depth sequencing and coverage at a cost comparable to many commonly used diagnostic tests4; basic techniques exist for using these data in a medically relevant fashion.5 6 However, genomic medicine relies on knowledge of the genetic basis of disease, and although methods such as genome-wide association studies using genotyping arrays have become the gold standard for discovering and exploring genetic variations, they have had a relatively poor success rate in explaining the chief genetic contributions to the heritability of many major, common diseases; this is known as the ‘missing heritability’ problem.7–9 We need additional tools to accelerate the process of uncovering the causal variations giving rise to pathology.10 Unfortunately, targeted candidate gene association studies have had a notoriously poor rate of replication in contrast to the much less biased genome-wide approaches.11 12 An approach that can prioritize targeted portions of the genome implicated in disease association in an unbiased, genome-wide manner, based on data-driven, functional properties would provide a powerful tool for future medical genomics. In this paper, we describe such a method and demonstrate its ability to prioritize the genes with variants implicated in a specific form of immunodeficiency, ‘common variable immunodeficiency’ (CVID), characterized by patient inability to produce sufficient antibodies.
Many common, multifactorial diseases include an infectious, autoimmune or inflammatory component; the mammalian immune response is a very finely tuned, highly complex system with hundreds of signaling molecules, dozens of different cell types, and the involvement of multiple tissue types and organs.13 It features all the motifs and elements of the most complex biological circuits and control systems, with many interacting feedback and feed-forward elements.14 Dysfunction can arise from variation in many different components of this complex, highly inter-connected system, and the phenotypic changes in the immune system to the range of genomic variations possible is only beginning to be understood. However, characterization of the variations that lead to serious immune failure can help in the treatment of affected patients and also, hopefully, provide insight into the range of human immune response as influenced by genetics.
Although large knowledge bases on immune function have been constructed,15 we also have access to genome-wide functional data in the form of gene expression information. Large repositories like the NIH NCBI Gene Expression Omnibus16 provide access to tens of thousands of different highly parallelized gene expression measurements. We and others have previously described methods that look across multiple studies which each provide many gene expression measurements (ie, multiplex) in what we call ‘multiplex meta-analysis’17–20 to obtain an overall picture of gene expression across studies. Taking advantage of the central dogma that DNA codes for RNA, which codes for protein, studying RNA at the gene expression level in specific phenotypes has provided potential insight into functional processes in the phenotype and suggests genes (DNA) that may contain variants associated with that disease.21 22 In this study, we extend this idea and describe a method for integrating gene expression data across multiple gene expression studies for clinically/phenotypically similar presentations of a range of immune deficiencies across species and tissue types to create an integrated multiplex (parallelized) meta-profile of gene expression. This meta-expression profile allows powerful prioritization of the genes involved in general immune deficiency but also shows predictive power over a recent genome-wide association study of CVID (figure 1). We suggest that this method could be used to investigate the genetic and molecular pathology of other forms of immune dysfunction.
We collected data from the NIH NCBI Gene Expression Omnibus for 16 different studies of immunodeficiency (table 1). Importantly, these gene expression measurements are from many different forms of immune dysfunction and span multiple species and different tissue types. The gene expression samples in each study were hand annotated and further divided into 37 different experimental comparison subgroups (such as different mouse backgrounds used in the same study). Gene expression levels were compared between immune deficient samples and controls (normal immune function samples) in each subgroup using the modified t test proposed by Tusher et al37 and incorporated into the significance analysis of microarrays.38 The annotations of all samples are available upon request.
We calculate the mean log fold change38 of each oligonucleotide in each array (m) in each experimental class (k, either i for immunodeficiency or c for control) and each subgroup (g), with nk,g indicating the total number of arrays of class k in subgroup g.(1)
These mean log fold changes are then used in the modified t test (equation 2) introduced by Tusher et al37 and explained in detail in Witten and Tibshirani,38 using the standard deviation (σg) and a value so,g (scaling factor) selected to minimize the coefficient of variation of Tg across all oligonucleotides.(2)
The modified t statistic Tg is then bootstrapped to calculate a p value (pg) for each gene. The oligonucleotides on the array are mapped to genes using AILUN39 and across species using HomoloGene groups.40 Using a method proposed by Fisher41 and based on the fact that p values selected at random should be uniformly distributed, we can look for deviations by the χ2 test (equation 3), with pg the p value for that gene in the subgroup g and ng the number of subgroups measuring that gene.(3)
Importantly, this method will freely mix different directions of variation across studies, up or down, using only the significance of the test. The final meta p values may then be used to rank the significance of expression differences between immune deficient and normal controls across studies. We predict that genes that are significantly differentially expressed across studies identified through our method are more likely to be involved in immunodeficiency; the lower the p value, the greater the priority given.
To establish a baseline and identify genes involved with the immune system in general, we also took lists of genes annotated for involvement in acquired immunity, innate immunity, and inflammation taken from the Molecular Signatures Database.42 The genes with an annotation in each respective category were taken as predictions for having variants associated with disease risk, and those genes lacking annotations were taken to be predictions for those genes not being closely coupled to variants associated with increased disease risk. For example, one strategy to prioritize CVID related genes would be to first investigate all those genes annotated for being involved in the biological process of ‘acquired immunity’, using the presence of that annotation on a gene as the predictor for involvement with CVID.
To evaluate our ability to prioritize genes with variants associated with immunodeficiency, we compiled a manually expertly curated list of 131 genes known to have variations associated with immunodeficiency in general. However, as that list might be biased toward genes discovered through differential expression, we also used the results of a recent genome-wide association study of 363 patients with CVID and 3031 healthy controls. We took the results of this study and identified 31 genes linked to variations associated with CVID,43 prominent among them the genes of the major histocompatibility complex (MHC).
We evaluated the ability to prioritize the genes with variants implicated in immunodeficiency using the receiver operating characteristic (ROC) curve. The area under the ROC curve summarizes the power of a prediction method into a single numerical value, in this case predicting the association of genes likely to contain disease-associated variants. An area of 0.5, or a curve along the diagonal indicates no predictive power. A larger area under the curve (AUC) implies better discrimination ability. We obtain a p value by comparing the observed AUC against 1000 randomizations.
In figure 1, we show the ability of our multiplex meta-analysis derived gene expression profile for immunodeficiency to identify and prioritize genes with disease-associated variants. In panel A of figure 1, we see that our approach provides an AUC of 0.713, much better than a random sampling of the genome (diagonal) or selecting and prioritizing genes annotated for immune system involvement (other solid lines).
In figure 1B, we can see very strong prioritization of genes identified through an unbiased, genome-wide investigation of the genetic causes of CVID. Even though CVID represents a collection of different, related syndromes of dysfunction in antibody production, our gene expression analysis drawn from an even more heterogeneous collection of diseases provides predictive power with an AUC of 0.643.
The top 20 genes from our gene expression meta-analysis for immunodeficiency are shown in table 2. Our approach suggests that these genes should be further investigated in relation to their role in immune dysfunction. Although many of these genes have been poorly characterized for function, some are already known to have an important role in immunity, suggesting that although looking just at genes annotated for immunity does not really help selection, some key immune genes have nonetheless been found using microarray meta-analysis, such as PECAM1 which codes for CD31 (a protein involved in leukocyte migration) and STAT1 which is a very important immune regulator involved in response to signaling by interferons.
Our hypothesis in this study was that by integrating RNA repeatedly implicated in gene expression microarray experiments related to immunology, we could identify genes (DNA) recognized to contain variants or mutations well known to be associated with immune dysfunction. To test this, we compared sets of genes found through meta-analysis with genes selected using prior knowledge of immunology pathways, comparing against a manually curated gold standard list of genes known to have variants associated with immunodeficiency.
The important result here is that prior annotations of immune function actually provide no particular insight, as the resulting curves diverge very little from the diagonal (∼0.5 in both panels of figure 1). Our study of genes by their annotation for involvement in immune processes is analogous to previously commonly used candidate gene approaches; genes which were believed to be involved in a pathway or process involved in the phenotype of interest (such as the disease) were examined for genetic variation enriched in affected individuals relative to healthy controls. Although such knowledge-driven approaches have suggested many genes and variants that might influence disease risk, the variants found by such approaches have typically failed to be replicated in larger or subsequent studies.12 44 Our findings lead to a similar result. The previous knowledge-based approach provides little value in prioritization.
There are many potential biases in the identification of genes with disease-associated variation, and it has been suggested that the best ways to address this are to interrogate the genome as widely as possible and to combine studies whenever possible to demonstrate reproducibility and consistency of effect.45–48 In statistics, previous beliefs that lead to bias may be cast in a more formal way, for example as a Bayesian prior, which is a type of statistical bias. Other types of biases may be the result of sample collection having a particular structure that skews the results, such as patients and controls in a clinical trial selected in a non-random way. Overall, bias is not meant here as a pejorative term, merely descriptive. Selecting genes using previous knowledge would represent a form of bias. We suggest that our meta-analysis approach is relatively unbiased compared to using annotations of individual genes, in that it relies on the single assumption that the datasets synthesized relate to the phenotype under study, and is not biased on particular genes or pathways having been previously studied to a greater extent than others.
In our results, investigating this type of immune disease using biased, prior knowledge of the molecular genetics of immune function provides little advantage over a random selection of genes. A study looking for genetic association only in the genes annotated for immune function, would apparently not do much better than random. However, our data-driven approach offers a much better way to prioritize genes for genetic investigation. The meta-analysis derived expression profiles preferentially select genes with disease-associated variants without making any assumptions of function other than differential expression in a related disease.
The vastness of the human genome provides a huge challenge as we seek to investigate the genetic underpinnings of disease; however, we want to interrogate not only a huge range of variants at different loci but also systems with potentially many interacting components. Tackling such a task will require integration over many different experimental modalities. We have demonstrated a method for using raw functional data in the form of multiplexed gene expression to propose genes associated with disease. Importantly, our results shown in figure 1 suggest that this unbiased, data-driven approach is superior to using highly biased, functional annotations.
Although our method may be of particular use in further investigations of immunodeficiency and CVID in particular, it does not need to be confined to immunology. Our approach may work in other disease domains. Immune dysfunction is just one example of many disease phenotypes that derive from the interaction of many genes, proteins, and environmental factors. Future extensions may include information on sequence conservation10 or direct interactions between genes and proteins.49
Special thanks to Rong Chen and Alex Skrenchuk for the development of database resources.
Funding This work was supported in part by the March of Dimes, the Lucile Packard Foundation for Children's Health, the Hewlett Packard Foundation, and the National Library of Medicine through direct research funding (R01 LM009719) and a Biomedical Informatics training grant (T15 LM007033). This research was also supported, in part, by the Intramural Research Program of the NIH, NIAID.
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/2.0/ and http://creativecommons.org/licenses/by-nc/2.0/legalcode.