Clinical utility of sequence-based genotype compared with that derivable from genotyping arrays
- 1Department of Biochemistry, Stanford Genome Technology Center, Stanford University, Stanford, California, USA
- 2Biomedical Informatics Graduate Training Program, Division of Systems Medicine, Department of Pediatrics, Stanford University, Stanford, California, USA
- 3Lucile Packard Children's Hospital, Palo Alto, California, USA
- Correspondence to Dr Alexander A Morgan, Department of Biochemistry, Stanford Genome Technology Center, 855 South California Avenue, Palo Alto, CA 94304, USA;
- Received 30 November 2011
- Accepted 19 April 2012
Objective We investigated the common-disease relevant information obtained from sequencing compared with that reported from genotyping arrays.
Materials and methods Using 187 publicly available individual human genomes, we constructed genomic disease risk summaries based on 55 common diseases with reported gene–disease associations in the research literature using two different risk models, one based on the product of likelihood ratios and the other on the allelic variant with the maximum associated disease risk. We also constructed risk profiles based on the single nucleotide polymorphisms (SNPs) of these individuals that could be measured or imputed from two common genotyping array platforms.
Results We show that the model risk predictions derived from sequencing differ substantially from those obtained from the SNPs measured on commercially available genotyping arrays for several different non-monogenic diseases, although high density genotyping arrays give identical results for many diseases.
Conclusions Our approach may be used to compare the ability of different platforms to probe known genetic risks disease by disease.
- genomic medicine
- personalized medicine
- modeling physiologic and disease processes
- linking the genotype and phenotype
- identifying genome and protein structure and function
- visualization of data and knowledge
Access to a patient's genome offers amazing possibilities in clinical medicine. Researchers are racing to develop the biomedical knowledge and tools capable of turning the mass of individual sequence data into clinically actionable findings. Recent work has contrasted different direct-to-consumer large-scale genotyping services using single nucleotide polymorphism (SNP) arrays,1 and many individuals are receiving health related information through this route. At the same time, whole genome sequencing has been used to directly implicate particular genetic variants in disease,2 ,3 while others have reported on methods for clinical use of a full human genome sequence.4 ,5 As the price of large-scale sequencing continues to drop, it is worthwhile to determine if the findings of whole genome sequencing substantially differ from those available through commercially available genotyping arrays and to what extent the results depend on the diseases being investigated.
Large-scale sequencing provides more genetic information than even the highest density arrays, and the cost difference between technologies is easy to calculate. However, determining the relative benefits for any individual of a genetic-based test for disease prognosis is substantially more difficult. For any particular disease of interest, disease prediction relies on the interplay of many factors including disease prevalence, the predictive power of disease associated variants, the frequency of the disease associated genetic variations (eg, allele frequency), the history of environmental exposures that may alter disease risk, the existence or absence of treatments for the disease, and many other factors. We cannot address all of these issues; however, we can provide some core insight into the differences in information provided by separate technologies. We can compare risk models derived from genotyping arrays with those obtained from more complete sequencing. If genotyping arrays and sequencing provide substantially different risk predictions, then sequencing has an information advantage over genotyping arrays at the level of individual diseases. If the reported risks are not substantially different, then full sequencing may not currently provide much useful clinical information over genotyping arrays. This is a relatively trivial exercise for monogenic (Mendelian) diseases: either a particular technology measures the variant with accuracy or it does not. However, complex diseases, with many different loci associated with disease risk, are the topic of this paper.
Because many different SNPs at different locations in the genome have been associated with separate diseases and each is associated with different variations in disease risk, it is important to consider how to combine the results of genotyping at multiple disease associated loci for each disease/condition. We use two different models for integrating multiple disease associated genetic variants. We have previously suggested using likelihood ratios and modeling each measured disease–variant association as an independent test for the disease using the product of the likelihood ratios to combine results, described in greater detail in previous publications.4 ,5 This has some advantage over other approaches which model the independence of risk contribution, particularly when ORs are used to approximate risk ratios,6 and when large numbers of variants are combined, extremely large, nonsensical values can be reported. However, this is not the only model of genetic risk of disease. Instead of looking at each genetically independent variant measured as an independent test for disease, we can assume that a single variant establishes most of the genetic risk in each individual, but that the particular genetic locus driving risk varies from individual to individual. This model uses the maximum likelihood ratio for the variants measured. This is analogous to the idea of a physical chain being only as strong as its weakest link; in our model the genetic risk is approximated by the risk conferred by the single allele with the largest effect size, that is the most damaging allele. The variant with the single largest likelihood ratio is assumed to confer all the risk, and any genetic contribution from any other variants is ignored. This is biased toward the larger effect sizes and more deleterious alleles by definition. Of course, neither of these models should be taken as a perfect predictor of genetic risk; however, they provide parsimonious models. They make no assumptions about epistatic interactions that have not yet been validated. Importantly, they do not incorporate any other clinical covariate that may affect the likelihood of a particular disease diagnosis, such as environmental or demographic features, although one advantage of these approaches, which are both fundamentally Bayesian in formulation, is that additional diagnostic features, particularly when they are independent of the genetic association with disease, may be easily integrated into a more complex version of the model.7 ,8
Only a handful of individuals have been fully sequenced in great depth. However, the 1000 Genomes Project9 provides greater sequence data with better informed imputation across a number of individuals of different ethnic backgrounds using a variety of sequencing technologies at different research centers. We can use the preliminary release of these data as a source of sequences to generate clinical risk reports based on our two models, the product of likelihood ratios and maximum likelihood ratio. At the same time, we can also use the subset of variants reported from common genotyping arrays, either through direct measurement or imputed from the SNPs measured in arrays, and compare them with clinical reports from the fuller sequencing data. For our analysis, we focus only on two commercially available genotyping platforms. It is certainly possible to design a custom array around specific variants of interest, including rare variants of clinical interest, or to laboriously employ targeted use of PCR to examine all the relevant variants desired, a strategy followed by many of the companies carrying out personal genomic testing.10 This has the advantage of providing a clinical profile based on variants vetted through expert curation (eg, Hsu et al),11 but has the disadvantage of some loss of flexibility if there are differences in opinion concerning which variants to include in the analysis.
We have found that even if a large amount of data can be imputed from a very high-density genotyping array, the clinical risk assessment using both the product of the likelihood ratios model and the maximum likelihood ratio differs substantially between the results obtained using two common genotyping arrays (Illumina Omni and Affymetrix 500k) and the fuller sequence data from 1000 Genomes for a number of diseases. This suggests that clinical interpretation of the results of genotyping using an ‘off-the-shelf’ array is likely to lack important information relevant to a patient's health.
As explained in Ashley et al4 and their associated supplementary materials, we have compiled an extensive database of published genetic associations with disease from over 2800 research publications. We filter out any reported association that is not significant: in a candidate gene study the p value must be <0.05 and in a genome-wide study, the p value must be <0.00001. Reported associations can also be filtered out for a variety of quality control reasons, including the fact that many do not report the actual risk associated allele, or do not report enough information to calculate a likelihood ratio.
To maximize the independence assumptions needed in our approach, we merge results reported from multiple SNPs for the same condition that are in strong linkage disequilibrium with one another in the same haploblock, as defined by the CEU population in HapMap, and then take the most significant. This could introduce some bias toward larger effect sizes, but it is necessary to combine studies that use slightly different genotyping technology and are indirectly measuring the same variation, as we do not want to double count closely linked SNPs. Here we operate under the assumption that the most significant of a set of linked SNPs is more closely associated with some underlying, perhaps directly causal, variant, common in many studies which report the most strongly associated variant when a group is linked with a disease in a genome-wide association study. For this study we have focused on genetic associations for 55 important diseases, as many of the conditions in our database are either not at the right level of specificity (eg, genetic risk associations for cancers in general), not particularly clinically relevant (eg, eye color), or related to medication response, something which can be dealt with more comprehensively by experts in pharmacogenomics.
The 187 genomes were taken from the 1000 Genomes Project pilot 1 and pilot 2 studies,9 downloaded on July 1, 2010. These genomes represent a variety of ethnic groups and sequencing technologies, and include related family members, all factors that can influence the results of any comparative analysis. At the same time, although these are extensive genetic sequences, they have not been analyzed to the depth reported for other individuals.12 ,13 However, this is a unique resource on sequencing data for individuals, and although genetic association studies are biased toward particular populations (eg, CEU), we will use this as the standard of comparison.
The 187 sequences we have collected are derived from a variety of high throughput sequencing technologies and centers.9 Although we do not have access to basic genotyping array studies performed on these individuals, we can infer the results from a theoretical array from the more complete sequence information. We can examine all the SNPs measured by the array platform of interest and use the sequence calls from the human sequence as the reported genotypes from a theoretical genotyping array. Also, we therefore automatically exclude any experimental differences where the genotyping array does not match the sequence data. We have chosen to compare the set of variants measured and potentially imputable by the very commonly used Affymetrix GeneChip Human Mapping 500K Array Set (Affymetrix 500k), which measures approximately 500 000 SNPs, and the very high coverage Human Omni1-Quad Beadchip (Illumina Omni), which measures over one million SNPs.
Genotyping arrays are frequently not used only for the variants that they directly measure, as often variations are in strong linkage disequilibrium with one another and one can be accurately imputed from measuring another. To increase the coverage of our theoretical genotyping arrays, we also use the reported sequence genotype for any variant which can be imputed from those measured on the array platform with >75% accuracy using the state-of-art MACH imputation software with CEU phasing data in the HapMap as the reference.14 This is not ideal for non-Caucasian individuals, but our analysis is not meant to provide perfect predictions, just illustrative results.
To model potential interactions between disease associated alleles, we view each associated variant (per haploblocks, as above) as providing an independent genetic test for the associated disease. For each variant we compute a likelihood ratio:(equation 1)
This enables the creation of a flexible likelihood ratio for any conceivable genotype combination of homozygote or heterozygote alleles, in contrast with many risk models which compare only two sets of genotypes, grouping heterozygotes with one of the homozygote genotypes. The genetic variants are free from genetic direct linkage, as we look only at SNPs in distinct and independent haploblocks and then assume that they are independent tests for disease. We can then model the overall likelihood ratio of disease simply as the product of likelihood ratios from each individual disease associated variant to provide the product model for likelihood ratios, depicted graphically in previous work as a nomogram.7 Genetic measurements at variants that are not associated with increased or decreased likelihood of disease are assumed to be uninformative tests, with a likelihood ratio of unity, and are therefore excluded.
In our alternative maximum likelihood ratio model, we assume that the significant contribution to disease risk arises from only the variant with the largest effect size. A likelihood ratio is computed as described above in equation 1 for each genotype at each relevant locus for an individual and the single largest association with disease is used, the maximum likelihood ratio. This is roughly proportional, although not exactly equivalent, to taking the maximum OR or RR.
To compare genotyping array based results with the sequence derived likelihood ratios, we compute the differences in the natural logarithm of likelihood ratio for each of the 55 diseases, for each of the 187 individuals. The results of these comparisons are shown in figures 1 and 2. The boxplots show the difference between the risk profiles derived from the genotyping arrays and the risk profiles derived from the genome sequence for each disease for each of the 187 individuals. Wider boxes correspond to greater variation in the difference, and the median difference is indicated by a vertical bar. To allow comparison on amount of genetic information available for each disease, the number of SNPs contributing to the risk profile is also shown.
As is apparent from the lists of numbers in the figures, the number of SNPs used in each model is not always a simple whole number; instead, an average number of usable SNPs across individuals is reported. This fractional number can occur for several reasons including a lack of accurate reporting for all possible genotypes from the genetic association studies used, or a missing or ambiguous sequence call at a key SNP. In such cases, it is taken as an uninformative or inclusive test for disease association, and a likelihood ratio of 1.0 is assumed for that particular variant–disease association.
The root mean square (RMS) difference in the log likelihood ratios within a disease averaged across the 55 diseases is shown in table 1. This gives a quantitative, overall summary of variability between platforms and shows the relative contribution of imputation. These 55 diseases were chosen to represent a wide range of conditions, but the 187 individuals are not entirely representative of any particular population.
Some potential biases are observable in the figures. In the model that is derived from the maximum (reported) likelihood ratio, there is a strong skew toward a result suggesting increased risk compared with the product of likelihood ratios model which combines risk and protective alleles, attenuating the derived likelihood ratio. In other words, the median tends to be higher in the max model compared with the product model. However, as can be seen in the plots on the right of each figure, this is not universally the case, and it may seem counter-intuitive that the model using the maximum likelihood ratio can show consistently lower reported risk relative to the product of likelihood ratios for a disease. This is because a genotyping array may not measure any of the SNPs associated with that disease and thus all individuals will have a likelihood ratio of 1.0, or no information. However, most or all of the individuals sequenced may actually have the protective allele, and thus when full sequence information is available, the risk derived from the maximum likelihood ratio may actually be less. This may be anticipated as an artifact of our patient population, as we know that the subjects who have been sequenced are biased toward older individuals, free from disease, and perhaps selectively biased toward possessors of alleles that have protected them from disease into middle age. In general, meta-analyses, such as our synthesis of published association studies, are subject to issues in reporting bias toward positive findings over negative ones and toward over-inflated effect size estimates.15 ,16
Although the absolute differences in the clinical profile are important, a key parameter is the relative variability in disease likelihood between that reported based on the genotyping arrays and that derived from the more extensive sequence data. In tables 2 and 3 we show these relative differences (equation 2) for each of the 55 diseases, for the SNPs derived from the genotyping arrays, as explained for figures 1 and 2. For each disease, we show the RMS of the difference in log likelihood ratio between what is derived from the genotyping array SNPs over the RMS of the log likelihood ratio derived from the variants in the sequence data (equation 1). For each disease, d, and each patient, p, in the set of patients P, we compare the RMS difference in the log likelihood ratio derived from sequence data, LLRseq, and that derived from the variants measured or imputable from the genotyping array, LLRgen.
A value of >1.0 in tables 2 and 3 indicates that the variability in the clinical profile for that disease as reported from the SNPs on the genotyping array relative to that from sequencing is greater than the variability between individuals. We can see that for many diseases, the variability between the results provided by sequencing and genotyping SNPs is a large fraction of the variability between individuals, and depending on the model and genotyping array, may actually exceed the variability between individuals.
When using a richer model that allows many individual variants to contribute to adjustments in the risk, in this case the product model, the medically relevant results from the genotyping arrays differ substantially from those derived from sequencing across many diseases, even with the very high coverage of the Illumina Omni platform and a liberal allowance for imputation. For diseases like Alzheimer's and type II diabetes with many associated variants not covered by the genotyping array, the overall likelihood ratios can vary dramatically by as much as a factor of 20 (shown as a difference in logs of three in figures 1 and 2).
There are some limitations to our work. We focus only on the disease associations published in the literature from findings that have been reported as statistically significant, from studies primarily relying on relatively common genetic variants. Undoubtedly, many genetic variants that contribute to disease risk remain to be discovered, and these will help explain the heritability of many common diseases.17 ,18 However, our results demonstrate that whole genome sequencing shows markedly different results when only the currently known gene–disease associations are studied, and before consideration of the even greater advantage provided by examining unique variations that may be associated with other deleterious changes, including the introduction of early stop codons or amino acid substitutions predictive of deleterious effects.19 In this work, we considered two different, plausible, models of combining disease-associated variants, and they give similar results, but many other models are possible. Our work further highlights the need for more comprehensive, multivariate models of disease risk, including possible epistatic interactions,20 and these more complex models will be developed as the research community continues to investigate the contributions of genetic variations in disease.
The clinical use of genomic data is in its infancy. However, our results suggest that even with the currently limited knowledge of gene–disease associations, genome sequencing provides a substantially different, medically relevant risk profile than that available from common genotyping arrays. Although custom genotyping arrays can certainly be designed around known disease associated variants, it is likely that the continual discovery of new associations between variations and disease will outpace their design cycle. Indeed, an initial hypothesis of this work was that the heavy use of genotyping arrays in genome-wide association studies would skew the results toward highlighting only genetic variants already measured on genotyping arrays and thus very little difference would be apparent. However, variants outside of the scope of genotyping array technology will continue to be discovered, as even in genome-wide association studies using arrays, targeted deep sequencing often identifies likely ‘causal’ variants, often following the discovery phase.
Researchers investigating the association between genetics and specific diseases can use these results to compare the relative power of sequencing versus genotyping arrays to capture currently understood risk before they search for novel associations or replicate previously reported results. In addition, improved prior information can help inform difficult cost–benefit analyses and our approach and the results reported here may help the design of experiments.
In this analysis, we very deliberately focused on an empirical evaluation of the differences in medically relevant genetic variant coverage between sequencing and commonly used genotyping arrays. Our analysis excludes many types of additional variation between the calls reported by arrays and next generation sequencing technologies, all of which would likely increase the differences we report. Just as genotyping platforms differ in which SNPs they measure, sequencing technologies can vary considerably not only in the underlying method of sequencing but also in the results they report, and these methods can differ one from another on DNA from the same individual.21–23 At the same time, there are a variety of different bioinformatics techniques that can be used to interpret sequencing data and make actual base calls, and there can be substantial differences in the results reported by different methods,24 another potential source of variation. In our work we also gave ‘the benefit of the doubt’ to imputation methods, although we know that accuracy with these approaches is often far from ideal.25 ,26 Our exclusion of these other sources of variation was intentional to allow us to limit the scope of our investigation, but we believe that future studies will show even larger differences.
Genotyping arrays still have an important role in the discovery phase of genome-wide disease association studies, as the potential space for hypotheses examining associations is incredibly large using a full genome sequence. Targeted approaches that focus on key genes in disease associated pathways, or using known features that enrich for disease association such as expression variation,27 ,28 may allow the construction of targeted genotyping arrays that aid in disease association discovery. In addition, future development and use of genotyping arrays specifically designed around known human disease associated SNPs may reduce the prognosis gap.
In summary, many important, relatively common, currently known disease associated variants are not measurable or imputable from commercial genotyping arrays. These variants are common enough and have an effect size (likelihood ratio) large enough to influence our assumptions of risk for many diseases for many individuals. Individuals or their healthcare providers deciding between different technologies to elucidate potential disease risks encoded in the genome should take into account these differences of coverage as well as which disease conditions are most relevant to them. Also, researchers attempting to discover novel gene–disease associations or investigate previously reported associations should consider these differences in coverage across individuals as they balance cost versus benefit in their research planning.
We thank Optra Systems of Pune, India for gene–disease variant data curation.
Funding This work was funded in part by the Lucile Packard Foundation for Children's Health, the Hewlett Packard Foundation, and the National Library of Medicine (R01 LM009719). AAM was supported by NLM's Biomedical Informatics Training Grant (T15 LM007033).
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/2.0/ and http://creativecommons.org/licenses/by-nc/2.0/legalcode.