The coming age of data-driven medicine: translational bioinformatics' next frontier
- 1Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California, USA
- 2Duke Translational Medicine Institute, Duke University, Durham, North Carolina, USA
- Correspondence to Dr Nigam H Shah, Stanford University School of Medicine, 1265 Welch Road, Room X-229, Stanford, CA 94305, USA;
- Accepted 26 March 2012
- Collaborative technologies
- applications that link biomedical knowledge from diverse primary sources (includes automated indexing)
- knowledge representations
- statistical analysis of large datasets
- methods for integration of information from disparate sources
- resource discovery
Last year, in 2011, we argued that biomedical informatics stands ready to revolutionize human health and healthcare using large-scale measurements on a large number of individuals.1 We anticipated that, with the coming changes in the amount and diversity of datasets, data-centric approaches that compute on massive amounts of data (often called ‘Big Data’2 ,3) to discover patterns and to make clinically relevant predictions would be increasingly common in translational bioinformatics.
Given these trends, we programmed the 2012 Summit on Translational Bioinformatics to focus on research that takes us from base pairs to the bedside,4 with a particular emphasis on clinical implications of mining massive datasets, and bridging the latest multimodal measurement technologies with the large amounts of electronic healthcare data that are increasingly available.
The coming year did turn out to be the year of Big Data for the Summit, with multiple submissions on managing and interpreting large datasets (figure 1). Among the 35 full paper submissions to the Summit, four stood out for their innovation, and hence the authors were invited to expand the work for this special issue of JAMIA—adding to the growing presence of translational bioinformatics in the journal.5–9
Liu et al10 demonstrated how the ability to predict adverse drug reactions can be increased by integrating chemical, biological, and phenotypic properties of drugs. They demonstrated that prediction accuracy increased from 0.9054 (when only chemical structures were used) to 0.9524 (when chemical structures along with biological and phenotypic features were used). They conclude that data fusion approaches are promising for large-scale adverse drug reaction predictions in both preclinical and post-marketing phases.
Bhavnani et al11 assert that existing methods to analyze ancestral informative single-nucleotide polymorphisms (SNPs) (ie, SNPs that have large differences in genotype frequencies between two or more ancestral populations) identify a parsimonious set of SNPs that can identify distinct population clusters. However, existing methods do not directly visualize which clusters of subjects are related to which clusters of SNPs, or allow visualization of the genotypes that determine the cluster memberships. In an attempt to reveal such hidden relationships, they used three bipartite analytical representations (a bipartite network, a heat map with dendrograms, and a Circos ideogram) to simultaneously visualize clusters of subjects, SNPs, and the attributes that cause them to cluster.
Seeking to maximize the utility of the abundance of available genome-wide association study (GWAS) data, Russu et al12 introduced a novel Bayesian model search algorithm, binary outcome stochastic search, for model selection when the number of predictors (eg, SNPs) far exceeds the number of observations. They propose an innovative stochastic model search technique where the relationship between the observed responses and the available predictors is described by a latent variable model with a probit link. They compare binary outcome stochastic search with three established methods (stepwise regression, logistic lasso, and elastic net) in a simulated study and in two real world studies to demonstrate higher precision (while preserving recall) in identifying SNPs associated with the observed outcome than the one obtained from established methods.
Morgan et al,13 recipient of the Marco Ramoni Best Paper Award, constructed genomic disease risk summaries for 55 common diseases using reported gene–disease associations in the research literature. They constructed risk profiles based on the SNPs as well as on 187 whole-genome sequences and show that risk predictions derived from sequencing differ substantially from those obtained from the SNPs for several different non-monogenic diseases. When a large fraction of associated variants for a given disease is not covered by the genotyping array, the overall risk predictions can vary dramatically—by as much as a factor of 20 times in some instances.
Beyond this year's conference papers, in the larger informatics community, researchers have demonstrated that GWAS can now be performed by leveraging large amounts of electronic medical record (EMR) data. For example, Kho et al showed that, by using commonly available data from five different EMRs, it is possible to accurately identify type 2 diabetes cases and controls for genetic study across multiple institutions.14 In addition, genomic sequencing has moved out of the research realm and established itself in the clinic. For example, at the Medical College of Wisconsin, Dr Howard Jacob's team used genome sequencing to identify a novel causal mutation that led to successful treatment of a 6-year-old boy with an extreme form of inflammatory bowel disease.15 ,16
Currently, the discussion of Big Data in translational informatics often connotes next-generation sequencing data.3 ,17 ,18 However, this is beginning to change: in 2011, the use of large public datasets of various kinds increased dramatically. The research activity around data mining for predicting adverse drug events (ADEs) using public data is an excellent example.19 Drug safety surveillance is currently based on spontaneous reporting systems, which contain reports of suspected ADEs seen in clinical practice. In the USA, the primary database for such reports is the Adverse Event Reporting System (AERS) database at the Food and Drug Administration. This resource has been successfully mined using ‘disproportionality measures’, which quantify the magnitude of difference between observed and expected rates of particular drug–ADE pairs.20 ,21
Given the amount of data available in AERS,22 researchers are developing methods for detecting new or latent multi-drug adverse events. Examples include using side effect profiles from AERS' reports to infer the presence of unreported adverse events,23–25 and creating a network of known drug–ADE relationships to predict as yet unknown ADEs before they are found in post-market evidence.26
Going beyond reported adverse events and making use of molecular level data, Pouliot et al27 generated logistic regression models to correlate and predict post-marketing ADEs based on screening data from PubChem, a public database of chemical structures of small organic molecules along with information about their biological activities. In a related effort, Vilar et al28 devised a way to enhance existing, data-mining algorithms with chemical information using molecular fingerprints—which represent molecules through a bit vector that codifies the existence of particular structural features or functional groups—to enhance ADE signals generated from adverse event reports. There have been increasing efforts to use other data sources, such as EMRs, for the purpose of detecting ADEs29–31 and to discover multi-drug ADEs.32 Researchers have also used billing and claims data for active drug safety surveillance33–35 and applied literature mining for drug safety.36 Recently, Chee et al37 explored the use of online health forums as a source of data to identify drugs for further scrutiny. They aggregate individuals' opinions of drugs in roughly 12 million personal health messages using natural language processing and are able to identify drugs withdrawn from the market based on messages discussing them before their removal.
Looking ahead, we believe that Big Data in biomedical informatics will be far more than genome sequence data.38–40 We argue that Big Data must be considered in a comprehensive manner, including both large amounts of ‘molecular measurements’ on a person (eg, sequencing) and small amounts of ‘routine measurements’ on a large number of people (eg, clinical notes, laboratory measurements, claims data and adverse event reports). In contrast with the buzz around genomic-data-in-the-clinic or adverse event predictions, consider the example by Frankovich et al.41 When the existing literature and a survey of colleagues was insufficient to guide the clinical care of a patient, Frankovich et al applied trend analysis to the EMR data from 98 patients to ‘learn’ a data-driven guideline on how to provide care for a 13-year-old girl with systemic lupus erythematosus.41 Such data-centric approaches are particularly useful when derivation of a formal guideline is not feasible from a practical standpoint.
It is tantalizing to imagine how scientific inquiry would be performed differently if we collect and share access to lots of data—both genomic and ‘routine’. How will the kinds of questions we ask change when we cross a certain data threshold?42 ,43 For example, researchers at Carnegie Mellon University built a scene completion tool by scraping millions of other images on the web from public sources. After the system accumulated a corpus of millions of photos, completed scenes were indistinguishable to the naked eye. The case for Big Data analytics has already won over the legal domain in at least one application, replacing armies of lawyers with computer algorithms designed for ‘e-discovery’—that is, retrieval of relevant materials for a legal case.44 Even the liberal arts are embracing Big Data: capitalizing on Google's efforts to digitize books, researchers in the humanities are blazing new trails in ‘culturomics’ by examining language based on the analysis of word combinations occurring in millions of digitized books through time.45
In 2013, we will have the sixth Summit on Translational Bioinformatics and the third year of the AMIA Joint Summits on Translational Science. Translational research has become integral to the biomedical research enterprise, as evidenced by the creation of a National Center for Advancing Translational Science at the NIH. The Joint Summits continue to be a venue to facilitate dramatic changes that are underway to deliver quality, personalized healthcare in the USA without increasing spending at a rate exceeding the growth of the GDP.46 Reflecting this priority, the 2013 TBI Summit will have new tracks that will showcase the ways in which the translational sciences are having a significant impact on the way clinical care, biomedical research, and drug discovery are performed.
We believe that the time is ripe for medicine to embrace Big Data, to usher in the age of data-driven medicine—and to truly enable proactive, predictive, preventive, participatory, and patient-centered health.47 Data-driven medicine will enable the discovery of new treatment options based on the multi-model molecular measurements on patients and learning from the trends hidden among the diagnoses, prescriptions, and discharge summaries of millions of patient encounters logged by clinical practitioners.48 ,49 The increasing synergy between the Translational Bioinformatics Summit and the Clinical Research Informatics Summit is an indication of this impending convergence. This is an exciting time when medicine begins utilizing massive amounts of data to discover patterns and trends and to make predictions in a manner that is a mainstay of web-scale computing.42
Funding NHS is funded by the US National Institute of Health Roadmap (U54 HG004028 and U54 LM008748). JDT is funded by a Clinical and Translational Science Award (UL1 RR024128) and a gift from David H Murdock.
Competing interests None.
Provenance and peer review Commissioned; internally peer reviewed.
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/2.0/ and http://creativecommons.org/licenses/by-nc/2.0/legalcode.