rss
JAMIA 2009;16:759-767 doi:10.1197/jamia.M2780
  • Perspectives on Informatics
  • Viewpoint Paper

Large Datasets in Biomedicine: A Discussion of Salient Analytic Issues

  1. Anshu Sinha,
  2. George Hripcsak,
  3. Marianthi Markatou
  1. Affiliations of the authors: Department of Biomedical Informatics(AS,GH), Department of Biostatistics(MM), Columbia University, New York, NY
  1. Correspondence: Marianthi Markatou, 722 West 168 Street, Floor 6, Room 632, New York, NY, 10032 Email: <mm168{at}columbia.edu>
  • Received 3 March 2008
  • Accepted 2 August 2009

Abstract

Advances in high-throughput and mass-storage technologies have led to an information explosion in both biology and medicine, presenting novel challenges for analysis and modeling. With regards to multivariate analysis techniques such as clustering, classification, and regression, large datasets present unique and often misunderstood challenges. The authors' goal is to provide a discussion of the salient problems encountered in the analysis of large datasets as they relate to modeling and inference to inform a principled and generalizable analysis and highlight the interdisciplinary nature of these challenges. The authors present a detailed study of germane issues including high dimensionality, multiple testing, scientific significance, dependence, information measurement, and information management with a focus on appropriate methodologies available to address these concerns. A firm understanding of the challenges and statistical technology involved ultimately contributes to better science. The authors further suggest that the community consider facilitating discussion through interdisciplinary panels, invited papers and curriculum enhancement to establish guidelines for analysis and reporting.

Footnotes

  • Dr. Markatou's research was funded by NSF DMS-0504957 and OBE/CBER/FDA. Dr. Hripcsak is supported by NLM R01 LM06910 “Discovering and applying knowledge in clinical databases”. Anshu Sinha is supported by the NLM Informatics Research Training Program (T15 LM007079-17).

Access policy for JAMIA

All content published in JAMIA is deposited with PubMedCentral by the publisher but with varying embargo times. Authors/funders may pay an Unlocked fee of $2,000 to make the article free on the JAMIA website and PMC immediately on publication. Research funded by government and other recognised agencies is deposited with a 12 month embargo. All other content is deposited with a 36 month embargo.

The Journal of the American Medical Informatics Association is published for the American Medical Informatics Association by BMJ Publishing Group Ltd.