rss
JAMIA 1999;6:478-493 doi:10.1136/jamia.1999.0060478
  • The Practice of Informatics
  • Application of Information Technology

Organization of Heterogeneous Scientific Data Using the EAV/CR Representation

  1. Prakash M Nadkarni,
  2. Luis Marenco,
  3. Roland Chen,
  4. Emmanouil Skoufos,
  5. Gordon Shepherd,
  6. Perry Miller
  1. Affiliation of the authors: Yale University, New Haven, Connecticut
  1. Corresdpondence and reprints: Prakash M. Nadkarni, MD, Center for Medical Informatics, Yale University School of Medicine, P.O. Box 208009, New Haven, CT 06520-8009. e-mail: 〈Prakash.Nadkarni{at}yale.edu
  • Received 26 March 1999
  • Accepted 10 June 1999

Abstract

Entity-attribute-value (EAV) representation is a means of organizing highly heterogeneous data using a relatively simple physical database schema. EAV representation is widely used in the medical domain, most notably in the storage of data related to clinical patient records. Its potential strengths suggest its use in other biomedical areas, in particular research databases whose schemas are complex as well as constantly changing to reflect evolving knowledge in rapidly advancing scientific domains. When deployed for such purposes, the basic EAV representation needs to be augmented significantly to handle the modeling of complex objects (classes) as well as to manage interobject relationships. The authors refer to their modification of the basic EAV paradigm as EAV/CR (EAV with classes and relationships). They describe EAV/CR representation with examples from two biomedical databases that use it.

Footnotes

  • This work was supported by NIH grants U01-CA78266 from the National Cancer Institute, R01-DC03972 from the National Institute of Mental Health, and G01-LM05583 and T15-LM07056 from the National Library of Medicine.

  • * The maximum length of a “short string” depends on the database engine—2,000 characters in Oracle, 8,000 in Microsoft SQL Server 7.0, and 255 in Microsoft SQL Server 6.5 and Sybase. The differences between short and long strings are that the former can be indexed, whereas the latter (which have arbitrary length) cannot be. Also, characters in a short string are stored contiguously on disk, whereas characters in a long string are stored as separate “blocks,” possibly on different disk sectors, that are chained together.

  • The history of public genome-related databases also supports our experience. Both NCBI's Entrez and the Human Genome Database provide object-at-a-time access through their Web front ends, and their obvious success seems to imply that most users don't care about complex query. (On the other hand, for the small minority of scientists doing hard research on large subsets of the data, complex query is essential.)

Access policy for JAMIA

All content published in JAMIA is deposited with PubMedCentral by the publisher but with varying embargo times. Authors/funders may pay an Unlocked fee of $2,000 to make the article free on the JAMIA website and PMC immediately on publication. Research funded by government and other recognised agencies is deposited with a 12 month embargo. All other content is deposited with a 36 month embargo.

The Journal of the American Medical Informatics Association is published for the American Medical Informatics Association by BMJ Publishing Group Ltd.