Key Points
-
Research data concerning the genetic basis of health and disease is accumulating rapidly, as modern, high-throughput experimental techniques deliver increasingly larger data sets.
-
Data integration efforts in the field face numerous challenges, including the increased data size and complexity, quality control, data sensitivity and personal privacy, data access and publication bias.
-
Traditional approaches of gathering data into centralized repositories and publishing results in static paper journals, which have proved successful in the past, will not be sufficient to address the emerging and future needs of the field.
-
The alternative of a partially centralized and partially federated model has been proposed to solve this problem. This will entail a distributed, decentralized network of interconnected information sources and analysis services, the first incarnations of which are now starting to appear. A central requirement of this model is the far greater use of standardization for data models and exchange formats, and in the deployment of existing and emerging software components and network protocols.
-
Community adoption of new database technologies, and the development of robust data standards, will be vital to achieving the global integration of G2P data in the future. This might also help to address other challenges, such as accrediting and rewarding data submitters and database managers, as we move towards the emergence of a universal G2P 'knowledge environment'.
Abstract
The flow of research data concerning the genetic basis of health and disease is rapidly increasing in speed and complexity. In response, many projects are seeking to ensure that there are appropriate informatics tools, systems and databases available to manage and exploit this flood of information. Previous solutions, such as central databases, journal-based publication and manually intensive data curation, are now being enhanced with new systems for federated databases, database publication, and more automated management of data flows and quality control. Along with emerging technologies that enhance connectivity and data retrieval, these advances should help to create a powerful knowledge environment for genotype–phenotype information.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Wheeler, D. L. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 35, D5–D12 (2007).
Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).
Kent, W. J. et al. The Human Genome Browser at UCSC. Genome Res. 12, 996–1006 (2002).
Stein, L. Creating a bioinformatics nation. Nature 417, 119–120 (2002).
Miyazaki, S. et al. DDBJ in the stream of various biological data. Nucleic Acids Res. 32, D31–D34 (2004).
Benson, D. A. et al. GenBank. Nucleic Acids Res. 36, D25–D30 (2008).
Kanz, C. et al. The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 33, D29–D33 (2005).
Chen, N. et al. WormBase: a comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Res. 33, D383–D389 (2005).
Twigger, S. N. et al. The Rat Genome Database, update 2007 — easing the path from disease to data and back again. Nucleic Acids Res. 35, D658–D662 (2007).
Bult, C. J. et al. The Mouse Genome Database (MGD): mouse biology and model systems. Nucleic Acids Res. 36, D724–D728 (2008).
Hamosh, A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517 (2005).
McKusick, V. A. Mendelian Inheritance in Man. A Catalog of Human Genes and Genetic Disorders (Johns Hopkins Univ. Press, 1966).
Ball, E. V. et al. Microdeletions and microinsertions causing human genetic disease: common mechanisms of mutagenesis and the role of local DNA sequence complexity. Hum. Mutat. 26, 205–213 (2005).
Altman, R. B. PharmGKB: a logical home for knowledge relating genotype to drug response phenotype. Nature Genet. 39, 426–426 (2007).
Lehmann, H. & Kynoch, P. A. M. Human Haemoglobin Variants and Their Characteristics (North-Holland Publishing, Amsterdam, 1976).
Horaitis, O. et al. A database of locus-specific databases. Nature Genet. 39, 425 (2007).
Mailman, M. D. et al. The NCBI dbGaP database of genotypes and phenotypes. Nature Genet. 39, 1181–1186 (2007).
Becker, K. G. et al. The Genetic Association Database. Nature Genet. 36, 431–432 (2004).
Bertram, L. et al. Systematic meta-analyses of Alzheimer disease genetic association studies: the AlzGene database. Nature Genet. 39, 17–23 (2007).
Allen, N. C. et al. Systematic meta-analyses and field synopsis of genetic association studies in schizophrenia: the SzGene database. Nature Genet. 40, 827–834 (2008).
Mardis, E. R. The impact of next-generation sequencing technology on genetics. Trends Genet. 24, 133–141 (2008).
Howe, D. et al. Big data: the future of biocuration. Nature 455, 47–50 (2008).
Goble, C. & Stevens, R. State of the nation in data integration for bioinformatics. J. Biomed. Inform. 41, 687–693 (2008). This paper describes many of the technologies and challenges in data integration; in particular, different methods ranging from 'heavyweight' data warehousing approaches to loose-touch data 'mashups'.
Knoppers, B. et al. Population Genomics: The Public Population Project in Genomics (P3G): a proof of concept? Eur. J. Hum. Genet. 16, 664–665 (2008).
Ioannidis, J. P. A. et al. A road map for efficient and reliable human genome epidemiology. Nature Genet. 38, 3–5 (2006).
Elnitski, L. L. et al. The ENCODEdb portal: simplified access to ENCODE Consortium data. Genome Res. 17, 954–959 (2007).
Hoyweghen, I. V. & Horstman, K. European practices of genetic information and insurance: lessons for the Genetic Information Nondiscrimination Act. JAMA 300, 326–327 (2008).
Diergaarde, B. et al. Genetic information: special or not? Responses from focus groups with members of a health maintenance organization. Am. J. Med. Genet. A 143, 564–569 (2007).
Gilbar, R. Patient autonomy and relatives' right to know genetic information. Med. Law 26, 677–697 (2007).
Knoppers, B. M. et al. The emergence of an ethical duty to disclose genetic research results: international perspectives. Eur. J. Hum. Genet. 14, 1170–1178 (2006).
Godard, B. et al. Data storage and DNA banking for biomedical research: informed consent, confidentiality, quality issues, ownership, return of benefits. A professional perspective. Eur. J. Hum. Genet. 11 (Suppl. 2), S88–S122 (2003).
Homer, N. et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 4, e1000167 (2008).
Cambon-Thomsen, A., Rial-Sebbag, E. & Knoppers, B. M. Trends in ethical and legal frameworks for the use of human biobanks. Eur. Respir. J. 30, 373–382 (2007).
Zerhouni, E. A. & Nabel, E. G. Protecting aggregate genomic data. Science 322, 44 (2008).
Giardine, B. et al. PhenCode: connecting ENCODE data with mutations and phenotype. Hum. Mutat. 28, 554–562 (2007).
Stein, L. D. Integrating biological databases. Nature Rev. Genet. 4, 337–345 (2003).
Stevens, R., Goble, C. A. & Bechhofer, S. Ontology-based knowledge representation for bioinformatics. Brief. Bioinform. 1, 398–414 (2000).
Quackenbush, J. Standardizing the standards. Mol. Syst. Biol. 2, 2006.0010 (2006).
Smith, B. et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnol. 25, 1251–1255 (2007).
Dowell, R. D. et al. The Distributed Annotation System. BMC Bioinformatics 2, 7 (2001).
Berners-Lee, T., Hendler, J. & Lassila, O. The Semantic Web — a new form of web content that is meaningful to computers will unleash a revolution of new possibilities. Sci. Am. 284, 34–43 (2001).
Compete, collaborate, compel [Editorial]. Nature Genet. 39, 931 (2007).
Kauffmann, F. & Cambon-Thomsen, A. Tracing biological collections: between books and clinical trials. JAMA 299, 2316–2318 (2008).
Merali, Z. & Giles, J. Databases in peril. Nature 435, 1010–1011 (2005).
Stein, L. D. Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges. Nature Rev. Genet. 9, 678–688 (2008). This is a recent comprehensive review of current and emerging components of informatics infrastructure for modern biological research.
Spellman, P. T. et al. Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol. 3, research0046.1–00469 (2002).
The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).
Jones, A. R. et al. The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nature Biotechnol. 25, 1127–1133 (2007).
Clark, T., Martin, S. & Liefeld, T. Globally distributed object identification for biological knowledgebases. Brief. Bioinform. 5, 59–70 (2004).
Saltz, J. et al. caGrid: design and implementation of the core architecture of the cancer biomedical informatics grid. Bioinformatics 22, 1910–1916 (2006).
Wang, X., Gorlitsky, R. & Almeida, J. S. From XML to RDF: how semantic web technologies will change the design of 'omic' standards. Nature Biotechnol. 23, 1099–1103 (2005). This paper describes the potential of semantic web standards and technologies for describing and integrating biological data.
Taylor, C. F. et al. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nature Biotechnol. 26, 889–896 (2008).
Acknowledgements
The authors acknowledge the valuable ideas, advice and funding provided by the GEN2PHEN project as part of the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 200754, which enabled the preparation of this Review.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Related links
Related links
FURTHER INFORMATION
Biobanking and Biomolecular Resources Research Infrastructure (BBMRI)
Cancer Biomedical Informatics Grid (caBIG)
Coordination and Sustainability of International Mouse Informatics Resources (CASIMIR)
Enabling Grids for E-sciencE (EGEE)
European Advanced Translational Research Infrastructure in Medicine (EATRIS)
European Biobanking and Biomolecular Resources Research Infrastructure (BBMRI)
European Clinical Research Infrastructures Network (ECRIN)
European Genotype Archive (EGA)
European Life Sciences Infrastructure for Biological Information (ELIXIR)
European Model for Bioinformatics Research and Community Education (EMBRACE)
European Network of Genomic and Genetic Epidemiology (ENGAGE)
European Strategy Forum on Research Infrastructures (ESFRI)
Generic Model Organism Database (GMOD)
Genes, Environment and Health Initiative
Genetic Association Database (GAD)
Human Gene Mutation Database (HGMD)
Human Genome Epidemiology Network (HuGENet)
Human Genome Variation Society
Human Genomics and Proteomics journal
International Nucleotide Sequence Database Collaboration (INSDC)
Minimum Information for Biological and Biomedical Investigations (MIBBI)
Minimum Information for QTLs and Association Studies specification (MIQAS)
Online Mendelian Inheritance in Man (OMIM)
Open Biomedical Ontologies (OBO)
Persistent Uniform Resource Locator (PURL)
Pharmacogenetics and Pharmacogenoics Knowledge Base (PharmGKB)
Phenotype and Genotype Experiment Object Model (PaGE-OM)
Public Population Project in Genomics (P3G)
Public Population Project in Genomics observatory
Resource Description Framework (RDF)
Service-oriented architecture (SOA)
Glossary
- Screen-scraping
-
The automated process of extracting data from web pages intended for human viewing.
- Genotype-to-phenotype
-
(G2P). The relationship between genetic variation in an organism and how this affects its observable characteristics.
- Genome-wide association study
-
(GWA study). Examination of DNA variation (typically SNPs) across the whole genome in a large number of individuals who have been matched for population ancestry and assessed for a disease or trait of interest. Correlations between variants and the trait are used to locate genetic risk factors.
- Knowledge representation
-
Structured presentation of information that facilitates the drawing of inferences or conclusions, often giving predictive abilities.
- ENCODE
-
(Encyclopedia of DNA Elements). An international research project to identify all functional elements in the human genome.
- Biobanking
-
Assembling large collections of biosamples and associated information, for the purpose of biomedical investigation.
- Syntax
-
The syntax of information is concerned with how the data is organized, ordered and structured.
- Semantics
-
The semantics of information is concerned with the meaning of the data elements, such as words.
- Semantic web
-
An extension of the World Wide Web that embeds semantics, or meaning, in documents, in links between documents and in descriptions of web services, thereby enabling navigation and reasoning by automated agents.
- Genetic association database
-
A catalogue of reported genetic associations between genotype and phenotype.
Rights and permissions
About this article
Cite this article
Thorisson, G., Muilu, J. & Brookes, A. Genotype–phenotype databases: challenges and solutions for the post-genomic era. Nat Rev Genet 10, 9–18 (2009). https://doi.org/10.1038/nrg2483
Issue Date:
DOI: https://doi.org/10.1038/nrg2483
This article is cited by
-
The Unique Evolutionary Signature of Genes Associated with Autism Spectrum Disorder
Behavior Genetics (2016)
-
Neurocarta: aggregating and sharing disease-gene relations for the neurosciences
BMC Genomics (2013)
-
Managing sensitive phenotypic data and biomaterial in large-scale collaborative psychiatric genetic research projects: practical considerations
Molecular Psychiatry (2012)
-
Computational tools for comparative phenomics: the role and promise of ontologies
Mammalian Genome (2012)
-
Using electronic health records to drive discovery in disease genomics
Nature Reviews Genetics (2011)