Genotype–phenotype databases: challenges and solutions for the post-genomic era

Thorisson, Gudmundur A.; Muilu, Juha; Brookes, Anthony J.

doi:10.1038/nrg2483

Review Article
Published: January 2009

Genotype–phenotype databases: challenges and solutions for the post-genomic era

Gudmundur A. Thorisson¹,
Juha Muilu² &
Anthony J. Brookes¹

Nature Reviews Genetics volume 10, pages 9–18 (2009)Cite this article

1956 Accesses
72 Citations
6 Altmetric
Metrics details

Key Points

Research data concerning the genetic basis of health and disease is accumulating rapidly, as modern, high-throughput experimental techniques deliver increasingly larger data sets.
Data integration efforts in the field face numerous challenges, including the increased data size and complexity, quality control, data sensitivity and personal privacy, data access and publication bias.
Traditional approaches of gathering data into centralized repositories and publishing results in static paper journals, which have proved successful in the past, will not be sufficient to address the emerging and future needs of the field.
The alternative of a partially centralized and partially federated model has been proposed to solve this problem. This will entail a distributed, decentralized network of interconnected information sources and analysis services, the first incarnations of which are now starting to appear. A central requirement of this model is the far greater use of standardization for data models and exchange formats, and in the deployment of existing and emerging software components and network protocols.
Community adoption of new database technologies, and the development of robust data standards, will be vital to achieving the global integration of G2P data in the future. This might also help to address other challenges, such as accrediting and rewarding data submitters and database managers, as we move towards the emergence of a universal G2P 'knowledge environment'.

Abstract

The flow of research data concerning the genetic basis of health and disease is rapidly increasing in speed and complexity. In response, many projects are seeking to ensure that there are appropriate informatics tools, systems and databases available to manage and exploit this flood of information. Previous solutions, such as central databases, journal-based publication and manually intensive data curation, are now being enhanced with new systems for federated databases, database publication, and more automated management of data flows and quality control. Along with emerging technologies that enhance connectivity and data retrieval, these advances should help to create a powerful knowledge environment for genotype–phenotype information.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Extreme models for database integration.**

**Figure 2: Databases and database networks.**

**Figure 3: Success depends upon recognition and reward.**

MGeND: an integrated database for Japanese clinical and genomic information

Article Open access 06 December 2019

Genomic data in the All of Us Research Program

Article Open access 19 February 2024

The LOVD3 platform: efficient genome-wide sharing of genetic variants

Article Open access 15 September 2021

References

Wheeler, D. L. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 35, D5–D12 (2007).
Article CAS Google Scholar
Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).
Article CAS Google Scholar
Kent, W. J. et al. The Human Genome Browser at UCSC. Genome Res. 12, 996–1006 (2002).
Article CAS Google Scholar
Stein, L. Creating a bioinformatics nation. Nature 417, 119–120 (2002).
Article CAS Google Scholar
Miyazaki, S. et al. DDBJ in the stream of various biological data. Nucleic Acids Res. 32, D31–D34 (2004).
Article CAS Google Scholar
Benson, D. A. et al. GenBank. Nucleic Acids Res. 36, D25–D30 (2008).
Article CAS Google Scholar
Kanz, C. et al. The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 33, D29–D33 (2005).
Article CAS Google Scholar
Chen, N. et al. WormBase: a comprehensive data resource for Caenorhabditis biology and genomics. Nucleic Acids Res. 33, D383–D389 (2005).
Article CAS Google Scholar
Twigger, S. N. et al. The Rat Genome Database, update 2007 — easing the path from disease to data and back again. Nucleic Acids Res. 35, D658–D662 (2007).
Article CAS Google Scholar
Bult, C. J. et al. The Mouse Genome Database (MGD): mouse biology and model systems. Nucleic Acids Res. 36, D724–D728 (2008).
Article CAS Google Scholar
Hamosh, A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33, D514–D517 (2005).
Article CAS Google Scholar
McKusick, V. A. Mendelian Inheritance in Man. A Catalog of Human Genes and Genetic Disorders (Johns Hopkins Univ. Press, 1966).
Google Scholar
Ball, E. V. et al. Microdeletions and microinsertions causing human genetic disease: common mechanisms of mutagenesis and the role of local DNA sequence complexity. Hum. Mutat. 26, 205–213 (2005).
Article CAS Google Scholar
Altman, R. B. PharmGKB: a logical home for knowledge relating genotype to drug response phenotype. Nature Genet. 39, 426–426 (2007).
Article CAS Google Scholar
Lehmann, H. & Kynoch, P. A. M. Human Haemoglobin Variants and Their Characteristics (North-Holland Publishing, Amsterdam, 1976).
Google Scholar
Horaitis, O. et al. A database of locus-specific databases. Nature Genet. 39, 425 (2007).
Article CAS Google Scholar
Mailman, M. D. et al. The NCBI dbGaP database of genotypes and phenotypes. Nature Genet. 39, 1181–1186 (2007).
Article CAS Google Scholar
Becker, K. G. et al. The Genetic Association Database. Nature Genet. 36, 431–432 (2004).
Article CAS Google Scholar
Bertram, L. et al. Systematic meta-analyses of Alzheimer disease genetic association studies: the AlzGene database. Nature Genet. 39, 17–23 (2007).
Article CAS Google Scholar
Allen, N. C. et al. Systematic meta-analyses and field synopsis of genetic association studies in schizophrenia: the SzGene database. Nature Genet. 40, 827–834 (2008).
Article CAS Google Scholar
Mardis, E. R. The impact of next-generation sequencing technology on genetics. Trends Genet. 24, 133–141 (2008).
Article CAS Google Scholar
Howe, D. et al. Big data: the future of biocuration. Nature 455, 47–50 (2008).
Article CAS Google Scholar
Goble, C. & Stevens, R. State of the nation in data integration for bioinformatics. J. Biomed. Inform. 41, 687–693 (2008). This paper describes many of the technologies and challenges in data integration; in particular, different methods ranging from 'heavyweight' data warehousing approaches to loose-touch data 'mashups'.
Article Google Scholar
Knoppers, B. et al. Population Genomics: The Public Population Project in Genomics (P3G): a proof of concept? Eur. J. Hum. Genet. 16, 664–665 (2008).
Article CAS Google Scholar
Ioannidis, J. P. A. et al. A road map for efficient and reliable human genome epidemiology. Nature Genet. 38, 3–5 (2006).
Article CAS Google Scholar
Elnitski, L. L. et al. The ENCODEdb portal: simplified access to ENCODE Consortium data. Genome Res. 17, 954–959 (2007).
Article CAS Google Scholar
Hoyweghen, I. V. & Horstman, K. European practices of genetic information and insurance: lessons for the Genetic Information Nondiscrimination Act. JAMA 300, 326–327 (2008).
Article Google Scholar
Diergaarde, B. et al. Genetic information: special or not? Responses from focus groups with members of a health maintenance organization. Am. J. Med. Genet. A 143, 564–569 (2007).
Article Google Scholar
Gilbar, R. Patient autonomy and relatives' right to know genetic information. Med. Law 26, 677–697 (2007).
PubMed Google Scholar
Knoppers, B. M. et al. The emergence of an ethical duty to disclose genetic research results: international perspectives. Eur. J. Hum. Genet. 14, 1170–1178 (2006).
Article Google Scholar
Godard, B. et al. Data storage and DNA banking for biomedical research: informed consent, confidentiality, quality issues, ownership, return of benefits. A professional perspective. Eur. J. Hum. Genet. 11 (Suppl. 2), S88–S122 (2003).
Article Google Scholar
Homer, N. et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 4, e1000167 (2008).
Article Google Scholar
Cambon-Thomsen, A., Rial-Sebbag, E. & Knoppers, B. M. Trends in ethical and legal frameworks for the use of human biobanks. Eur. Respir. J. 30, 373–382 (2007).
Article CAS Google Scholar
Zerhouni, E. A. & Nabel, E. G. Protecting aggregate genomic data. Science 322, 44 (2008).
Article CAS Google Scholar
Giardine, B. et al. PhenCode: connecting ENCODE data with mutations and phenotype. Hum. Mutat. 28, 554–562 (2007).
Article CAS Google Scholar
Stein, L. D. Integrating biological databases. Nature Rev. Genet. 4, 337–345 (2003).
Article CAS Google Scholar
Stevens, R., Goble, C. A. & Bechhofer, S. Ontology-based knowledge representation for bioinformatics. Brief. Bioinform. 1, 398–414 (2000).
Article CAS Google Scholar
Quackenbush, J. Standardizing the standards. Mol. Syst. Biol. 2, 2006.0010 (2006).
Smith, B. et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnol. 25, 1251–1255 (2007).
Article CAS Google Scholar
Dowell, R. D. et al. The Distributed Annotation System. BMC Bioinformatics 2, 7 (2001).
Article CAS Google Scholar
Berners-Lee, T., Hendler, J. & Lassila, O. The Semantic Web — a new form of web content that is meaningful to computers will unleash a revolution of new possibilities. Sci. Am. 284, 34–43 (2001).
Article Google Scholar
Compete, collaborate, compel [Editorial]. Nature Genet. 39, 931 (2007).
Kauffmann, F. & Cambon-Thomsen, A. Tracing biological collections: between books and clinical trials. JAMA 299, 2316–2318 (2008).
Article CAS Google Scholar
Merali, Z. & Giles, J. Databases in peril. Nature 435, 1010–1011 (2005).
Article CAS Google Scholar
Stein, L. D. Towards a cyberinfrastructure for the biological sciences: progress, visions and challenges. Nature Rev. Genet. 9, 678–688 (2008). This is a recent comprehensive review of current and emerging components of informatics infrastructure for modern biological research.
Article CAS Google Scholar
Spellman, P. T. et al. Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol. 3, research0046.1–00469 (2002).
Article Google Scholar
The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).
Jones, A. R. et al. The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nature Biotechnol. 25, 1127–1133 (2007).
Article CAS Google Scholar
Clark, T., Martin, S. & Liefeld, T. Globally distributed object identification for biological knowledgebases. Brief. Bioinform. 5, 59–70 (2004).
Article CAS Google Scholar
Saltz, J. et al. caGrid: design and implementation of the core architecture of the cancer biomedical informatics grid. Bioinformatics 22, 1910–1916 (2006).
Article CAS Google Scholar
Wang, X., Gorlitsky, R. & Almeida, J. S. From XML to RDF: how semantic web technologies will change the design of 'omic' standards. Nature Biotechnol. 23, 1099–1103 (2005). This paper describes the potential of semantic web standards and technologies for describing and integrating biological data.
Article CAS Google Scholar
Taylor, C. F. et al. Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project. Nature Biotechnol. 26, 889–896 (2008).
Article CAS Google Scholar

Download references

Acknowledgements

The authors acknowledge the valuable ideas, advice and funding provided by the GEN2PHEN project as part of the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 200754, which enabled the preparation of this Review.

Author information

Authors and Affiliations

Department of Genetics, University of Leicester, University Road, Leicester, LE1 7RH, UK
Gudmundur A. Thorisson & Anthony J. Brookes
Institute for Molecular Medicine Finland, University of Helsinki, Haartmaninkatu 8, Helsinki, FIN-00290, Finland
Juha Muilu

Authors

Gudmundur A. Thorisson
View author publications
You can also search for this author in PubMed Google Scholar
Juha Muilu
View author publications
You can also search for this author in PubMed Google Scholar
Anthony J. Brookes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anthony J. Brookes.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Glossary

Screen-scraping: The automated process of extracting data from web pages intended for human viewing.
Genotype-to-phenotype: (G2P). The relationship between genetic variation in an organism and how this affects its observable characteristics.
Genome-wide association study: (GWA study). Examination of DNA variation (typically SNPs) across the whole genome in a large number of individuals who have been matched for population ancestry and assessed for a disease or trait of interest. Correlations between variants and the trait are used to locate genetic risk factors.
Knowledge representation: Structured presentation of information that facilitates the drawing of inferences or conclusions, often giving predictive abilities.
ENCODE: (Encyclopedia of DNA Elements). An international research project to identify all functional elements in the human genome.
Biobanking: Assembling large collections of biosamples and associated information, for the purpose of biomedical investigation.
Syntax: The syntax of information is concerned with how the data is organized, ordered and structured.
Semantics: The semantics of information is concerned with the meaning of the data elements, such as words.
Semantic web: An extension of the World Wide Web that embeds semantics, or meaning, in documents, in links between documents and in descriptions of web services, thereby enabling navigation and reasoning by automated agents.
Genetic association database: A catalogue of reported genetic associations between genotype and phenotype.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Thorisson, G., Muilu, J. & Brookes, A. Genotype–phenotype databases: challenges and solutions for the post-genomic era. Nat Rev Genet 10, 9–18 (2009). https://doi.org/10.1038/nrg2483

Download citation

Issue Date: January 2009
DOI: https://doi.org/10.1038/nrg2483

This article is cited by

The Unique Evolutionary Signature of Genes Associated with Autism Spectrum Disorder
- Erez Tsur
- Michael Friger
- Idan Menashe
Behavior Genetics (2016)
Neurocarta: aggregating and sharing disease-gene relations for the neurosciences
- Elodie Portales-Casamar
- Carolyn Ch’ng
- Paul Pavlidis
BMC Genomics (2013)
Managing sensitive phenotypic data and biomaterial in large-scale collaborative psychiatric genetic research projects: practical considerations
- S Y Demiroglu
- D Skrowny
- T G Schulze
Molecular Psychiatry (2012)
Computational tools for comparative phenomics: the role and promise of ontologies
- Georgios V. Gkoutos
- Paul N. Schofield
- Robert Hoehndorf
Mammalian Genome (2012)
Using electronic health records to drive discovery in disease genomics
- Isaac S. Kohane
Nature Reviews Genetics (2011)

Genotype–phenotype databases: challenges and solutions for the post-genomic era

Key Points

Abstract

Access options

Similar content being viewed by others

MGeND: an integrated database for Japanese clinical and genomic information

Genomic data in the All of Us Research Program

The LOVD3 platform: efficient genome-wide sharing of genetic variants

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

This article is cited by

The Unique Evolutionary Signature of Genes Associated with Autism Spectrum Disorder

Neurocarta: aggregating and sharing disease-gene relations for the neurosciences

Managing sensitive phenotypic data and biomaterial in large-scale collaborative psychiatric genetic research projects: practical considerations

Computational tools for comparative phenomics: the role and promise of ontologies

Using electronic health records to drive discovery in disease genomics

Search

Quick links

Key Points

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Related links

Related links

FURTHER INFORMATION

Glossary

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links