Assessing and managing risk when sharing aggregate genetic variant data

Craig, David W.; Goor, Robert M.; Wang, Zhenyuan; Paschall, Justin; Ostell, Jim; Feolo, Michael; Sherry, Stephen T.; Manolio, Teri A.

doi:10.1038/nrg3067

Opinion
Published: 16 September 2011

Assessing and managing risk when sharing aggregate genetic variant data

David W. Craig¹,
Robert M. Goor²,
Zhenyuan Wang²,
Justin Paschall²,
Jim Ostell²,
Michael Feolo²,
Stephen T. Sherry² &
…
Teri A. Manolio³

Nature Reviews Genetics volume 12, pages 730–736 (2011)Cite this article

1638 Accesses
40 Citations
15 Altmetric
Metrics details

Subjects

An Erratum to this article was published on 27 September 2011

This article has been updated

Abstract

Access to genetic data across studies is an important aspect of identifying new genetic associations through genome-wide association studies (GWASs). Meta-analysis across multiple GWASs with combined cohort sizes of tens of thousands of individuals often uncovers many more genome-wide associated loci than the original individual studies; this emphasizes the importance of tools and mechanisms for data sharing. However, even sharing summary-level data, such as allele frequencies, inherently carries some degree of privacy risk to study participants. Here we discuss mechanisms and resources for sharing data from GWASs, particularly focusing on approaches for assessing and quantifying the privacy risks to participants that result from the sharing of summary-level data.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Sharing 5,000 SNPs at different prevalence or prior probabilities.**

Opportunities and challenges for the use of common controls in sequencing studies

Article 17 May 2022

Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts

Article 18 May 2020

A resource-efficient tool for mixed model association analysis of large-scale data

Article 25 November 2019

Change history

27 September 2011
In the above article, the incorrect link was provided for GWAS Central. The correct link should have been http://www.gwascentral.org. In the Further Information Box, the link to http://gwas.nih.gov was incorrectly described as 'GWAS Central (includes policy)'.The editors apologize for this error.

References

Hirschhorn, J. N. & Daly, M. J. Genome-wide association studies for common diseases and complex traits. Nature Rev. Genet. 6, 95–108 (2005).
Article CAS PubMed Google Scholar
Klein, R. J. et al. Complement factor H polymorphism in age-related macular degeneration. Science 308, 385–389 (2005).
Article CAS PubMed Google Scholar
Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
CAS PubMed Central PubMed Google Scholar
Zhernakova, A. et al. Meta-analysis of genome-wide association studies in celiac disease and rheumatoid arthritis identifies fourteen non-HLA shared loci. PLoS Genet. 7, e1002004 (2011).
Article CAS PubMed Google Scholar
Hollingworth, P. et al. Common variants at ABCA7, MS4A6A/MS4A4E, EPHA1, CD33 and CD2AP are associated with Alzheimer's disease. Nature Genet. 43, 429–435 (2011).
Article CAS PubMed Google Scholar
Schunkert, H. et al. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nature Genet. 43, 333–338 (2011).
Article CAS PubMed Google Scholar
Teslovich, T. M. et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–713 (2010).
Article CAS PubMed Google Scholar
Kho, A. N. et al. Electronic medical records for genetic research: results of the eMERGE consortium. Sci. Transl. Med. 3, 79re1 (2011).
Article PubMed Google Scholar
Skol, A. D., Scott, L. J., Abecasis, G. R. & Boehnke, M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature Genet. 38, 209–213 (2006).
Article CAS PubMed Google Scholar
Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nature Rev. Genet. 11, 499–511 (2010).
CAS PubMed Google Scholar
Durbin, R. M. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Article CAS PubMed Google Scholar
Zheng, S. L. et al. Cumulative association of five genetic variants with prostate cancer. N. Engl. J. Med. 358, 910–919 (2008).
Article CAS PubMed Google Scholar
Vacic, V. et al. Duplications of the neuropeptide receptor gene VIPR2 confer significant risk for schizophrenia. Nature 471, 499–503 (2011).
Article CAS PubMed Google Scholar
Heeney, C., Hawkins, N., de Vries, J., Boddington, P. & Kaye, J. Assessing the privacy risks of data sharing in genomics. Public Health Genomics 14, 17–25 (2011).
Article CAS PubMed Google Scholar
Church, G. et al. Public access to genome-wide data: five views on balancing research with privacy and protection. PLoS Genet. 5, e1000665 (2009).
Article PubMed Google Scholar
Preuss, M. et al. Design of the Coronary ARtery DIsease Genome-Wide Replication And Meta-Analysis (CARDIoGRAM) Study: a genome-wide association meta-analysis involving more than 22 000 cases and 60 000 controls. Circ. Cardiovasc. Genet. 3, 475–483 (2010).
Article CAS PubMed Google Scholar
Speliotes, E. K. et al. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nature Genet. 42, 937–948 (2010).
Article CAS PubMed Google Scholar
Cornelis, M. C. et al. The gene, environment association studies consortium (GENEVA): maximizing the knowledge obtained from GWAS by collaboration across studies of multiple conditions. Genet. Epidemiol. 34, 364–372 (2010).
Article PubMed Google Scholar
The Psychiatric GWAS Consortium Steering Committee. A framework for interpreting genome-wide association studies of psychiatric disorders. Mol. Psychiatry 14, 10–17 (2009).
Nelson, M. R. et al. The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. Am. J. Hum. Genet. 83, 347–358 (2008).
Article CAS PubMed Google Scholar
The International HapMap Consortium. The International HapMap Project. Nature 426, 789–796 (2003).
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Article CAS PubMed Google Scholar
Mailman, M. D. et al. The NCBI dbGaP database of genotypes and phenotypes. Nature Genet. 39, 1181–1186 (2007).
Article CAS PubMed Google Scholar
Leinonen, R. et al. The European Nucleotide Archive. Nucleic Acids Res. 39, D28–D31 (2011).
Article CAS PubMed Google Scholar
Yu, W., Gwinn, M., Clyne, M., Yesupriya, A. & Khoury, M. J. A navigator for human genome epidemiology. Nature Genet. 40, 124–125 (2008).
Article CAS PubMed Google Scholar
Thorisson, G. A. et al. HGVbaseG2P: a central genetic association database. Nucleic Acids Res. 37, D797–D802 (2009).
Article CAS PubMed Google Scholar
Hirakawa, M. et al. JSNP: a database of common gene variations in the Japanese population. Nucleic Acids Res. 30, 158–162 (2002).
Article CAS PubMed Google Scholar
Hindorff, L. A. et al. PheGenI: an integrated resource for browsing genetic association data. Proc. of the 2011 AMIA Summit on Translational Bioinformatics [online], (2011).
Google Scholar
Homer, N. et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 4, e1000167 (2008).
Article PubMed Google Scholar
Sankararaman, S., Obozinski, G., Jordan, M. I. & Halperin, E. Genomic privacy and limits of individual detection in a pool. Nature Genet. 41, 965–967 (2009).
Article CAS PubMed Google Scholar
Jacobs, K. B. et al. A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nature Genet. 41, 1253–1257 (2009).
Article CAS PubMed Google Scholar
Neyman, J. & Pearson, E. On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. Lond. A 231, 289–337 (1933).
Article Google Scholar
Braun, R., Rowe, W., Schaefer, C., Zhang, J. & Buetow, K. Needles in the haystack: identifying individuals present in pooled genomic data. PLoS Genet. 5, e1000668 (2009).
Article PubMed Google Scholar
Wang, R., Li, Y. F., Wang, X., Tang, H. & Zhou, X. Learning your identity and disease from research papers: information leaks in genome wide association study. Proc. of the 16th ACM Conf. on Computer and Communications Security, 534–544 (2009).
Visscher, P. M. & Hill, W. G. The limits of individual identification from sample allele frequencies: theory and statistical analysis. PLoS Genet. 5, e1000628 (2009).
Article PubMed Google Scholar
Clayton, D. On inferring presence of an individual in a mixture: a Bayesian approach. Biostatistics 11, 661–673 (2010).
Article PubMed Google Scholar
Sampson, J. & Zhao, H. Identifying individuals in a complex mixture of DNA with unknown ancestry. Stat. Appl. Genet. Mol. Biol. 8, 37 (2009).
Article Google Scholar
Zerhouni, E. A. & Nabel, E. G. Protecting aggregate genomic data. Science 322, 44 (2008).
Article CAS PubMed Google Scholar
Krawczak, M., Goebel, J. W. & Cooper, D. N. Is the NIH policy for sharing GWAS data running the risk of being counterproductive? Investig. Genet. 1, 3 (2010).
Article PubMed Google Scholar
Haga, S. B. & O'Daniel, J. Public perspectives regarding data-sharing practices in genomics research. Public Health Genomics 24 Mar 2011 (doi:10.1159/000324705).
Article CAS PubMed Google Scholar
Malin, B., Karp, D. & Scheuermann, R. H. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J. Investig. Med. 58, 11–18 (2010).
Article PubMed Google Scholar
Elias-Sonnenschein, L. S., Viechtbauer, W., Ramakers, I. H., Verhey, F. R. & Visser, P. J. Predictive value of APOE-ɛ4 allele for progression from MCI to AD-type dementia: a meta-analysis. J. Neurol. Neurosurg. Psychiatry 14 Apr 2011 (doi:10.1136/jnnp.2010.231555).
Article Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This manuscript represents the views and opinions of its authors and does not necessarily represent the views or policies of the NIH or the US Department of Health and Human Services. This research was supported in part by the Intramural Research Program of the NIH National Library of Medicine. D.W.C. would like to acknowledge support from the US National Heart, Lung and Blood Institute (NHLBI), award U01 HL086528. The authors thank I. Marpuri, S. Buchholtz and L. Gyi for their support in coordinating the development of this work.

Author information

Authors and Affiliations

David W. Craig is at the Translational Genomics Research Institute (TGen), Phoenix, Arizona 85004, USA.,
David W. Craig
Robert M. Goor, Zhenyuan Wang, Justin Paschall, Jim Ostell, Michael Feolo and Stephen T. Sherry are at the National Center for Biotechnology Information (NCBI), Bethesda, Maryland 20892, USA.,
Robert M. Goor, Zhenyuan Wang, Justin Paschall, Jim Ostell, Michael Feolo & Stephen T. Sherry
Teri A. Manolio is at the National Human Genome Research Institute (NHGRI), Bethesda, Maryland 20892, USA.,
Teri A. Manolio

Authors

David W. Craig
View author publications
You can also search for this author in PubMed Google Scholar
Robert M. Goor
View author publications
You can also search for this author in PubMed Google Scholar
Zhenyuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Justin Paschall
View author publications
You can also search for this author in PubMed Google Scholar
Jim Ostell
View author publications
You can also search for this author in PubMed Google Scholar
Michael Feolo
View author publications
You can also search for this author in PubMed Google Scholar
Stephen T. Sherry
View author publications
You can also search for this author in PubMed Google Scholar
Teri A. Manolio
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David W. Craig.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Glossary

Allele frequency: The frequency of the less-common allele of a polymorphism. It has a value between 0 and 0.5 and can vary between populations.
Bayesian: A statistical framework for evaluating a hypothesis. The Bayesian approach assesses the probability of a hypothesis being correct by incorporating the prior probability of the hypothesis.
Discrimination threshold: The significance threshold for rejecting the null hypothesis in a statistical test.
Frequentist: A statistical framework for evaluating a hypothesis. The frequentist approach tests a hypothesis as being correct given the strength of a data set.
Imputation: A method for inferring untyped variants from neighbouring variants, based on linkage disequilibrium and haplotype structure.
Linear regression: The estimation of a first-order relationship between two variables, which involves fitting a line of best fit to the data.
Missingness: The percentage of samples that do not receive a genotype call for a SNP in a genome-wide association study.
Neyman–Pearson lemma: A theorem that assures the optimality of a likelihood ratio test between simple hypotheses at a given threshold.
Prevalence: The prior probability that a person is in a data set of interest. Alternatively, the term can refer to the fraction of individuals in a data set out of the total number of individuals that could be in the data set.
Reference data set: A data set of samples from individuals who are from the same population that was sampled in the summary-level data set of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Craig, D., Goor, R., Wang, Z. et al. Assessing and managing risk when sharing aggregate genetic variant data. Nat Rev Genet 12, 730–736 (2011). https://doi.org/10.1038/nrg3067

Download citation

Published: 16 September 2011
Issue Date: October 2011
DOI: https://doi.org/10.1038/nrg3067

This article is cited by

Sociotechnical safeguards for genomic data privacy
- Zhiyu Wan
- James W. Hazel
- Bradley A. Malin
Nature Reviews Genetics (2022)
Reconstructing SNP allele and genotype frequencies from GWAS summary statistics
- Zhiyu Yang
- Peristera Paschou
- Petros Drineas
Scientific Reports (2022)
Registered access: authorizing data access
- Stephanie O. M. Dyke
- Mikael Linden
- Paul Flicek
European Journal of Human Genetics (2018)
Balancing the local and the universal in maintaining ethical access to a genomics biobank
- Catherine Heeney
- Shona M. Kerr
BMC Medical Ethics (2017)
Beneficial effect of chronic Staphylococcus aureus infection in a model of multiple sclerosis is mediated through the secretion of extracellular adherence protein
- Prateek Kumar
- Benedikt Kretzschmar
- Katharina Hein
Journal of Neuroinflammation (2015)