Abstract
Access to genetic data across studies is an important aspect of identifying new genetic associations through genome-wide association studies (GWASs). Meta-analysis across multiple GWASs with combined cohort sizes of tens of thousands of individuals often uncovers many more genome-wide associated loci than the original individual studies; this emphasizes the importance of tools and mechanisms for data sharing. However, even sharing summary-level data, such as allele frequencies, inherently carries some degree of privacy risk to study participants. Here we discuss mechanisms and resources for sharing data from GWASs, particularly focusing on approaches for assessing and quantifying the privacy risks to participants that result from the sharing of summary-level data.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Change history
27 September 2011
In the above article, the incorrect link was provided for GWAS Central. The correct link should have been http://www.gwascentral.org. In the Further Information Box, the link to http://gwas.nih.gov was incorrectly described as 'GWAS Central (includes policy)'.The editors apologize for this error.
References
Hirschhorn, J. N. & Daly, M. J. Genome-wide association studies for common diseases and complex traits. Nature Rev. Genet. 6, 95–108 (2005).
Klein, R. J. et al. Complement factor H polymorphism in age-related macular degeneration. Science 308, 385–389 (2005).
Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
Zhernakova, A. et al. Meta-analysis of genome-wide association studies in celiac disease and rheumatoid arthritis identifies fourteen non-HLA shared loci. PLoS Genet. 7, e1002004 (2011).
Hollingworth, P. et al. Common variants at ABCA7, MS4A6A/MS4A4E, EPHA1, CD33 and CD2AP are associated with Alzheimer's disease. Nature Genet. 43, 429–435 (2011).
Schunkert, H. et al. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nature Genet. 43, 333–338 (2011).
Teslovich, T. M. et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–713 (2010).
Kho, A. N. et al. Electronic medical records for genetic research: results of the eMERGE consortium. Sci. Transl. Med. 3, 79re1 (2011).
Skol, A. D., Scott, L. J., Abecasis, G. R. & Boehnke, M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature Genet. 38, 209–213 (2006).
Marchini, J. & Howie, B. Genotype imputation for genome-wide association studies. Nature Rev. Genet. 11, 499–511 (2010).
Durbin, R. M. et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
Zheng, S. L. et al. Cumulative association of five genetic variants with prostate cancer. N. Engl. J. Med. 358, 910–919 (2008).
Vacic, V. et al. Duplications of the neuropeptide receptor gene VIPR2 confer significant risk for schizophrenia. Nature 471, 499–503 (2011).
Heeney, C., Hawkins, N., de Vries, J., Boddington, P. & Kaye, J. Assessing the privacy risks of data sharing in genomics. Public Health Genomics 14, 17–25 (2011).
Church, G. et al. Public access to genome-wide data: five views on balancing research with privacy and protection. PLoS Genet. 5, e1000665 (2009).
Preuss, M. et al. Design of the Coronary ARtery DIsease Genome-Wide Replication And Meta-Analysis (CARDIoGRAM) Study: a genome-wide association meta-analysis involving more than 22 000 cases and 60 000 controls. Circ. Cardiovasc. Genet. 3, 475–483 (2010).
Speliotes, E. K. et al. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nature Genet. 42, 937–948 (2010).
Cornelis, M. C. et al. The gene, environment association studies consortium (GENEVA): maximizing the knowledge obtained from GWAS by collaboration across studies of multiple conditions. Genet. Epidemiol. 34, 364–372 (2010).
The Psychiatric GWAS Consortium Steering Committee. A framework for interpreting genome-wide association studies of psychiatric disorders. Mol. Psychiatry 14, 10–17 (2009).
Nelson, M. R. et al. The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. Am. J. Hum. Genet. 83, 347–358 (2008).
The International HapMap Consortium. The International HapMap Project. Nature 426, 789–796 (2003).
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Mailman, M. D. et al. The NCBI dbGaP database of genotypes and phenotypes. Nature Genet. 39, 1181–1186 (2007).
Leinonen, R. et al. The European Nucleotide Archive. Nucleic Acids Res. 39, D28–D31 (2011).
Yu, W., Gwinn, M., Clyne, M., Yesupriya, A. & Khoury, M. J. A navigator for human genome epidemiology. Nature Genet. 40, 124–125 (2008).
Thorisson, G. A. et al. HGVbaseG2P: a central genetic association database. Nucleic Acids Res. 37, D797–D802 (2009).
Hirakawa, M. et al. JSNP: a database of common gene variations in the Japanese population. Nucleic Acids Res. 30, 158–162 (2002).
Hindorff, L. A. et al. PheGenI: an integrated resource for browsing genetic association data. Proc. of the 2011 AMIA Summit on Translational Bioinformatics [online], (2011).
Homer, N. et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 4, e1000167 (2008).
Sankararaman, S., Obozinski, G., Jordan, M. I. & Halperin, E. Genomic privacy and limits of individual detection in a pool. Nature Genet. 41, 965–967 (2009).
Jacobs, K. B. et al. A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nature Genet. 41, 1253–1257 (2009).
Neyman, J. & Pearson, E. On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. Lond. A 231, 289–337 (1933).
Braun, R., Rowe, W., Schaefer, C., Zhang, J. & Buetow, K. Needles in the haystack: identifying individuals present in pooled genomic data. PLoS Genet. 5, e1000668 (2009).
Wang, R., Li, Y. F., Wang, X., Tang, H. & Zhou, X. Learning your identity and disease from research papers: information leaks in genome wide association study. Proc. of the 16th ACM Conf. on Computer and Communications Security, 534–544 (2009).
Visscher, P. M. & Hill, W. G. The limits of individual identification from sample allele frequencies: theory and statistical analysis. PLoS Genet. 5, e1000628 (2009).
Clayton, D. On inferring presence of an individual in a mixture: a Bayesian approach. Biostatistics 11, 661–673 (2010).
Sampson, J. & Zhao, H. Identifying individuals in a complex mixture of DNA with unknown ancestry. Stat. Appl. Genet. Mol. Biol. 8, 37 (2009).
Zerhouni, E. A. & Nabel, E. G. Protecting aggregate genomic data. Science 322, 44 (2008).
Krawczak, M., Goebel, J. W. & Cooper, D. N. Is the NIH policy for sharing GWAS data running the risk of being counterproductive? Investig. Genet. 1, 3 (2010).
Haga, S. B. & O'Daniel, J. Public perspectives regarding data-sharing practices in genomics research. Public Health Genomics 24 Mar 2011 (doi:10.1159/000324705).
Malin, B., Karp, D. & Scheuermann, R. H. Technical and policy approaches to balancing patient privacy and data sharing in clinical and translational research. J. Investig. Med. 58, 11–18 (2010).
Elias-Sonnenschein, L. S., Viechtbauer, W., Ramakers, I. H., Verhey, F. R. & Visser, P. J. Predictive value of APOE-ɛ4 allele for progression from MCI to AD-type dementia: a meta-analysis. J. Neurol. Neurosurg. Psychiatry 14 Apr 2011 (doi:10.1136/jnnp.2010.231555).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Acknowledgements
This manuscript represents the views and opinions of its authors and does not necessarily represent the views or policies of the NIH or the US Department of Health and Human Services. This research was supported in part by the Intramural Research Program of the NIH National Library of Medicine. D.W.C. would like to acknowledge support from the US National Heart, Lung and Blood Institute (NHLBI), award U01 HL086528. The authors thank I. Marpuri, S. Buchholtz and L. Gyi for their support in coordinating the development of this work.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Related links
Related links
FURTHER INFORMATION
23andMe personal genomics company
Electronic Medical Records and Genomics (eMERGE) Network
Database of Genotypes and Phenotypes (dbGaP)
European Genome–Phenome Archive
Genome Medicine Database of Japan (GeMDBJ)
Nature Reviews Genetics series on Genome-Wide Association Studies
Phenotype–Genotype Integrator (PheGenI)
Public Population Project in Genomics (P3G)
SNP Health Association Resource (SHARe)
US National Human Genome Research Institute catalogue of published GWASs
Glossary
- Allele frequency
-
The frequency of the less-common allele of a polymorphism. It has a value between 0 and 0.5 and can vary between populations.
- Bayesian
-
A statistical framework for evaluating a hypothesis. The Bayesian approach assesses the probability of a hypothesis being correct by incorporating the prior probability of the hypothesis.
- Discrimination threshold
-
The significance threshold for rejecting the null hypothesis in a statistical test.
- Frequentist
-
A statistical framework for evaluating a hypothesis. The frequentist approach tests a hypothesis as being correct given the strength of a data set.
- Imputation
-
A method for inferring untyped variants from neighbouring variants, based on linkage disequilibrium and haplotype structure.
- Linear regression
-
The estimation of a first-order relationship between two variables, which involves fitting a line of best fit to the data.
- Missingness
-
The percentage of samples that do not receive a genotype call for a SNP in a genome-wide association study.
- Neyman–Pearson lemma
-
A theorem that assures the optimality of a likelihood ratio test between simple hypotheses at a given threshold.
- Prevalence
-
The prior probability that a person is in a data set of interest. Alternatively, the term can refer to the fraction of individuals in a data set out of the total number of individuals that could be in the data set.
- Reference data set
-
A data set of samples from individuals who are from the same population that was sampled in the summary-level data set of interest.
Rights and permissions
About this article
Cite this article
Craig, D., Goor, R., Wang, Z. et al. Assessing and managing risk when sharing aggregate genetic variant data. Nat Rev Genet 12, 730–736 (2011). https://doi.org/10.1038/nrg3067
Published:
Issue Date:
DOI: https://doi.org/10.1038/nrg3067
This article is cited by
-
Sociotechnical safeguards for genomic data privacy
Nature Reviews Genetics (2022)
-
Reconstructing SNP allele and genotype frequencies from GWAS summary statistics
Scientific Reports (2022)
-
Registered access: authorizing data access
European Journal of Human Genetics (2018)
-
Balancing the local and the universal in maintaining ethical access to a genomics biobank
BMC Medical Ethics (2017)
-
Beneficial effect of chronic Staphylococcus aureus infection in a model of multiple sclerosis is mediated through the secretion of extracellular adherence protein
Journal of Neuroinflammation (2015)