Abstract
Tens of millions of base pairs of euchromatic human genome sequence, including many protein-coding genes, have no known location in the human genome. We describe an approach for localizing the human genome's missing pieces using the patterns of genome sequence variation created by population admixture. We mapped the locations of 70 scaffolds spanning 4 million base pairs of the human genome's unplaced euchromatic sequence, including more than a dozen protein-coding genes, and identified 8 new large interchromosomal segmental duplications. We find that most of these sequences are hidden in the genome's heterochromatin, particularly its pericentromeric regions. Many cryptic, pericentromeric genes are expressed at the RNA level and have been maintained intact for millions of years while their expression patterns diverged from those of paralogous genes elsewhere in the genome. We describe how knowledge of the locations of these sequences can inform disease association and genome biology studies.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Venter, J.C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Li, R. et al. Building the sequence map of the human pan-genome. Nat. Biotechnol. 28, 57–63 (2010).
Kidd, J.M. et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat. Methods 7, 365–371 (2010).
Kirsch, S. et al. Interchromosomal segmental duplications of the pericentromeric region on the human Y chromosome. Genome Res. 15, 195–204 (2005).
Lyle, R. et al. Islands of euchromatin-like sequence and expressed polymorphic sequences within the short arm of human chromosome 21. Genome Res. 17, 1690–1696 (2007).
International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).
Lander, E.S. Initial impact of the sequencing of the human genome. Nature 470, 187–197 (2011).
Pickrell, J.K., Gaffney, D.J., Gilad, Y. & Pritchard, J.K. False positive peaks in ChIP-seq and other sequencing-based functional assays caused by unannotated high copy number regions. Bioinformatics 27, 2144–2146 (2011).
Eichler, E.E., Clark, R.A. & She, X. An assessment of the sequence gaps: unfinished business in a finished human genome. Nat. Rev. Genet. 5, 345–354 (2004).
Botstein, D., White, R.L., Skolnick, M. & Davis, R.W. Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am. J. Hum. Genet. 32, 314–331 (1980).
Donis-Keller, H. et al. A genetic linkage map of the human genome. Cell 51, 319–337 (1987).
Weissenbach, J. et al. A second-generation linkage map of the human genome. Nature 359, 794–801 (1992).
Kong, A. et al. A high-resolution recombination map of the human genome. Nat. Genet. 31, 241–247 (2002).
Reich, D. et al. A whole-genome admixture scan finds a candidate locus for multiple sclerosis susceptibility. Nat. Genet. 37, 1113–1118 (2005).
Winkler, C.A., Nelson, G.W. & Smith, M.W. Admixture mapping comes of age. Annu. Rev. Genomics Hum. Genet. 11, 65–89 (2010).
Hinch, A.G. et al. The landscape of recombination in African Americans. Nature 476, 170–175 (2011).
Wegmann, D. et al. Recombination rates in admixed individuals identified by ancestry-based inference. Nat. Genet. 43, 847–853 (2011).
Seldin, M.F., Pasaniuc, B. & Price, A.L. New approaches to disease mapping in admixed populations. Nat. Rev. Genet. 12, 523–528 (2011).
Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J. & Sayers, E.W. GenBank. Nucleic Acids Res. 39, D32–D37 (2011).
1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).
Taylor, H.A. Jr. et al. Toward resolution of cardiovascular health disparities in African Americans: design and methods of the Jackson Heart Study. Ethn. Dis. 15, S6-4-17 (2005).
Musunuru, K. et al. Candidate gene association resource (CARe): design, methods, and proof of concept. Circ. Cardiovasc. Genet. 3, 267–275 (2010).
International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007).
Martin, J. et al. The sequence and analysis of duplication-rich human chromosome 16. Nature 432, 988–994 (2004).
Doggett, N.A. et al. A 360-kb interchromosomal duplication of the human HYDIN locus. Genomics 88, 762–771 (2006).
Kim, J.I., Ju, Y.S., Kim, S., Hong, D. & Seo, J.S. Detection of HYDIN gene duplication in personal genome sequence data. Genomics Inform. 7, 159–162 (2009).
Reiner, A.P. et al. Genome-wide association study of white blood cell count in 16,388 African Americans: the Continental Origins and Genetic Epidemiology Network (COGENT). PLoS Genet. 7, e1002108 (2011).
Guipponi, M. et al. Genomic structure of a copy of the human TPTE gene which encompasses 87 kb on the short arm of chromosome 21. Hum. Genet. 107, 127–131 (2000).
Bailey, J.A., Yavor, A.M., Massa, H.F., Trask, B.J. & Eichler, E.E. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 11, 1005–1017 (2001).
Bailey, J.A. et al. Recent segmental duplications in the human genome. Science 297, 1003–1007 (2002).
Bailey, J.A. et al. Human-specific duplication and mosaic transcripts: the recent paralogous structure of chromosome 22. Am. J. Hum. Genet. 70, 83–100 (2002).
Golfier, G. et al. The 200-kb segmental duplication on human chromosome 21 originates from a pericentromeric dissemination involving human chromosomes 2, 18 and 13. Gene 312, 51–59 (2003).
Ruault, M., Ventura, M., Galtier, N., Brun, M.E. & Archidiacono, N. BAGE genes generated by juxtacentromeric reshuffling in the Hominidae lineage are under selective pressure. Genomics 81, 391–399 (2003).
Dennis, M.Y. et al. Evolution of human-specific neural SRGAP2 genes by incomplete segmental duplication. Cell 149, 912–922 (2012).
Sudmant, P.H. et al. Diversity of human copy number variation and multicopy genes. Science 330, 641–646 (2010).
BAC Resource Consortium. Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature 409, 953–958 (2001).
Mahtani, M.M. & Willard, H.F. Physical and genetic mapping of the human X chromosome centromere: repression of recombination. Genome Res. 8, 100–110 (1998).
Samonte, R.V. & Eichler, E.E. Segmental duplications and the evolution of the primate genome. Nat. Rev. Genet. 3, 65–72 (2002).
She, X. et al. The structure and evolution of centromeric transition regions within the human genome. Nature 430, 857–864 (2004).
Zhang, J., Feuk, L., Duggan, G.E., Khaja, R. & Scherer, S.W. Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenet. Genome Res. 115, 205–214 (2006).
Ryan, D.P. et al. Mutations in potassium channel Kir2.6 cause susceptibility to thyrotoxic hypokalemic periodic paralysis. Cell 140, 88–98 (2010).
Eichler, E.E. Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet. 17, 661–669 (2001).
Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. USA 108, 1513–1518 (2011).
Christiansen, J. et al. Chromosome 1q21.1 contiguous gene deletion is associated with congenital heart disease. Circ. Res. 94, 1429–1435 (2004).
International Schizophrenia Consortium. Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature 455, 237–241 (2008).
Stefansson, H. et al. Large recurrent microdeletions associated with schizophrenia. Nature 455, 232–236 (2008).
Mefford, H.C. et al. Recurrent rearrangements of chromosome 1q21.1 and variable pediatric phenotypes. N. Engl. J. Med. 359, 1685–1699 (2008).
Brunetti-Pierri, N. et al. Recurrent reciprocal 1q21.1 deletions and duplications associated with microcephaly or macrocephaly and developmental and behavioral abnormalities. Nat. Genet. 40, 1466–1471 (2008).
Lango Allen, H. et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467, 832–838 (2010).
Sırmacı, A. et al. A truncating mutation in SERPINB6 is associated with autosomal-recessive nonsyndromic sensorineural hearing loss. Am. J. Hum. Genet. 86, 797–804 (2010).
Alkan, C., Sajjadian, S. & Eichler, E.E. Limitations of next-generation genome sequence assembly. Nat. Methods 8, 61–65 (2011).
Ju, Y.S. et al. Extensive genomic and transcriptional diversity identified through massively parallel DNA and RNA sequencing of eighteen Korean individuals. Nat. Genet. 43, 745–752 (2011).
Church, D.M. et al. Modernizing reference genome assemblies. PLoS Biol. 9, e1001091 (2011).
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Korn, J.M. et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat. Genet. 40, 1253–1260 (2008).
Price, A.L. et al. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 5, e1000519 (2009).
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Handsaker, R.E., Korn, J.M., Nemesh, J. & McCarroll, S.A. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat. Genet. 43, 269–276 (2011).
Robinson, J.T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
Acknowledgements
This study was supported by grants RC1 GM091332-01 (S.A.M. and J.G.W.), R01 HG006855 (S.A.M.) and R01DK54931 (G.G. and M.R.P.) from the US National Institutes of Health and by a Smith Family Foundation Award for Excellence in Biomedical Research (S.A.M.).
The Jackson Heart Study is supported and conducted in collaboration with Jackson State University (N01-HC-95170), University of Mississippi Medical Center (N01-HC-95171) and Touglaoo College (N01-HC-95172) contracts from the National Heart, Lung, and Blood Institute (NHLBI) and the National Institute for Minority Health and Health Disparities (NIMHD), with additional support from the National Institute on Biomedical Imaging and Bioengineering (NIBIB).
The Atherosclerosis Risk in Communities Study is carried out as a collaborative study supported by NHLBI contracts (HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN268201100010C, HHSN268201100011C and HHSN268201100012C).
The Coronary Artery Risk Development in Young Adults Study (CARDIA) is conducted and supported by the NHLBI in collaboration with the University of Alabama at Birmingham (N01-HC95095 and N01-HC48047), the University of Minnesota (N01-HC48048), Northwestern University (N01-HC48049) and the Kaiser Foundation Research Institute (N01-HC48050).
MESA, MESA Family and the MESA SHARe project are conducted and supported by the NHLBI in collaboration with the MESA investigators. Support for MESA is provided by contracts N01-HC-95159, through N01-HC-95169, and RR-024156. Funding for MESA Family is provided by grants R01-HL-071051, R01-HL-071205, R01-HL-071250, R01-HL-071251, R01-HL-071252, R01-HL-071258 and R01-HL-071259. MESA Air is funded by the US Environmental Protection Agency (EPA)–Science to Achieve Results (STAR) Program Grant RD831697. Funding for genotyping was provided by NHLBI contracts N02-HL-6-4278 and N01-HC-65226.
This manuscript does not necessarily reflect the opinions or views of ARIC, CARDIA, JHS, MESA or the NHLBI.
Author information
Authors and Affiliations
Contributions
G.G. and S.A.M. conceived the project, designed the analyses and wrote the manuscript. G.G. performed the analysis of the CARe, ICDB, JHS and BodyMap 2.0 data sets. R.E.H. performed the sequence read depth analysis of selected regions. H.L. performed the alignments of HuRef scaffolds and GenBank clones. N.A. contributed the analysis of the HuRef unplaced scaffolds. A.M.L. performed the FISH experiments. K.C. organized and contributed to the design of the Sequenom experiment. B.P., A.L.P. and D.R. provided advice for the local ancestry inference. C.C.M. participated in the interpretation of the FISH experiments. M.R.P. participated in planning discussions for the linkage analysis. J.G.W. participated in planning discussions, coordinated interactions with JHS and edited the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Text and Figures
Supplementary Note, Supplementary Tables 1–13 and Supplementary Figures 1–24 (PDF 1973 kb)
Rights and permissions
About this article
Cite this article
Genovese, G., Handsaker, R., Li, H. et al. Using population admixture to help complete maps of the human genome. Nat Genet 45, 406–414 (2013). https://doi.org/10.1038/ng.2565
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.2565
This article is cited by
-
Uniform genomic data analysis in the NCI Genomic Data Commons
Nature Communications (2021)
-
Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome
BMC Bioinformatics (2014)
-
Face shape differs in phylogenetically related populations
European Journal of Human Genetics (2014)
-
The role of replicates for error mitigation in next-generation sequencing
Nature Reviews Genetics (2014)
-
Prioritization of neurodevelopmental disease genes by discovery of new mutations
Nature Neuroscience (2014)