Reference epigenomes enable comprehensive annotations of dynamic non-coding regulatory and transcribed elements across hundreds of human cell types and tissues
Reference epigenome mapping across tissues and cell types
Integrative analysis of 111 reference human epigenomes
Roadmap Epigenomics Consortium et al.Nature 10.1038/nature14248
We jointly processed and analyzed our 111 reference epigenomes with 16 additional epigenomes from ENCODE9,23. We generated genome-wide normalized coverage tracks, peaks and broad enriched domains for ChIP-seq and DNase-seq7,32, normalized gene expression values for RNA-seq33, and fractional methylation levels for each CpG site31,34,35.
The resulting datasets provide global views of the epigenomic landscape in a wide range of human cell and tissue types (Fig. 3), including: the largest and most diverse collection to date of chromatin state annotations (Fig. 3a); some of the deepest surveys of individual cell types using diverse epigenomic assays (with 21–31 distinct epigenomic marks for seven deeply-profiled epigenomes, Fig. 3b); and some of the broadest surveys of individual epigenomic marks across multiple cell types (Fig. 3c). These datasets enable genome-wide epigenomic analyses across multiple dimensions (Fig. 3d). All datasets, standards and protocols are publicly available from web portals, linked from the main consortium homepage (http://www.roadmapepigenomics.org), including the supplementary website for this paper (http://compbio.mit.edu/roadmap).
Integrative chromatin state annotations across cell types and tissues
Integrative analysis of 111 reference human epigenomes
Roadmap Epigenomics Consortium et al.Nature 10.1038/nature14248
As a foundation for integrative analysis, we learned a common set of combinatorial chromatin states40 across all 111 epigenomes, plus 16 additional epigenomes generated by the ENCODE project (127 epigenomes in total), using the core set of five histone modification marks that were common to all. We learned a 15-state model (Fig. 4a,b, Table S3a) consisting of 8 active states and 7 repressed states (Fig. 4c) that were recurrently recovered (Extended Data 2a), and showed distinct levels of DNA methylation (Fig. 4d), DNA accessibility (Fig. 4e), regulator binding (Extended Data 2b, Fig. S2), and evolutionary conservation (Fig. 4f, Fig. S3). The active states (associated with expressed genes) consist of active transcription start site (TSS)-proximal promoter states (TssA, TssAFlnk), a transcribed state at the 5′ and 3′ end of genes showing both promoter and enhancer signatures (TxFlnk), actively-transcribed states (Tx, TxWk), enhancer states (Enh, EnhG), and a state associated with zinc finger protein genes (ZNF/Rpts). The inactive states consist of constitutive heterochromatin (Het), bivalent regulatory states (TssBiv, BivFlnk, EnhBiv), repressed Polycomb states (ReprPC, ReprPCWk), and a quiescent state (Quies) which covers on average 68% of each reference epigenome. Enhancer and promoter states cover approximately 5% of each reference epigenome on average, and show enrichment for evolutionarily-conserved non-exonic regions41.
To capture the greater complexity afforded by additional marks, we learned additional chromatin state models in subsets of cell types. In the subset of 98 reference epigenomes that also included H3K27ac data, we also learned an 18-state model (Extended Data 2c, Table S3b), enabling us to distinguish enhancer states containing strong H3K27ac signal (EnhA1, EnhA2), which showed higher DNA accessibility (Extended Data 3a), lower methylation (Extended Data 3b), and higher TF binding (Extended Data 2c) than enhancers lacking H3K27ac.
In a subset of 7 epigenomes with an average of 24 epigenomic marks, we learned separate 50-state chromatin state models based on all the available histone marks and DNA accessibility in each epigenome (Fig. S4), which additionally distinguished: a DNase-state with distinct TF binding enrichments (Fig. S4f), including for mediator/cohesin components42 (even though CTCF was not included as an input track to learn the model) and repressor NRSF; transcribed states showing H3K79me1 and H3K79me2 and associated with the 5′ ends of genes and introns; and a large number of putative regulatory and neighboring regions showing diverse acetylation marks even in absence of the H3K4 methylation signatures characteristic of enhancer and promoter regions.
Dynamics of chromatin states and chromosomal domains across cell types and tissues
Integrative analysis of 111 reference human epigenomes
Roadmap Epigenomics Consortium et al.Nature 10.1038/nature14248
We next sought to characterize the overall variability of each chromatin state across the full range of cell and tissue types. We first evaluated the observed consistency of each chromatin state at any given genomic position across all 127 epigenomes (Fig. 5a). We found that H3K4me1-associated states (including TxFlnk, EnhG, EnhBiv, Enh) are the most tissue-specific, with 90% of instances present in at most 5–10 epigenomes, followed by bivalent promoters (TssBiv), and repressed states (ReprPC, Het). In contrast, active promoters (TssA) and transcribed states (Tx, TxWk) were highly constitutive, with 90% of regions marked in as many as 60–75 epigenomes, and quiescent regions (Quies) were the most constitutive, with 90% of Quies regions consistently marked as Quies in most of the 127 epigenomes. These results held in the 18-state chromatin state model (Extended Data 5a), and in the subset of highest-quality epigenomes (Fig. S6a,b).
Adjusting for the overall coverage and variability of each state, we then studied differences in the relative fraction of the genome annotated to each chromatin state between cell types (Fig. 5b, Extended Data 5b, S6c-e). Hematopoietic stem cells and immune cells show a consistent and previously unrecognized depletion of active and bivalent promoters (TssA, TssBiv) and weakly transcribed states (TxWk), which may be related to their capacity to generate sub-lineages and enter quiescence (reversible G0 phase). ESCs and iPSCs show enrichment of TssBiv, consistent with previous studies57, and a depletion of ReprPCWk (defined by weak H3K27me3), possibly due to restriction of H3K27me3-establishing Polycomb proteins to promoter regions. Surprisingly, IMR90 fetal lung fibroblasts, which were previously used as a somatic reference cell type58 are in fact a strong outlier in multiple ways, showing higher levels of Het, ReprPC and EnhG, and a depletion of Quies chromatin states.
We next studied the relative frequency with which different chromatin states switch to other states across different tissues and cell types (Fig. 5c), relative to switching across samples of the same tissue or cell type (Fig. S7a,b). This revealed a relative switching enrichment between active states and repressed states, consistent with activation and repression of regulatory regions. The only exception was significant switching between transcribed states and active promoter and enhancer states, possibly due to alternative usage of promoters22 and enhancers59 embedded within transcribed elements. These chromatin state switching properties were also found in the 18-state model incorporating H3K27ac marks (Extended Data 5c) and in the subset of 16 ENCODE reference epigenomes using both models (Fig. S7c,d). We found that enhancers and promoters maintained their identity, except for a small subset of regions switching between enhancer signatures and promoter signatures60. Luciferase assays showed that these regions indeed possess both enhancer and promoter activity60, consistent with their epigenomic marks.
While our chromatin state analysis focused at the nucleosome resolution (200-bp), we also studied the overall co-occurrence of chromatin states across tissues at a larger 2Mb resolution to recognize higher-order properties (Fig. 5d). This analysis revealed that 2Mb segments rich in active enhancers are constrained to approximately 40% of the genome (clusters c1-c6), with the remainder marked predominantly by inactive regions (c7-c11), consistent with the identification of two large chromatin conformation compartments12,61. However, both compartments can be further subdivided by their chromatin state composition: inactive regions separate into predominantly quiescent (40%; c9, c11), heterochromatic (10%, c10), or bivalent (10%, c7-c8) marked regions; and active regions separate into regions rich in multiple marks (c3 and c6, showing a large diversity of active, ReprPC, and bivalent states), weakly-transcribed regions (c5, showing primarily Enh and TxWk states), and regions of intermediate activity (c1, c2, c4). As these subdivisions are based on average state density across a large diversity of cell types, we expected them to be stable chromosomal features, and indeed, they showed strong differences in gene density, CpG island occupancy, lamina association62,63 and cytogenetic bands (Fig. 5d, Extended Data 5d).
Modular chromatin state dynamics of high-resolution chromatin-accessible regulatory elements
Integrative analysis of 111 reference human epigenomes
Roadmap Epigenomics Consortium et al.Nature 10.1038/nature14248
We next exploited the dynamics of epigenomic modifications at cis-regulatory elements to gain insights into gene regulation. We focused on 2.3M regions (12.6% of the genome) showing DNA accessibility in any reference epigenome and regulatory (promoter or enhancer) chromatin states, considering enhancer-only, promoter-only, or enhancer-promoter alternating states separately (Fig. S11). We clustered enhancer-only elements (Enh, EnhBiv, EnhG) into 226 enhancer modules of coordinated activity (Fig. 7a), promoter-only elements into 82 promoter modules (Fig. S11a) and promoter/enhancer 'dyadic' elements into 129 modules (Fig. S11b), enabling us to distinguish ubiquitously-active, lineage-restricted, and tissue-specific modules for each group. Focusing on the enhancer-only clusters, we found that the neighboring genes of enhancers in the same module showed significant enrichment for common functions65 (Fig. 7b, Fig. S11c,d), common genotype-phenotype associations65 (Fig. 7c), and common expression in their mouse orthologs (Fig. S12), each annotation type showing strong consistency with the known biology of the corresponding tissues. For example, stem-cell enhancers are enriched near developmental patterning genes, immune cell enhancers near immune response genes, and brain enhancers near learning and memory genes (Fig. 7b). Sub-clustering of individual modules continued to reveal distinct enrichment patterns of individual sub-modules (Fig. S11e), suggesting increased diversity of regulatory processes beyond the 226 modules used here.
Alzheimer's-disease-associated regulatory regions help interpret non-coding genetic variation
Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer's disease
Gjoneska, E. et al.Nature 10.1038/nature14252
Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer's disease Nature https://doi.org/10.1038/nature14252
We next utilized the epigenomic annotations of increased-activity enhancer orthologs to gain insights into AD-associated loci (Supplementary Table S7). Among the 20 genome-wide significant AD-associated loci4, 11 contain no protein-altering SNPs in linkage disequilibrium (LD), indicating they may play non-coding roles. Of these, 5 localize within increased-level enhancer orthologs, including two well-established GWAS loci (PICALM, BIN1), and three loci (INPP5D, CELF1/SPI1, PTK2B) only recently recognized as significant by combining all AD cohorts.
For INPPD5 (Fig. 3a), a known regulator of inflammation28, the most significant variants localize within an increased-level enhancer ortholog, which also shows CD14+ enhancer activity. In the CELF1 locus (Fig. 3b) a large region of association spans several genes, but the strongest genetic signal (p=2×10−6) localizes upstream of SPI1 (PU.1), and specifically within an increased-level enhancer ortholog that is also active in immune cells. We confirmed that the AD-associated C-T substitution, rs1377416, in the SPI1 enhancer leads to increased in vivo enhancer activity in murine BV-2 microglia cells using a luciferase reporter assay (Fig. 3d). In addition, the AD-associated SNP rs55876153 near SPI1, which overlaps an increased-level mouse enhancer ortholog, is in strong linkage disequilibrium (LD=0.89, see Methods) with a known SPI1 eQTL, rs1083869825, even though it did not significantly alter enhancer activity in the luciferase assay.
Outside known GWAS loci, an additional 22 weakly-associated regions (3.9 fold, p<4.9×10−7) contain variants within increased-level enhancer orthologs (Supplementary Table S7), of which 17 lack protein-altering variants in LD (R2<0.4), providing strong candidates for directed experiments. One such example includes ABCA1 (p=6.9×10−5, Fig. 3c), a paralog of AD-associated ABCA7 and encoding a glial-expressed transporter that influences APOE metabolism in the central nervous system29. The region lacks protein-altering variants and all five SNPs in the cluster of association lie specifically within an increased-enhancer ortholog, which is also active in CD14+ immune cells and, to a lesser extent, in human hippocampus and fetal brain.
Causal variants map to discretely regulated elements within super-enhancers
Genetic and epigenetic fine mapping of causal autoimmune disease variants
Farh, K. K.-H. et al.Nature 10.1038/nature13835
Genomic loci that encode cellular identity genes frequently contain large regions with clustered or contiguous enhancers bound by transcriptional co-activators and marked by H3K27ac. Recent studies showed that such 'super-enhancer' regions are enriched for GWAS catalogue SNPs, including those related to autoimmunity18,19. Consistently, we find that PICS SNPs are 7.5-fold enriched in CD4+ T-cell super-enhancers, relative to random SNPs from the genome. We therefore parsed the topography of super-enhancers in immune cells using our genetic and epigenetic data.
The IL2RA locus exemplifies the complex landscape of enhancer regulation. IL2RA encodes a receptor with key roles in T-cell stimulation and Treg function15. The super-enhancer in this locus comprises a cluster of elements recognizable as distinct H3K27ac peaks (Fig. 4a). Although the region meets the super-enhancer definition in multiple CD4+ T-cell types18, sub-elements are preferentially acetylated in Treg, TH17 and/or THStim T-cells, consistent with differential regulation. Some sub-elements appear bound by T-cell master regulators, including FOXP3 in Tregs, T-BET (also known as TBX21) in TH1 cells, and GATA3 in TH2 cells. A systematic analysis indicates PICS SNPs are most enriched at distinct stimulus-dependent H3K27ac peaks within super-enhancer regions (Extended Data Fig. 7).
PICS SNPs for eight autoimmune diseases map to distinct segments of the IL2RA super-enhancer. For example, Immunochip data identify a candidate causal SNP for multiple sclerosis that has no effect on autoimmune thyroiditis disease risk. Conversely, a candidate causal SNP for autoimmune thyroiditis has no effect on multiple sclerosis risk, despite the proximity of the two SNPs within the super-enhancer (Fig. 4b). Furthermore, index SNPs for multiple other diseases are not in LD, suggesting that multiple sites of nucleotide variation in the locus have separable disease associations (Fig. 4c). The distribution of PICS SNPs and the partially discordant regulation of sub-regions suggest that super-enhancers may comprise multiple discrete units with distinct regulatory signals, functions and phenotypic associations.
Integrative analyses of epigenomic profiles across 28 human tissue types
Integrative analysis of haplotype-resolved epigenomes across human tissues
Leung, D. et al.Nature 10.1038/nature14217
We performed ChIP-seq experiments to generate extensive datasets profiling 6 histone modifications across 16 human tissue-types from four individual donors (181 datasets). Combining with previously published datasets, we conducted in-depth analyses across 28 cell/tissue-types, covering a wide spectrum of developmental states, including embryonic stem cells, early embryonic lineages and somatic primary tissue-types representing all three germ layers (Fig. 1a).
[...]
We systematically identified cis-regulatory elements by employing a random-forest based algorithm (RFECS), predicting a total of 292,495 enhancers (consisting of 175,912 strong enhancers with high H3K27ac enrichment) across representative samples of all tissues-types (Supplementary table 1). We additionally identified 24,462 highly active promoters with strong H3K4me3 enrichment (see Supplementary Information) (Supplementary table 2). Subsequently, we defined tissue-restricted promoters (n=10,396) and enhancers (n=115,222) (Extended Data Fig. 1a).
[...]
Intriguingly, 15.2% (n=3,717) of strong promoters were also predicted as enhancers in other tissues, Analogous to observations in mice, where intragenic enhancers act as promoters to produce cell-type specific transcripts19. These sites possessed histone modification signatures of active enhancers in some tissue/cell-types but were enriched with active promoter marks in others. We termed these sequences cis-Regulatory Elements with Dynamic Signatures (cREDS). For example, cREDS enhancers showed enrichment of H3K27ac and H3K4me1 and a striking depletion of H3K4me3 in lung (Fig. 1b and c, Supplementary table 3). However, the signature shifted to that of active promoters in other tissues (Fig. 1b and c). cREDS are also found in other cell/tissue-types (Extended Data Fig. 4a).
[...]
We defined genes with allelically biased expression mapping the RNA-seq reads in each tissue sample to the two haploid genomes of the donor. We observed extensive allelically biased gene expression, ranging from 4% to 13% of all informative genes (>10 allelic read counts) in each tissue sample (FDR=5%, Extended Data Fig. 7a-b). Comparatively, the proportion of allelically biased genes in individual tissue donors ranged from 6% to 23% of all informative genes, giving a combined total of 2,570 allelically biased genes (Fig. 2b, Supplementary Table 7).
[...]
As natural genetic variations can affect enhancer selection and function in mammalian cells, we hypothesized that polymorphisms at cis-regulatory sequences underlie the widespread allelic transcriptional biases.
[...]
We generated additional H3K27ac ChIP-seq datasets with deeper coverage and longer sequencing reads (for better delineation of alleles) for 14 of the previously analyzed tissue samples and an additional 6 samples from independent donors (Supplementary Table 7). Of the informative enhancers (with >10 polymorphism-bearing sequence reads), 11.6% (n=11,714, FDR=1%) showed significant allelically biased H3K27ac enrichment in any tissue types (Fig. 3c, and Supplementary table 8). Interestingly, identical genotypes often yielded the same direction of biases in allelic enhancer activities (Fig. 3d).
Tissue- and cell-type-specific long non-coding RNA
Epigenetic and transcriptional determinants of the human breast
Gascard, P et al.Nature Communications 10.1038/ncomms7351
Noncoding RNAs are key regulators of diverse cellular processes16 that can interact directly with the epigenetic machinery and may be prognostic in breast cancer17. We identified 936 unique miRNAs expressed at similar distributions across the 5 mammary derived cell types, including a core set of 29 which were highly expressed (>1000 RPM) across myoepithelial, luminal epithelial and stem-like cell types (Supplementary Figure 12b and Supplementary Table 5). Hierarchical clustering demonstrated expected cell type relationships (Supplementary Figure 12c) and cell type-specific miRNAs were identified with a majority being expressed in vHMECs (Figure 2d). We also identified 1,870 expressed lincRNAs (Supplementary Figure 13 and Supplementary Table 6) and 82 cell type-specific lincRNAs across the mammary cell types with myoepithelial cells showing the smallest number of cell type-specific events (Figure 2e and Supplementary Table 7). Restricting our comparison to myoepithelial and luminal cells, we identified 206 DE non-coding RNAs, including 130 lincRNAs and 76 antisense transcripts. Among the differentially expressed lincRNAs, MALAT (NEAT2), a critical regulator of metastasis in epithelial cancers18, was overexpressed in normal luminal cells suggesting that its expression is not solely restricted to metastatic potential in epithelial lineages. An imprinted region of 14q32.3, that encodes maternally expressed noncoding MEG3 and MEG8 transcripts and 54 miRNAs expressed from the maternally inherited homolog, was transcriptionally silenced in luminal cells (Supplementary Figure 14). Loss of expression of the MEG3 cluster through LOH and promoter hypermethylation is frequent in epithelial cancers19. Our results suggest that MEG3 transcriptional repression is associated with normal epithelial differentiation and provide a novel intergenic differentially methylated region that may responsible for its cell type-specific regulation (Supplementary Figure 14).
Epigenomic footprints across 111 reference epigenomes reveal tissue-specific epigenetic regulation of lincRNAs
Amin, V et al.Nature Communications 10.1038/ncomms7370
Long noncoding RNAs (lncRNAs) are implicated in an increasing number of cellular processes including mammalian cellular differentiation1. Their role in repressing lineage-specific genes during early development was demonstrated by knockdown experiments in mouse embryonic stem cells2. Lineage-specific role of specific lncRNAs has now been established in cardiac3,4, epidermal5, neuronal6, mammary gland development7,8, and in T-cells9. Striking tissue specific transcription of lncRNAs10,11 is consistent with their role in developmental regulation and presents a possible inroad into understanding their biology. The intergenic lncRNAs (lincRNAs) are a major class of lncRNAs that are particularly convenient to study computationally and experimentally because of their lack of overlap with protein coding genes. Despite their relative accessibility, lincRNAs are experimentally less tractable than protein coding genes because of the lack of information about their potential function and associated phenotypes. We here address this knowledge gap by determining their tissue-specific epigenetic regulation, thus complementing the current knowledge about their tissue-specific transcription.
By analyzing 111 reference epigenomes from the NIH Roadmap Epigenomics project, we report that at least 3,753 (69% examined) lincRNAs show exquisitely tissue-specific epigenomic footprints and strongly associate with cell- and tissue-specific pathways, suggesting developmental or tissue-specific function for this newly discovered class of genes.
Skin cell-type-specific differentially DNA methylated regions
Regulatory network decoded from epigenomes of surface ectoderm-derived cell types
Lowdon, R. F. et al.Nature Communications 10.1038/ncomms6442
We identified 12,892 500 bp regions encompassing 193,202 CpGs with a DNA methylation status unique to one of the three most common skin cell types (fibroblasts, melanocytes, and keratinocytes)(Fig. 2a,). The majority of these skin cell type-specific differentially DNA methylated regions (DMRs) were hypomethylated (Fig. 2a), suggesting potential cell type-specific regulatory activity at these regions4, 12, 13. Forty to 46% of the DMRs were intergenic and 5–9% were associated with RefSeq-annotated gene promoters (Supplementary Fig. 5); non-CpG island promoters were enriched among cell type-specific DMRs (Supplementary Note 4 and Supplementary Table 2). Eighty to 91% of hypomethylated cell type-specific DMRs overlapped with regulatory element-associated histone modifications in the same cell type (Fig. 2b). Accordingly, hypomethylation of cell type-specific DMRs at gene promoters correlated with increased gene expression relative to the other two cell types where the DMR was hypermethylated (Fig. 2c). Gene Ontology (GO) analysis using the GREAT (ref. 14) tool on hypomethylated cell type-specific DMRs showed strong enrichment for biological processes relevant to each cell type (for example, 'extracellular matrix organization' for fibroblasts (P−value=9.05E−45) and 'pigmentation' for melanocytes (P−value=2.43E−06); Fig. 2d). These data suggest skin cell type-specific DMRs occur primarily at distal enhancers and regulate genes relevant to each cell type.
Intermediate DNA methylation is a conserved signature of genome regulation
Elliott, G et al.Nature Communications 10.1038/ncomms7363
The bimodal pattern of DNA methylation implies a binary control over gene expression, yet a significant number of loci throughout the genome have an intermediate level of DNA methylation. To comprehensively identify regions of intermediate methylation (IM) and their quantitative relationship with gene activity, integrative and comparative analysis was applied to 25 human cell and tissue epigenomes. These analyses identified 18,452 IM regions located near 36% of genes. CpGs in IM regions had a mean methylation of 57% using whole-genome bisulfite sequencing. IM regions were enriched at enhancers and exons and exhibit a quantitative relationship with enhancer signals and exon inclusion, respectively (Figure 2c,d,e). These associations were equally strong in tissue, unsorted peripheral blood and 6 highly purified cell types. Significant interspecies conservation of IM status at orthologuous loci, and conservation among different individuals, further suggests an important function, and potentially a shared mechanism for their establishment and maintenance. The data is consistent with the hypothesis that IM is a distinct epigenetic signature of evolutionarily conserved, gene context-dependent function.
Meta-epigenomic structure of purified human stem cells
The meta-epigenomic structure of purified human stem cell populations is defined at cis-regulatory sequences
Wijetunga, N et al.Nature Communications 10.1038/ncomms6195
To determine whether epigenetic variability was occurring at regulatory sites with possible functional consequences, we took advantage of public chromatin mapping data for CD34+ HSPCs generated by the Roadmap Epigenomics programme (Supplementary Table 5). The DNase hypersensitivity and ChIP-seq data create combinatorial patterns that have previously been exploited to define functional elements in the genome. We processed the Roadmap data using an adaptation of an imaging signal processing algorithm, to define the locations of chromatin constituents with minimal data transformation (Supplementary Fig. 4). These chromatin constituent locations were then used to generate a selforganizing map (SOM), and to map candidate regulatory elements using the Segway algorithm (Supplementary Fig. 5). The individual Segway features were then overlaid as contour plots onto the SOM, which clusters in two-dimensional space loci with similar genomic characteristics, allowing intuitive visualization of the major contributors to each feature (Fig. 2a and Supplementary Fig. 6). Of the multiple chromatin states for which each feature is enriched, feature 6 has the H3K4me3 enrichment, indicating promoter function, features 4 and 5 both have marks indicative of enhancer function (H3K4me1 and H3K27ac, respectively), features 1–3 have the H3K36me3 enrichment typical of transcribed sequences, while feature 0 in enriched for heterochromatic marks (H3K9me9 and H3K27me3).
We also created a metaplot of these new annotations relative to all RefSeq genes in the genome (Supplementary Fig. 7), showing that Segway feature 6 is strikingly enriched at transcription start sites (TSSs), flanked by enrichment for feature 4 and, to a lesser degree, feature 5 (Fig. 2b). Features 1–3 are enriched in gene bodies and feature 0 at intergenic sequences. Statistical testing of the enrichment of features 4 and 6 in their windows of peak frequencies compared with their distributions over all RefSeq genes and flanking regions showed significance (P<0.001 for each). CpG islands and their immediate flanking sequences have previously been related to 'stochastic' DNA methylation variability8 and gene expression regulation31. The Segway annotations demonstrate that although the bodies of CpG islands are enriched for the candidate promoter (feature 6) sequences, the ±2 kb flanking region, generally described as its 'shore', is strikingly enriched for feature 4 (Fig. 2c). Both achieve statistical significance (Po0.001) when compared with their distributions over all CpG islands (feature 6) or flanking regions (feature 4). Finally, stratifying the RefSeq genes by expression quartile in CD34+ HSPCs reveals the transcriptional dependencies of the Segway annotations (Fig. 3). We conclude that the Segway annotations define candidate promoters (feature 6), enhancers (features 4 and 5), transcribed regions (features 1–3) and repressed chromatin (feature 0) for CD34+ HSPCs.
Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues
Ernst, J. & Kellis, M. Nature Biotechnology 10.1038/nbt.3157
We applied ChromImpute to a compendium of 127 reference epigenomes, including 111 profiled by the NIH Roadmap Epigenomics project10 and 16 profiled by the ENCODE project2,3 (Fig. 1a). These span diverse tissues and cell types, including Embryonic Stem Cells (ESCs), induced Pluripotent Stem Cells (iPSC), ESC-derived cells, blood and immune cells, skin, brain, adipose, muscle, heart, smooth muscle, digestive, liver, lung and others.
Only 5 'core' histone modification marks were experimentally profiled in all 127 reference epigenomes. These are promoter-associated H3K4me3, enhancer-associated H3K4me1, Polycomb repression-associated H3K27me3, transcription-associated H3K36me3 and heterochromatin-associated H3K9me3. Varying subsets of 34 marks were profiled in different epigenomes, including 30 histone modifications (11 histone methylation marks, 18 histone acetylation marks, and H3T11ph), histone variant H2A.Z, DNA accessibility, DNA methylation data, and RNA-seq data.
Based on these experimentally-profiled ('observed') datasets, we imputed the 31 marks observed in at least two epigenomes in all 127 epigenomes, and the three marks mapped in only one epigenome in the remaining 126 epigenomes. In total we generated 4,315 datasets based on imputation, of which only 1,122 (26%) were also experimentally mapped and 3,193 (74%) are only available as imputed data. Signal tracks for all marks were imputed at 25 base pair resolution (121 million predictions per track) except for DNA methylation, which was imputed at single-nucleotide resolution for each of 28 million CpGs. Across all marks, samples, and positions, we generated a total of 526 billion predicted signal values.
We first learned a 25-state model jointly3 across all 127 samples (Fig. 6b,c) using all Tier-1 and 2 marks. This captured multiple types of promoter, enhancer, open chromatin, transcribed, and repressed states and shows specific DNA methylation and RNA-seq enrichments (Fig. 6b,c, S33). Compared to the 15-state chromatin state model based on observed data in the 127 samples (Fig. S33), the 12-mark model better distinguished active vs. poised enhancer states (using H3K27ac and H3K9ac), and captured novel states (e.g. state 19_DNase showing DNA accessibility but lacking enhancer/promoters marks and state 5_Tx5′ associated with 5′ends of transcripts and based on H3K79me2). Benefiting from the increased stability and robustness of imputed data, imputation-based chromatin states showed more consistent genome coverage across tissue/samples (Fig. S34), better agreement with annotated gene bodies and transcription start sites, both for all transcripts (Fig. S35a,b) and for the set of transcripts expressed in a given tissue (Fig. S35c,d), and better discrimination of evolutionarily-conserved elements (Fig. S36)38. Additionally we saw better recovery of samples that were not included in any of our training data (e.g. an osteoblast DNA accessibility dataset39, Fig. S37), while capturing major cell type specific differences in chromatin states (e.g. ESC/iPSC cell types showing consistently more abundant bivalent promoter states40, Fig. S38), with cell type specific differences even more pronounced than for chromatin states based on observed data (Fig. S38).
Rights and permissions
About this article
Cite this article
1. Annotation of the non-coding genome. Nature (2015). https://doi.org/10.1038/nature14309
Published:
DOI: https://doi.org/10.1038/nature14309
This article is cited by
-
The importance of cohort studies in the post-GWAS era
Nature Genetics (2018)