Reference epigenome mapping across tissues and cell types

Integrative analysis of 111 reference human epigenomes

Roadmap Epigenomics Consortium et al.Nature 10.1038/nature14248

We jointly processed and analyzed our 111 reference epigenomes with 16 additional epigenomes from ENCODE9,23. We generated genome-wide normalized coverage tracks, peaks and broad enriched domains for ChIP-seq and DNase-seq7,32, normalized gene expression values for RNA-seq33, and fractional methylation levels for each CpG site31,34,35.

The resulting datasets provide global views of the epigenomic landscape in a wide range of human cell and tissue types (Fig. 3), including: the largest and most diverse collection to date of chromatin state annotations (Fig. 3a); some of the deepest surveys of individual cell types using diverse epigenomic assays (with 21–31 distinct epigenomic marks for seven deeply-profiled epigenomes, Fig. 3b); and some of the broadest surveys of individual epigenomic marks across multiple cell types (Fig. 3c). These datasets enable genome-wide epigenomic analyses across multiple dimensions (Fig. 3d). All datasets, standards and protocols are publicly available from web portals, linked from the main consortium homepage (http://www.roadmapepigenomics.org), including the supplementary website for this paper (http://compbio.mit.edu/roadmap).

Integrative chromatin state annotations across cell types and tissues

Integrative analysis of 111 reference human epigenomes

Roadmap Epigenomics Consortium et al.Nature 10.1038/nature14248

As a foundation for integrative analysis, we learned a common set of combinatorial chromatin states40 across all 111 epigenomes, plus 16 additional epigenomes generated by the ENCODE project (127 epigenomes in total), using the core set of five histone modification marks that were common to all. We learned a 15-state model (Fig. 4a,b, Table S3a) consisting of 8 active states and 7 repressed states (Fig. 4c) that were recurrently recovered (Extended Data 2a), and showed distinct levels of DNA methylation (Fig. 4d), DNA accessibility (Fig. 4e), regulator binding (Extended Data 2b, Fig. S2), and evolutionary conservation (Fig. 4f, Fig. S3). The active states (associated with expressed genes) consist of active transcription start site (TSS)-proximal promoter states (TssA, TssAFlnk), a transcribed state at the 5′ and 3′ end of genes showing both promoter and enhancer signatures (TxFlnk), actively-transcribed states (Tx, TxWk), enhancer states (Enh, EnhG), and a state associated with zinc finger protein genes (ZNF/Rpts). The inactive states consist of constitutive heterochromatin (Het), bivalent regulatory states (TssBiv, BivFlnk, EnhBiv), repressed Polycomb states (ReprPC, ReprPCWk), and a quiescent state (Quies) which covers on average 68% of each reference epigenome. Enhancer and promoter states cover approximately 5% of each reference epigenome on average, and show enrichment for evolutionarily-conserved non-exonic regions41.

To capture the greater complexity afforded by additional marks, we learned additional chromatin state models in subsets of cell types. In the subset of 98 reference epigenomes that also included H3K27ac data, we also learned an 18-state model (Extended Data 2c, Table S3b), enabling us to distinguish enhancer states containing strong H3K27ac signal (EnhA1, EnhA2), which showed higher DNA accessibility (Extended Data 3a), lower methylation (Extended Data 3b), and higher TF binding (Extended Data 2c) than enhancers lacking H3K27ac.

In a subset of 7 epigenomes with an average of 24 epigenomic marks, we learned separate 50-state chromatin state models based on all the available histone marks and DNA accessibility in each epigenome (Fig. S4), which additionally distinguished: a DNase-state with distinct TF binding enrichments (Fig. S4f), including for mediator/cohesin components42 (even though CTCF was not included as an input track to learn the model) and repressor NRSF; transcribed states showing H3K79me1 and H3K79me2 and associated with the 5′ ends of genes and introns; and a large number of putative regulatory and neighboring regions showing diverse acetylation marks even in absence of the H3K4 methylation signatures characteristic of enhancer and promoter regions.

Dynamics of chromatin states and chromosomal domains across cell types and tissues

Integrative analysis of 111 reference human epigenomes

Roadmap Epigenomics Consortium et al.Nature 10.1038/nature14248

We next sought to characterize the overall variability of each chromatin state across the full range of cell and tissue types. We first evaluated the observed consistency of each chromatin state at any given genomic position across all 127 epigenomes (Fig. 5a). We found that H3K4me1-associated states (including TxFlnk, EnhG, EnhBiv, Enh) are the most tissue-specific, with 90% of instances present in at most 5–10 epigenomes, followed by bivalent promoters (TssBiv), and repressed states (ReprPC, Het). In contrast, active promoters (TssA) and transcribed states (Tx, TxWk) were highly constitutive, with 90% of regions marked in as many as 60–75 epigenomes, and quiescent regions (Quies) were the most constitutive, with 90% of Quies regions consistently marked as Quies in most of the 127 epigenomes. These results held in the 18-state chromatin state model (Extended Data 5a), and in the subset of highest-quality epigenomes (Fig. S6a,b).

Adjusting for the overall coverage and variability of each state, we then studied differences in the relative fraction of the genome annotated to each chromatin state between cell types (Fig. 5b, Extended Data 5b, S6c-e). Hematopoietic stem cells and immune cells show a consistent and previously unrecognized depletion of active and bivalent promoters (TssA, TssBiv) and weakly transcribed states (TxWk), which may be related to their capacity to generate sub-lineages and enter quiescence (reversible G0 phase). ESCs and iPSCs show enrichment of TssBiv, consistent with previous studies57, and a depletion of ReprPCWk (defined by weak H3K27me3), possibly due to restriction of H3K27me3-establishing Polycomb proteins to promoter regions. Surprisingly, IMR90 fetal lung fibroblasts, which were previously used as a somatic reference cell type58 are in fact a strong outlier in multiple ways, showing higher levels of Het, ReprPC and EnhG, and a depletion of Quies chromatin states.

We next studied the relative frequency with which different chromatin states switch to other states across different tissues and cell types (Fig. 5c), relative to switching across samples of the same tissue or cell type (Fig. S7a,b). This revealed a relative switching enrichment between active states and repressed states, consistent with activation and repression of regulatory regions. The only exception was significant switching between transcribed states and active promoter and enhancer states, possibly due to alternative usage of promoters22 and enhancers59 embedded within transcribed elements. These chromatin state switching properties were also found in the 18-state model incorporating H3K27ac marks (Extended Data 5c) and in the subset of 16 ENCODE reference epigenomes using both models (Fig. S7c,d). We found that enhancers and promoters maintained their identity, except for a small subset of regions switching between enhancer signatures and promoter signatures60. Luciferase assays showed that these regions indeed possess both enhancer and promoter activity60, consistent with their epigenomic marks.

While our chromatin state analysis focused at the nucleosome resolution (200-bp), we also studied the overall co-occurrence of chromatin states across tissues at a larger 2Mb resolution to recognize higher-order properties (Fig. 5d). This analysis revealed that 2Mb segments rich in active enhancers are constrained to approximately 40% of the genome (clusters c1-c6), with the remainder marked predominantly by inactive regions (c7-c11), consistent with the identification of two large chromatin conformation compartments12,61. However, both compartments can be further subdivided by their chromatin state composition: inactive regions separate into predominantly quiescent (40%; c9, c11), heterochromatic (10%, c10), or bivalent (10%, c7-c8) marked regions; and active regions separate into regions rich in multiple marks (c3 and c6, showing a large diversity of active, ReprPC, and bivalent states), weakly-transcribed regions (c5, showing primarily Enh and TxWk states), and regions of intermediate activity (c1, c2, c4). As these subdivisions are based on average state density across a large diversity of cell types, we expected them to be stable chromosomal features, and indeed, they showed strong differences in gene density, CpG island occupancy, lamina association62,63 and cytogenetic bands (Fig. 5d, Extended Data 5d).

Modular chromatin state dynamics of high-resolution chromatin-accessible regulatory elements

Integrative analysis of 111 reference human epigenomes

Roadmap Epigenomics Consortium et al.Nature 10.1038/nature14248

We next exploited the dynamics of epigenomic modifications at cis-regulatory elements to gain insights into gene regulation. We focused on 2.3M regions (12.6% of the genome) showing DNA accessibility in any reference epigenome and regulatory (promoter or enhancer) chromatin states, considering enhancer-only, promoter-only, or enhancer-promoter alternating states separately (Fig. S11). We clustered enhancer-only elements (Enh, EnhBiv, EnhG) into 226 enhancer modules of coordinated activity (Fig. 7a), promoter-only elements into 82 promoter modules (Fig. S11a) and promoter/enhancer 'dyadic' elements into 129 modules (Fig. S11b), enabling us to distinguish ubiquitously-active, lineage-restricted, and tissue-specific modules for each group. Focusing on the enhancer-only clusters, we found that the neighboring genes of enhancers in the same module showed significant enrichment for common functions65 (Fig. 7b, Fig. S11c,d), common genotype-phenotype associations65 (Fig. 7c), and common expression in their mouse orthologs (Fig. S12), each annotation type showing strong consistency with the known biology of the corresponding tissues. For example, stem-cell enhancers are enriched near developmental patterning genes, immune cell enhancers near immune response genes, and brain enhancers near learning and memory genes (Fig. 7b). Sub-clustering of individual modules continued to reveal distinct enrichment patterns of individual sub-modules (Fig. S11e), suggesting increased diversity of regulatory processes beyond the 226 modules used here.

Alzheimer's-disease-associated regulatory regions help interpret non-coding genetic variation

Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer's disease

Gjoneska, E. et al.Nature 10.1038/nature14252

Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer's disease Nature https://doi.org/10.1038/nature14252

We next utilized the epigenomic annotations of increased-activity enhancer orthologs to gain insights into AD-associated loci (Supplementary Table S7). Among the 20 genome-wide significant AD-associated loci4, 11 contain no protein-altering SNPs in linkage disequilibrium (LD), indicating they may play non-coding roles. Of these, 5 localize within increased-level enhancer orthologs, including two well-established GWAS loci (PICALM, BIN1), and three loci (INPP5D, CELF1/SPI1, PTK2B) only recently recognized as significant by combining all AD cohorts.

For INPPD5 (Fig. 3a), a known regulator of inflammation28, the most significant variants localize within an increased-level enhancer ortholog, which also shows CD14+ enhancer activity. In the CELF1 locus (Fig. 3b) a large region of association spans several genes, but the strongest genetic signal (p=2×10−6) localizes upstream of SPI1 (PU.1), and specifically within an increased-level enhancer ortholog that is also active in immune cells. We confirmed that the AD-associated C-T substitution, rs1377416, in the SPI1 enhancer leads to increased in vivo enhancer activity in murine BV-2 microglia cells using a luciferase reporter assay (Fig. 3d). In addition, the AD-associated SNP rs55876153 near SPI1, which overlaps an increased-level mouse enhancer ortholog, is in strong linkage disequilibrium (LD=0.89, see Methods) with a known SPI1 eQTL, rs1083869825, even though it did not significantly alter enhancer activity in the luciferase assay.

Outside known GWAS loci, an additional 22 weakly-associated regions (3.9 fold, p<4.9×10−7) contain variants within increased-level enhancer orthologs (Supplementary Table S7), of which 17 lack protein-altering variants in LD (R2<0.4), providing strong candidates for directed experiments. One such example includes ABCA1 (p=6.9×10−5, Fig. 3c), a paralog of AD-associated ABCA7 and encoding a glial-expressed transporter that influences APOE metabolism in the central nervous system29. The region lacks protein-altering variants and all five SNPs in the cluster of association lie specifically within an increased-enhancer ortholog, which is also active in CD14+ immune cells and, to a lesser extent, in human hippocampus and fetal brain.

Causal variants map to discretely regulated elements within super-enhancers

Genetic and epigenetic fine mapping of causal autoimmune disease variants

Farh, K. K.-H. et al.Nature 10.1038/nature13835

Genomic loci that encode cellular identity genes frequently contain large regions with clustered or contiguous enhancers bound by transcriptional co-activators and marked by H3K27ac. Recent studies showed that such 'super-enhancer' regions are enriched for GWAS catalogue SNPs, including those related to autoimmunity18,19. Consistently, we find that PICS SNPs are 7.5-fold enriched in CD4+ T-cell super-enhancers, relative to random SNPs from the genome. We therefore parsed the topography of super-enhancers in immune cells using our genetic and epigenetic data.

The IL2RA locus exemplifies the complex landscape of enhancer regulation. IL2RA encodes a receptor with key roles in T-cell stimulation and Treg function15. The super-enhancer in this locus comprises a cluster of elements recognizable as distinct H3K27ac peaks (Fig. 4a). Although the region meets the super-enhancer definition in multiple CD4+ T-cell types18, sub-elements are preferentially acetylated in Treg, TH17 and/or THStim T-cells, consistent with differential regulation. Some sub-elements appear bound by T-cell master regulators, including FOXP3 in Tregs, T-BET (also known as TBX21) in TH1 cells, and GATA3 in TH2 cells. A systematic analysis indicates PICS SNPs are most enriched at distinct stimulus-dependent H3K27ac peaks within super-enhancer regions (Extended Data Fig. 7).

PICS SNPs for eight autoimmune diseases map to distinct segments of the IL2RA super-enhancer. For example, Immunochip data identify a candidate causal SNP for multiple sclerosis that has no effect on autoimmune thyroiditis disease risk. Conversely, a candidate causal SNP for autoimmune thyroiditis has no effect on multiple sclerosis risk, despite the proximity of the two SNPs within the super-enhancer (Fig. 4b). Furthermore, index SNPs for multiple other diseases are not in LD, suggesting that multiple sites of nucleotide variation in the locus have separable disease associations (Fig. 4c). The distribution of PICS SNPs and the partially discordant regulation of sub-regions suggest that super-enhancers may comprise multiple discrete units with distinct regulatory signals, functions and phenotypic associations.

Integrative analyses of epigenomic profiles across 28 human tissue types

Integrative analysis of haplotype-resolved epigenomes across human tissues

Leung, D. et al.Nature 10.1038/nature14217

We performed ChIP-seq experiments to generate extensive datasets profiling 6 histone modifications across 16 human tissue-types from four individual donors (181 datasets). Combining with previously published datasets, we conducted in-depth analyses across 28 cell/tissue-types, covering a wide spectrum of developmental states, including embryonic stem cells, early embryonic lineages and somatic primary tissue-types representing all three germ layers (Fig. 1a).

[...]

We systematically identified cis-regulatory elements by employing a random-forest based algorithm (RFECS), predicting a total of 292,495 enhancers (consisting of 175,912 strong enhancers with high H3K27ac enrichment) across representative samples of all tissues-types (Supplementary table 1). We additionally identified 24,462 highly active promoters with strong H3K4me3 enrichment (see Supplementary Information) (Supplementary table 2). Subsequently, we defined tissue-restricted promoters (n=10,396) and enhancers (n=115,222) (Extended Data Fig. 1a).

[...]

Intriguingly, 15.2% (n=3,717) of strong promoters were also predicted as enhancers in other tissues, Analogous to observations in mice, where intragenic enhancers act as promoters to produce cell-type specific transcripts19. These sites possessed histone modification signatures of active enhancers in some tissue/cell-types but were enriched with active promoter marks in others. We termed these sequences cis-Regulatory Elements with Dynamic Signatures (cREDS). For example, cREDS enhancers showed enrichment of H3K27ac and H3K4me1 and a striking depletion of H3K4me3 in lung (Fig. 1b and c, Supplementary table 3). However, the signature shifted to that of active promoters in other tissues (Fig. 1b and c). cREDS are also found in other cell/tissue-types (Extended Data Fig. 4a).

[...]

We defined genes with allelically biased expression mapping the RNA-seq reads in each tissue sample to the two haploid genomes of the donor. We observed extensive allelically biased gene expression, ranging from 4% to 13% of all informative genes (>10 allelic read counts) in each tissue sample (FDR=5%, Extended Data Fig. 7a-b). Comparatively, the proportion of allelically biased genes in individual tissue donors ranged from 6% to 23% of all informative genes, giving a combined total of 2,570 allelically biased genes (Fig. 2b, Supplementary Table 7).

[...]

As natural genetic variations can affect enhancer selection and function in mammalian cells, we hypothesized that polymorphisms at cis-regulatory sequences underlie the widespread allelic transcriptional biases.

[...]

We generated additional H3K27ac ChIP-seq datasets with deeper coverage and longer sequencing reads (for better delineation of alleles) for 14 of the previously analyzed tissue samples and an additional 6 samples from independent donors (Supplementary Table 7). Of the informative enhancers (with >10 polymorphism-bearing sequence reads), 11.6% (n=11,714, FDR=1%) showed significant allelically biased H3K27ac enrichment in any tissue types (Fig. 3c, and Supplementary table 8). Interestingly, identical genotypes often yielded the same direction of biases in allelic enhancer activities (Fig. 3d).

Tissue- and cell-type-specific long non-coding RNA

Epigenetic and transcriptional determinants of the human breast

Gascard, P et al.Nature Communications 10.1038/ncomms7351

Noncoding RNAs are key regulators of diverse cellular processes16 that can interact directly with the epigenetic machinery and may be prognostic in breast cancer17. We identified 936 unique miRNAs expressed at similar distributions across the 5 mammary derived cell types, including a core set of 29 which were highly expressed (>1000 RPM) across myoepithelial, luminal epithelial and stem-like cell types (Supplementary Figure 12b and Supplementary Table 5). Hierarchical clustering demonstrated expected cell type relationships (Supplementary Figure 12c) and cell type-specific miRNAs were identified with a majority being expressed in vHMECs (Figure 2d). We also identified 1,870 expressed lincRNAs (Supplementary Figure 13 and Supplementary Table 6) and 82 cell type-specific lincRNAs across the mammary cell types with myoepithelial cells showing the smallest number of cell type-specific events (Figure 2e and Supplementary Table 7). Restricting our comparison to myoepithelial and luminal cells, we identified 206 DE non-coding RNAs, including 130 lincRNAs and 76 antisense transcripts. Among the differentially expressed lincRNAs, MALAT (NEAT2), a critical regulator of metastasis in epithelial cancers18, was overexpressed in normal luminal cells suggesting that its expression is not solely restricted to metastatic potential in epithelial lineages. An imprinted region of 14q32.3, that encodes maternally expressed noncoding MEG3 and MEG8 transcripts and 54 miRNAs expressed from the maternally inherited homolog, was transcriptionally silenced in luminal cells (Supplementary Figure 14). Loss of expression of the MEG3 cluster through LOH and promoter hypermethylation is frequent in epithelial cancers19. Our results suggest that MEG3 transcriptional repression is associated with normal epithelial differentiation and provide a novel intergenic differentially methylated region that may responsible for its cell type-specific regulation (Supplementary Figure 14).

Epigenomic footprints across 111 reference epigenomes reveal tissue-specific epigenetic regulation of lincRNAs

Amin, V et al.Nature Communications 10.1038/ncomms7370

Long noncoding RNAs (lncRNAs) are implicated in an increasing number of cellular processes including mammalian cellular differentiation1. Their role in repressing lineage-specific genes during early development was demonstrated by knockdown experiments in mouse embryonic stem cells2. Lineage-specific role of specific lncRNAs has now been established in cardiac3,4, epidermal5, neuronal6, mammary gland development7,8, and in T-cells9. Striking tissue specific transcription of lncRNAs10,11 is consistent with their role in developmental regulation and presents a possible inroad into understanding their biology. The intergenic lncRNAs (lincRNAs) are a major class of lncRNAs that are particularly convenient to study computationally and experimentally because of their lack of overlap with protein coding genes. Despite their relative accessibility, lincRNAs are experimentally less tractable than protein coding genes because of the lack of information about their potential function and associated phenotypes. We here address this knowledge gap by determining their tissue-specific epigenetic regulation, thus complementing the current knowledge about their tissue-specific transcription.

By analyzing 111 reference epigenomes from the NIH Roadmap Epigenomics project, we report that at least 3,753 (69% examined) lincRNAs show exquisitely tissue-specific epigenomic footprints and strongly associate with cell- and tissue-specific pathways, suggesting developmental or tissue-specific function for this newly discovered class of genes.

Skin cell-type-specific differentially DNA methylated regions

Regulatory network decoded from epigenomes of surface ectoderm-derived cell types

Lowdon, R. F. et al.Nature Communications 10.1038/ncomms6442

We identified 12,892 500 bp regions encompassing 193,202 CpGs with a DNA methylation status unique to one of the three most common skin cell types (fibroblasts, melanocytes, and keratinocytes)(Fig. 2a,). The majority of these skin cell type-specific differentially DNA methylated regions (DMRs) were hypomethylated (Fig. 2a), suggesting potential cell type-specific regulatory activity at these regions4, 12, 13. Forty to 46% of the DMRs were intergenic and 5–9% were associated with RefSeq-annotated gene promoters (Supplementary Fig. 5); non-CpG island promoters were enriched among cell type-specific DMRs (Supplementary Note 4 and Supplementary Table 2). Eighty to 91% of hypomethylated cell type-specific DMRs overlapped with regulatory element-associated histone modifications in the same cell type (Fig. 2b). Accordingly, hypomethylation of cell type-specific DMRs at gene promoters correlated with increased gene expression relative to the other two cell types where the DMR was hypermethylated (Fig. 2c). Gene Ontology (GO) analysis using the GREAT (ref. 14) tool on hypomethylated cell type-specific DMRs showed strong enrichment for biological processes relevant to each cell type (for example, 'extracellular matrix organization' for fibroblasts (P−value=9.05E−45) and 'pigmentation' for melanocytes (P−value=2.43E−06); Fig. 2d). These data suggest skin cell type-specific DMRs occur primarily at distal enhancers and regulate genes relevant to each cell type.

Intermediate DNA methylation is a conserved signature of genome regulation

Elliott, G et al.Nature Communications 10.1038/ncomms7363

The bimodal pattern of DNA methylation implies a binary control over gene expression, yet a significant number of loci throughout the genome have an intermediate level of DNA methylation. To comprehensively identify regions of intermediate methylation (IM) and their quantitative relationship with gene activity, integrative and comparative analysis was applied to 25 human cell and tissue epigenomes. These analyses identified 18,452 IM regions located near 36% of genes. CpGs in IM regions had a mean methylation of 57% using whole-genome bisulfite sequencing. IM regions were enriched at enhancers and exons and exhibit a quantitative relationship with enhancer signals and exon inclusion, respectively (Figure 2c,d,e). These associations were equally strong in tissue, unsorted peripheral blood and 6 highly purified cell types. Significant interspecies conservation of IM status at orthologuous loci, and conservation among different individuals, further suggests an important function, and potentially a shared mechanism for their establishment and maintenance. The data is consistent with the hypothesis that IM is a distinct epigenetic signature of evolutionarily conserved, gene context-dependent function.

Meta-epigenomic structure of purified human stem cells

The meta-epigenomic structure of purified human stem cell populations is defined at cis-regulatory sequences

Wijetunga, N et al.Nature Communications 10.1038/ncomms6195

To determine whether epigenetic variability was occurring at regulatory sites with possible functional consequences, we took advantage of public chromatin mapping data for CD34+ HSPCs generated by the Roadmap Epigenomics programme (Supplementary Table 5). The DNase hypersensitivity and ChIP-seq data create combinatorial patterns that have previously been exploited to define functional elements in the genome. We processed the Roadmap data using an adaptation of an imaging signal processing algorithm, to define the locations of chromatin constituents with minimal data transformation (Supplementary Fig. 4). These chromatin constituent locations were then used to generate a selforganizing map (SOM), and to map candidate regulatory elements using the Segway algorithm (Supplementary Fig. 5). The individual Segway features were then overlaid as contour plots onto the SOM, which clusters in two-dimensional space loci with similar genomic characteristics, allowing intuitive visualization of the major contributors to each feature (Fig. 2a and Supplementary Fig. 6). Of the multiple chromatin states for which each feature is enriched, feature 6 has the H3K4me3 enrichment, indicating promoter function, features 4 and 5 both have marks indicative of enhancer function (H3K4me1 and H3K27ac, respectively), features 1–3 have the H3K36me3 enrichment typical of transcribed sequences, while feature 0 in enriched for heterochromatic marks (H3K9me9 and H3K27me3).

We also created a metaplot of these new annotations relative to all RefSeq genes in the genome (Supplementary Fig. 7), showing that Segway feature 6 is strikingly enriched at transcription start sites (TSSs), flanked by enrichment for feature 4 and, to a lesser degree, feature 5 (Fig. 2b). Features 1–3 are enriched in gene bodies and feature 0 at intergenic sequences. Statistical testing of the enrichment of features 4 and 6 in their windows of peak frequencies compared with their distributions over all RefSeq genes and flanking regions showed significance (P<0.001 for each). CpG islands and their immediate flanking sequences have previously been related to 'stochastic' DNA methylation variability8 and gene expression regulation31. The Segway annotations demonstrate that although the bodies of CpG islands are enriched for the candidate promoter (feature 6) sequences, the ±2 kb flanking region, generally described as its 'shore', is strikingly enriched for feature 4 (Fig. 2c). Both achieve statistical significance (Po0.001) when compared with their distributions over all CpG islands (feature 6) or flanking regions (feature 4). Finally, stratifying the RefSeq genes by expression quartile in CD34+ HSPCs reveals the transcriptional dependencies of the Segway annotations (Fig. 3). We conclude that the Segway annotations define candidate promoters (feature 6), enhancers (features 4 and 5), transcribed regions (features 1–3) and repressed chromatin (feature 0) for CD34+ HSPCs.

Large-scale imputation of epigenomic datasets for systematic annotation of diverse human tissues

Ernst, J. & Kellis, M. Nature Biotechnology 10.1038/nbt.3157

We applied ChromImpute to a compendium of 127 reference epigenomes, including 111 profiled by the NIH Roadmap Epigenomics project10 and 16 profiled by the ENCODE project2,3 (Fig. 1a). These span diverse tissues and cell types, including Embryonic Stem Cells (ESCs), induced Pluripotent Stem Cells (iPSC), ESC-derived cells, blood and immune cells, skin, brain, adipose, muscle, heart, smooth muscle, digestive, liver, lung and others.

Only 5 'core' histone modification marks were experimentally profiled in all 127 reference epigenomes. These are promoter-associated H3K4me3, enhancer-associated H3K4me1, Polycomb repression-associated H3K27me3, transcription-associated H3K36me3 and heterochromatin-associated H3K9me3. Varying subsets of 34 marks were profiled in different epigenomes, including 30 histone modifications (11 histone methylation marks, 18 histone acetylation marks, and H3T11ph), histone variant H2A.Z, DNA accessibility, DNA methylation data, and RNA-seq data.

Based on these experimentally-profiled ('observed') datasets, we imputed the 31 marks observed in at least two epigenomes in all 127 epigenomes, and the three marks mapped in only one epigenome in the remaining 126 epigenomes. In total we generated 4,315 datasets based on imputation, of which only 1,122 (26%) were also experimentally mapped and 3,193 (74%) are only available as imputed data. Signal tracks for all marks were imputed at 25 base pair resolution (121 million predictions per track) except for DNA methylation, which was imputed at single-nucleotide resolution for each of 28 million CpGs. Across all marks, samples, and positions, we generated a total of 526 billion predicted signal values.

We first learned a 25-state model jointly3 across all 127 samples (Fig. 6b,c) using all Tier-1 and 2 marks. This captured multiple types of promoter, enhancer, open chromatin, transcribed, and repressed states and shows specific DNA methylation and RNA-seq enrichments (Fig. 6b,c, S33). Compared to the 15-state chromatin state model based on observed data in the 127 samples (Fig. S33), the 12-mark model better distinguished active vs. poised enhancer states (using H3K27ac and H3K9ac), and captured novel states (e.g. state 19_DNase showing DNA accessibility but lacking enhancer/promoters marks and state 5_Tx5′ associated with 5′ends of transcripts and based on H3K79me2). Benefiting from the increased stability and robustness of imputed data, imputation-based chromatin states showed more consistent genome coverage across tissue/samples (Fig. S34), better agreement with annotated gene bodies and transcription start sites, both for all transcripts (Fig. S35a,b) and for the set of transcripts expressed in a given tissue (Fig. S35c,d), and better discrimination of evolutionarily-conserved elements (Fig. S36)38. Additionally we saw better recovery of samples that were not included in any of our training data (e.g. an osteoblast DNA accessibility dataset39, Fig. S37), while capturing major cell type specific differences in chromatin states (e.g. ESC/iPSC cell types showing consistently more abundant bivalent promoter states40, Fig. S38), with cell type specific differences even more pronounced than for chromatin states based on observed data (Fig. S38).

Figure 1: Epigenomic information across tissues and marks.
figure 1

a. Chromatin state annotations across 127 reference epigenomes (rows, Fig. 2) in a 3.5Mb region on chromosome 9. Promoters are primarily constitutive (red vertical lines), while enhancers are highly dynamic (dispersed yellow regions). b. Signal tracks for IMR90 showing RNA-seq, a total of 28 histone modification marks, whole-genome bisulfite DNA methylation, DNA accessibility, Digital Genomic Footprints (DGF), input DNA, and chromatin conformation information71. c. Individual epigenomic marks across all epigenomes in which they are available. d. Relationship of figure panels highlights dataset dimensions.

Figure 2: Chromatin states and DNA methylation dynamics.
figure 2

a. Chromatin state definitions, abbreviations, and histone mark probabilities. b. Average genome coverage. Genomic annotation enrichments in H1-ESC. c. Active and inactive gene enrichments in H1-ESC (see Extended Data 2b for GM12878). d. DNA methylation. e. DNA accessibility. d-e. Whiskers show 1.5 interquartile range. Circles are individual outliers. f. Average overlap fold enrichment for GERP evolutionarily conserved non-coding regions. Bars denote standard deviation. g. DNA methylation (WGBS) density (color, ln scale) across cell types. red=max ln(density+1). Left column indicates tissue groupings, full list shown in Extended Data 4f. h. DNA methylation levels (left) and TF enrichment (right) during ESC differentiation. i. Chromatin mark changes during cardiac muscle differentiation. Heatmap=average normalized mark signal in Enh. C5 cluster enrichment54.

Figure 3: Chromatin state model robustness and enrichments.
figure 3

a. Chromatin state model robustness. Clustering of 15-state “core” chromatin state model learned jointly across reference epigenomes (Fig. 4a) with chromatin state models learned independently in 111 reference epigenomes. We applied ChromHMM to learn a 15-state ChromHMM model using the five core marks in each of the 111 reference epigenomes generated by the Roadmap Epigenomics program, and clustered the resulting 1680 state emission probability vectors (leaves of the tree) with the 15 states from the joint model (indicated by arrows). We found that the vast majority of states learned across cell types clustered into 15 clusters, corresponding to the joint model states, validating the robustness of chromatin states across cell types. This analysis revealed two new clusters (red crosses) which are not represented in the 15 states of the jointly-learned model: “HetWk”, a cluster showing weak enrichment for H3K9me3; and “Rpts”, a cluster showing H3K9me3 along with a diversity of other marks, and enriched in specific types of repetitive elements (satellite repeats) in each cell type, which may be due to mapping artifacts. This joint clustering also revealed subtle variations in the relative intensity of H3K4me1 in states TxFlnk, Enh, and TssBiv, and H3K27me3 in state TssBiv. Overall, this analysis confirms that the 15-state chromatin state model based on the core set of five marks provides a robust framework for interpreting epigenomic complexity across tissues and cell types. b. Enrichments for 15-state model based on five histone modification marks. Top Left: TF binding site overlap enrichments of 15 states in H1-ESC from the “core” model for transcription factor binding sites (TFBS) based on ChIP-seq data in H1-ESC. TF binding coverage for other cell-types based on matched TF ChIP-seq data is shown in Fig. S2. Top Right: Enrichments for expressed and non-expressed genes in H1-ESC and GM12878. Bottom: Positional enrichments at the transcription start site (TSS) and transcription end site (TES) of expressed (expr.) and repressed (repr.) genes in H1-ESC. Transition probabilities show frequency of co-occurrence of each pair of chromatin states in neighboring 200-bp bins. d. Definition and enrichments for 18-state 'expanded' model that also includes H3K27ac associated with active enhancer and active promoter regions, but which was only available for 98 of the 127 reference epigenomes. Inclusion of H3K27ac distinguishes active enhancers and active promoters. Top: TFBS enrichments in H1-ESC (E003) chromatin states using ENCODE TF ChIP-seq data in H1-ESC . Bottom: Positional enrichments in H1-ESC for genomic annotations, expressed and repressed genes, TSS and TES, and state transitions as in Extended Data 2b and Fig. 4a-c. Right: Average fold-enrichment (colors bars) and standard deviation (black line) across 98 reference epigenomes (Fig. S3d) for the fold enrichment for non-coding of genomic segments (GERP) in each chromatin state (rows) in the 18-state model. Even after excluding protein-coding exons (see Fig. S3b vs. Fig. S3d), the TSS-proximal states show the highest levels of conservation, followed by EnhBiv and the three non-transcribed enhancer states. In contrast, Tx and TxWk elements are weakly depleted for conserved regions, and Znf/Rpts, and Het are strongly depleted for conserved elements.

Figure 4: Cell type differences in chromatin states.
figure 4

a. Chromatin state variability, based on genome coverage fraction consistently labeled with each state. b. Relative chromatin state frequency for each reference epigenome. c. Chromatin state switching log10 relative frequency (inter-cell-type vs. inter-replicate). d. Clustering of 2Mb intervals (columns) based on relative chromatin state frequency (fold enrichment), averaged across reference epigenomes. LaminB1 occupancy profiled in ESCs. Red lines show cluster average.

Figure 5: Chromatin state variation, coverage and over-/under-representation across cell types.
figure 5

a,b. Variability of chromatin states across subset of highest-quality epigenomes. Chromatin state variability (similar to Fig. 5a), based on the fraction of the genomic coverage (y-axis) of each state (color) that is consistently labeled with that state in at most N (ranging from 1 to 43) reference epigenomes for: (a) 43 highest-quality non-redundant epigenomes using the 15-state model learned on 5 core marks; (b) 34 highest-quality non-redundant epigenomes that had H3K27ac data using the 18-state model learned on 5 core marks. c,d. Fractions of the genome occupied by each chromatin state in the 15-state model across 127 epigenomes (panel c) and in the 18-state model across 98 epigenomes (panel d). Percentages at the bottom show the average genome occupancy for each chromatin state across the 127 epigenomes, and each point shows over- or under-representation for that chromatin state in that epigenome relative to the vertical line representing the average occupancy. Grey shaded area shows +2/−2 standard deviations. e. Left: Number of standard deviations away from the mean for the coverage of each state in the 15-state model across 127 epigenomes for all cases greater than 1 standard deviation above (red) or below (blue) the mean. Right: Actual coverage values for all cases that deviate by at least 1 standard deviation.

Figure 6: Chromatin state switching across cell types and across replicates/individuals.
figure 6

a. Intra-tissue state switching probability for 15-state core model across samples (replicates or individuals) from same tissue/cell (Table S1 - Sheet VariationAnalysis). b. Inter-tissue state switching probability for 15 state core model across 43 high-quality, non-redundant Roadmap epigenomes (Table S1 - Sheet VariationAnalysis). The relative switching frequencies (log-ratio) shown in Fig. 5c correspond to the ratio of panel “b” frequencies to panel “a” frequencies. c. Log10 ratio of inter-tissue state switching probabilities relative to intra-tissue switching probabilities for ENCODE samples, for the 15 state core model (analogous to Fig. 5c). d. Log10 ratio of inter-tissue chromatin state switching probabilities relative to intra-tissue switching probabilities for ENCODE samples, for the 18-state expanded model that includes H3K27ac (analogous to Extended Data 5c).

Figure 7: Chromatin state variability, switching, and genomic coverage.
figure 7

a. Variability level for 18-state model. Chromatin state variability (similar to Fig. 5a), quantified based on the fraction of the genomic coverage (y-axis) of each state (color) that is consistently labeled with that state in at most N (ranging from 1 to 98) reference epigenomes, using the 18-state model learned based on 6 chromatin marks, including H3K27ac. b. Chromatin state over- and under-representation for 18-state expanded model. c. Log-ratio (log10) of chromatin state switching probabilities for the 18-state expanded model across 34 high-quality, non-redundant epigenomes that have H3K27ac data, relative to intra-tissue switching probabilities across replicates or samples from multiple individuals. d. Chromatin state coverage grouped by epigenomic domains. Top: Chromosome “painting” of 11 clusters shown in Fig. 5d and discovered based on chromatin state co-occurrence at the 2Mb scale across reference epigenomes. Bottom: Enrichment of CpG islands in each cluster clearly showing higher CpG density “active” clusters 3 and 6 comparing to passive clusters 9-11. Each box plot shows a distribution of CpG total occupancy in 2Mb bins in each cluster (with box boundaries indicate 25th and 75th percentiles the whiskers extend to the most extreme datapoints the algorithm considers to not be outliers. Points are drawn as outliers if they are larger than Q3+W(Q3-Q1) or smaller than Q1-W(Q3-Q1), where Q1 and Q3 are the 25th and 75th percentiles, respectively.).

Figure 8: Regulatory modules from epigenome dynamics.
figure 8

a. Enhancer modules by activity-based clustering of 2.3 million DNase-accessible regions classified as Enh, EnhG or EnhBiv (color) across 111 reference epigenomes. Vertical lines separate 226 modules. Broadly-active enhancers shown first. Module IDs shown in Fig. S11c. b-c. Proximal gene enrichments54 (b) for each module using gene ontology (GO) biological process (panel b) and human phenotypes (panel c). Rectangles pinpoint enrichments for selected modules. Representative gene set names (left) selected using bag-of-words enrichment.

Figure 9: Enhancer/promoter module clustering and gene set enrichment analysis.
figure 9

a. Promoter modules. Clustering of 81,232 DNaseI-accessible promoter-marked regions into 82 promoter modules (1.44% of genome). b. Promoter/enhancer 'dyadic' modules. Clustering of 129,960 DNaseI-accessible promoter/enhancer regions into 226 enhancer modules (0.99% of the genome). c. Enhancer modules. Clustering of 2,328,936 DNaseI-accessible enhancer-marked regions into 129 dyadic modules (12.64% of the genome). Same as Fig. 7a, with cluster names shown. d. Gene Ontology Biological Processes gene set enrichments (y-axis) for genes proximal to enhancer regions in each of the 226 activity-based clusters (x-axis). Colors indicate level of statistical significance (white: p > 0.01, yellow: 0.001 < p < 0.01, orange: 0.0001 < p < 0.001, red: p < 0.0001). Cluster sizes in terms of number of enhancers are shown at the top, along with cluster reference numbers. d. Subclustering of cluster c98 (indicated by an arrow and asterisk in (a)) into 9 subclusters (top), each having a distinct Gene Ontology Biological Processes gene set enrichment pattern (bottom).

Figure 10: Additional gene set enrichment analysis results.
figure 10

Top. Additional gene set enrichment analyses. Cluster sizes in terms of number of enhancers are shown at the top, along with cluster reference numbers as in Fig. S11. a. Disease Ontology Database 120 results. Rectangles indicate areas of interest, pointing to groups of cell type-restricted clusters clearly enriched for specific gene sets. b. Mouse Genome Informatics 121 database enrichments, indicating clusters of enhancer regions associated with orthologous mouse genes with particular anatomical regional expression patterns

Figure 11: Increasing enhancer orthologs help interpret AD-associated non-coding loci.
figure 11

Overlap of disease-associated SNPs (top) with increasing enhancers (2nd row, red) and immune enhancers in human (CD14+ primary cells) is shown for genome-wide significant (INPP5D and SPI1/CELF1; panels a, and b) and below-significance (ABCA1; panel d) AD GWAS loci. Roadmap chromatin state annotations for immune cells (CD14+ primary; E029), hippocampus (E071), and fetal brain (E81), with colors as shown in the legend. Light red highlight denotes increasing enhancer regions tested in luciferase assay. c, AD associated SNP rs1377416 amplifies in vitro luciferase activity of putative enhancer region 38,313 - 37,359 bp upstream of SPI1 (PU.1) gene in BV-2 cells. (n=3, One-way ANOVA p<0.0001, Tukey's test p<0.05). ns, nonsignificant.

Figure 12: Disease variants map to discrete elements in super-enhancers.
figure 12

a, Candidate causal SNPs for autoimmune diseases are displayed along with H3K27ac, RNA-seq and TF binding profiles for the IL2RA locus, which contains a super-enhancer (pink shade). b, For all SNPs in the IL2RA locus, scatter plot compares strength of association with MS versus autoimmune thyroiditis. Immunochip data resolve rs706779 (red) as the lead SNP for autoimmune thyroiditis and rs2104286 (blue) as the lead SNP for MS. c, LD matrix displaying r2 between lead SNPs for different diseases at the IL2RA locus confirms distinct and independent genetic associations within the super-enhancer.

Figure 13: Epigenome profiles of tissues reveal cREDS with dynamic histone modification signatures.
figure 13

a)Schematic of the cell/tissue-types profiled and their progression along developmental lineages. Samples include embryonic stem cells (H1), early embryonic lineages (mesendoderm cells(MES), neural progenitor cells (NPC), trophoblast---like cells (TRO) and mesenchymal stem cells (MSC)) and somatic primary tissues, representative of all three germ layers (Ectoderm: hippocampus (HIP), anterior caudate (AC), cingulate gyrus (CG), inferior temporal lobe (ITL) and mid-frontal lobe (MFL); Endoderm: lung (LG), small bowel (SB), thymus (TH), sigmoid colon (SG), pancreas (PA), liver (LIV) and IMR-90 fibroblasts; Mesoderm: duodenum smooth muscle (DUO), spleen (SX), psoas (PO), gastric tissue (GA), right heart ventricle (RV), right heart atrium (RA), left heart ventricle (LV), aorta (AO), ovary (OV) and adrenal gland (AD)). b) Heatmaps show H3K27ac, H3K4me3 and H3K4me1 enrichment (RPKM) at predicted lung enhancers (n=1,321), which are defined as promoters in other tissues, across all 28 samples. Red box highlights the signatures in lung. c) A UCSC genome browser snapshot of a region on chromosome 20, showing the chromatin states of a cREDS element (gray shading) predicted as a promoter in psoas and an enhancer in lung. d) A boxplot of RNA-seq signals (RPKM) overlapping ±1kb of cREDS enhancers, cREDS promoters, non-cREDS control enhancers and non- REDS control promoters. ( indicates p-value<10e-142, Wilcoxon test) e) RNA-seq and chromatin states of a cREDS element (gray shading) is shown for a region on chromosome 17 in H1 and IMR-90. Arrow indicates an alternate exon incorporated in IMR-90.

Figure 14: Widespread, individual---specific allelic bias in gene expression.
figure 14

b) Proportion of genes with allelically biased expression among informative genes and the number of tissue samples derived from each donor (ntissue) are described.

Figure 15: Characterization of allele bias in chromatin states at cis---regulatory elements.
figure 15

c) Proportion of allelic (n=11,714) and non---allelic (n=89, 599) among all informative enhancers (n=101,313) across 20 tissues. d) A snapshot showing a SNP (rs138143205) with H3K27ac bias towards the G allele in both LV donors (Left). Bar chart illustrates the number of H3K27ac reads corresponding to the P1 versus P2 alleles in both donors (Right) ( p-value<10e-19, binomial test).

Figure 16
figure 16

Tissue-specific regulatory regions and associated phenotypes. (a) Lineage-specific regulatory regions were determined by comparing epigenomes within a cluster against the epigenomes outside the cluster. Pie chart shows percentage of regions that harbor cluster-specific marks-- some unique to the cluster, some shared by two or three related subtrees, and highlights regulatory regions that are less specifically modified (grey). (b) Distribution of regulatory regions that are unique and shared for each cluster. (c) Enrichment of mouse phenotype terms associated with lineage specific regulators calculated using GREAT tool (lineages representing all three germ layers were selected). We have developed an on-line tutorial (link: http://genboree.org/theCommons/projects/aminv-natcomm-2015/wiki) on how to use on-line tools integrated within the Genboree Workbench to carry out the types of analyses reported in this Figure.

Figure 17: Identification and characterization of skin cell type-specific DMRs.
figure 17

(a) Hypomethylation and hypermethylation percentages for each set of skin cell type-specific DMRs defined by comparison against the other two skin cell types. The total number for each set of cell type-specific DMRs is listed above the pie chart. DMRs are 500 bp windows. (b) Histone modification patterns at skin cell type-specific hypomethylated DMRs. (c) Skin cell-type RNA expression levels for genes with hypomethylated cell type-specific DMRs in their promoter regions. Each panel depicts expression values for a set of cell type-specific DMR-associated genes. Plotted values are RNA-seq RPKM values over exons, averaged (mean) over three biological replicates. For each box plot, the middle line indicates the median value, top and bottom box edges are the third- and first-quartile boundaries, respectively. The upper whisker is the highest data value within 1.5 times the interquartile range; the lower whisker indicates the lowest value within 1.5 times the interquartile range. The interquartile range is the distance between the first and third quartiles. Points indicate data beyond whiskers. Logarithmic scale transformations were applied before box plot statistics were computed. RPKM distributions for a given set of cell type-specific DMR-associated genes in the specified cell type compared with other cell types were statistically significant (Wilcoxon ranked test, paired, P-value0.003, Keratinocyte-DMRs n=602, Fibroblast-DMRs n=108, Melanocyte-DMRs n=74; K, keratinocytes; F, fibroblasts; M, melanocytes; Supplementary Tables 3–5). (d) Heat map depicting selected GO terms enriched for keratinocyte, fibroblast and melanocyte-hypomethylated cell type-specific DMRs. F, fibroblasts; K, keratinocytes; M, melanocytes. Colour intensity represents the negative log10 transformed P-value of enrichment of a given cell type-specific DMR set for association with the listed GO term. Full data sets are in Supplementary Data 3.

Figure 18: Empirical annotation of the CD34+ HSPC genome based on chromatin features reveals candidate cis-regulatory element locations.
figure 18

(a) A contour plot of the regions within the SOM where Segway features 4 (above) and 6 (below) enrich, showing feature 4 to be composed of loci where H3K4me1 and H3K27me3 occur, while the loci composing feature 6 contain the H3K4me3 and H3K27ac modifications. Consistent with these findings, bshows feature 6 (red) to be enriched at the TSS for a metaplot (top) and a heat map (below) of all RefSeq genes, indicating promoter characteristics, while feature 4 (yellow) flanks this region and is consistent with enhancers in a poised state. In c, similar metaplot (top) and heat map (below) representations of the 2-kb flanking CpG islands demonstrate strong enrichment in feature 4, indicating that these 'CpG island shores' in fact represent candidate enhancers in this cell type.

Figure 19: Transcriptional relationships of Segway features.
figure 19

A RefSeq metaplot for the Segway features divided by expression quantile shows that features 1-3 enrich in the bodies of genes as transcription increases, at the expense of feature 0, which appears to represent repressed chromatin. Feature 6 is strongly enriched at canonical TSSs, flanked by an enrichment of feature 4 and, to a lesser extent, feature 5, which have chromatin signatures indicative of enhancer function.