Introduction

A central goal of biological studies is the identification and characterization of proteins that act in a common cellular pathway. Efforts toward this goal have been greatly aided by large-scale perturbation analyses coupled with whole-transcriptome profiling, in which each gene’s transcriptional response to a perturbation is measured. If a sufficient database of expression profiles exists, then a pathway affected by an uncharacterized perturbation such as a gene mutation, drug treatment or growth condition—can be described by matching the resultant profile to a known profile1. For the yeast Saccharomyces cerevisiae, the expression profiles of a large number of individual yeast deletion mutants have been established and used to infer protein complexes and networks2,3,4. Maximizing the utility of expression profiling approaches for inference of physical and genetic interactions requires ever larger such datasets. However, standard techniques, such as pairwise correlation, do not fully capture the variation available to link gene function as more dimensions are added from larger scale experiments. Therefore, techniques that reduce dimensionality of the data while maintaining relationships between genes are imperative for the inference of physical and genetic interactions in very large gene expression datasets.

Dimensionality reduction methods capture variability in a limited number of random variables to facilitate 2- or 3D-visualization of datasets with tens to thousands of dimensions. This approach is recognizable in the commonly used method of principal component analysis (PCA), which uses linear combinations of variables to generate orthogonal axes that efficiently capture the variation present in the data with fewer variables. Another approach, t-Distributed Stochastic Neighbor Embedding (t-SNE), carries out dimensionality reduction by analyzing similarity of points using a Gaussian distance in high-dimensional space and projecting these data into a low-dimensional space5. A more recent method, Uniform Manifold Approximation and Projection (UMAP), estimates a topology of the high-dimensional data and uses this information to construct a low-dimensional representation that preserves relationships present in the data6. UMAP has been particularly useful to precisely define cell types in mixed populations based on data from single-cell RNA-seq experiments7,8,9,10,11,12,13; it also performs well on other gold-standard datasets6,14. Because UMAP is better able to preserve elements of the data structure from high-dimensional space than similar outputs from t-SNE, it captures local relationships within groups of transcriptomes in addition to global relationships between distinct groups14. This feature is especially useful in the inference of gene relationships, which can be due to physical interaction, overlapping gene function, or coordinated contributions to a larger cellular process. Here, we show that the use of dimensionality reduction by UMAP on bulk expression profiling data of 1484 single-gene mutants of S. cerevisiae links gene function in clusters at increasingly finer scales, corresponding to broad cellular activities, pathways, protein complexes and individual protein-protein interactions.

Results

UMAP groups deletion mutants with shared protein function

We assigned groups, or clusters, to deletion mutants with similar transcriptional responses using the Louvain community detection algorithm in low-dimensional UMAP space9. While many single-cell transcriptomic studies use expression values from genes with the highest dispersion across individual cells, we took advantage of the completeness of bulk microarray data generated by Kemmeren et al.3 and used expression values for all 6170 genes measured in each of the 1484 single-gene deletion strains to make a UMAP projection for subsequent clustering. This approach resolved 50 main clusters, with the number of deletion backgrounds assigned to each cluster ranging from 4 to 298 (median of 11). Clusters with >25 strains were subsequently sub-clustered using similar parameters to define groups. The final dataset contains 171 clusters with a median of 8 strains per cluster.

A total of 194 characterized yeast complexes have at least two of their corresponding genes in the dataset of single deletions. For 40% of these complexes (78/194), we could assign two or more genes to the same cluster (examples of complexes in the initial set of 50 clusters in Fig. 1a, additional complexes were separated in the sub-clustered set (Fig. 1b)). For example, the sub-clustering of the original cluster 2, which is characterized by cell cycle and chromosome organization genes, resulted in more distinctly separating the Isw2-Itc1 chromatin remodeling complex, the Csm3-Tof1 S-phase checkpoint complex and the Oca S-phase histone activation complex (Fig. 1b). Within this sub-clustered set, multiple complexes could be found among genes within a single cluster, suggesting that these complexes may cooperatively contribute to chromosome cohesion and recombination (Fig. 1b).

Fig. 1: UMAP clusters single-gene deletion transcriptomes according to shared function.
figure 1

a UMAP coordinates of 1484 single-gene deletion strains clustered by similarity in transcriptional effects. The initial 50 individual clusters are each shown in a different color. Strains that comprise protein complexes are indicated alongside a bar colored according to cluster identity. Each complex is represented as a fraction: the number of complex members found in the cluster over the number of complex members in the set of 1484 mutants. Clusters with coordinates far from the main group are shown in boxes. Clusters without a known complex are marked as “unknown,” along with an arbitrary cluster number; these clusters are annotated with a broad GO term enriched in that cluster. b Cluster 2 shows more distinct groupings when re-clustered separately. Annotations as in a. Cluster 2 as a whole was enriched for cell cycle and chromosome organization, with individual clusters corresponding to parts of this process. c The tRNA wobble uridine pathway, captured entirely within the cluster containing the Elongator complex (boxed green cluster in a). Complex members within this cluster are annotated with orange boxes, while new members are annotated in blue. One pathway member, Nfs1, was not present in the single-gene deletion dataset. The heatmap represents fine-scale distances between each pair of points within the cluster. Darker shades of red indicate points nearer in UMAP space; hierarchical clustering was applied on this distance metric to group proteins within this pathway. Heterodimeric interactions, such as Ncs6-Ncs2 (bottom-right corner of heatmap), are nearer to each other than other members of the pathway. Novel members of this pathway (blue text) are grouped with other members based on their similarity of UMAP distance, and these new interactions are indicated with gray lines in the pathway diagram.

In some cases, members of individual complexes were assigned to separate clusters, suggesting sub-functionalization of components. For example, the 13-member mediator complex was found in three clusters (numbers 16, 34, and 41) containing 3, 6, and 4 members of mediator, respectively (Fig. 1a). Cluster 16 also contains members of SAGA and SWI/SNF complexes, and loss of mediator subunits in this cluster alters the transcription of amino acid metabolism genes and glucose transmembrane transporters (Supplementary Data 1); cluster 34 contains galactose-responsive subunits of mediator; and cluster 41 contains transcriptional initiation-related mediator subunits. Here, UMAP preserves global relationships between clusters in addition to resolving proximal cluster members. For example, most chromatin remodeling complexes grouped in UMAP space, despite being present in separate clusters and containing unique local topologies (Fig. 1a).

UMAP clustering identified the components of the pathway for tRNA wobble uridine modification (Fig. 1c), which requires the URM1 pathway for 2-thiolation and the Elongator complex for side chain formation at U34 of tRNA15. The clustering revealed two additional members that are likely to link metabolism and cell cycle to this process. One of these, Met18, has a human ortholog (MMS19) that functions in maturation of Fe-S cluster-containing proteins; the conserved yeast and human Elongator component Elp3 is one of these Fe-S proteins16. The other new member, the PP2A phosphatase Sit4, is implicated in dephosphorylation of Elongator; its absence leads to tRNA modification defects17.

Comparison of UMAP distance to other protein interaction metrics

To assess whether distance in UMAP space captured known interactions as well as pairwise correlation, we used three gold-standard interaction datasets: (1) protein interactions determined by co-immunoprecipitation followed by mass spectrometry2; (2) gene interactions from stringDB18, which are derived from a probabilistic metric based on multiple evidence channels including yeast two-hybrid, pathway annotations, and co-expression; and (3) interactions from CellMAP19, which are derived from an experimental screen for synthetic genetic interactions. The UMAP distance metric captured protein complexes more sensitively and with more precision than previous pairwise correlation-based metrics (AUC pairwise correlation = 0.73, AUC UMAP = 0.84, Fig. 2a). UMAP distance also captured known interacting pairs better than distance in high-dimensional space (AUC = 0.56) and distance in PCA space (AUC = 0.70), suggesting that the UMAP dimensionality reduction itself adds value in the identification of interactions (Fig. 2a, Supplementary Fig. 1a). Across each gold-standard interaction dataset, UMAP distance performed better than several other standard approaches for analyzing the transcriptome data, including PCA, random orthonormal projections20, and tSNE (Supplementary Fig. 1a, b).

Fig. 2: UMAP distance identifies protein-protein interactions more effectively than previous methods.
figure 2

a A receiver-operator curve showing the ability of UMAP distance to capture known protein-protein interactions (sensitivity) as a function of its false positive detection. UMAP distance (blue) performs better than pairwise correlation (green), PCA distance (dark gray), and high-dimensional distance (light gray) in identifying interactions. b For each protein-protein interaction, the distance between points in UMAP space was plotted against the pairwise correlation of that pair of transcriptomes. The density of points is indicated with blue lines. Inset in the upper right shows a zoomed-in portion of the x-axis; points with UMAP distance in this range are highly enriched for true interactions that are not captured by pairwise correlation.

Performing clustering in UMAP space ought to produce clusters containing more true interactions than distance in other spaces. To test whether similar results were obtained without UMAP dimensionality reduction, we clustered the data in PCA space. Clustering in PCA space identified 8/50 clusters with perfect overlap to UMAP clusters, and 34/50 that overlap by at least 50% (Supplementary Fig. 1c).

To compare pairwise correlation with the UMAP approach, we calculated for each known interacting pair (1) the Pearson correlation of their deletion transcriptomes; and (2) the distance of those two genes in the UMAP space generated by using all deletion transcriptomes. Among these interacting pairs, UMAP distance and pairwise correlation are negatively correlated (Fig. 2b). However, the increased sensitivity of UMAP distance to detect known interactions suggests that the discrepancies between UMAP distance and pairwise correlation might represent interactions that were previously overlooked. Based on a UMAP distance cutoff corresponding to a 5% FDR of known complex members (Inset, Fig. 2b), we were able to identify 176 putative interactions that would not have been confidently called by previous approaches using pairwise correlations (PCC < 0.5); these interactions contain 86 unique genes, of which 77 show co-IP or yeast two-hybrid evidence for membership among 31 protein complexes, while the remaining 9 genes had no such evidence.

Since proximity in UMAP space tends to capture known interactions and shared function, distance in UMAP space could serve as a useful tool to investigate evolutionary questions about gene divergence. We calculated UMAP distance between 151 paralogous gene pairs in yeast and used this distance to characterize the functional divergence between each pair (Supplementary Fig. 2a). Proximity of paralog pairs in UMAP space did not correspond to previous estimations of paralog divergence (Supplementary Fig. 2b, c) based on synthetic genetic interaction (R = 0.018) or Gene Ontology relationships (R = 0.035)19. When paralogs show a negative genetic interaction—that is, deletion of both genes leads to lower fitness than expected—it is assumed that the two genes retain redundant functions. However, 11 paralog pairs whose negative genetic interactions suggested redundant function showed distinct downstream effects on gene expression when each gene was deleted (Supplementary Fig. 2b, d); these genes may have distinct effects on fitness in different environments21. In these cases, a gene may retain the capacity to complement the essential function of its paralogous partner, while diverging sufficiently in function as revealed by the UMAP-based transcriptome analysis.

Despite successful clustering of many protein complexes and pathways of yeast, the UMAP approach nevertheless identified several clusters that did not obviously correspond to a complex or pathway. We used GO enrichment of differentially expressed genes in these clusters to interrogate their function: cluster 26 showed enriched terms for cell cycle, non-membrane-bound organelles, and prions; cluster 13 showed enrichment for mitochondrial function; cluster 46 showed enrichment for TOR signaling and aerobic respiration; cluster 32 showed enrichment for protein folding; and cluster 11 showed enrichment for heme binding. Differential expression analysis produced significant gene sets for all main and sub-clusters (Supplementary Data 1).

Discussion

Because of its greater sensitivity than other approaches, as well as its ability to capture both local and global relationships, UMAP-based association of gene function adds value in the identification of protein complexes, pathways, and novel interactions in transcriptomic datasets. However, the utility of this method is dependent on the availability of high-quality profiling data from large-scale environmental or genetic perturbation experiments. As more datasets of this type become available, we expect that this approach, or similar dimensionality reduction techniques, will become increasingly useful in mapping protein complexes and pathways both within and across other species. The recent appearance of single-cell expression profiling data paired with CRISPR-induced mutations will be an especially useful source of data of this type, as these experiments include increasingly larger numbers of mutations22. While many of the most useful applications of dimensionality reduction tend to arise from single-cell genomics, for which typical datasets necessitate approaches like UMAP to define relationships between cells, these approaches may also prove useful in visualizing the spatial relationships of biomolecules in tissues23, genetic interactions, or relationships between human populations24.

Methods

Yeast single-gene deletion transcriptome data

Growth-rate adjusted microarray expression values derived from limma modeling by Kemmeren et al.3. were used as input data. All 1484 single-gene deletion strains from this dataset were used for subsequent dimensionality reduction.

Dimensionality reduction and clustering

To project single-gene deletion strains into two dimensions we performed dimensionality reduction with the UMAP algorithm6 using the wrapper function in Monocle 3 (v2.99.3)9 to project single-gene deletion strains into two dimensions and subsequently used Louvain clustering25 on strains in 2D UMAP space using default parameters (except, reduceDimension: reduction_method = UMAP, metric = cosine, n_neighbors = 10, min_dist = 0.05; clusterCells: method = louvain, res = 1e-4, k = 3). Prior to dimensionality reduction, expression values from all 6170 yeast genes were given as input to Principal Component Analysis (PCA) using the Monocle 3 wrapper function “preprocess_cds”. The top 100 principal components were then used as input to UMAP for generating 2D projections of the data. For subclustering, main clusters 1–10 were each individually processed using top 25 principal components in the subset data as input to UMAP dimensionality reduction and Louvain clustering (resolution = 1e-4).

Alternative dimensionality reduction with tSNE was performed using the Monocle 3 function reduceDimension with default parameters (reduction_method = tSNE). Dimensionality reduction using random projection, based on the Johnson-Lindenstrauss lemma, was performed using the RandPro (v0.2.0) R package.

Differentially expressed genes per cluster

Gene expression values for single-gene deletions within a cluster were compared to the background set of all deletions. Differentially expressed genes for each cluster were calculated using the differentialGeneTest() function in Monocle 3. Because the expression datasets were microarray-derived rather than count-based RNA-seq data, the “gaussianff” expression family was used; significance values were corrected for genomic inflation factors using lamba gc26.

Benchmarking with known interacting pairs

To test the ability of UMAP distance, and other distance metrics, to capture known interactions, we used a curated consensus set of protein complexes derived from two large, high-throughput mass spectrometry datasets and GO interactions2. The consensus set was transformed into a pairwise Boolean interaction matrix based on whether or not each pair had been observed together in the known complex set. Using the subset of pairs that were found in the set of 1484 single-gene deletion transcriptome datasets, for each gene pair, we calculated Euclidean distance in UMAP space. We then used these distances, along with labels for true and false interacting gene pairs derived from gold standard interaction datasets, to generate receiver operating characteristic (ROC) and precision/recall curves with the PRROC package in R27.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.