Recursive integration of synergised graph representations of multi-omics data for cancer subtypes identification

Madhumita; Dwivedi, Archit; Paul, Sushmita

doi:10.1038/s41598-022-17585-2

Download PDF

Article
Open access
Published: 17 September 2022

Recursive integration of synergised graph representations of multi-omics data for cancer subtypes identification

Madhumita¹^na1,
Archit Dwivedi¹^na1 &
Sushmita Paul^1,2

Scientific Reports volume 12, Article number: 15629 (2022) Cite this article

1331 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Cancer subtypes identification is one of the critical steps toward advancing personalized anti-cancerous therapies. Accumulation of a massive amount of multi-platform omics data measured across the same set of samples provides an opportunity to look into this deadly disease from several views simultaneously. Few integrative clustering approaches are developed to capture shared information from all the views to identify cancer subtypes. However, they have certain limitations. The challenge here is identifying the most relevant feature space from each omic view and systematically integrating them. Both the steps should lead toward a global clustering solution with biological significance. In this respect, a novel multi-omics clustering algorithm named RISynG (Recursive Integration of Synergised Graph-representations) is presented in this study. RISynG represents each omic view as two representation matrices that are Gramian and Laplacian. A parameterised combination function is defined to obtain a synergy matrix from these representation matrices. Then a recursive multi-kernel approach is applied to integrate the most relevant, shared, and complementary information captured via the respective synergy matrices. At last, clustering is applied to the integrated subspace. RISynG is benchmarked on five multi-omics cancer datasets taken from The Cancer Genome Atlas. The experimental results demonstrate RISynG’s efficiency over the other approaches in this domain.

An integrated network representation of multiple cancer-specific data for graph-based machine learning

Article Open access 29 April 2022

Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer

Article Open access 05 January 2021

wMKL: multi-omics data integration enables novel cancer subtype identification via weight-boosted multi-kernel learning

Article 26 January 2024

Introduction

Cancer is a heterogeneous disease with diverse pathogeneses, and clinical features that can develop in different tissues and cell types¹. A cancer subtype can be defined as a subcategory of specific cancer; for example, Cervical cancer can be further grouped into Adenocarcinomas and Squamous cell carcinomas. Multiple subtypes are distinguishable based on molecular profiles, histology, or sometimes specific mutation. In personalized medicine practices, patient-specific medicines are provided rather than generic medicine. Therefore, for effective treatment of any cancer, it is crucial to identify the appropriate cancer subtype in order to provide an effective prognosis².

Nowadays, with the advancement of technologies, it has become very easy to generate high-dimensional multi-omics data for an individual. Multi-omics data include miRNA and mRNA expressions, DNA methylation, reverse protein phase assays, and others. These datasets are publicly available in various databases like The Cancer Genome Atlas (TCGA)³. Accumulation of various omics data opens up the opportunity to develop novel computational methods to integrate the tremendous amount of multi-view information available for cancer subtype identification. The usual practice of identifying cancer subtypes is by clustering cancer patient data. By grouping the cancer patients based on their genetic profiles, one can better understand the pathogenic mechanisms behind the disease. This will later help in the development of subtype-specific anticancer treatments. However, several challenges exist in grouping the cancer patients and integrating multi-omics data.

The multi-view omics data integration and clustering of cancer patients are considerably new research areas. Few algorithms are developed to address the challenges associated with it. A decade ago, researchers used single omics data to cluster cancer subtypes. Several studies are performed using only gene expression data^4,5,6 or DNA methylation data⁷ or copy number data⁸ to identify cancer subtypes. These algorithms perform clustering across the samples to capture the homogeneity present within the patients based on expression levels of a specific biomarker. Since acquiring cancer hallmarks requires multiple molecular alterations at multiple levels, these algorithms fail to establish the causal relationship between molecular signatures. This biological phenomenon indicates the need for algorithms that integrates multi-omics data to identify cancer subtype. In this regard, integrative clustering-based approaches are found helpful for capturing underlying molecular mechanisms working behind deadly cancer. Further, these algorithms can be categorized into two groups. The first group of algorithms identifies clusters from each omic data separately. Later, it combines these clustering results to obtain a global cluster that represents cancer subtypes^9,10,11,12. These forms of algorithms are known as Consensus Clustering (CC). Mostly, the CC algorithms perform final clustering on individual clusters obtained from different omic datasets using a voting mechanism. Different voting mechanisms generate different clustering solutions. The second group of integrative clustering-based approaches first integrates the multi-view omics data and then applies clustering to obtain cancer subtypes^13,14,15,16. Sometimes the multi-view data are concatenated or stacked together, and clustering identifies cancer subtypes. Data concatenation may lead to information loss and amplifies the curse of dimensionality¹⁶. On the other hand, to overcome the above mentioned limitations, a set of algorithms are developed to extract informative subspace from each of the omics datasets and then performs clustering on the integrated dataset^{14,15,16,17,18,19}.

Clustering multi-view genomics data is a challenging task. One of the critical steps is selecting relevant information from all the available information sources and judiciously integrating them to obtain better clustering solutions. The multi-view data from multi-omics studies vary in terms of variance, scale, and unit. If the integration step is not performed correctly, the fused information may be biased towards the most variant omic view. Therefore, it becomes essential to first capture the variations present in each view and then integrate them. There are some methods available that model the variation of each view first with the help of similarity graphs and integrate them to identify clusters^13,19,20,21. The challenge here is finding the best possible way of integration to capture the essence of all the views from different types of genomic information available for the same set of samples. The research area devoted to this type of problem is multi-view learning^{22,23,24,25,26,27}.

In this study, a novel algorithm named RISynG (Recursive Integration of Synergised Graph-representations) is presented. The proposed approach treats multi-omics data clustering as multi-view clustering, where information from multiple omics platforms is integrated to identify clinically important sub-groups within cancer. In order to judiciously capture the variation present across the multi-omics dataset, the proposed approach works in three steps. At first, for each view, two sample similarity matrices are computed using graph representation matrices, namely, the Gramian matrix and the Laplacian matrix. This step acknowledges the statistical diversity in the multi-view omics data, which directly influences the quantification of similarity between the samples. Later, it involves the integration of representation matrices for the respective omic-view using a parameterized combination function to generate synergy matrices. In the second step, the variation captured through synergy matrices for each omic-view is fused. The proposed approach first arranges all the synergy matrices based on their relevance. Then, a recursive function is designed to merge each synergy matrix so that the less relevant matrix has only a slight influence on the final cluster structures. At the end of this process, the final accretive basis of the accretive subspace is obtained, whose first k eigenvectors hold the cluster structure. At last, k-means clustering is applied on the rows of the accretive basis matrix to generate cluster labels. The efficacy of the proposed algorithm is extensively studied on five multi-omics cancer datasets and compared with existing multi-view clustering approaches used for cancer subtypes identification.

Proposed approach for cancer-subtypes identification

This section describes the novel algorithm designed in this study to integrate multi-omics data for cancer subtypes identification. The proposed method integrates multi-view data using a recursive multi-kernel integration function. It uses the graphical representation to harness the best picture of sample similarities from each of the omic views and explores each view’s statistical property. The schematic workflow of RISynG is presented in Fig. 1. Before moving to the steps of the proposed algorithm, first, the required analytical formulations are discussed.

Gramian matrix and kernel trick

Gramian matrix, $G=[g_{ij}]_{n\times n}$ is a Hermitian matrix, in which each element is a pairwise Hermitian inner product of the vectors in a Hausdorff pre-Hilbert space, V = $\{{v_{1},v_{2},v_{3}, \ldots ,v_{n}}\}$.

$$\begin{aligned} G(v_{1},\dots ,v_{n})= \begin{bmatrix}<v_{1},v_{1}>&{} \dots &{}<v_{1},v_{n}>\\<v_{2},v_{1}> &{} \dots &{}<v_{2},v_{n}> \\ \vdots &{} \ddots &{} \vdots \\<v_{n},v_{1}> &{} \ldots &{} <v_{n},v_{n}> \\ \end{bmatrix} \quad , v_{i}\in {\mathbb {R}}^d. \end{aligned}$$

The Hermitian inner product space is accompanied by the geometric notions associated with the vectors, such as the length and the angle between two vectors. Since G is a Hermitian matrix, it inherits all the properties portrayed by a Hermitian matrix. A few of the relevant properties are enlisted below²⁸.

Property 1

All the eigenvalues of G are real.

Proof

Eigenvalues of a matrix are the roots of its characteristic equation. The characteristic equation for matrix G is written as:

$$\begin{aligned} \det {(\lambda I-G)}=0. \end{aligned}$$

(1)

Let the root be some complex number $\lambda = a+ib, a,b\in {\mathbb {R}}, b\ne 0$ and I be the identity matrix of same order. Since, at this value of $\lambda $, the characteristic equation has a non-empty kernel, there must exist a vector $u=x+iy, x,y\in {\mathbb {R}}$ such that:

$$\begin{aligned} {Gu=\lambda u}, \end{aligned}$$

(2)

or,

$$\begin{aligned} {G(x+iy)=(a+ib)(x+iy)}. \end{aligned}$$

(3)

Taking adjoint of this equation we get,

$$\begin{aligned} {G(x-iy)=(a-ib)(x-iy)}. \end{aligned}$$

(4)

If $x+iy$ and $x-iy$ were two different eigenvectors of matrix G, then their inner product $x^2+y^2$ would have been 0 because of the mutual orthogonality among the eigenvectors. That is not possible until x and y are 0, in which case, (3) and (4) would be indifferent. That is possible only if the initial assumption is contradicted and b is allowed to be 0 for all eigenvectors x. Hence, it is proved that all the eigenvalues of G are real. $\square $

Property 2

G is symmetric and positive semi-definite matrix.

Proof

Pertaining to the fact that $v_{i}\in {\mathbb {R}}^d$, the following should hold for some set of vectors x.

$$\begin{aligned} {x^{\textsf {T}}{G} x=\sum _{i,j}x_{i}x_{j}\left\langle v_{i},v_{j}\right\rangle =\sum _{i,j}\left\langle x_{i}v_{i},x_{j}v_{j}\right\rangle }. \end{aligned}$$

(5)

According to the elementary property of inner products, $\square $

${\displaystyle \langle x+y,x+y\rangle =\langle x,x\rangle +\langle x,y\rangle +\langle y,x\rangle +\langle y,y\rangle \,.}$ It implies that the sum of inner products in (5) can be taken forward as,

$$\begin{aligned} {\left\langle \sum _{i}x_{i}v_{i}, \sum _{j}x_{j}v_{j}\right\rangle =\left\| \sum _{i}x_{i}v_{i} \right\| ^{2}\ge 0.} \end{aligned}$$

(6)

Therefore, G is positive semi-definite matrix.

Property 3

All the eigenvalues of G are non-negative.

Proof

Property 2 implies $x^{\textsf {T}}{G} x\ge 0$. Substituting the value of Gx from (2) into it,

$$\begin{aligned} {x^{\textsf {T}}{G} x=\lambda x^{\textsf {T}}x}\ge 0. \end{aligned}$$

(7)

Since $x^{\textsf {T}}x$ is positive for all eigenvectors, therefore, $\lambda \ge 0$. Hence proved.

The previously described premise is often used in various methods of dimensionality reduction. Algorithms like Principal Component Analysis and its variants utilize kernel trick to map the observations into a higher dimension to make the data linearly separable. It is equivalent to projecting the mean-centered data onto a subspace on which its variance is maximum²⁹. It is shown by Bernhard Scholkopf et al.³⁰ that algorithms like KPCA use a kernel function $\varvec{\kappa }$ to essentially learn a mapping function $\phi $ for the input space ${\mathbb {R}}^n$ into a high-dimensional Hilbert space $\mathbf{F}$, which can be called as feature space. The process is demonstrated in (8) and (9).

$$\begin{aligned} {\phi :{\mathbb {R}}^n \rightarrow \mathbf{F}}. \end{aligned}$$

(8)

Therefore, for a data point $v=(x_1,\dots ,x_n), x_i \in {\mathbb {R}}^d$, mapping into a feature space ${\mathbb {R}}^{n+k}$ will be given by

$$\begin{aligned} {\phi (v)=(x_1,\dots ,x_n,p_1,\dots ,p_k)\in {\mathbb {R}}^{n+k}}, \end{aligned}$$

(9)

where, the value of $p_i$ depends upon the kernel that has been used for the mapping; however, kernels do not explicitly project the data into that high dimensional feature space; rather, it generates a Gramian matrix G of the mapped data in the aforementioned feature space $\mathbf{F}$. Generated Gramian matrix enables the input data to be operated in that high-dimensional feature space³¹. If $X=(x_1\dots x_n), x_i\in {\mathbb {R}}^{d}$ represent the input data. The corresponding Gramian matrix is given by

$$\begin{aligned} {[G]_{ij}=\kappa ({x_i, x_j}) = \phi ({x_i}) \phi ({x_j})^T, {x_i},{x_j}\in X}. \end{aligned}$$

(10)

Let $G=U\Sigma U^T$ represent the eigen decomposition of G, where U is a matrix containing the eigenvectors of matrix G, arranged column-wise in descending order of their corresponding eigenvalues, which are present in the same fashion in the diagonal matrix $\Sigma $ as shown in (11) and (12).

$$\begin{aligned} U= & {} [u_1,\dots ,u_n], \end{aligned}$$

(11)

$$\begin{aligned} \Sigma= & {} diag(\lambda _1,\lambda _2,\dots ,\lambda _n). \end{aligned}$$

(12)

Here, $\lambda _1\ge \dots \ge \lambda _n\ge 0$ (see Property 3 of Gramian matrix), $u_i^Tu_i=1$ for $i\in \{1,2,\dots ,n\}$ and $Gu_i=\lambda _i u_i$. Also note that in context of PCA Principal Components refers to the projection of the input data points onto the principal direction where the variance of the data is maximum. For PCA, the projection is given by $y_i=U_k^Tx_i$ for all $i\in \{1,2,\dots ,n\}$, where $U_k$ is a matrix of first k eigenvectors of G. However, in case of KPCA, the spectrum of G itself gives the projection of X³². Note that when $\phi (v)=v$, Gramian matrix transforms into covariance matrix. Generalising both, if $U_k$ represent k principal axes, the algorithm finds a basis of an optimal low-dimensional subspace where the $ L_2$-norm of reconstruction error is minimum³³. That is, for a test sample x

$$\begin{aligned} {\underset{{U_k}}{\mathrm{arg}\,\mathrm{min} }}\, \Vert \phi (x)-U_kU_k^T\phi (x)\Vert ^2. \end{aligned}$$

(13)

In addition to dimensionality reduction, principal component analysis can also be used for k-clustering using a heuristic based k-means algorithm. This is done by performing k-means clustering in the projected space, as shown in heuristic k-means algorithm described in³⁴.$\square $

Graph Laplacian

Any set of observations appear to have an emergent behaviour to evince the properties of a graph when operated in a clustering pipeline. Therefore, given a set of data points $X=(x_1,x_2,\dots ,x_n) \in {\mathbb {R}}^{d\times n}$ and a notion of similarity between any two points $x_i$,$x_j\in X$, an undirected similarity graph $S=(V,E)$ can be constructed out of them such that each vertex $v_i\in V$ represent a data point $x_i$, and $(v_i,v_j)\in E$ represent the edge between vertices $v_i$ and $v_j$. With each edge, there is an associated edge weight $e_{ij}$ that represent the similarity between the corresponding data points. Let the similarity matrix be $W(i,j)=[e_{ij}]_{n\times n}$. The degree $d(v_i)$ associated with each node $v_i$ is given by

$$\begin{aligned} d(v_i)=|\{v_j\in V|\{v_j,v_i\}\in E\text { or }\{v_i,v_j\}\in E\}|=\sum _{j=1}^ne_{ij}. \end{aligned}$$

(14)

The degrees of all the nodes/vertices can be wrapped in matrix form as shown in (15)

$$\begin{aligned} D=diag(d_1,d_2,\dots ,d_n). \end{aligned}$$

(15)

These matrices act as a precursor for constructing a matrix of algebraic importance, called Laplacian matrix. The data can be composed as a discrete graph form by making graph Laplacian of its continuous representations like vector space or Riemannian manifolds. Laplacian matrix has many variants, so much so, that depending on the problem and available data, authors device their own version of graph Laplacian matrix³⁵. The simplest graph Laplacian, is given by ($D-W$). It is called unnormalise graph Laplacian matrix. However, in the proposed algorithm, the normalised graph Laplacian matrix has been used. That is,

$$\begin{aligned} {\mathscr {L}}=D^{-1/2}(D-W)D^{-1/2}=I-D^{-1/2}WD^{-1/2}, \end{aligned}$$

(16)

where $D^{-1/2}=diag(d_1^{-1/2},d_2^{-1/2}, \dots ,d_n^{-1/2})$ and I is the identity matrix of appropriate order. Considering the fact that similarity matrix is a Gramian matrix, it is apparent that Gramian and Laplacian are not much different. Laplacian can be characterised as the Gramian normalised over the degree matrix. The distinction between unnormalised and normalised graph Laplacian is better apparent in light of spectral clustering. Consider a strongly connected graph $S=(V,E)$. The purpose of clustering is to come up with the subsets of points according to their similarity, such that the similar points lie in the same subset. It is equivalent to finding the partitions of a graph such that the edge between different partitions has minimum weights. For two disjoint subsets $A, B\subset V$ corresponding to two different partitions, the cut size is given by

$$\begin{aligned} cut(A,B)=\sum _{i\in A,j\in B}e_{ij}. \end{aligned}$$

(17)

Let there be k clusters in the data. The aim of clustering is to find k such partitions ${\mathbf{A}=(A_1,A_2,\dots ,A_k)}$, such that the size of the cuts, as shown in (17), over all the partitions is minimum. That is

$$\begin{aligned} \underset{A_1,\dots ,A_k}{\min }{{\text {cut}}} (A_i:0\ge i\ge k):=\sum _{i=1}^k{cut(A_i,\bar{A_i})}, \end{aligned}$$

(18)

where $\bar{A_i}$ is the complement of $A_i$. This is called the mincut problem. However, solving (18) alone does not achieve reliable clustering results. For example, for $k=2$, partitioning one vertex from the rest of the graph can also be a valid solution as per mincut. In clustering, each cluster needs to accommodate a reasonably large partition to be considered credible. Therefore, the objective function is redefined in following two ways

$$\begin{aligned} \underset{A_1,\dots ,A_k}{\min }{{\text {RatioCut}}} (A_i:1\ge i\ge k):= & {} \sum _{i=1}^{k} {\frac{cut(A_i,\bar{A_i})}{|A_i|}}, \end{aligned}$$

(19)

$$\begin{aligned} \underset{A_1,\dots ,A_k}{\min }{{\text {NCut}}} (A_i:1\ge i\ge k):= & {} \sum _{i=1}^{k} {\frac{cut(A_i,\bar{A_i})}{vol(A_i)}}, \end{aligned}$$

(20)

where $|A_i|$ represent the number of vertices in partition $A_i$ and $vol(A_i)=\sum _{v_j\in A_i}{d_j}$.

However, solving these minimisation problems is NP hard. Laplacian matrix is an utility that can be used to approximate these minimisation problem. Consequently, unnormalised Laplacian serves in the approximation of the minimization of RatioCut, while normalised Laplacian serves in the approximation of the minimization of NCut. Therefore, the approximated objective function using normalised Laplacian is given by (21).

$$\begin{aligned} \underset{U_k}{\min }{{\text {tr}}} (U_k^T{\mathscr {L}}U_k), \text { subjected to }U_k^TU_k=I. \end{aligned}$$

(21)

The above expression is minimum when $U_k\in {\mathbb {R}}^{n\times k}$ is a matrix containing eigenvectors corresponding to k smallest non-zero eigenvalues of matrix ${\mathscr {L}}$. This matrix is used to embed the data into a k dimensional euclidean space spanned by the vectors in matrix U, in which grouping of the data points is arguably easy even with simpler techniques like k-means. The described practice is known as Laplacian embedding. The embedded data is then subjected to k-means clustering algorithm for cluster discovery, as shown in Normalised Spectral Clustering presented in Ref.³⁶.For a strongly connected graph with single component, the eigenvector corresponding to the trivial solution (i.e. $\lambda =0$) of the eigenvalue problem of matrix ${\mathscr {L}}$ is a column vector of n ones. Therefore, ${\mathscr {L}}{} \mathbf{1}_{n}=0$ where $\mathbf{1}_{n}=(1,\dots ,1)^T$. If the graph happens to have more than one components, then the multiplicity k of eigenvalue 0 if equal to the number of connected components in the graph. Nonetheless, with respect to clustering, the eigenvector(s) corresponding to eigenvalue 0 should be omitted while performing Laplacian embedding. It can be done by introducing a minor change in the matrix.

$$\begin{aligned} L={\mathscr {L}}+{\frac{2}{n}}{(1_n 1_n^T)}. \end{aligned}$$

(22)

If the eigenpairs of ${\mathscr {L}}$ are given by

$$\begin{aligned} {\varvec{\Gamma }}({\mathscr {L}})=\{(\lambda _1,f_1), (\lambda _2,f_2),\dots ,(\lambda _n,f_n)\} \end{aligned}$$

then, the eigenpairs of (22) are given by

$$\begin{aligned} &{\varvec{\Gamma }}(L)= {} \{(\lambda _2,f_2),(\lambda _3,f_3),\dots ,(\lambda _1+2,f_1)\}\\&{\text { { where,} }} 0=\lambda _1<\lambda _2\dots \le \lambda _n\le 2\text { and }f_1=\mathbf{1}_n. \end{aligned}$$

Hence, the new eigenvalue problem becomes

$$\begin{aligned} Lv={\mathscr {L}}v+{\frac{2}{n}}(1_n1_n^T)v=\lambda v. \end{aligned}$$

(23)

By modifying the matrix to L, the initial k eigenvectors can be taken right away. This trick works because of the fact that for all the pairs in ${\varvec{\Gamma }}({\mathscr {L}})$ except $(\lambda _1,f_1)$, the matrix L gets reduced to ${\mathscr {L}}$. Hence, set ${\varvec{\Gamma }}(L)$ is going to have all the eigenpairs that are in ${\varvec{\Gamma }({\mathscr {L}})}$, except $(\lambda _1,f_1)$. While at $v=f_1=\mathbf{1}_{n}$,

$$\begin{aligned} L\mathbf{1}_{n}={\mathscr {L}}{} \mathbf{1}_{n}+{\frac{2}{n}}(\mathbf{1}_n1_n^T) \mathbf{1}_{n}=\lambda _1\mathbf{1}_{n}+2\mathbf{1}_{n}=(\lambda _1+2)\mathbf{1}_{n}. \end{aligned}$$

(24)

Therefore, in the new set $\varvec{\Gamma }(L)$, the rank of all the eigenvalues greater than $\lambda _1$ gets reduced by one and $\mathbf{1}_{n}$ becomes the eigenvector corresponding to the largest eigenvalue. Laplacian matrix has certain properties which are exploited by many clustering techniques like the one shown above. Some of the relevant properties are as following.

Property 1

For every vector $f\in {\mathbb {R}}^n$, ${\mathscr {L}}$ satisfies the following condition

$$\begin{aligned} {f^\prime {\mathscr {L}}f={\frac{1}{2}} \bigg(\sum _{i,j=1}^ne_{ij} \bigg({\frac{f_i}{\sqrt{d_i}}}-{\frac{f_j}{\sqrt{d_j}}\bigg)^2\bigg)}} \end{aligned}$$

(25)

Proof

By the definition of degree, $d_i=\sum _{j=1}^ne_{ij}$. Therefore,

$$\begin{aligned}& f^{\prime }{\mathscr {L}f = f^{\prime }(I-D^{-1/2}WD^{-1/2})f} \\& \quad =\sum _{i=1}^nf_i^2-\sum _{i,j=1}^n{\frac{f_i}{\sqrt{d_i}}}{\frac{f_j}{\sqrt{d_j}}e_{ij}} \\& \quad ={\frac{1}{2}}\left( \sum _{i=1}^n{\frac{f_i^2}{d_i}}d_i+\sum _{j=1}^n{\frac{f_j^2}{d_j}}d_j-2\sum _{i,j=1}^n{\frac{f_i}{\sqrt{d_i}}}{\frac{f_j}{\sqrt{d_j}}}e_{ij}\right) \\& \quad ={\frac{1}{2}}\left( \sum _{i,j=1}^n{\frac{f_i^2}{d_i}}e_{ij}+{\frac{f_j^2}{d_j}}e_{ij}-2{\frac{f_i}{\sqrt{d_i}}}{\frac{f_j}{\sqrt{d_j}}}e_{ij}\right) \\& \quad ={\frac{1}{2}}\left( \sum _{i,j=1}^ne_{ij}\left( {\frac{f_i}{\sqrt{d_i}}}-{\frac{f_j}{\sqrt{d_j}}}\right) ^2\right) . \end{aligned}$$

Hence proved.$\square $

Property 2

${\mathscr {L}}$ is symmetric and positive semi-definite matrix.

Proof

From (16), the symmetry of the matrix is fairly evident. Also, from the property 1, ${f^\prime {\mathscr {L}}f}\ge 0$ for all $f\in {\mathbb {R}}^n$. Hence, it is provrd that ${\mathscr {L}}$ is symmetric and positive semi-definite matrix. $\square $

Property 3

All eigenvalues of ${\mathscr {L}}$ are non-negative.

Proof

Property 1 implies ${f^\prime {\mathscr {L}}f}\ge 0$. Substituting ${\mathscr {L}}f=\lambda f$, we get ${{f^\prime {\mathscr {L}}f}=\lambda x^{\textsf {T}}x}\ge 0$. Since $f^\prime f$ is positive for all eigenvectors, therefore, $\lambda \ge 0$. Hence proved.$\square $

RISynG algorithm

For grouping the cancer patients into clusters, each omic view is represented as a graph using two representation matrices, that is the Gramian matrix and the Laplacian matrix. Each of the representation matrices attributes the similarity network of the samples with a notion of similarity between the samples. Consider a view $X_m=(x_1,x_2,\dots ,x_n)$, $x_i\in {\mathbb {R}}^{d_m}$ corresponding to mth omic-source. If $\rho (x_i,x_j)$ denotes the distance between $x_i$ and $x_j$ $\in X_m$, then the similarity $w(x_i,x_j)$ between them is given by:

$$\begin{aligned} w(x_i,x_j)=\text {exp} \left\{ -{\frac{\rho (x_i,x_j)}{\sigma }}\right\} , \end{aligned}$$

(26)

where $\sigma $ is a free parameter adjusted as per the intrinsic properties of the data when subjected to clustering model. For the cancer data used in this study, the $\sigma $ is given by $\sigma ={\text {max}(\frac{\rho (x_i,x_j))}{2}}$ for all $x_i,x_j\in X_m$. It has been assumed in the proposed method that multi-views may constitute different cluster manifolds when learnt on a particular similarity measure. Therefore, predicted clusters would be apparent, and in strong concordance with the clinical clusters if pairwise sample similarity is computed in data-dependent multi-kernel approach. It was found that in some views correlation distance was prominently reflecting cluster manifold that concurred with the natural clusters, while some of them showed proclivity towards Euclidean distance, and the rest seemed to accommodate parts of both. All things considered, two different graph representation matrices have been formulated, Gramian matrix and Laplacian matrix, both with different measures of similarity. Let for $X_m$, the correlation distance between $x_i$ and $x_j$ be given by $\varphi _m(x_i,x_j)$ and the squared Euclidean distance be given by $\varepsilon _m(x_i,x_j)$. If $\hat{\varphi }_m$ and $\hat{\varepsilon }_m$ denotes the maximum pairwise correlation distance and squared Euclidean distance respectively, then Gramian matrix $G_m$ and similarity matrix $W_m$ are given by

$$\begin{aligned}{}[G_m]_{ij}= & {} w_G(x_i,x_j)=\text {exp} \left \{-{\frac{\varphi _m(x_i,x_j)}{\hat{\varphi }_m}} \right \} \left \{-{\frac{\varphi _m(x_i,x_j)}{\hat{\varphi }_m}} \right \} , \quad where \; i,j\in \{1,2,\dots ,n\}, \end{aligned}$$

(27)

$$\begin{aligned} _{ij}= & {} w_L(x_i,x_j)=\text {exp} \left \{-{\frac{\varepsilon _m(x_i,x_j)}{\hat{\varepsilon }_m}} \right \} , \quad where \; i,j\in \{1,2,\dots ,n\}. \end{aligned}$$

(28)

The matrix articulated in (28) is a crucial precursor for the construction of Laplacian matrix. Laplacian matrix is constructed by normalising $W_m$ by the degree matrix $D_m$ of its associated graph as in Eqs. (15) and (16). Hence, required representation matrices for each view $X_m$, $m\in \{1,2,\dots ,M\}$ are given by (27) and (29).

$$\begin{aligned} {\mathscr {L}}_m=D_m^{-1/2}(D_m-W_m)D_m^{-1/2}=I-D_m^{-1/2}W_mD_m^{-1/2}. \end{aligned}$$

(29)

So obtained laplacian matrix is then modified as described in Eq. (22)

$$\begin{aligned} L_m={\mathscr {L}}_m+{\frac{2}{n}}(1_n 1_n^T). \end{aligned}$$

(30)

It is apparent from the discussion presented under the heading Gramian Matrix and Kernel Trick and Graph laplacian that the matrix $U_k$ from Gramian matrix has the same role as that from Laplacian matrix. Therefore, for combining the information encoded in these matrices, a parameterised combination function ${\varvec{\Omega }}(\cdot ,\cdot )$ can be used, hence obtaining a synergy matrix of representation matrices. If $G_m$ is the Gramian matrix and $L_m$ is the Laplacian matrix of omic-view $X_m$, then the synergy matrix is given by:

$$\begin{aligned} {\varvec{\Omega }}(G_m,{\mathscr {L}}_m) = H_m =\beta G+(1-\beta )L, \quad where \; 0\le \beta \le 1. \end{aligned}$$

(31)

Consequently, the corresponding objective functions, (13) and (21) also combines to optimise over $U_k\in {\mathbb {R}}^{n\times k}$.

$$\begin{aligned} \begin{aligned} \underset{U_k}{\min }{\beta \Vert X-U_kU_k^TX\Vert _F+ (1-\beta ){\mathbf{tr}}}(U_k^T{\mathscr {L}}U_k) , \text { subjected to }U_k^TU_k=I. \end{aligned} \end{aligned}$$

(32)

Some of the relevant properties of synergy matrix $H_m$ are:

Property 1

$H_m$ is symmetric and positive semi-definite matrix.

Proof

$H_m$ can be called a positive semi-definite matrix if and only if $v^TH_mv\ge 0$ for all $v\in {\mathbb {R}}^n$. Also, from the properties of the Graph Laplacian and the Gramian, it is evident that both L and G satisfies this condition. Therefore,

$$\begin{aligned} v^TH_mv=\beta v^TGv+(1-\beta )v^TLv\ge 0. \end{aligned}$$

(33)

In addition to that, since $H_m$ is a summation of symmetric matrices, it is also symmetric. Hence, it is proved that $H_m$ is a symmetric and positive semi-definite matrix.$\square $

Given Property 1, rest of the properties are its direct consequence.

Property 2

All the eigenvalues of $H_m$ are real.

Property 3

All the eigenvalues of $H_m$ are non-negative.

Recursive multi-kernel integration

After generating synergy matrices for all the views of the dataset, the next step is to integrate the information obtained from each of them. However, before moving to the integration step, the proposed approach needs these matrices to be arranged based on their relative relevance for cluster discovery. It is apparent that the better views would encode the cluster structure better. As a consequence of that, they would depict better cluster validity indices as well. Therefore, the sorting of synergy matrices have been done based on cluster validity indices such as silhouette index. Suppose $\mathbf{H}=\{H_1,\dots , H_M\}$ be the set of synergy matrices of a dataset with M views. Let the sorted set be $\mathbf{H}^{\prime }=\{^1H, \dots , ^MH\}$, where the superscript i denotes the relevance of the corresponding synergy matrix $^iH$, $^1H$ being the most relevant. Additionally, let every $^iU_k$ from the set $\mathbf{U}=\{^1U_k, \dots , ^MU_k\}$ represent the basis of eigenspace corresponding to k smallest eigenvalues of matrix $^iH$.

Next, a method for combination has been proposed which distills the cluster information from each of the synergy matrix one by one, in an iterative fashion. While doing that, it subtly takes care of enriching the information coming from the relevant matrices. The way that the synergy matrices has been made, it is apparent that it is their basis of the eigenspace that brings out the latent cluster structure in the corresponding view. Therefore, the proposed method uses a recursive function to exploit this fact for integration as well as enrichment of the relevant views of the dataset. The recursive formula can be written as:

$$\begin{aligned} \begin{aligned} \mathbf{k}_{\eta +1}:=\mathbf{k}_{\eta }\otimes {{\mathscr {N}}}(\mathbf{k}_{\eta },^{(\eta +1)}U) , \text {where } \mathbf{k}_1=^1H \text { and }\eta =1,\dots ,M. \end{aligned} \end{aligned}$$

(34)

Here $\mathbf{k}_{\eta }$ is called accretive matrix of $\eta $th recursive step. Non-cumulative operator $\otimes $ signifies the integration operation. That is, for $A\in {\mathbb {R}}^{n\times n}$ and $U\in {\mathbb {R}}^{n\times k}$, where A has its k smallest eigenvectors in $V\in {\mathbb {R}}^{n\times k}$, and U is a basis matrix, the expression $A\otimes U$ evaluates to an accretive matrix $A^\prime \in {\mathbb {R}}^{n\times n}$ with k smallest eigenvectors given by $V+U$. Other eigenvectors of A are irrelevant for this discussion. Let the basis of eigenspace of $A^\prime $ be known as accretive basis and associated subspace as accretive subspace. Also, let the accretive basis corresponding to k smallest eigenvectors of $\mathbf{k}_{\eta }$ be given by $\mathbf{b}_{\eta }$.

In extension to that, for enriching relatively relevant views, the proposed method uses an orthogonalising-normalising function ${{\mathscr {N}}}(\cdot ,\cdot )$. To ensure the accumulation of only the essential cluster information, the proposed approach acquires the basis of that projection of synergy matrix eigenspace which is orthogonal to the accretive subspace at that recursive step. The idea is similar to eigenspace updation for integrative clustering as performed in Ref.¹⁸. This function does not normalise the synergy matrix per se, rather, it normalises the basis of the described projection subspace. The computation starts by instantiating $\mathbf{k}_{1}=\text { }^1H$ so that $\mathbf{b}_{\eta }$ becomes $^1U_k$. Therefore, at ($\eta +1$)th recursive step ($\eta \in \{0,1,\dots ,M\}$), one should have accretive matrix $\mathbf{k}_{\eta }$ and eigenspace basis $^{(\eta +1)}U_k$ of synergy matrix $^{(\eta +1)}H$. Subsequently, processing within orthogonalising-normalising function ${{\mathscr {N}}}(\mathbf{k}_{\eta },^{(\eta +1)}U_k)$ renders the final basis matrix in four steps:

First, computing the basis ${\mathscr {P}}$ of the projection subspace, which is given by:

$$\begin{aligned} {\mathscr {P}}=\mathbf{b}_{\eta }{} \mathbf{b}_{\eta }^T\text { }^{(\eta +1)}U_k. \end{aligned}$$

(35)

Second, computing the residual component of the synergy matrix eigenspace ${\mathscr {Q}}$ which is given by subtracting the above-mentioned projected component from $^{(\eta +1)}U_k$ as:

$$\begin{aligned} {\mathscr {Q}}=\text { }^{(\eta +1)}U_k-{\mathscr {P}}. \end{aligned}$$

(36)

In the third step, ${\mathscr {Q}}$ is subjected to Gram-Schmidt orthogonalisation to yield the final basis ${\mathscr {R}}$. This basis cannot be integrated with the eigenspace of accretive matrix, therefore it needs to be normalised on the basis of its relevance. So, the fourth step of normalization is performed as:

$$\begin{aligned} {{\mathscr {N}}}(\mathbf{k}_{\eta },^{(\eta +1)}U_k)=V, \quad where \; V=[diag({\mathscr {R}} {\mathscr {R}}^T)^{-{\frac{1}{2}}}({\mathscr {R}})]^{(\eta +1)} \end{aligned}$$

(37)

Here the notation $[\cdot ]$ denotes that the subsequent operations are done in element-wise fashion. The resultant V matrix is called as orthogonalised-normalised basis matrix. After the end of the process, the final accretive matrix $\mathbf{k}_{M}$ is obtained whose first k eigenvectors in the matrix $\mathbf{b}_{M}\in {\mathbb {R}}^{n\times k}$ holds the cluster structure. Hence, performing k-means on the rows of the matrix $\mathbf{b}_{M}$ returns the cluster labels for each sample. The proposed algorithm is described in Algorithm 1.

Computational complexity

For the proposed algorithm, given M similarity matrices and Gramian matrices with n samples under study, the computation starts with constructing degree matrix $D_m$ for each of the M views. The complexity of this step is bounded by $O(n^2)$ for each view. In the next step, the Laplacian matrix is made with a complexity of $O(n^3)$. Let the number of iterations (regulated through parameter $\beta $) to learn the synergy matrix’s best composition in steps 12 to 16 be $t_\beta $. However, it has been found that for the datasets used in this study, the value of $t_{\beta }=10$ suffices. Iterating $\beta $ from 0 to 1 with an increment of 0.1 with each iteration can produce an optimal combination ratio for the representation matrices. However, here, the increment step has been referred to as $\alpha $ for consistency. Assuming $t_{max}$ be the highest iteration by the k-means clustering algorithm the complexity of the aforesaid steps becomes $O(t_{\beta }n^3+t_{\beta }t_{max}nk^2+t_{\beta }n)$. Where $t_{\beta }n^3$ comes from the complexity of eigenvalue decomposition of synergy matrix, $t_{\beta }t_{max}nk^2$ is for the step where k-means clustering is performed, and $t_{\beta }n$ is for the f-measure calculation. Therefore, the complexity of steps formulated from 12 to 16 turns out to be bounded by $O(t_{\beta }n^3)$. Steps 17 to 19 are doing the same processing as previously, just at the optimal value of $\beta $. Hence, they are also bounded by $O(t_{\beta }n^3)$. Summing up all the steps from 9 to 20 for M views, the complexity of $O(Mn^2+Mn^3+Mt_{\beta }n^3)$ reduces to $O(Mt_{\beta }n^3)$. Sorting can be done at O(MlogM). After that, an accretive basis is constructed as defined in the function INTEGRATE($\mathbf{b},\eta $). Step 5 consists of the construction of ${\mathscr {P}}$, ${\mathscr {Q}}$ and orthogonalized-normalized matrix V. In this step, two matrix multiplication operations are bounded under the complexity of $O(n^2k)$. Gram-Schmidt orthogonalization and normalization step combined has a complexity of $O(n^2)$. Therefore, step 5 has a complexity of $O(n^2k)$. Step 6 is matrix addition with complexity O(nk), but step 5 seem to dominate over that. In addition to that, since the function runs $(M-1)$ times, the complexity from steps 21 to 23 becomes $O(MlogM+Mn^2k)=O(Mn^2k)$. After the construction of the accretive basis, k-means is performed, which, as explained previously, has time complexity $O(t_{max}nk^2)$. Considering everything, the overall complexity of RISynG comes out to be $O(Mt_{\beta }n^3+Mn^2k+t_{max}nk^2) = O(Mt_{\beta }n^3)$.

Significance of proposed algorithm

There are some aspects of the proposed algorithm that enhance its performance and make it unique from the other algorithms designed to identify cancer subtypes. Although each omic-view in the cancer dataset has its distinct cluster structure, the knowledge of cancer biology suggests that no omics-source to which each view belongs can dictate the final cancer subtype alone. Instead, all the omics sources collectively manifest the cancer subtype in a sample. Therefore, multi-view integration is critical to a sensible and clinically relevant clustering. The proposed approach can be broken down into three operative steps: (1) construction of representation matrices for each view, (2) construction of synergy matrix for each view, and (3) construction of accretive basis through recursive multi-kernel integration of synergy matrices. These steps make the proposed algorithm more effective in the following manner:

1.
Construction of representation matrices To group the cancer patients into clusters, each omic-view first has to be represented as similarity graphs. These similarity graphs can be interpreted through various representation matrices like the Gramian, Laplacian, and Adjacency. Each representation matrix attributes the samples’ similarity network with a notion of similarity between the samples. The proposed method assumes that multiple information sources may constitute different cluster manifolds when learned on a particular similarity measure. Therefore, predicted clusters would be apparent and in strong concordance with the clinical clusters if pairwise sample similarity is computed in a data-dependent multi-kernel approach³⁷. In some views, Correlation distance was prominently reflecting cluster manifold that concurred with the natural clusters. Whereas some of them showed proclivity towards Euclidean distance, the rest seemed to accommodate both. All things considered, two different graph representation matrices have been formulated, the Gramian matrix and Laplacian matrix, both with different measures of similarity.
2.
Construction of synergy matrices Representation matrices so constructed have two noteworthy aspects: (1) $G_m$ represents a similarity graph formed using correlation-based distance. In the correlation-based distance, two objects are considered similar if the trends among their elements are highly correlated. That means the correlation distance between two perfectly correlated samples will be 0, even though they are far apart in the euclidean space of their dimension. It is instinctive to assume the omics data to behave like that. (2) Laplacian, on the other hand, preserves the intrinsic manifold structure in the data casted on a low embedding space. To integrate these representation matrices, a combination function has been devised that takes a convex combination of both the matrices. This method of combining matrices rectifies any bias created by the dissimilarity in distance measurement used while constructing the similarity graphs. The combination function defined in (31) utilises the parameter $\beta \in [0,1]$ to capture graphs constituted by the Gramian and Laplacian. Parameter $\beta $ can only take a positive value, making the combination a convex combination of representation matrices. This parameter’s optimal value is learnt by iterating it from 0 to 1 at some incremental step size $\alpha \in (0,1)$. The datasets used in this study tend to pick up the optimal value of $\beta $ at a step size of $\alpha =0.1$. It is crucial to choose the incremental step size wisely as the number of iterations $t_{\beta }$ is directly proportional to the algorithm’s time complexity. Because the synergy matrix will ultimately affect the cluster assignment, the best way to evaluate the appropriate value of $\beta $ is to perform a provisional cluster validity test on the synergy matrix constructed with that $\beta $ using a cluster validity index like silhouette index. Algorithm-1, steps 15 to 19 formulate the described provisional cluster validity test using silhouette as a criterion.
3.
Construction of accretive basis After the similarity between the cancer patients is captured in a refined form with the help of synergy matrices, the next step is to integrate them. Property 1 of the synergy matrix proves that $H_m$ is a positive semi-definite matrix. That makes the integration of synergy matrices a multi-kernel integration. The proposed algorithm does that by recursive multi-kernel integration by iteratively integrating each of the synergy matrices’ relevant subspace. Here, relevant subspace refers to that subspace of the matrix that purely encodes the cluster information, which in the case of synergy matrix is its eigenspace corresponding to k eigenvalues. Finally, an accretive basis matrix is generated. This accretive matrix is required to have more cluster information coming from relevant views. Therefore, the orthogonalizing-normalizing function is made such that the accretive basis at each recursive step gets less influenced by the irrelevant matrix.

Description of datasets

For analysing the efficiency of the proposed algorithm for identifying cancer subtypes, it is applied to five cancer datasets taken from TCGA (https://cancergenome.nih.gov/). The datasets used are Cervical cancer (CESC), Breast cancer (BRCA), Ovarian cancer (OV), Lower-grade glioma (LGG), and Stomach cancer (STAD). Different studies have identified 4 clinically important subtypes for BRCA⁹ and STAD³⁸, 3 for CESC³⁹ and LGG⁴⁰ and 2 for OV⁴¹. The cancer genome is neither simple nor independent but is complicated and dysregulated by multiple levels in the biological system through genomic, epigenomic, transcriptomic, proteomic levels⁴². miRNA, as one of the important regulators of gene expression, can be integrated with gene expression to identify the selective inhibition of translation or selective degradation^43,44,45. Furthermore, in terms of epigenetic regulation, histone modification or DNA methylation can serve to regulate gene expression in cancer^46,47. Also, protein expression data can be utilized for the diagnostic prognosis of cancer patients⁴⁸. Therefore, four omic views, namely, gene expression (mRNA), microRNA expression (miRNA), DNA methylation (metDNA), and reverse-phase protein assays (RPPA), are utilized for CESC, BRCA, and LGG datasets. For STAD and OV datasets, mRNA and miRNA expression are only considered because metDNA and RPPA information are not available for most samples. To avoid involving features with too many missing values, more than 5% of missing values in all of the omic views are removed, and the rest of the missing values are replaced with 0. Sequence-based expression data are log-transformed to make the data more or less normally distributed⁴⁹. Therefore the 0 entries from miRNA and mRNA expression data are replaced with 1 and then log-transformed with base 10. For metDNA datasets, beta values are considered. At last, variance filtering is applied to mRNA and metDNA omic views for all cancer datasets, and 2000 most variable genes and CpG locations were only considered. Table 1 contains a description of the final processed data used for this study. The datasets selected for benchmarking cover a wide range of sample sizes from 124 in CESC to 474 in OV datasets. TCGA contains several platforms for individual data types, the platforms having the largest number of matching samples across the omics are selected in the present study. The proposed algorithm can be applied to other large-scale multi-omics datasets if available; the run time will increase with the increase in sample size or the number of omic views, as shown in Fig. 2. With the increase in sample size from 124 to 474, the runtime increases from 0.22 to 0.47 s. Even though the BRCA dataset has lesser samples (398) than the OV dataset (474), the runtime for BRCA (0.56 s) is more than OV (0.47 s) because of the number of omic-views involved, which is 4 for BRCA and 2 for OV.

Table 1 Datasets description.

Full size table

Experimental results and discussion

The performance of the proposed approach is compared with eleven other algorithms available for cancer subtype identification. Both two-stage clustering approaches and integrative clustering approaches are considered for method comparison. The methods used for comparison are Similarity Network Fusion (SNF)¹³, Weighted Multi-View Low Rank Representation (WMLRR)⁵⁰, Consensus Clustering (CC)^6,51, Multi-view clustering approach with enhanced consensus (ECMC)⁵², SNF.CC (SNF merged with CC)⁵³, Cluster of Cluster Assignment (COCA)^9,54, Consensus Non-negative Matrix Factorization (CNMF)⁵⁵, Selective Update of Relevant Eigenspaces (SURE)¹⁸, Convex-combination of Approximate Laplacians (CoALa)¹⁹, iCluster¹⁴, and Multi-manifold Integrative Clustering (MiMIC)⁵⁶.

Performance analysis on multi-omics cancer datasets

The proposed approach and the above-described methods are applied to five cancer datasets, namely CESC, BRCA, OV, LGG, and STAD, taken from TCGA. The sample clusters identified by these methods are evaluated based on several internal and external cluster evaluation indices. The cancer subtypes identified by these methods are also evaluated for their biological relevance. Next, the detailed comparative analysis of the proposed algorithm is discussed.

Cluster evaluation

The clusters (cancer subtypes) generated by all the methods are evaluated based on several internal and external cluster evaluation indices. These indices help get the idea of how well a method can group the samples into homogeneous clusters. Samples belonging to the same cluster should have higher similarity representing a cancer subtype, whereas samples belonging to different clusters should be highly dissimilar. How well an algorithm can capture the natural grouping present in the data can be quantified with internal validity indices. Following four internal evaluation indices are calculated in this study. Table 3, presents the internal evaluation indices for every method.

1.
Silhouette Index: It measures the consistency present in the clusters. The value lies in the range $[-1,1]$. A value nearer to + 1 indicates a higher distance between the clusters, a value of 0 indicates that the sample is very close boundary between two neighboring clusters, and a negative value indicates misclassification⁵⁷.
$$\begin{aligned} {\mathbb {S}}_c = \frac{1}{c} \sum _{k=1}^{c}S(\Upsilon _k), \end{aligned}$$
(38)
where, $S(\Upsilon _k)$ represents silhouette width of the obtained clusters, $\Upsilon _k (k=1, \ldots ,c)$ which is calculated as: $S(\Upsilon _k)=\frac{1}{n_k}\sum _{x_i\in \Upsilon _k}^{}s(x_i)$ where, $n_k$ is cardinality of $\Upsilon _k$ and $s(x_i)$ is silhouette width of sample $x_i$. For every sample, the silhouette width $s(x_i)$ is estimated as: $s(x_i)=\frac{b(i)-a(i)}{max\{a(i),b(i)\}}$ Here, $a(i) = $ average dissimilarity of $i_{th}$ object to all other objects in the same cluster and $b(i) = $ average dissimilarity of $i_{th}$ object with all objects in the closest cluster.
2.
Dunn Index: A higher value represents better clustering solution⁵⁸. It is defined as:
$$\begin{aligned} DI = \underset{1\le i \le c}{{\text {min}}} \Big \{ \underset{1\le i \le c}{{\text {min}}} \Big \{ {\frac{\delta (C_i,C_j)}{\underset{1\le k \le c}{{\text {max}}} \small \{\Delta (C_k)\}}} \Big \}\Big \} \end{aligned}$$
(39)
Here, $\delta (C_i,C_j) = $ distance between cluster $C_i$ and $C_j$ and $\Delta (C_k) = $ intra-cluster distance within cluster $C_k$.
3.
Davies–Bouldin Index: It is defined as the ratio of within cluster dispersion to between cluster dispersion⁵⁹. A lower value indicates better clustering.
$$\begin{aligned} DB = \frac{1}{C} \sum _{i=1}^{C} (D_i) \end{aligned}$$
(40)
Here, $ D_{i} = \max _{{j \ne i}} R_{{i,j}} $ and $R_{i,j} = \frac{S_i+S_j}{M_{ij}}$. $M_{i,j}$ is the separation between the ith and the jth cluster. $S_i$ and $S_j$ are the within cluster scatter for cluster i and j and C is the number of clusters.
4.
Xie–Beni Index: The index for crisp clustering is estimated as:
$$\begin{aligned} \text {Xie}-{\text {Beni}} = \frac{1}{N} \frac{WGSS}{ {\text{min}}_{{k < \mathop k\limits^{{\prime }} }} \acute{\delta } (C_k,C_{\acute{k}})^2} \end{aligned}$$
(41)
Here, $\frac{1}{N} {WGSS}$ represents the averaged-squared distance of all the points with respect to the barycenter of the cluster they belong to, and $\acute{\delta }$ a measure of the between-cluster distance⁶⁰.

The class distribution of the cancer datasets used in this study is presented in Table 2. Except for the CESC dataset, all the other cancers have an imbalanced class. When clustering is applied to these datasets, there are chances that most of the samples get clustered into one group leading to good values for internal indices. Still, in reality, the clustering is not efficient. If the ground truth is available, the partitions created in such imbalanced data can be efficiently evaluated with external evaluation indices. In this study, five external evaluation indices are calculated to compare the clustering efficiency of the different algorithms. Considering a set of n objects ${{\mathbb {X}}}=\{{{\mathscr {X}}}_1, {{\mathscr {X}}}_2, \ldots ,{{\mathscr {X}}}_n\}$, suppose ${{\mathbb {C}}}=\{{{\mathscr {C}}}_1, {{\mathscr {C}}}_2, \ldots ,{{\mathscr {C}}}_R\}$ represents a partition of ${{\mathbb {X}}}$ obtained by a clustering algorithm and ${{\mathbb {K}}}=\{{{\mathscr {K}}}_1, {{\mathscr {K}}}_2,\ldots ,{{\mathscr {K}}}_C\}$ represents the ground truth or the class information. A contingency table is created to look for the overlap between the clustering result and the ground truth, where $n_{ij}=|{{\mathbb {C}}}_{i}\cap {{\mathbb {K}}}_{j}|$ is the common elements in cluster ${{\mathbb {C}}}_{i}$ and class ${{\mathbb {K}}}_{j}$. $n_i$ is the number of elements in $ {{\mathbb {C}}}_{i}$ and $n_{j}$ is the number of elements in ${{\mathbb {K}}}_{j}$. The external indices are defined as:

1.
F-measure (FM): The idea of precision and recall from information retrieval is merged to obtain FM. It disregards the unmatched portions of the clusters. It can attain values ranging between 0 and 1. A value nearer to 1 represents better clustering⁶¹.
$$\begin{aligned} FM = \sum _{j=1}^{C} \frac{n_j}{n} \, \underset{i=1 \cdot \cdot \cdot R}{{\text {max}}}\, \left[ \frac{2 \times \frac{n_{ij}}{n_i} \times \frac{n_{ij}}{n_j}}{\frac{n_{ij}}{n_i}+\frac{n_{ij}}{n_j}}\right] \end{aligned}$$
(42)
2.
Adjusted Rand Index (ARI): A commonly used variations of the Rand index, and takes into account agreements arising by chance given a hypergeometric distribution. In the case of ARI, the lower bound, $-k$, depends on the exact data partitioning⁶². Closer the value of ARI to 1, better is the clustering.
$$\begin{aligned} ARI = \frac{\sum _{i=1}^{R} \sum _{j=1}^{C} \left( \begin{array}{c} n_{ij} \\ 2 \end{array}\right) - {\left( \begin{array}{c} n \\ 2 \end{array} \right) }^ {-1} \sum _{i=1}^{R} \left( \begin{array}{c} n_{i} \\ 2 \end{array} \right) \sum _{j=1}^{C} \left( \begin{array}{c} n_{j} \\ 2 \end{array} \right) }{\frac{1}{2} \left[ \sum _{i=1}^{R} \left( \begin{array}{c} n_{i} \\ 2 \end{array}\right) + \sum _{j=1}^{C} \left( \begin{array}{c} n_{j} \\ 2 \end{array} \right) \right] - \left( \begin{array}{c} n \\ 2 \end{array} \right) ^{-1} \sum _{i=1}^{R} \left( \begin{array}{c} n_{i} \\ 2 \end{array}\right) \sum _{j=1}^{C} \left( \begin{array}{c} n_{j} \\ 2 \end{array} \right) } \end{aligned}$$
(43)
3.
Normalized Mutual Information (NMI): The inter-dependencies between cluster number and cluster quality can be quantified by NMI. It is estimated as:
$$\begin{aligned} NMI({\mathbb {C}},{\mathbb {K}})=\frac{{\mathscr {I}}({\mathbb {C}},{\mathbb {K}})}{[{\mathscr {H}}({\mathbb {C}})+{\mathscr {H}}({\mathbb {K}})]/2} \end{aligned}$$
(44)
Here, ${\mathscr {I}}$ is mutual information and ${\mathscr {H}}$ is entropy. The value ranges from 0 to 1, value nearer to 1 means better clustering⁶³.
4.
Jaccard Index: It is used to measure the similarity between two sets, that are clustering solution, and the class information. It is defined as:
$$\begin{aligned} J({\mathbb {C}},{\mathbb {K}})= \frac{|{\mathbb {C}} \cap {\mathbb {K}}|}{|{\mathbb {C}} \cup {\mathbb {K}}|} \end{aligned}$$
(45)
Higher the value of this index better in the clustering.
5.
Purity: For estimating Purity, the clusters are first allocated to that class which is present most frequently in the cluster. Later, the accuracy of this cluster-class allocation is obtained by dividing the number of correctly assigned objects to total number of objects⁶³. The equation for calculating Purity is:
$$\begin{aligned} Purity({\mathbb {C}},{\mathbb {K}})=\frac{1}{n}\sum _{i}max_{j}|C_i \cap K_j| \end{aligned}$$
(46)
Purity ranges from 0 to 1, a value closer to 1, better is the clustering.

Based on these five external evaluation indices, it is observed that the proposed algorithm outperforms in CESC, BRCA, LGG, and STAD datasets. OV cancer is the only case where the proposed approach cannot work that well. Suppose all the datasets are considered together to rank the clustering efficiency of all the algorithms under study, considering all the external indices. In that case, the proposed method stands first by attaining a maximum value for 20 times out of 25. The execution times reported in Table 3 show that RISynG is faster than other algorithms.

Table 2 Cancer subtypes description: actual class distribution.

Full size table

Table 3 Comparative cluster analysis of proposed and existing approaches.

Full size table

Importance of multi-omics data integration

The proposed algorithm RISynG iteratively integrates the relevant subspace of each of the synergy matrices. The relevant subspace corresponds to the k largest eigenvectors of the synergy matrices that hold the cluster structure. To exhibit the significance of this iterative integration and the effectiveness of RISynG, it is compared with Spectral clustering performed on individual omics datasets. The results presented in Table 4 show that the proposed algorithm outperforms the individual omic-views in CESC, BRCA, LGG, and STAD datasets for all the external clusters validity indices. In the OV dataset, RISynG outperforms for F-measure, Jaccard, and Purity. However, the miRNA view performs better for ARI and NMI indices. The performance of RISynG is significantly higher than the best individual view in the case of CESC, BRCA, and LGG datasets, irrespective of any indices.

To express the cluster holding capacity of the integrated subspace obtained by the proposed approach, scatter plots for the best k dimensions are plotted. The colours in the plots indicate the ground truth (cancer subtypes). Comparative plots are also presented in Figs. 3, 4, 5, 6, and 7 to show that the integrated subspace obtained by RISynG are more informative than other subspace-based integrative-clustering approaches (SNF, SURE, CoALa, iCluster, WMLRR, and MiMIC), for most of the datasets. Comparison with the best individual omic-view (CESC: mRNA, BRCA: metDNA, OV: miRNA, LGG: metDNA, and STAD:miRNA) is also presented to establish the significance of multi-omics data integration performed by the proposed approach. Considering the proposed approach, the scatter plots show that the clusters are well separated in the case of CESC (Fig. 3) and LGG (Fig. 6) datasets. There is a slight overlap between the two groups in BRCA (Fig. 4), but it is better than the other methods. Whereas, for OV (Fig. 5) and STAD (Fig. 7) datasets, the overlap between subtypes is observed in the subspace obtained by all the methods.

Table 4 Comparative performance analysis of proposed approach and individual omic-view.

Full size table

Biological analysis

Once the cancer subtypes are obtained, the patient clusters’ molecular characteristic feature is also evaluated to establish their biological relevance. To understand the varying expression of different biomarkers in different subtypes, differential expression analysis (DEA) of miRNAs and mRNAs is performed between the correctly identified groups of patients. A comparative analysis is performed between the true positives and true negatives obtained by all the algorithms. As there are three subtypes in the case of LGG and CESC datasets; therefore, DEA is performed between three pairs (considering all possible pairs). Similarly, in the case of STAD and BRCA datasets, since there are four subtypes, DEA is performed for six pairs, and for the OV dataset, there are two subtypes; therefore, DEA is performed for one pair. R package Limma⁶⁴ is used to perform DEA. miRNAs and mRNAs having Bejamini-Hochberg false discovery rate adjusted p-value $< 0.05$ are considered as differentially expressed. Number of differentially expressed biomarkers obtained from different groups in CESC, BRCA, OV, LGG, and STAD datasets are reported in Tables 5, 6, 7, 8, and 9 respectively. To further explore and highlight the biological knowledge and process-specific functioning of the identified sets of differentially expressed biomarkers, different types of enrichment analyses are also performed, considering the hundred most differentially expressed biomarkers in each case.

Biological enrichment analyses

The first analysis is Pathway enrichment analysis (PEA). It explores the mechanistic insight into the set of differentially expressed biomarkers. It helps identify those biological pathways enriched in a set of biomarkers more than expected by chance. The second one is Biological process enrichment analysis (BPEA). It helps characterize the relationship between genes or miRNAs by specifically annotating them to associated biological processes. It helps identify the over-represented biological processes in our list, which can help evaluate the biological significance of the obtained cancer subtypes. Furthermore, the third one is Disease ontology enrichment analysis (DOEA). Disease Ontology (DO) helps map the relevance of cancer subtypes identified from high-throughput data to clinical relevance. In this study, the R package, clusterProfiler⁶⁵ and DIANA Tools mirPath v.3⁶⁶ are used for performing PEA and BPEA for genes and miRNAs, respectively, and R package DOSE⁶⁷ is used to perform DOEA for the genes. The top 100 differentially expressed biomarkers are passed to these tools. In some cases, if the number of differentially expressed biomarkers is less than 100, then all of them are used. KEGG database is selected for PEA⁶⁸. All the pathway terms associated with the set of biomarkers having false discovery rate adjusted p-value $< 0.05$ (significant pathway terms) are only considered. Suppose any differentially expressed biomarker sets are not associated with significant KEGG pathway terms. In that case, that set is said to be not biologically relevant with respect to KEGG pathway terms. Similarly, all the biological process (BP) terms associated with the set of biomarkers having a false discovery rate adjusted p-value $< 0.05$ (significant pathway terms) are only considered. If any of the differentially expressed biomarker sets are not associated with significant BP terms, that set is said to be not biologically relevant with respect to BP terms. In DOEA, semantic similarities between DO terms and genes are calculated that help explore the similarities of diseases and gene functions from a disease perspective. The output of DOES has associated disease terms. A gene set is said to be enriched with DO terms if the terms obtained by its DOEA have a false discovery rate corrected p-value $<0.05$.

For the quantification of KPEA, BPEA, and DOEA, respective enrichment scores⁶⁹, and annotation ratios⁶⁹ are calculated. The higher the value of these scores better is the enrichment; hence, the more biologically significant the differentially expressed biomarkers are, the better the cancer sub-typing. Following are the equations for these scores:

$$\begin{aligned} BPES= & {} \frac{1}{T} \sum _{t=1}^{T}-log_{10}(p-value_{t}), \end{aligned}$$

(47)

$$\begin{aligned} AR= & {} \frac{1}{T \times G} \sum _{i=1}^{T}g_{i}. \end{aligned}$$

(48)

Here, T denotes the number of significant pathway/BP/terms associated with a set of differentially expressed genes or miRNAs between two cancer subtypes identified by any clustering approaches. G denotes the total number of genes given to clusterProfiler for the enrichment analysis, and g denotes the gene count associated with a pathway/BP/DO term. Comparative analysis of the cancer subtypes obtained by the proposed approach and other existing algorithms are performed and the associated quantitative indices are reported in Tables 5, 6, 7, 8, and 9. Some of the differentially expressed miRNAs or mRNAs have no associated significant terms; therefore, there is no scope for calculating the quantitative indices. Also, in some cases, there are no differentially expressed biomarkers. All these cases are represented by $*$ in the tables.

To compare the effectiveness of the proposed approach with the other algorithms in this study, the overall performance of all the methods is also evaluated. When all the five cancer datasets are considered together, the proposed approach outperforms concerning both cluster evaluation indices and biological enrichment analysis, as shown in Fig. 8. The analysis is performed by considering the success frequency (number of times a method scored the highest value for respective indices when all the cases in all the cancer types are considered). The success frequency shows that the proposed approach outperforms when cluster validity indices are considered by scoring maximum values for 21 times, followed by SNF.CC (7), SNF (6), CNMF (5), CC (2), COCA (2), and WMLRR (1). Similarly, suppose the methods are ranked considering the success frequency for quantitative indices calculated for biological enrichment analysis. In that case, the proposed approach will again stand first by scoring the maximum value 67 times, followed by SNF (21), SNF.CC (20), CC (12), CoALa (10), CNMF (9), MiMIC (7), SURE (5), WMLFF (5), COCA (4), and iCluster (1). If the cluster validity indices are looked upon individually, the proposed approach also outperforms with respect to F-measure, ARI, NMI, Jaccard index, and Purity. Considering the indices for biological enrichment individually, the proposed algorithm again outperforms with respect to all the indices except for AR for BPES for mRNA enrichment, where it stands second.

Table 5 Comparative biological analysis of CESC dataset.

Full size table

Table 6 Comparative biological analysis of BRCA dataset.

Full size table

Table 7 Comparative biological analysis of OV dataset.

Full size table

Table 8 Comparative biological analysis of LGG dataset.

Full size table

Table 9 Comparative biological analysis of STAD dataset.

Full size table

Overlap analysis

The hundred most differentially expressed genes between all the subtypes-pairs in cervical cancer that RISynG and the other methods identified are explored further for experimental support. The genes are analyzed based on the degree of overlap with known cervical cancer genes that are experimentally validated. The Cervical Cancer Gene Database (CCDB)⁷⁰ is used for finding the overlap. It is a manually curated catalog of experimentally validated genes involved in the different stages of cervical carcinogenesis. All the up-regulated and down-regulated genes in cervical cancer with evidence from the published literature available in CCDB are considered for this analysis. 367 genes are reported in CCDB that are differentially expressed in cervical cancer. This list contains 185 genes from a total number of 2000 genes that are used for cancer subtype identification in this study. The statistical significance of the overlap analysis is reported in Table 10. In total, 30 genes out of 222 identified from the proposed approach overlap with cervical cancer-related genes. This is the maximum overlap when compared with the other methods. Fisher’s exact test is used here to find the statistical significance of the contingency table created from the overlap analysis in Table 10 for different algorithms. At 95% confidence, it is observed that only the genes identified by the proposed approach have significant overlap with experimentally validated genes curated from literature with a p-value of 0.026. Therefore, it indicates that the proposed approach has the potential to identify clinically important subtypes of cancer that have a characteristic molecular signature.

Table 10 Overlap with experimentally validated gene-list.

Full size table

Conclusion

The present study describes a method named RISynG that efficiently identifies cancer subtypes. Cancer subtypes identification can facilitate cancer diagnosis and therapy. It is one of the vital components of the precision medicine framework. The main contributions of this study are: (1) Development of an integrative clustering method for multi-view omics data. (2) Demonstration of the effectiveness of the proposed method over other methods. (3) Establishing biological relevance for the obtained results.

Data availability

The python scripts for RISynG and the pre-processed sample-matched datasets are available at http://home.iitj.ac.in/~sushmitapaul/CBL/code/RISynG.zip.

References

Stingl, J. & Caldas, C. Molecular heterogeneity of breast carcinomas and the cancer stem cell hypothesis. Nat. Rev. Cancer 7, 791–799 (2007).
Article PubMed CAS Google Scholar
Liang, M., Li, Z., Chen, T. & Zeng, J. Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach. IEEE/ACM Trans. Comput. Biol. Bioinform. 12, 928–937 (2015).
Article PubMed CAS Google Scholar
Tomczak, K., Czerwińska, P. & Wiznerowicz, M. The Cancer Genome Atlas (TCGA): An immeasurable source of knowledge. IEEE/ACM Trans. Comput. Biol. Bioinform. 19, A66–A77 (2015).
Google Scholar
Therese, S. et al. Gene expression patterns of breast carcinomas distinguish tumor sub classes with clinical implications. Proc. Natl. Acad. Sci. U.S.A. 98, 10869–10874 (2001).
Article Google Scholar
Bhattacharjee, A. et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma sub classes. Proc. Natl. Acad. Sci. U.S.A. 98, 13790–13795 (2001).
Article ADS PubMed PubMed Central CAS Google Scholar
Monti, S., Tamayo, P., Mesirov, J. & Golub, T. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn. 52, 91–118 (2003).
Article MATH Google Scholar
Teschendorff, A. E., Miremadi, A., Pinder, S. E., Ellis, I. O. & Caldas, C. An immune response gene expression module identifies a good prognosis subtype in estrogen receptor negative breast cancer. Genome Biol. 8, R157 (2007).
Article PubMed PubMed Central CAS Google Scholar
Zhang, W., Feng, H., Wu, H. & Zheng, X. Accounting for tumor purity improves cancer subtype classification from DNA methylation data. Bioinformatics 33, 2651–2657 (2017).
Article PubMed PubMed Central CAS Google Scholar
Network, C. G. A. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).
Article ADS CAS Google Scholar
Network, C. G. A. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).
Article ADS CAS Google Scholar
Hoadley, K. A. et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell 158, 929–944 (2014).
Article PubMed PubMed Central CAS Google Scholar
Gabasova, E., Reid, J. & Wernisch, L. Clusternomics: Integrative context-dependent clustering for heterogeneous datasets. PLos Comput. Biol. 13, e1005781 (2017).
Article ADS PubMed PubMed Central CAS Google Scholar
Bo, W. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11, 333–337 (2014).
Article CAS Google Scholar
Shen, R., Olshen, A. B. & Ladanyi, M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25, 2906–2912 (2009).
Article PubMed PubMed Central CAS Google Scholar
Ronglai, S. et al. Integrative subtype discovery in glioblastoma using iCluster. Gynecol. Oncol. 7, e35236 (2012).
Google Scholar
Zhang, W. et al. Integrating genomic, epigenomic, and transcriptomic features reveals modular signatures underlying poor prognosis in ovarian cancer. Cell Rep. 4, 542–553 (2013).
Article PubMed CAS Google Scholar
Wu, D., Wang, D., Zhang, M. Q. & Gu, J. Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: Application to cancer molecular classification. BMC Genom. 16, 1–10 (2015).
Article CAS Google Scholar
Khan, A. & Maji, P. Selective update of relevant eigenspaces for integrative clustering of multimodal data. IEEE Trans. Cybern. 1–13 (2020).
Khan, A. & Maji, P. Approximate graph laplacians for multimodal data clustering. IEEE Trans. Pattern Anal. Mach. Intell. (2019).
Xu, T. et al. Identifying cancer subtypes from miRNA-TF-mRNA regulatory networks and expression data. PLoS One 11, e0152792 (2016).
Article PubMed PubMed Central CAS Google Scholar
Jiang, L., Xiao, Y., Ding, Y., Tang, J. & Guo, F. Discovering cancer subtypes via an accurate fusion strategy on multiple profile data. Front. Genet. 10, 20 (2019).
Article PubMed PubMed Central CAS Google Scholar
Long, B., Yu, P. S. & Zhang, Z. A General model for multiple view unsupervised learning. In Proceedings of the 2008 SIAM International Conference on Data Mining 822–833 (SIAM, 2008).
Xia, T., Tao, D., Mei, T. & Zhang, Y. Multiview spectral embedding. IEEE Trans. Syst. Man. Cybern. Part B Cybern. 40, 1438–1446 (2010).
Article Google Scholar
Zhou, D. & Burges, C. J. Spectral clustering and transductive learning with multiple views. In Proceedings of the 24th International Conference on Machine Learning 1159–1166 (ACM, 2007).
Zhang, C. et al. Generalized latent multi-view subspace clustering. IEEE Trans. Pattern Anal. Mach. Intell. 42, 86–99 (2020).
Article PubMed Google Scholar
Li, X., Zhang, H., Wang, R. & Nie, F. Multiview clustering: A scalable and parameter-free bipartite graph fusion method. IEEE Trans. Pattern Anal. Mach. Intell. 44, 330–344 (2022).
Article PubMed Google Scholar
Gao, Q. et al. Enhanced tensor RPCA and its application. IEEE Trans. Pattern Anal. Mach. Intell. 43, 2133–2140 (2021).
Article PubMed Google Scholar
Jha, V. N. Study on Hermitian, Skew-Hermitian and unitary matrices as a part of normal matrices. Int. J. Open Inf. Technol. 4, 2307–8162 (2016).
Google Scholar
Collins, M., Dasgupta, S. & Schapire, R. E. A generalization of principal component analysis to the exponential family. In NIPS’01: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic 617–624 (2001).
Schölkopf, B., Mika, S., Smola, A., Rätsch, G. & Müller, K.-R. Kernel PCA pattern reconstruction via approximate pre-images. In International Conference on Artificial Neural Networks 147–152 (Springer, 1998).
Raykar, V. C. Spectral Clustering and Kernel Principal Component Analysis are Pursuing Good Projections. Project Report (2004).
Schölkopf, B., Smola, A. & Müller, K. R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10, 1299–1319 (1998).
Article Google Scholar
Welling, M. Kernel principal components analysis. Adv. Neural. Inf. Process. Syst. 15, 70–72 (2005).
Google Scholar
Mantao, X. & Franti, P. A Heuristic k-means clustering algorithm by kernel PCA. In 2004 International Conference on Image Processing, 2004. ICIP ’04., vol. 5, 3503–3506 (2004).
von Luxburg, U. A Tutorial on Spectral Clustering (2007). arXiv:0711.0189.
Ng, A. Y., Jordan, M. I. & Weiss, Y. On spectral clustering: Analysis and an algorithm. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, 849–856 (MIT Press, 2001).
Gönen, M. & Alpaydın, E. Multiple kernel learning algorithms. J. Mach. Learn. Res. 12, 2211–268 (2011).
MathSciNet MATH Google Scholar
Network, T. R. Clinical significance of four molecular subtypes of gastric cancer identified by the Cancer Genome Atlas Project. Clin. Cancer Res. (2017).
Network, T. R. Integrated genomic and molecular characterization of cervical cancers. Nature 543, 378–384 (2017).
Article ADS CAS Google Scholar
Network, T. R. Comprehensive, integrative genomic analysis of diffuse lower-grade gliomas. N. Engl. J. Med. 372, 2481–2498 (2015).
Article CAS Google Scholar
Matsuno, R. K. et al. Agreement for tumor grade of ovarian carcinoma: Analysis of archival tissues from the surveillance, epidemiology and end results residual tissue repository. Cancer Causes Control 24, 749–757 (2013).
Article PubMed PubMed Central Google Scholar
Huang, T., Yang, J. & Cai, Y. D. Novel candidate key drivers in the integrative network of genes, micrornas, methylations, and copy number variations in squamous cell lung carcinoma. BioMed Res. Int. (2015).
Borel, C. et al. Identification of cis- and trans-regulatory variation modulating microRNA expression levels in human fibroblasts. Genome Res. 21, 68–73 (2011).
Article PubMed PubMed Central CAS Google Scholar
Lu, J. & Clark, A. Impact of microRNA regulation on variation in human gene expression. Genome Res. 22, 1243–1254 (2012).
Article PubMed PubMed Central CAS Google Scholar
Liu, F., Dong, H., Mei, Z. & Huang, T. Investigation of miRNA and mRNA co-expression network in ependymoma. Front. Bioeng. Biotechnol. 8, 177 (2020).
Article PubMed PubMed Central Google Scholar
Dudziec, E., Gogol-Döring, A., Cookson, V., Chen, W. & Catto, J. Integrated epigenome profiling of repressive histone modifications, DNA methylation and gene expression in normal and malignant urothelial cells. PLoS One 7, e32750 (2012).
Article ADS PubMed PubMed Central CAS Google Scholar
McMahon, K. W., Karunasena, E. & Ahuja, N. The roles of DNA methylation in the stages of cancer. PCancer J. (Sudbury, Mass.) 23, 257–261 (2017).
Article CAS Google Scholar
Kim, T., Jeong, H. & Sohn, K. Topological integration of RPPA proteomic data with multi-omics data for survival prediction in breast cancer via pathway activity inference. BMC Med. Genom. 12, 1–14 (2019).
Article Google Scholar
Zwiener, I., Frisch, B. & Binder, H. Transforming RNA-seq data to improve the performance of progonostic gene signatures. PLoS One 9, e85150 (2014).
Article ADS PubMed PubMed Central CAS Google Scholar
Sun, Y., Ou-Yang, L. & Dai, D.-Q. WMLRR: A weighted multi-view low rank representation to identify cancer subtypes from multiple types of omics data. IEEE/ACM Trans. Comput. Biol. Bioinf. 18, 2891–2897 (2021).
Article CAS Google Scholar
Wilkerson, M. D. & Hayes, D. N. ConsensusClusterPlus: A class discovery tool with confidence assessments and item tracking. PLoS One 26, 1572–1573 (2010).
CAS Google Scholar
Cai, M. & Li, L. Subtype identification from heterogeneous TCGA datasets on a genomic scale by multi-view clustering with enhanced consensus. BMC Med. Genom. 10, 65–79 (2017).
Article Google Scholar
Xu, T. et al. CancerSubtypes: An R/bioconductor package for molecular cancer subtype identification, validation and visualization. Bioinformatics 23, 3131–3133 (2017).
Article CAS Google Scholar
Cabassi, A. & Kirk, P. D. W. Multiple Kernel Learning for Integrative Consensus Clustering of Omic Datasets. arXiv preprint (2019).
Brunet, J. P., Tamayo, P., Golub, T. R. & Mesirov, J. P. Metagenes and molecular pattern discovery using matrix factorization. PNAS 101, 4164–4169 (2004).
Article ADS PubMed PubMed Central CAS Google Scholar
Khan, A. & Maji, P. Multi-manifold optimization for multi-view subspace clustering. IEEE Trans. Neural Netw. Learn. Syst. 1–13 (2021).
Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Comput. Appl. Math. 20, 53–65 (1987).
Article MATH Google Scholar
Bezdek, J. C. & Pal, N. R. Cluster Validation with Generalized Dunn’s Indices. In Proceedings 1995 Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems. IEEE Xplore 190–193 (1995).
Davies, D. L. & Bouldin, D. W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227 (1979).
Article PubMed CAS Google Scholar
Xie, X. & Beni, G. A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 13, 841–846 (1991).
Article Google Scholar
de Souto, M. C. P. et al. A comparison of external clustering evaluation indices in the context of imbalanced data sets. In 2012 Brazilian Symposium on Neural Networks (2012).
Hubert, L. J. & Arabie, P. Comparing partitions. J. Classif. 2, 193–218 (1985).
Article MATH Google Scholar
Qiang, W., Yong, D., Xinwang, L., Qi, L. & Shijie, L. Multi-view clustering with extreme learning machine. Neurocomputing 214, 483–494 (2016).
Article Google Scholar
Smyth, G. K. Limma: Linear models for microarray data. In Bioinformatics and Computational Biology Solutions Using R and Bioconducter, vol. 214, 397–420 (Springer, 2005).
Yu, G., Wang, L., Han, Y. & He, Q. clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS J. Integr. Biol. 16, 284–287 (2012).
Article CAS Google Scholar
Vlachos, I. S. et al. Deciphering microRNA function with experimental support. DIANA-miRPath v3.0. Nucleic Acids Res. 43, W460–W466 (2015).
Yu, G., Wang, L. G., Yan, G. & He, Q. Y. DOSE: An R/Bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics 31, 608–609 (2015).
Article PubMed CAS Google Scholar
Kanehisa, M. & Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article PubMed PubMed Central CAS Google Scholar
Paul, S. & Madhumita. RFCM3: Computational method for identification of miRNA–mRNA regulatory modules in cervical cancer. IEEE/ACM Trans. Comput. Biol. Bioinform.17, 1729–1740 (2020).
Agarwal, S. M., Raghav, D., Singh, H. & Raghava, G. CCDB: A curated database of genes involved in cervix cancer. Nucleic Acids Res. 39, D975–D979 (2011).
Article PubMed CAS Google Scholar

Download references

Acknowledgements

This work is partially supported by the seed grant program of the Indian Institute of Technology Jodhpur, India (Grant no. I/SEED/SPU/20160010). The authors acknowledge Dr.Sukhendu Ghosh, Department of Mathematics, Indian Institute of Technology Jodhpur for fruitful discussions.

Author information

These authors contributed equally: Madhumita and Archit Dwivedi.

Authors and Affiliations

Department of Bioscience and Bioengineering, Indian Institute of Technology, Jodhpur, Rajasthan, 342037, India
Madhumita, Archit Dwivedi & Sushmita Paul
School of Artificial Intelligence and Data Science, Indian Institute of Technology, Jodhpur, Rajasthan, 342037, India
Sushmita Paul

Authors

Madhumita
View author publications
You can also search for this author in PubMed Google Scholar
Archit Dwivedi
View author publications
You can also search for this author in PubMed Google Scholar
Sushmita Paul
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.P. conceived and designed the research. M. and A.D. designed the algorithm, performed experiments, analyzed data, and interpreted the results of the experiments. All the authors drafted the manuscript.

Corresponding author

Correspondence to Sushmita Paul.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Madhumita, Dwivedi, A. & Paul, S. Recursive integration of synergised graph representations of multi-omics data for cancer subtypes identification. Sci Rep 12, 15629 (2022). https://doi.org/10.1038/s41598-022-17585-2

Download citation

Received: 23 August 2021
Accepted: 27 July 2022
Published: 17 September 2022
DOI: https://doi.org/10.1038/s41598-022-17585-2

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

An integrated network representation of multiple cancer-specific data for graph-based machine learning

Benchmarking joint multi-omics dimensionality reduction approaches for the study of cancer

wMKL: multi-omics data integration enables novel cancer subtype identification via weight-boosted multi-kernel learning

Introduction

Proposed approach for cancer-subtypes identification

Gramian matrix and kernel trick

Property 1

Proof

Property 2

Proof

Property 3

Proof

Graph Laplacian

Property 1

Proof

Property 2

Proof

Property 3

Proof

RISynG algorithm

Property 1

Proof

Property 2

Property 3

Recursive multi-kernel integration

Computational complexity

Significance of proposed algorithm

Description of datasets

Experimental results and discussion

Performance analysis on multi-omics cancer datasets

Cluster evaluation

Importance of multi-omics data integration

Biological analysis

Biological enrichment analyses

Overlap analysis

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links