Introduction

Cancer is a heterogeneous disease with diverse pathogeneses, and clinical features that can develop in different tissues and cell types1. A cancer subtype can be defined as a subcategory of specific cancer; for example, Cervical cancer can be further grouped into Adenocarcinomas and Squamous cell carcinomas. Multiple subtypes are distinguishable based on molecular profiles, histology, or sometimes specific mutation. In personalized medicine practices, patient-specific medicines are provided rather than generic medicine. Therefore, for effective treatment of any cancer, it is crucial to identify the appropriate cancer subtype in order to provide an effective prognosis2.

Nowadays, with the advancement of technologies, it has become very easy to generate high-dimensional multi-omics data for an individual. Multi-omics data include miRNA and mRNA expressions, DNA methylation, reverse protein phase assays, and others. These datasets are publicly available in various databases like The Cancer Genome Atlas (TCGA)3. Accumulation of various omics data opens up the opportunity to develop novel computational methods to integrate the tremendous amount of multi-view information available for cancer subtype identification. The usual practice of identifying cancer subtypes is by clustering cancer patient data. By grouping the cancer patients based on their genetic profiles, one can better understand the pathogenic mechanisms behind the disease. This will later help in the development of subtype-specific anticancer treatments. However, several challenges exist in grouping the cancer patients and integrating multi-omics data.

The multi-view omics data integration and clustering of cancer patients are considerably new research areas. Few algorithms are developed to address the challenges associated with it. A decade ago, researchers used single omics data to cluster cancer subtypes. Several studies are performed using only gene expression data4,5,6 or DNA methylation data7 or copy number data8 to identify cancer subtypes. These algorithms perform clustering across the samples to capture the homogeneity present within the patients based on expression levels of a specific biomarker. Since acquiring cancer hallmarks requires multiple molecular alterations at multiple levels, these algorithms fail to establish the causal relationship between molecular signatures. This biological phenomenon indicates the need for algorithms that integrates multi-omics data to identify cancer subtype. In this regard, integrative clustering-based approaches are found helpful for capturing underlying molecular mechanisms working behind deadly cancer. Further, these algorithms can be categorized into two groups. The first group of algorithms identifies clusters from each omic data separately. Later, it combines these clustering results to obtain a global cluster that represents cancer subtypes9,10,11,12. These forms of algorithms are known as Consensus Clustering (CC). Mostly, the CC algorithms perform final clustering on individual clusters obtained from different omic datasets using a voting mechanism. Different voting mechanisms generate different clustering solutions. The second group of integrative clustering-based approaches first integrates the multi-view omics data and then applies clustering to obtain cancer subtypes13,14,15,16. Sometimes the multi-view data are concatenated or stacked together, and clustering identifies cancer subtypes. Data concatenation may lead to information loss and amplifies the curse of dimensionality16. On the other hand, to overcome the above mentioned limitations, a set of algorithms are developed to extract informative subspace from each of the omics datasets and then performs clustering on the integrated dataset14,15,16,17,18,19.

Clustering multi-view genomics data is a challenging task. One of the critical steps is selecting relevant information from all the available information sources and judiciously integrating them to obtain better clustering solutions. The multi-view data from multi-omics studies vary in terms of variance, scale, and unit. If the integration step is not performed correctly, the fused information may be biased towards the most variant omic view. Therefore, it becomes essential to first capture the variations present in each view and then integrate them. There are some methods available that model the variation of each view first with the help of similarity graphs and integrate them to identify clusters13,19,20,21. The challenge here is finding the best possible way of integration to capture the essence of all the views from different types of genomic information available for the same set of samples. The research area devoted to this type of problem is multi-view learning22,23,24,25,26,27.

In this study, a novel algorithm named RISynG (Recursive Integration of Synergised Graph-representations) is presented. The proposed approach treats multi-omics data clustering as multi-view clustering, where information from multiple omics platforms is integrated to identify clinically important sub-groups within cancer. In order to judiciously capture the variation present across the multi-omics dataset, the proposed approach works in three steps. At first, for each view, two sample similarity matrices are computed using graph representation matrices, namely, the Gramian matrix and the Laplacian matrix. This step acknowledges the statistical diversity in the multi-view omics data, which directly influences the quantification of similarity between the samples. Later, it involves the integration of representation matrices for the respective omic-view using a parameterized combination function to generate synergy matrices. In the second step, the variation captured through synergy matrices for each omic-view is fused. The proposed approach first arranges all the synergy matrices based on their relevance. Then, a recursive function is designed to merge each synergy matrix so that the less relevant matrix has only a slight influence on the final cluster structures. At the end of this process, the final accretive basis of the accretive subspace is obtained, whose first k eigenvectors hold the cluster structure. At last, k-means clustering is applied on the rows of the accretive basis matrix to generate cluster labels. The efficacy of the proposed algorithm is extensively studied on five multi-omics cancer datasets and compared with existing multi-view clustering approaches used for cancer subtypes identification.

Proposed approach for cancer-subtypes identification

This section describes the novel algorithm designed in this study to integrate multi-omics data for cancer subtypes identification. The proposed method integrates multi-view data using a recursive multi-kernel integration function. It uses the graphical representation to harness the best picture of sample similarities from each of the omic views and explores each view’s statistical property. The schematic workflow of RISynG is presented in Fig. 1. Before moving to the steps of the proposed algorithm, first, the required analytical formulations are discussed.

Figure 1
figure 1

Schematic flow diagram of the proposed approach for cancer subtypes identification.

Gramian matrix and kernel trick

Gramian matrix, \(G=[g_{ij}]_{n\times n}\) is a Hermitian matrix, in which each element is a pairwise Hermitian inner product of the vectors in a Hausdorff pre-Hilbert space, V = \(\{{v_{1},v_{2},v_{3}, \ldots ,v_{n}}\}\).

$$\begin{aligned} G(v_{1},\dots ,v_{n})= \begin{bmatrix}<v_{1},v_{1}>&{} \dots &{}<v_{1},v_{n}>\\<v_{2},v_{1}> &{} \dots &{}<v_{2},v_{n}> \\ \vdots &{} \ddots &{} \vdots \\<v_{n},v_{1}> &{} \ldots &{} <v_{n},v_{n}> \\ \end{bmatrix} \quad , v_{i}\in {\mathbb {R}}^d. \end{aligned}$$

The Hermitian inner product space is accompanied by the geometric notions associated with the vectors, such as the length and the angle between two vectors. Since G is a Hermitian matrix, it inherits all the properties portrayed by a Hermitian matrix. A few of the relevant properties are enlisted below28.

Property 1

All the eigenvalues of G are real.

Proof

Eigenvalues of a matrix are the roots of its characteristic equation. The characteristic equation for matrix G is written as:

$$\begin{aligned} \det {(\lambda I-G)}=0. \end{aligned}$$
(1)

Let the root be some complex number \(\lambda = a+ib, a,b\in {\mathbb {R}}, b\ne 0\) and I be the identity matrix of same order. Since, at this value of \(\lambda \), the characteristic equation has a non-empty kernel, there must exist a vector \(u=x+iy, x,y\in {\mathbb {R}}\) such that:

$$\begin{aligned} {Gu=\lambda u}, \end{aligned}$$
(2)

or,

$$\begin{aligned} {G(x+iy)=(a+ib)(x+iy)}. \end{aligned}$$
(3)

Taking adjoint of this equation we get,

$$\begin{aligned} {G(x-iy)=(a-ib)(x-iy)}. \end{aligned}$$
(4)

If \(x+iy\) and \(x-iy\) were two different eigenvectors of matrix G, then their inner product \(x^2+y^2\) would have been 0 because of the mutual orthogonality among the eigenvectors. That is not possible until x and y are 0, in which case, (3) and (4) would be indifferent. That is possible only if the initial assumption is contradicted and b is allowed to be 0 for all eigenvectors x. Hence, it is proved that all the eigenvalues of G are real. \(\square \)

Property 2

G is symmetric and positive semi-definite matrix.

Proof

Pertaining to the fact that \(v_{i}\in {\mathbb {R}}^d\), the following should hold for some set of vectors x.

$$\begin{aligned} {x^{\textsf {T}}{G} x=\sum _{i,j}x_{i}x_{j}\left\langle v_{i},v_{j}\right\rangle =\sum _{i,j}\left\langle x_{i}v_{i},x_{j}v_{j}\right\rangle }. \end{aligned}$$
(5)

According to the elementary property of inner products, \(\square \)

\({\displaystyle \langle x+y,x+y\rangle =\langle x,x\rangle +\langle x,y\rangle +\langle y,x\rangle +\langle y,y\rangle \,.}\) It implies that the sum of inner products in (5) can be taken forward as,

$$\begin{aligned} {\left\langle \sum _{i}x_{i}v_{i}, \sum _{j}x_{j}v_{j}\right\rangle =\left\| \sum _{i}x_{i}v_{i} \right\| ^{2}\ge 0.} \end{aligned}$$
(6)

Therefore, G is positive semi-definite matrix.

Property 3

All the eigenvalues of G are non-negative.

Proof

Property 2 implies \(x^{\textsf {T}}{G} x\ge 0\). Substituting the value of Gx from (2) into it,

$$\begin{aligned} {x^{\textsf {T}}{G} x=\lambda x^{\textsf {T}}x}\ge 0. \end{aligned}$$
(7)

Since \(x^{\textsf {T}}x\) is positive for all eigenvectors, therefore, \(\lambda \ge 0\). Hence proved.

The previously described premise is often used in various methods of dimensionality reduction. Algorithms like Principal Component Analysis and its variants utilize kernel trick to map the observations into a higher dimension to make the data linearly separable. It is equivalent to projecting the mean-centered data onto a subspace on which its variance is maximum29. It is shown by Bernhard Scholkopf et al.30 that algorithms like KPCA use a kernel function \(\varvec{\kappa }\) to essentially learn a mapping function \(\phi \) for the input space \({\mathbb {R}}^n\) into a high-dimensional Hilbert space \(\mathbf{F}\), which can be called as feature space. The process is demonstrated in (8) and (9).

$$\begin{aligned} {\phi :{\mathbb {R}}^n \rightarrow \mathbf{F}}. \end{aligned}$$
(8)

Therefore, for a data point \(v=(x_1,\dots ,x_n), x_i \in {\mathbb {R}}^d\), mapping into a feature space \({\mathbb {R}}^{n+k}\) will be given by

$$\begin{aligned} {\phi (v)=(x_1,\dots ,x_n,p_1,\dots ,p_k)\in {\mathbb {R}}^{n+k}}, \end{aligned}$$
(9)

where, the value of \(p_i\) depends upon the kernel that has been used for the mapping; however, kernels do not explicitly project the data into that high dimensional feature space; rather, it generates a Gramian matrix G of the mapped data in the aforementioned feature space \(\mathbf{F}\). Generated Gramian matrix enables the input data to be operated in that high-dimensional feature space31. If \(X=(x_1\dots x_n), x_i\in {\mathbb {R}}^{d}\) represent the input data. The corresponding Gramian matrix is given by

$$\begin{aligned} {[G]_{ij}=\kappa ({x_i, x_j}) = \phi ({x_i}) \phi ({x_j})^T, {x_i},{x_j}\in X}. \end{aligned}$$
(10)

Let \(G=U\Sigma U^T\) represent the eigen decomposition of G, where U is a matrix containing the eigenvectors of matrix G, arranged column-wise in descending order of their corresponding eigenvalues, which are present in the same fashion in the diagonal matrix \(\Sigma \) as shown in (11) and (12).

$$\begin{aligned} U= & {} [u_1,\dots ,u_n], \end{aligned}$$
(11)
$$\begin{aligned} \Sigma= & {} diag(\lambda _1,\lambda _2,\dots ,\lambda _n). \end{aligned}$$
(12)

Here, \(\lambda _1\ge \dots \ge \lambda _n\ge 0\) (see Property 3 of Gramian matrix), \(u_i^Tu_i=1\) for \(i\in \{1,2,\dots ,n\}\) and \(Gu_i=\lambda _i u_i\). Also note that in context of PCA Principal Components refers to the projection of the input data points onto the principal direction where the variance of the data is maximum. For PCA, the projection is given by \(y_i=U_k^Tx_i\) for all \(i\in \{1,2,\dots ,n\}\), where \(U_k\) is a matrix of first k eigenvectors of G. However, in case of KPCA, the spectrum of G itself gives the projection of X32. Note that when \(\phi (v)=v\), Gramian matrix transforms into covariance matrix. Generalising both, if \(U_k\) represent k principal axes, the algorithm finds a basis of an optimal low-dimensional subspace where the \( L_2\)-norm of reconstruction error is minimum33. That is, for a test sample x

$$\begin{aligned} {\underset{{U_k}}{\mathrm{arg}\,\mathrm{min} }}\, \Vert \phi (x)-U_kU_k^T\phi (x)\Vert ^2. \end{aligned}$$
(13)

In addition to dimensionality reduction, principal component analysis can also be used for k-clustering using a heuristic based k-means algorithm. This is done by performing k-means clustering in the projected space, as shown in heuristic k-means algorithm described in34.\(\square \)

Graph Laplacian

Any set of observations appear to have an emergent behaviour to evince the properties of a graph when operated in a clustering pipeline. Therefore, given a set of data points \(X=(x_1,x_2,\dots ,x_n) \in {\mathbb {R}}^{d\times n}\) and a notion of similarity between any two points \(x_i\),\(x_j\in X\), an undirected similarity graph \(S=(V,E)\) can be constructed out of them such that each vertex \(v_i\in V\) represent a data point \(x_i\), and \((v_i,v_j)\in E\) represent the edge between vertices \(v_i\) and \(v_j\). With each edge, there is an associated edge weight \(e_{ij}\) that represent the similarity between the corresponding data points. Let the similarity matrix be \(W(i,j)=[e_{ij}]_{n\times n}\). The degree \(d(v_i)\) associated with each node \(v_i\) is given by

$$\begin{aligned} d(v_i)=|\{v_j\in V|\{v_j,v_i\}\in E\text { or }\{v_i,v_j\}\in E\}|=\sum _{j=1}^ne_{ij}. \end{aligned}$$
(14)

The degrees of all the nodes/vertices can be wrapped in matrix form as shown in (15)

$$\begin{aligned} D=diag(d_1,d_2,\dots ,d_n). \end{aligned}$$
(15)

These matrices act as a precursor for constructing a matrix of algebraic importance, called Laplacian matrix. The data can be composed as a discrete graph form by making graph Laplacian of its continuous representations like vector space or Riemannian manifolds. Laplacian matrix has many variants, so much so, that depending on the problem and available data, authors device their own version of graph Laplacian matrix35. The simplest graph Laplacian, is given by (\(D-W\)). It is called unnormalise graph Laplacian matrix. However, in the proposed algorithm, the normalised graph Laplacian matrix has been used. That is,

$$\begin{aligned} {\mathscr {L}}=D^{-1/2}(D-W)D^{-1/2}=I-D^{-1/2}WD^{-1/2}, \end{aligned}$$
(16)

where \(D^{-1/2}=diag(d_1^{-1/2},d_2^{-1/2}, \dots ,d_n^{-1/2})\) and I is the identity matrix of appropriate order. Considering the fact that similarity matrix is a Gramian matrix, it is apparent that Gramian and Laplacian are not much different. Laplacian can be characterised as the Gramian normalised over the degree matrix. The distinction between unnormalised and normalised graph Laplacian is better apparent in light of spectral clustering. Consider a strongly connected graph \(S=(V,E)\). The purpose of clustering is to come up with the subsets of points according to their similarity, such that the similar points lie in the same subset. It is equivalent to finding the partitions of a graph such that the edge between different partitions has minimum weights. For two disjoint subsets \(A, B\subset V\) corresponding to two different partitions, the cut size is given by

$$\begin{aligned} cut(A,B)=\sum _{i\in A,j\in B}e_{ij}. \end{aligned}$$
(17)

Let there be k clusters in the data. The aim of clustering is to find k such partitions \({\mathbf{A}=(A_1,A_2,\dots ,A_k)}\), such that the size of the cuts, as shown in (17), over all the partitions is minimum. That is

$$\begin{aligned} \underset{A_1,\dots ,A_k}{\min }{{\text {cut}}} (A_i:0\ge i\ge k):=\sum _{i=1}^k{cut(A_i,\bar{A_i})}, \end{aligned}$$
(18)

where \(\bar{A_i}\) is the complement of \(A_i\). This is called the mincut problem. However, solving (18) alone does not achieve reliable clustering results. For example, for \(k=2\), partitioning one vertex from the rest of the graph can also be a valid solution as per mincut. In clustering, each cluster needs to accommodate a reasonably large partition to be considered credible. Therefore, the objective function is redefined in following two ways

$$\begin{aligned} \underset{A_1,\dots ,A_k}{\min }{{\text {RatioCut}}} (A_i:1\ge i\ge k):= & {} \sum _{i=1}^{k} {\frac{cut(A_i,\bar{A_i})}{|A_i|}}, \end{aligned}$$
(19)
$$\begin{aligned} \underset{A_1,\dots ,A_k}{\min }{{\text {NCut}}} (A_i:1\ge i\ge k):= & {} \sum _{i=1}^{k} {\frac{cut(A_i,\bar{A_i})}{vol(A_i)}}, \end{aligned}$$
(20)

where \(|A_i|\) represent the number of vertices in partition \(A_i\) and \(vol(A_i)=\sum _{v_j\in A_i}{d_j}\).

However, solving these minimisation problems is NP hard. Laplacian matrix is an utility that can be used to approximate these minimisation problem. Consequently, unnormalised Laplacian serves in the approximation of the minimization of RatioCut, while normalised Laplacian serves in the approximation of the minimization of NCut. Therefore, the approximated objective function using normalised Laplacian is given by (21).

$$\begin{aligned} \underset{U_k}{\min }{{\text {tr}}} (U_k^T{\mathscr {L}}U_k), \text { subjected to }U_k^TU_k=I. \end{aligned}$$
(21)

The above expression is minimum when \(U_k\in {\mathbb {R}}^{n\times k}\) is a matrix containing eigenvectors corresponding to k smallest non-zero eigenvalues of matrix \({\mathscr {L}}\). This matrix is used to embed the data into a k dimensional euclidean space spanned by the vectors in matrix U, in which grouping of the data points is arguably easy even with simpler techniques like k-means. The described practice is known as Laplacian embedding. The embedded data is then subjected to k-means clustering algorithm for cluster discovery, as shown in Normalised Spectral Clustering presented in Ref.36.For a strongly connected graph with single component, the eigenvector corresponding to the trivial solution (i.e. \(\lambda =0\)) of the eigenvalue problem of matrix \({\mathscr {L}}\) is a column vector of n ones. Therefore, \({\mathscr {L}}{} \mathbf{1}_{n}=0\) where \(\mathbf{1}_{n}=(1,\dots ,1)^T\). If the graph happens to have more than one components, then the multiplicity k of eigenvalue 0 if equal to the number of connected components in the graph. Nonetheless, with respect to clustering, the eigenvector(s) corresponding to eigenvalue 0 should be omitted while performing Laplacian embedding. It can be done by introducing a minor change in the matrix.

$$\begin{aligned} L={\mathscr {L}}+{\frac{2}{n}}{(1_n 1_n^T)}. \end{aligned}$$
(22)

If the eigenpairs of \({\mathscr {L}}\) are given by

$$\begin{aligned} {\varvec{\Gamma }}({\mathscr {L}})=\{(\lambda _1,f_1), (\lambda _2,f_2),\dots ,(\lambda _n,f_n)\} \end{aligned}$$

then, the eigenpairs of (22) are given by

$$\begin{aligned} &{\varvec{\Gamma }}(L)= {} \{(\lambda _2,f_2),(\lambda _3,f_3),\dots ,(\lambda _1+2,f_1)\}\\&{\text { { where,} }} 0=\lambda _1<\lambda _2\dots \le \lambda _n\le 2\text { and }f_1=\mathbf{1}_n. \end{aligned}$$

Hence, the new eigenvalue problem becomes

$$\begin{aligned} Lv={\mathscr {L}}v+{\frac{2}{n}}(1_n1_n^T)v=\lambda v. \end{aligned}$$
(23)

By modifying the matrix to L, the initial k eigenvectors can be taken right away. This trick works because of the fact that for all the pairs in \({\varvec{\Gamma }}({\mathscr {L}})\) except \((\lambda _1,f_1)\), the matrix L gets reduced to \({\mathscr {L}}\). Hence, set \({\varvec{\Gamma }}(L)\) is going to have all the eigenpairs that are in \({\varvec{\Gamma }({\mathscr {L}})}\), except \((\lambda _1,f_1)\). While at \(v=f_1=\mathbf{1}_{n}\),

$$\begin{aligned} L\mathbf{1}_{n}={\mathscr {L}}{} \mathbf{1}_{n}+{\frac{2}{n}}(\mathbf{1}_n1_n^T) \mathbf{1}_{n}=\lambda _1\mathbf{1}_{n}+2\mathbf{1}_{n}=(\lambda _1+2)\mathbf{1}_{n}. \end{aligned}$$
(24)

Therefore, in the new set \(\varvec{\Gamma }(L)\), the rank of all the eigenvalues greater than \(\lambda _1\) gets reduced by one and \(\mathbf{1}_{n}\) becomes the eigenvector corresponding to the largest eigenvalue. Laplacian matrix has certain properties which are exploited by many clustering techniques like the one shown above. Some of the relevant properties are as following.

Property 1

For every vector \(f\in {\mathbb {R}}^n\), \({\mathscr {L}}\) satisfies the following condition

$$\begin{aligned} {f^\prime {\mathscr {L}}f={\frac{1}{2}} \bigg(\sum _{i,j=1}^ne_{ij} \bigg({\frac{f_i}{\sqrt{d_i}}}-{\frac{f_j}{\sqrt{d_j}}\bigg)^2\bigg)}} \end{aligned}$$
(25)

Proof

By the definition of degree, \(d_i=\sum _{j=1}^ne_{ij}\). Therefore,

$$\begin{aligned}& f^{\prime }{\mathscr {L}f = f^{\prime }(I-D^{-1/2}WD^{-1/2})f} \\& \quad =\sum _{i=1}^nf_i^2-\sum _{i,j=1}^n{\frac{f_i}{\sqrt{d_i}}}{\frac{f_j}{\sqrt{d_j}}e_{ij}} \\& \quad ={\frac{1}{2}}\left( \sum _{i=1}^n{\frac{f_i^2}{d_i}}d_i+\sum _{j=1}^n{\frac{f_j^2}{d_j}}d_j-2\sum _{i,j=1}^n{\frac{f_i}{\sqrt{d_i}}}{\frac{f_j}{\sqrt{d_j}}}e_{ij}\right) \\& \quad ={\frac{1}{2}}\left( \sum _{i,j=1}^n{\frac{f_i^2}{d_i}}e_{ij}+{\frac{f_j^2}{d_j}}e_{ij}-2{\frac{f_i}{\sqrt{d_i}}}{\frac{f_j}{\sqrt{d_j}}}e_{ij}\right) \\& \quad ={\frac{1}{2}}\left( \sum _{i,j=1}^ne_{ij}\left( {\frac{f_i}{\sqrt{d_i}}}-{\frac{f_j}{\sqrt{d_j}}}\right) ^2\right) . \end{aligned}$$

Hence proved.\(\square \)

Property 2

\({\mathscr {L}}\) is symmetric and positive semi-definite matrix.

Proof

From (16), the symmetry of the matrix is fairly evident. Also, from the property 1, \({f^\prime {\mathscr {L}}f}\ge 0\) for all \(f\in {\mathbb {R}}^n\). Hence, it is provrd that \({\mathscr {L}}\) is symmetric and positive semi-definite matrix. \(\square \)

Property 3

All eigenvalues of \({\mathscr {L}}\) are non-negative.

Proof

Property 1 implies \({f^\prime {\mathscr {L}}f}\ge 0\). Substituting \({\mathscr {L}}f=\lambda f\), we get \({{f^\prime {\mathscr {L}}f}=\lambda x^{\textsf {T}}x}\ge 0\). Since \(f^\prime f\) is positive for all eigenvectors, therefore, \(\lambda \ge 0\). Hence proved.\(\square \)

RISynG algorithm

For grouping the cancer patients into clusters, each omic view is represented as a graph using two representation matrices, that is the Gramian matrix and the Laplacian matrix. Each of the representation matrices attributes the similarity network of the samples with a notion of similarity between the samples. Consider a view \(X_m=(x_1,x_2,\dots ,x_n)\), \(x_i\in {\mathbb {R}}^{d_m}\) corresponding to mth omic-source. If \(\rho (x_i,x_j)\) denotes the distance between \(x_i\) and \(x_j\) \(\in X_m\), then the similarity \(w(x_i,x_j)\) between them is given by:

$$\begin{aligned} w(x_i,x_j)=\text {exp} \left\{ -{\frac{\rho (x_i,x_j)}{\sigma }}\right\} , \end{aligned}$$
(26)

where \(\sigma \) is a free parameter adjusted as per the intrinsic properties of the data when subjected to clustering model. For the cancer data used in this study, the \(\sigma \) is given by \(\sigma ={\text {max}(\frac{\rho (x_i,x_j))}{2}}\) for all \(x_i,x_j\in X_m\). It has been assumed in the proposed method that multi-views may constitute different cluster manifolds when learnt on a particular similarity measure. Therefore, predicted clusters would be apparent, and in strong concordance with the clinical clusters if pairwise sample similarity is computed in data-dependent multi-kernel approach. It was found that in some views correlation distance was prominently reflecting cluster manifold that concurred with the natural clusters, while some of them showed proclivity towards Euclidean distance, and the rest seemed to accommodate parts of both. All things considered, two different graph representation matrices have been formulated, Gramian matrix and Laplacian matrix, both with different measures of similarity. Let for \(X_m\), the correlation distance between \(x_i\) and \(x_j\) be given by \(\varphi _m(x_i,x_j)\) and the squared Euclidean distance be given by \(\varepsilon _m(x_i,x_j)\). If \(\hat{\varphi }_m\) and \(\hat{\varepsilon }_m\) denotes the maximum pairwise correlation distance and squared Euclidean distance respectively, then Gramian matrix \(G_m\) and similarity matrix \(W_m\) are given by

$$\begin{aligned}{}[G_m]_{ij}= & {} w_G(x_i,x_j)=\text {exp} \left \{-{\frac{\varphi _m(x_i,x_j)}{\hat{\varphi }_m}} \right \} \left \{-{\frac{\varphi _m(x_i,x_j)}{\hat{\varphi }_m}} \right \} , \quad where \; i,j\in \{1,2,\dots ,n\}, \end{aligned}$$
(27)
$$\begin{aligned} _{ij}= & {} w_L(x_i,x_j)=\text {exp} \left \{-{\frac{\varepsilon _m(x_i,x_j)}{\hat{\varepsilon }_m}} \right \} , \quad where \; i,j\in \{1,2,\dots ,n\}. \end{aligned}$$
(28)

The matrix articulated in (28) is a crucial precursor for the construction of Laplacian matrix. Laplacian matrix is constructed by normalising \(W_m\) by the degree matrix \(D_m\) of its associated graph as in Eqs. (15) and (16). Hence, required representation matrices for each view \(X_m\), \(m\in \{1,2,\dots ,M\}\) are given by (27) and (29).

$$\begin{aligned} {\mathscr {L}}_m=D_m^{-1/2}(D_m-W_m)D_m^{-1/2}=I-D_m^{-1/2}W_mD_m^{-1/2}. \end{aligned}$$
(29)

So obtained laplacian matrix is then modified as described in Eq. (22)

$$\begin{aligned} L_m={\mathscr {L}}_m+{\frac{2}{n}}(1_n 1_n^T). \end{aligned}$$
(30)

It is apparent from the discussion presented under the heading Gramian Matrix and Kernel Trick and Graph laplacian that the matrix \(U_k\) from Gramian matrix has the same role as that from Laplacian matrix. Therefore, for combining the information encoded in these matrices, a parameterised combination function \({\varvec{\Omega }}(\cdot ,\cdot )\) can be used, hence obtaining a synergy matrix of representation matrices. If \(G_m\) is the Gramian matrix and \(L_m\) is the Laplacian matrix of omic-view \(X_m\), then the synergy matrix is given by:

$$\begin{aligned} {\varvec{\Omega }}(G_m,{\mathscr {L}}_m) = H_m =\beta G+(1-\beta )L, \quad where \; 0\le \beta \le 1. \end{aligned}$$
(31)

Consequently, the corresponding objective functions, (13) and (21) also combines to optimise over \(U_k\in {\mathbb {R}}^{n\times k}\).

$$\begin{aligned} \begin{aligned} \underset{U_k}{\min }{\beta \Vert X-U_kU_k^TX\Vert _F+ (1-\beta ){\mathbf{tr}}}(U_k^T{\mathscr {L}}U_k) , \text { subjected to }U_k^TU_k=I. \end{aligned} \end{aligned}$$
(32)

Some of the relevant properties of synergy matrix \(H_m\) are:

Property 1

\(H_m\) is symmetric and positive semi-definite matrix.

Proof

\(H_m\) can be called a positive semi-definite matrix if and only if \(v^TH_mv\ge 0\) for all \(v\in {\mathbb {R}}^n\). Also, from the properties of the Graph Laplacian and the Gramian, it is evident that both L and G satisfies this condition. Therefore,

$$\begin{aligned} v^TH_mv=\beta v^TGv+(1-\beta )v^TLv\ge 0. \end{aligned}$$
(33)

In addition to that, since \(H_m\) is a summation of symmetric matrices, it is also symmetric. Hence, it is proved that \(H_m\) is a symmetric and positive semi-definite matrix.\(\square \)

Given Property 1, rest of the properties are its direct consequence.

Property 2

All the eigenvalues of \(H_m\) are real.

Property 3

All the eigenvalues of \(H_m\) are non-negative.

Recursive multi-kernel integration

After generating synergy matrices for all the views of the dataset, the next step is to integrate the information obtained from each of them. However, before moving to the integration step, the proposed approach needs these matrices to be arranged based on their relative relevance for cluster discovery. It is apparent that the better views would encode the cluster structure better. As a consequence of that, they would depict better cluster validity indices as well. Therefore, the sorting of synergy matrices have been done based on cluster validity indices such as silhouette index. Suppose \(\mathbf{H}=\{H_1,\dots , H_M\}\) be the set of synergy matrices of a dataset with M views. Let the sorted set be \(\mathbf{H}^{\prime }=\{^1H, \dots , ^MH\}\), where the superscript i denotes the relevance of the corresponding synergy matrix \(^iH\), \(^1H\) being the most relevant. Additionally, let every \(^iU_k\) from the set \(\mathbf{U}=\{^1U_k, \dots , ^MU_k\}\) represent the basis of eigenspace corresponding to k smallest eigenvalues of matrix \(^iH\).

Next, a method for combination has been proposed which distills the cluster information from each of the synergy matrix one by one, in an iterative fashion. While doing that, it subtly takes care of enriching the information coming from the relevant matrices. The way that the synergy matrices has been made, it is apparent that it is their basis of the eigenspace that brings out the latent cluster structure in the corresponding view. Therefore, the proposed method uses a recursive function to exploit this fact for integration as well as enrichment of the relevant views of the dataset. The recursive formula can be written as:

$$\begin{aligned} \begin{aligned} \mathbf{k}_{\eta +1}:=\mathbf{k}_{\eta }\otimes {{\mathscr {N}}}(\mathbf{k}_{\eta },^{(\eta +1)}U) , \text {where } \mathbf{k}_1=^1H \text { and }\eta =1,\dots ,M. \end{aligned} \end{aligned}$$
(34)

Here \(\mathbf{k}_{\eta }\) is called accretive matrix of \(\eta \)th recursive step. Non-cumulative operator \(\otimes \) signifies the integration operation. That is, for \(A\in {\mathbb {R}}^{n\times n}\) and \(U\in {\mathbb {R}}^{n\times k}\), where A has its k smallest eigenvectors in \(V\in {\mathbb {R}}^{n\times k}\), and U is a basis matrix, the expression \(A\otimes U\) evaluates to an accretive matrix \(A^\prime \in {\mathbb {R}}^{n\times n}\) with k smallest eigenvectors given by \(V+U\). Other eigenvectors of A are irrelevant for this discussion. Let the basis of eigenspace of \(A^\prime \) be known as accretive basis and associated subspace as accretive subspace. Also, let the accretive basis corresponding to k smallest eigenvectors of \(\mathbf{k}_{\eta }\) be given by \(\mathbf{b}_{\eta }\).

In extension to that, for enriching relatively relevant views, the proposed method uses an orthogonalising-normalising function \({{\mathscr {N}}}(\cdot ,\cdot )\). To ensure the accumulation of only the essential cluster information, the proposed approach acquires the basis of that projection of synergy matrix eigenspace which is orthogonal to the accretive subspace at that recursive step. The idea is similar to eigenspace updation for integrative clustering as performed in Ref.18. This function does not normalise the synergy matrix per se, rather, it normalises the basis of the described projection subspace. The computation starts by instantiating \(\mathbf{k}_{1}=\text { }^1H\) so that \(\mathbf{b}_{\eta }\) becomes \(^1U_k\). Therefore, at (\(\eta +1\))th recursive step (\(\eta \in \{0,1,\dots ,M\}\)), one should have accretive matrix \(\mathbf{k}_{\eta }\) and eigenspace basis \(^{(\eta +1)}U_k\) of synergy matrix \(^{(\eta +1)}H\). Subsequently, processing within orthogonalising-normalising function \({{\mathscr {N}}}(\mathbf{k}_{\eta },^{(\eta +1)}U_k)\) renders the final basis matrix in four steps:

First, computing the basis \({\mathscr {P}}\) of the projection subspace, which is given by:

$$\begin{aligned} {\mathscr {P}}=\mathbf{b}_{\eta }{} \mathbf{b}_{\eta }^T\text { }^{(\eta +1)}U_k. \end{aligned}$$
(35)

Second, computing the residual component of the synergy matrix eigenspace \({\mathscr {Q}}\) which is given by subtracting the above-mentioned projected component from \(^{(\eta +1)}U_k\) as:

$$\begin{aligned} {\mathscr {Q}}=\text { }^{(\eta +1)}U_k-{\mathscr {P}}. \end{aligned}$$
(36)

In the third step, \({\mathscr {Q}}\) is subjected to Gram-Schmidt orthogonalisation to yield the final basis \({\mathscr {R}}\). This basis cannot be integrated with the eigenspace of accretive matrix, therefore it needs to be normalised on the basis of its relevance. So, the fourth step of normalization is performed as:

$$\begin{aligned} {{\mathscr {N}}}(\mathbf{k}_{\eta },^{(\eta +1)}U_k)=V, \quad where \; V=[diag({\mathscr {R}} {\mathscr {R}}^T)^{-{\frac{1}{2}}}({\mathscr {R}})]^{(\eta +1)} \end{aligned}$$
(37)

Here the notation \([\cdot ]\) denotes that the subsequent operations are done in element-wise fashion. The resultant V matrix is called as orthogonalised-normalised basis matrix. After the end of the process, the final accretive matrix \(\mathbf{k}_{M}\) is obtained whose first k eigenvectors in the matrix \(\mathbf{b}_{M}\in {\mathbb {R}}^{n\times k}\) holds the cluster structure. Hence, performing k-means on the rows of the matrix \(\mathbf{b}_{M}\) returns the cluster labels for each sample. The proposed algorithm is described in Algorithm 1.

figure a

Computational complexity

For the proposed algorithm, given M similarity matrices and Gramian matrices with n samples under study, the computation starts with constructing degree matrix \(D_m\) for each of the M views. The complexity of this step is bounded by \(O(n^2)\) for each view. In the next step, the Laplacian matrix is made with a complexity of \(O(n^3)\). Let the number of iterations (regulated through parameter \(\beta \)) to learn the synergy matrix’s best composition in steps 12 to 16 be \(t_\beta \). However, it has been found that for the datasets used in this study, the value of \(t_{\beta }=10\) suffices. Iterating \(\beta \) from 0 to 1 with an increment of 0.1 with each iteration can produce an optimal combination ratio for the representation matrices. However, here, the increment step has been referred to as \(\alpha \) for consistency. Assuming \(t_{max}\) be the highest iteration by the k-means clustering algorithm the complexity of the aforesaid steps becomes \(O(t_{\beta }n^3+t_{\beta }t_{max}nk^2+t_{\beta }n)\). Where \(t_{\beta }n^3\) comes from the complexity of eigenvalue decomposition of synergy matrix, \(t_{\beta }t_{max}nk^2\) is for the step where k-means clustering is performed, and \(t_{\beta }n\) is for the f-measure calculation. Therefore, the complexity of steps formulated from 12 to 16 turns out to be bounded by \(O(t_{\beta }n^3)\). Steps 17 to 19 are doing the same processing as previously, just at the optimal value of \(\beta \). Hence, they are also bounded by \(O(t_{\beta }n^3)\). Summing up all the steps from 9 to 20 for M views, the complexity of \(O(Mn^2+Mn^3+Mt_{\beta }n^3)\) reduces to \(O(Mt_{\beta }n^3)\). Sorting can be done at O(MlogM). After that, an accretive basis is constructed as defined in the function INTEGRATE(\(\mathbf{b},\eta \)). Step 5 consists of the construction of \({\mathscr {P}}\), \({\mathscr {Q}}\) and orthogonalized-normalized matrix V. In this step, two matrix multiplication operations are bounded under the complexity of \(O(n^2k)\). Gram-Schmidt orthogonalization and normalization step combined has a complexity of \(O(n^2)\). Therefore, step 5 has a complexity of \(O(n^2k)\). Step 6 is matrix addition with complexity O(nk), but step 5 seem to dominate over that. In addition to that, since the function runs \((M-1)\) times, the complexity from steps 21 to 23 becomes \(O(MlogM+Mn^2k)=O(Mn^2k)\). After the construction of the accretive basis, k-means is performed, which, as explained previously, has time complexity \(O(t_{max}nk^2)\). Considering everything, the overall complexity of RISynG comes out to be \(O(Mt_{\beta }n^3+Mn^2k+t_{max}nk^2) = O(Mt_{\beta }n^3)\).

Significance of proposed algorithm

There are some aspects of the proposed algorithm that enhance its performance and make it unique from the other algorithms designed to identify cancer subtypes. Although each omic-view in the cancer dataset has its distinct cluster structure, the knowledge of cancer biology suggests that no omics-source to which each view belongs can dictate the final cancer subtype alone. Instead, all the omics sources collectively manifest the cancer subtype in a sample. Therefore, multi-view integration is critical to a sensible and clinically relevant clustering. The proposed approach can be broken down into three operative steps: (1) construction of representation matrices for each view, (2) construction of synergy matrix for each view, and (3) construction of accretive basis through recursive multi-kernel integration of synergy matrices. These steps make the proposed algorithm more effective in the following manner:

  1. 1.

    Construction of representation matrices To group the cancer patients into clusters, each omic-view first has to be represented as similarity graphs. These similarity graphs can be interpreted through various representation matrices like the Gramian, Laplacian, and Adjacency. Each representation matrix attributes the samples’ similarity network with a notion of similarity between the samples. The proposed method assumes that multiple information sources may constitute different cluster manifolds when learned on a particular similarity measure. Therefore, predicted clusters would be apparent and in strong concordance with the clinical clusters if pairwise sample similarity is computed in a data-dependent multi-kernel approach37. In some views, Correlation distance was prominently reflecting cluster manifold that concurred with the natural clusters. Whereas some of them showed proclivity towards Euclidean distance, the rest seemed to accommodate both. All things considered, two different graph representation matrices have been formulated, the Gramian matrix and Laplacian matrix, both with different measures of similarity.

  2. 2.

    Construction of synergy matrices Representation matrices so constructed have two noteworthy aspects: (1) \(G_m\) represents a similarity graph formed using correlation-based distance. In the correlation-based distance, two objects are considered similar if the trends among their elements are highly correlated. That means the correlation distance between two perfectly correlated samples will be 0, even though they are far apart in the euclidean space of their dimension. It is instinctive to assume the omics data to behave like that. (2) Laplacian, on the other hand, preserves the intrinsic manifold structure in the data casted on a low embedding space. To integrate these representation matrices, a combination function has been devised that takes a convex combination of both the matrices. This method of combining matrices rectifies any bias created by the dissimilarity in distance measurement used while constructing the similarity graphs. The combination function defined in (31) utilises the parameter \(\beta \in [0,1]\) to capture graphs constituted by the Gramian and Laplacian. Parameter \(\beta \) can only take a positive value, making the combination a convex combination of representation matrices. This parameter’s optimal value is learnt by iterating it from 0 to 1 at some incremental step size \(\alpha \in (0,1)\). The datasets used in this study tend to pick up the optimal value of \(\beta \) at a step size of \(\alpha =0.1\). It is crucial to choose the incremental step size wisely as the number of iterations \(t_{\beta }\) is directly proportional to the algorithm’s time complexity. Because the synergy matrix will ultimately affect the cluster assignment, the best way to evaluate the appropriate value of \(\beta \) is to perform a provisional cluster validity test on the synergy matrix constructed with that \(\beta \) using a cluster validity index like silhouette index. Algorithm-1, steps 15 to 19 formulate the described provisional cluster validity test using silhouette as a criterion.

  3. 3.

    Construction of accretive basis After the similarity between the cancer patients is captured in a refined form with the help of synergy matrices, the next step is to integrate them. Property 1 of the synergy matrix proves that \(H_m\) is a positive semi-definite matrix. That makes the integration of synergy matrices a multi-kernel integration. The proposed algorithm does that by recursive multi-kernel integration by iteratively integrating each of the synergy matrices’ relevant subspace. Here, relevant subspace refers to that subspace of the matrix that purely encodes the cluster information, which in the case of synergy matrix is its eigenspace corresponding to k eigenvalues. Finally, an accretive basis matrix is generated. This accretive matrix is required to have more cluster information coming from relevant views. Therefore, the orthogonalizing-normalizing function is made such that the accretive basis at each recursive step gets less influenced by the irrelevant matrix.

Description of datasets

For analysing the efficiency of the proposed algorithm for identifying cancer subtypes, it is applied to five cancer datasets taken from TCGA (https://cancergenome.nih.gov/). The datasets used are Cervical cancer (CESC), Breast cancer (BRCA), Ovarian cancer (OV), Lower-grade glioma (LGG), and Stomach cancer (STAD). Different studies have identified 4 clinically important subtypes for BRCA9 and STAD38, 3 for CESC39 and LGG40 and 2 for OV41. The cancer genome is neither simple nor independent but is complicated and dysregulated by multiple levels in the biological system through genomic, epigenomic, transcriptomic, proteomic levels42. miRNA, as one of the important regulators of gene expression, can be integrated with gene expression to identify the selective inhibition of translation or selective degradation43,44,45. Furthermore, in terms of epigenetic regulation, histone modification or DNA methylation can serve to regulate gene expression in cancer46,47. Also, protein expression data can be utilized for the diagnostic prognosis of cancer patients48. Therefore, four omic views, namely, gene expression (mRNA), microRNA expression (miRNA), DNA methylation (metDNA), and reverse-phase protein assays (RPPA), are utilized for CESC, BRCA, and LGG datasets. For STAD and OV datasets, mRNA and miRNA expression are only considered because metDNA and RPPA information are not available for most samples. To avoid involving features with too many missing values, more than 5% of missing values in all of the omic views are removed, and the rest of the missing values are replaced with 0. Sequence-based expression data are log-transformed to make the data more or less normally distributed49. Therefore the 0 entries from miRNA and mRNA expression data are replaced with 1 and then log-transformed with base 10. For metDNA datasets, beta values are considered. At last, variance filtering is applied to mRNA and metDNA omic views for all cancer datasets, and 2000 most variable genes and CpG locations were only considered. Table 1 contains a description of the final processed data used for this study. The datasets selected for benchmarking cover a wide range of sample sizes from 124 in CESC to 474 in OV datasets. TCGA contains several platforms for individual data types, the platforms having the largest number of matching samples across the omics are selected in the present study. The proposed algorithm can be applied to other large-scale multi-omics datasets if available; the run time will increase with the increase in sample size or the number of omic views, as shown in Fig. 2. With the increase in sample size from 124 to 474, the runtime increases from 0.22 to 0.47 s. Even though the BRCA dataset has lesser samples (398) than the OV dataset (474), the runtime for BRCA (0.56 s) is more than OV (0.47 s) because of the number of omic-views involved, which is 4 for BRCA and 2 for OV.

Table 1 Datasets description.
Figure 2
figure 2

Effect of sample size and number of omic-views on the runtime of the proposed algorithm. Values in the parentheses indicate the number of omic-views.

Experimental results and discussion

The performance of the proposed approach is compared with eleven other algorithms available for cancer subtype identification. Both two-stage clustering approaches and integrative clustering approaches are considered for method comparison. The methods used for comparison are Similarity Network Fusion (SNF)13, Weighted Multi-View Low Rank Representation (WMLRR)50, Consensus Clustering (CC)6,51, Multi-view clustering approach with enhanced consensus (ECMC)52, SNF.CC (SNF merged with CC)53, Cluster of Cluster Assignment (COCA)9,54, Consensus Non-negative Matrix Factorization (CNMF)55, Selective Update of Relevant Eigenspaces (SURE)18, Convex-combination of Approximate Laplacians (CoALa)19, iCluster14, and Multi-manifold Integrative Clustering (MiMIC)56.

Performance analysis on multi-omics cancer datasets

The proposed approach and the above-described methods are applied to five cancer datasets, namely CESC, BRCA, OV, LGG, and STAD, taken from TCGA. The sample clusters identified by these methods are evaluated based on several internal and external cluster evaluation indices. The cancer subtypes identified by these methods are also evaluated for their biological relevance. Next, the detailed comparative analysis of the proposed algorithm is discussed.

Cluster evaluation

The clusters (cancer subtypes) generated by all the methods are evaluated based on several internal and external cluster evaluation indices. These indices help get the idea of how well a method can group the samples into homogeneous clusters. Samples belonging to the same cluster should have higher similarity representing a cancer subtype, whereas samples belonging to different clusters should be highly dissimilar. How well an algorithm can capture the natural grouping present in the data can be quantified with internal validity indices. Following four internal evaluation indices are calculated in this study. Table 3, presents the internal evaluation indices for every method.

  1. 1.

    Silhouette Index: It measures the consistency present in the clusters. The value lies in the range \([-1,1]\). A value nearer to + 1 indicates a higher distance between the clusters, a value of 0 indicates that the sample is very close boundary between two neighboring clusters, and a negative value indicates misclassification57.

    $$\begin{aligned} {\mathbb {S}}_c = \frac{1}{c} \sum _{k=1}^{c}S(\Upsilon _k), \end{aligned}$$
    (38)

    where, \(S(\Upsilon _k)\) represents silhouette width of the obtained clusters, \(\Upsilon _k (k=1, \ldots ,c)\) which is calculated as: \(S(\Upsilon _k)=\frac{1}{n_k}\sum _{x_i\in \Upsilon _k}^{}s(x_i)\) where, \(n_k\) is cardinality of \(\Upsilon _k\) and \(s(x_i)\) is silhouette width of sample \(x_i\). For every sample, the silhouette width \(s(x_i)\) is estimated as: \(s(x_i)=\frac{b(i)-a(i)}{max\{a(i),b(i)\}}\) Here, \(a(i) = \) average dissimilarity of \(i_{th}\) object to all other objects in the same cluster and \(b(i) = \) average dissimilarity of \(i_{th}\) object with all objects in the closest cluster.

  2. 2.

    Dunn Index: A higher value represents better clustering solution58. It is defined as:

    $$\begin{aligned} DI = \underset{1\le i \le c}{{\text {min}}} \Big \{ \underset{1\le i \le c}{{\text {min}}} \Big \{ {\frac{\delta (C_i,C_j)}{\underset{1\le k \le c}{{\text {max}}} \small \{\Delta (C_k)\}}} \Big \}\Big \} \end{aligned}$$
    (39)

    Here, \(\delta (C_i,C_j) = \) distance between cluster \(C_i\) and \(C_j\) and \(\Delta (C_k) = \) intra-cluster distance within cluster \(C_k\).

  3. 3.

    Davies–Bouldin Index: It is defined as the ratio of within cluster dispersion to between cluster dispersion59. A lower value indicates better clustering.

    $$\begin{aligned} DB = \frac{1}{C} \sum _{i=1}^{C} (D_i) \end{aligned}$$
    (40)

    Here, \( D_{i} = \max _{{j \ne i}} R_{{i,j}} \) and \(R_{i,j} = \frac{S_i+S_j}{M_{ij}}\). \(M_{i,j}\) is the separation between the ith and the jth cluster. \(S_i\) and \(S_j\) are the within cluster scatter for cluster i and j and C is the number of clusters.

  4. 4.

    Xie–Beni Index: The index for crisp clustering is estimated as:

    $$\begin{aligned} \text {Xie}-{\text {Beni}} = \frac{1}{N} \frac{WGSS}{ {\text{min}}_{{k < \mathop k\limits^{{\prime }} }} \acute{\delta } (C_k,C_{\acute{k}})^2} \end{aligned}$$
    (41)

    Here, \(\frac{1}{N} {WGSS}\) represents the averaged-squared distance of all the points with respect to the barycenter of the cluster they belong to, and \(\acute{\delta }\) a measure of the between-cluster distance60.

The class distribution of the cancer datasets used in this study is presented in Table 2. Except for the CESC dataset, all the other cancers have an imbalanced class. When clustering is applied to these datasets, there are chances that most of the samples get clustered into one group leading to good values for internal indices. Still, in reality, the clustering is not efficient. If the ground truth is available, the partitions created in such imbalanced data can be efficiently evaluated with external evaluation indices. In this study, five external evaluation indices are calculated to compare the clustering efficiency of the different algorithms. Considering a set of n objects \({{\mathbb {X}}}=\{{{\mathscr {X}}}_1, {{\mathscr {X}}}_2, \ldots ,{{\mathscr {X}}}_n\}\), suppose \({{\mathbb {C}}}=\{{{\mathscr {C}}}_1, {{\mathscr {C}}}_2, \ldots ,{{\mathscr {C}}}_R\}\) represents a partition of \({{\mathbb {X}}}\) obtained by a clustering algorithm and \({{\mathbb {K}}}=\{{{\mathscr {K}}}_1, {{\mathscr {K}}}_2,\ldots ,{{\mathscr {K}}}_C\}\) represents the ground truth or the class information. A contingency table is created to look for the overlap between the clustering result and the ground truth, where \(n_{ij}=|{{\mathbb {C}}}_{i}\cap {{\mathbb {K}}}_{j}|\) is the common elements in cluster \({{\mathbb {C}}}_{i}\) and class \({{\mathbb {K}}}_{j}\). \(n_i\) is the number of elements in \( {{\mathbb {C}}}_{i}\) and \(n_{j}\) is the number of elements in \({{\mathbb {K}}}_{j}\). The external indices are defined as:

  1. 1.

    F-measure (FM): The idea of precision and recall from information retrieval is merged to obtain FM. It disregards the unmatched portions of the clusters. It can attain values ranging between 0 and 1. A value nearer to 1 represents better clustering61.

    $$\begin{aligned} FM = \sum _{j=1}^{C} \frac{n_j}{n} \, \underset{i=1 \cdot \cdot \cdot R}{{\text {max}}}\, \left[ \frac{2 \times \frac{n_{ij}}{n_i} \times \frac{n_{ij}}{n_j}}{\frac{n_{ij}}{n_i}+\frac{n_{ij}}{n_j}}\right] \end{aligned}$$
    (42)
  2. 2.

    Adjusted Rand Index (ARI): A commonly used variations of the Rand index, and takes into account agreements arising by chance given a hypergeometric distribution. In the case of ARI, the lower bound, \(-k\), depends on the exact data partitioning62. Closer the value of ARI to 1, better is the clustering.

    $$\begin{aligned} ARI = \frac{\sum _{i=1}^{R} \sum _{j=1}^{C} \left( \begin{array}{c} n_{ij} \\ 2 \end{array}\right) - {\left( \begin{array}{c} n \\ 2 \end{array} \right) }^ {-1} \sum _{i=1}^{R} \left( \begin{array}{c} n_{i} \\ 2 \end{array} \right) \sum _{j=1}^{C} \left( \begin{array}{c} n_{j} \\ 2 \end{array} \right) }{\frac{1}{2} \left[ \sum _{i=1}^{R} \left( \begin{array}{c} n_{i} \\ 2 \end{array}\right) + \sum _{j=1}^{C} \left( \begin{array}{c} n_{j} \\ 2 \end{array} \right) \right] - \left( \begin{array}{c} n \\ 2 \end{array} \right) ^{-1} \sum _{i=1}^{R} \left( \begin{array}{c} n_{i} \\ 2 \end{array}\right) \sum _{j=1}^{C} \left( \begin{array}{c} n_{j} \\ 2 \end{array} \right) } \end{aligned}$$
    (43)
  3. 3.

    Normalized Mutual Information (NMI): The inter-dependencies between cluster number and cluster quality can be quantified by NMI. It is estimated as:

    $$\begin{aligned} NMI({\mathbb {C}},{\mathbb {K}})=\frac{{\mathscr {I}}({\mathbb {C}},{\mathbb {K}})}{[{\mathscr {H}}({\mathbb {C}})+{\mathscr {H}}({\mathbb {K}})]/2} \end{aligned}$$
    (44)

    Here, \({\mathscr {I}}\) is mutual information and \({\mathscr {H}}\) is entropy. The value ranges from 0 to 1, value nearer to 1 means better clustering63.

  4. 4.

    Jaccard Index: It is used to measure the similarity between two sets, that are clustering solution, and the class information. It is defined as:

    $$\begin{aligned} J({\mathbb {C}},{\mathbb {K}})= \frac{|{\mathbb {C}} \cap {\mathbb {K}}|}{|{\mathbb {C}} \cup {\mathbb {K}}|} \end{aligned}$$
    (45)

    Higher the value of this index better in the clustering.

  5. 5.

    Purity: For estimating Purity, the clusters are first allocated to that class which is present most frequently in the cluster. Later, the accuracy of this cluster-class allocation is obtained by dividing the number of correctly assigned objects to total number of objects63. The equation for calculating Purity is:

    $$\begin{aligned} Purity({\mathbb {C}},{\mathbb {K}})=\frac{1}{n}\sum _{i}max_{j}|C_i \cap K_j| \end{aligned}$$
    (46)

    Purity ranges from 0 to 1, a value closer to 1, better is the clustering.

Based on these five external evaluation indices, it is observed that the proposed algorithm outperforms in CESC, BRCA, LGG, and STAD datasets. OV cancer is the only case where the proposed approach cannot work that well. Suppose all the datasets are considered together to rank the clustering efficiency of all the algorithms under study, considering all the external indices. In that case, the proposed method stands first by attaining a maximum value for 20 times out of 25. The execution times reported in Table 3 show that RISynG is faster than other algorithms.

Table 2 Cancer subtypes description: actual class distribution.
Table 3 Comparative cluster analysis of proposed and existing approaches.

Importance of multi-omics data integration

The proposed algorithm RISynG iteratively integrates the relevant subspace of each of the synergy matrices. The relevant subspace corresponds to the k largest eigenvectors of the synergy matrices that hold the cluster structure. To exhibit the significance of this iterative integration and the effectiveness of RISynG, it is compared with Spectral clustering performed on individual omics datasets. The results presented in Table 4 show that the proposed algorithm outperforms the individual omic-views in CESC, BRCA, LGG, and STAD datasets for all the external clusters validity indices. In the OV dataset, RISynG outperforms for F-measure, Jaccard, and Purity. However, the miRNA view performs better for ARI and NMI indices. The performance of RISynG is significantly higher than the best individual view in the case of CESC, BRCA, and LGG datasets, irrespective of any indices.

To express the cluster holding capacity of the integrated subspace obtained by the proposed approach, scatter plots for the best k dimensions are plotted. The colours in the plots indicate the ground truth (cancer subtypes). Comparative plots are also presented in Figs. 3, 4, 5, 6, and 7 to show that the integrated subspace obtained by RISynG are more informative than other subspace-based integrative-clustering approaches (SNF, SURE, CoALa, iCluster, WMLRR, and MiMIC), for most of the datasets. Comparison with the best individual omic-view (CESC: mRNA, BRCA: metDNA, OV: miRNA, LGG: metDNA, and STAD:miRNA) is also presented to establish the significance of multi-omics data integration performed by the proposed approach. Considering the proposed approach, the scatter plots show that the clusters are well separated in the case of CESC (Fig. 3) and LGG (Fig. 6) datasets. There is a slight overlap between the two groups in BRCA (Fig. 4), but it is better than the other methods. Whereas, for OV (Fig. 5) and STAD (Fig. 7) datasets, the overlap between subtypes is observed in the subspace obtained by all the methods.

Table 4 Comparative performance analysis of proposed approach and individual omic-view.
Figure 3
figure 3

Comparative analysis of different integrative sub-spaces for CESC dataset.

Figure 4
figure 4

Comparative analysis of different integrative sub-spaces for BRCA dataset.

Figure 5
figure 5

Comparative analysis of different integrative sub-spaces on OV dataset.

Figure 6
figure 6

Comparative analysis of different integrative sub-spaces for LGG dataset.

Figure 7
figure 7

Comparative analysis of different integrative sub-spaces for STAD dataset.

Biological analysis

Once the cancer subtypes are obtained, the patient clusters’ molecular characteristic feature is also evaluated to establish their biological relevance. To understand the varying expression of different biomarkers in different subtypes, differential expression analysis (DEA) of miRNAs and mRNAs is performed between the correctly identified groups of patients. A comparative analysis is performed between the true positives and true negatives obtained by all the algorithms. As there are three subtypes in the case of LGG and CESC datasets; therefore, DEA is performed between three pairs (considering all possible pairs). Similarly, in the case of STAD and BRCA datasets, since there are four subtypes, DEA is performed for six pairs, and for the OV dataset, there are two subtypes; therefore, DEA is performed for one pair. R package Limma64 is used to perform DEA. miRNAs and mRNAs having Bejamini-Hochberg false discovery rate adjusted p-value \(< 0.05\) are considered as differentially expressed. Number of differentially expressed biomarkers obtained from different groups in CESC, BRCA, OV, LGG, and STAD datasets are reported in Tables 5, 6, 7, 8, and 9 respectively. To further explore and highlight the biological knowledge and process-specific functioning of the identified sets of differentially expressed biomarkers, different types of enrichment analyses are also performed, considering the hundred most differentially expressed biomarkers in each case.

Biological enrichment analyses

The first analysis is Pathway enrichment analysis (PEA). It explores the mechanistic insight into the set of differentially expressed biomarkers. It helps identify those biological pathways enriched in a set of biomarkers more than expected by chance. The second one is Biological process enrichment analysis (BPEA). It helps characterize the relationship between genes or miRNAs by specifically annotating them to associated biological processes. It helps identify the over-represented biological processes in our list, which can help evaluate the biological significance of the obtained cancer subtypes. Furthermore, the third one is Disease ontology enrichment analysis (DOEA). Disease Ontology (DO) helps map the relevance of cancer subtypes identified from high-throughput data to clinical relevance. In this study, the R package, clusterProfiler65 and DIANA Tools mirPath v.366 are used for performing PEA and BPEA for genes and miRNAs, respectively, and R package DOSE67 is used to perform DOEA for the genes. The top 100 differentially expressed biomarkers are passed to these tools. In some cases, if the number of differentially expressed biomarkers is less than 100, then all of them are used. KEGG database is selected for PEA68. All the pathway terms associated with the set of biomarkers having false discovery rate adjusted p-value \(< 0.05\) (significant pathway terms) are only considered. Suppose any differentially expressed biomarker sets are not associated with significant KEGG pathway terms. In that case, that set is said to be not biologically relevant with respect to KEGG pathway terms. Similarly, all the biological process (BP) terms associated with the set of biomarkers having a false discovery rate adjusted p-value \(< 0.05\) (significant pathway terms) are only considered. If any of the differentially expressed biomarker sets are not associated with significant BP terms, that set is said to be not biologically relevant with respect to BP terms. In DOEA, semantic similarities between DO terms and genes are calculated that help explore the similarities of diseases and gene functions from a disease perspective. The output of DOES has associated disease terms. A gene set is said to be enriched with DO terms if the terms obtained by its DOEA have a false discovery rate corrected p-value \(<0.05\).

For the quantification of KPEA, BPEA, and DOEA, respective enrichment scores69, and annotation ratios69 are calculated. The higher the value of these scores better is the enrichment; hence, the more biologically significant the differentially expressed biomarkers are, the better the cancer sub-typing. Following are the equations for these scores:

$$\begin{aligned} BPES= & {} \frac{1}{T} \sum _{t=1}^{T}-log_{10}(p-value_{t}), \end{aligned}$$
(47)
$$\begin{aligned} AR= & {} \frac{1}{T \times G} \sum _{i=1}^{T}g_{i}. \end{aligned}$$
(48)

Here, T denotes the number of significant pathway/BP/terms associated with a set of differentially expressed genes or miRNAs between two cancer subtypes identified by any clustering approaches. G denotes the total number of genes given to clusterProfiler for the enrichment analysis, and g denotes the gene count associated with a pathway/BP/DO term. Comparative analysis of the cancer subtypes obtained by the proposed approach and other existing algorithms are performed and the associated quantitative indices are reported in Tables 5, 6, 7, 8, and 9. Some of the differentially expressed miRNAs or mRNAs have no associated significant terms; therefore, there is no scope for calculating the quantitative indices. Also, in some cases, there are no differentially expressed biomarkers. All these cases are represented by \(*\) in the tables.

To compare the effectiveness of the proposed approach with the other algorithms in this study, the overall performance of all the methods is also evaluated. When all the five cancer datasets are considered together, the proposed approach outperforms concerning both cluster evaluation indices and biological enrichment analysis, as shown in Fig. 8. The analysis is performed by considering the success frequency (number of times a method scored the highest value for respective indices when all the cases in all the cancer types are considered). The success frequency shows that the proposed approach outperforms when cluster validity indices are considered by scoring maximum values for 21 times, followed by SNF.CC (7), SNF (6), CNMF (5), CC (2), COCA (2), and WMLRR (1). Similarly, suppose the methods are ranked considering the success frequency for quantitative indices calculated for biological enrichment analysis. In that case, the proposed approach will again stand first by scoring the maximum value 67 times, followed by SNF (21), SNF.CC (20), CC (12), CoALa (10), CNMF (9), MiMIC (7), SURE (5), WMLFF (5), COCA (4), and iCluster (1). If the cluster validity indices are looked upon individually, the proposed approach also outperforms with respect to F-measure, ARI, NMI, Jaccard index, and Purity. Considering the indices for biological enrichment individually, the proposed algorithm again outperforms with respect to all the indices except for AR for BPES for mRNA enrichment, where it stands second.

Figure 8
figure 8

Method comparison.

Table 5 Comparative biological analysis of CESC dataset.
Table 6 Comparative biological analysis of BRCA dataset.
Table 7 Comparative biological analysis of OV dataset.
Table 8 Comparative biological analysis of LGG dataset.
Table 9 Comparative biological analysis of STAD dataset.

Overlap analysis

The hundred most differentially expressed genes between all the subtypes-pairs in cervical cancer that RISynG and the other methods identified are explored further for experimental support. The genes are analyzed based on the degree of overlap with known cervical cancer genes that are experimentally validated. The Cervical Cancer Gene Database (CCDB)70 is used for finding the overlap. It is a manually curated catalog of experimentally validated genes involved in the different stages of cervical carcinogenesis. All the up-regulated and down-regulated genes in cervical cancer with evidence from the published literature available in CCDB are considered for this analysis. 367 genes are reported in CCDB that are differentially expressed in cervical cancer. This list contains 185 genes from a total number of 2000 genes that are used for cancer subtype identification in this study. The statistical significance of the overlap analysis is reported in Table 10. In total, 30 genes out of 222 identified from the proposed approach overlap with cervical cancer-related genes. This is the maximum overlap when compared with the other methods. Fisher’s exact test is used here to find the statistical significance of the contingency table created from the overlap analysis in Table 10 for different algorithms. At 95% confidence, it is observed that only the genes identified by the proposed approach have significant overlap with experimentally validated genes curated from literature with a p-value of 0.026. Therefore, it indicates that the proposed approach has the potential to identify clinically important subtypes of cancer that have a characteristic molecular signature.

Table 10 Overlap with experimentally validated gene-list.

Conclusion

The present study describes a method named RISynG that efficiently identifies cancer subtypes. Cancer subtypes identification can facilitate cancer diagnosis and therapy. It is one of the vital components of the precision medicine framework. The main contributions of this study are: (1) Development of an integrative clustering method for multi-view omics data. (2) Demonstration of the effectiveness of the proposed method over other methods. (3) Establishing biological relevance for the obtained results.