Main

Protein-protein interactions (PPIs) are critical for most cellular processes. High-throughput PPI detection technologies have already resulted in identification of large sets of PPIs for several organisms1 and are being applied to other organisms. Given the rapidly growing numbers of public PPI data, analysis of PPI networks has become a major thrust in systems biology research. Analysis efforts to date can be roughly divided into two types, based on the size of the analyzed networks. The first type are large-scale studies of an entire species that involve hundreds of thousands of PPIs1, in which the main objective has been to identify network topological characteristics that have biological implications2. The second type are small-scale targeted studies of a single protein to identify tens of interactions to extend the knowledge of this protein.

There is, however, a paucity of medium-size PPI network studies in which protein interactors for tens or hundreds of proteins are analyzed simultaneously. This scale is interesting because protein lists of this size are being generated or inferred from functional genomics, proteomics and metabolomics studies. Analysis of such networks could reveal interesting hypotheses to be investigated in biological settings, but this potential has not been fully exploited because of the lack of a platform to systematically query, filter, analyze, visualize and manage PPI networks of this scale. Current databases and approaches provide these features in a limited way. The majority of public PPI databases, such as IntAct3, MINT4 and BioGRID5, provide PPI records from literature curation or direct user submissions. Some databases including APID6, STRING7, MiMI8 and UniHI9 integrate information from these curated PPI databases to provide a more comprehensive set of public PPIs. In addition, the STRING database contains predicted PPIs based on functional associations including genomic context, co-expression data and text mining.

The protein interaction network analysis platform (PINA) is an integrated platform for protein interaction network construction, filtering, analysis, visualization and management (Fig. 1). It includes a quarterly updated, nonredundant database based on integration of data from six public PPI databases: IntAct3, MINT4, BioGRID5, DIP10, HPRD11 and MIPS MPact12. PINA has versatile querying capabilities to construct PPI networks, such as queries for single proteins, a list of proteins, a list of protein pairs or two lists of proteins (Supplementary Methods online). Additionally, the query functions for either a list of proteins or protein pairs have been provided as web services to facilitate automatic use of PINA by other programs. Together, these querying features enable more advanced PPI searches than any current PPI database. In addition, PINA provides protein annotations including protein domains and Gene Ontology (GO) terms, and interaction annotations including the experiment information and links to the primary PubMed articles. Generated networks can be downloaded in GraphML, MITAB13 or customized tab delimited (TSV) formats. The latter two types of files can be loaded into Cytoscape14 to integrate PPI interactions, for example, with gene expression profiles.

Figure 1: Schematic of the PINA platform.
figure 1

Six public external PPI databases (top right, yellow) were integrated in the PINA database, which is used through query tools to construct networks with externally collected protein annotation databases (middle right, yellow). The user space contains saved networks with comments, and networks are refined with filtering, analysis and visualization tools.

PPI networks from query results can be rapidly analyzed using PINA's built-in GO and graph theoretical tools. For instance, PINA includes a GO tool that uses semantic similarities between the annotated GO terms of an interacting protein pair to generate a confidence score for the interaction (Supplementary Methods). Additionally, a GO enrichment analysis tool can be used to identify significantly enriched GO terms of a PPI network. Graph theoretical tools can be used to either discover basic topology properties of a PPI network or identify topologically important proteins, such as hubs or bottlenecks, based on several centrality measures (Supplementary Methods). All built-in tools can also be executed in an interactive applet (Fig. 2a), which supports visualization and manipulation of a PPI network.

Figure 2: PINA-generated visualization of the PPI networks for LKB1 and its 14 substrate kinases.
figure 2

(a) A screen short of the initial network consisting of 203 interactions based on the query for LKB1 and its 14 substrate kinases (green nodes) acquired with the PINA visualization tool. Nodes indicate proteins and lines indicate interactions. (b) The recurated network has 64 interactions discarded and 31 interactions added (green lines). Yellow nodes indicate proteins that interact with at least two query proteins.

A central and unique feature in PINA is a 'user space' that allows users with a free registered account to save networks from query results, manually upload interaction data and comment on specific interactions. With these features, users can modify networks by manually adding or removing identified PPIs, or use filtering tools to accept or reject interactions to obtain more reliable networks. Saved networks can be used as input for analysis tools and published in PINA for open or restricted access.

We demonstrated the utility of PINA by analyzing two human PPI networks. LKB1 is a tumor suppressor kinase underlying hereditary Peutz-Jeghers polyposis/cancer syndrome and is frequently inactivated in sporadic non-small cell lung adenocarcinomas15. LKB1 also phosphorylates and activates 14 kinases, but it remains unresolved which if any of these are important in tumorigenesis. We investigated whether generating and analyzing a PPI network of LKB1 and its 14 substrate kinases could provide hypotheses on links and pathways critical for tumorigenesis following LKB1 deficiency. Using these 15 proteins, in one PINA query we generated an initial PPI network consisting of 131 proteins and 203 interactions (Fig. 2a). Based on the substantial number of unexpected interactions, we recurated all interactions from the original publications with notes added to the 'comment' field in PINA records. This resulted in the exclusion of 64 PPIs for several reasons, including a lack of any evidence for an interaction other than co-occurrence of gene/protein names or duplication owing to different UniProt accession numbers (Supplementary Table 1 and Supplementary Methods online). After removing these interactions using a filtering tool based on user comments in PINA, we generated a network consisting of 139 interactions. This network, however, lacked several established interactions, especially kinase-substrate interactions, which implies that these typically transient but very important interactions are underrepresented in PPI databases. After we manually added 31 interactions, the curated LKB1 PPI network contained 170 interactions (Fig. 2b), and is available online in 'shared networks' of PINA.

The LKB1 PPI network generated in PINA highlights interactors easily left unnoticed because of, for example, PPIs being reported in supplementary tables of large-scale experiments or incorporated only in one of the source databases of PINA. For instance, the interactions of CDC25C with both MARK3 and BRSK1 suggest that BRSK kinases may also function through stable kinase-substrate complexes involving 14-3-3 (refs. 16, 17). In addition, the generated network linked LKB1 and the members of TGFβ signaling pathway through NUAK2 (Fig. 2b; SMAD2, SMAD4 and TGFβR1). Recently, decreased SMAD2 and TGFβ pathway activity had been suggested to underlie LKB1 deficiency–mediated tumorigenesis18. Thus, the PPIs suggest that decreased LKB1 signaling could attenuate TGFβ signaling through NUAK2.

The antagonist biological functions of transcription factors p53 and c-Jun19 could be due to competitive binding of limiting amounts of interacting proteins. Using PINA we rapidly identified, out of 447 interactors of either p53 or c-Jun (Supplementary Fig. 1a online), 39 proteins that interact with both. After recuration (Supplementary Table 1) 36 proteins were retained as common interacting proteins, and we then used these proteins in a single query to identify interactions among them. This revealed 123 additional PPIs, from which we combined 105 recurated interactions with the p53 and c-Jun interactions using the 'network operation' tool in PINA for further analysis (Supplementary Fig. 1b).

The PINA GO enrichment tool revealed significantly enriched terms in the transcription category (Supplementary Fig. 2 online), which were directly related to the regulation of either DNA-binding or transactivation capacity of p53 and c-Jun. Moreover, using a graph theoretical analysis tool in PINA, the top four candidate proteins were SMAD2, SP1, SMAD3 and ESR1 based on eigenvector centrality measure (Supplementary Fig. 3 online). High centrality scores for SP1 and SMAD3 may be due to their requirement for c-Jun– and p53–mediated activation of transcription at least in certain conditions20. Taken together, PINA analysis of common interactors of p53 and c-Jun revealed several candidates that could represent a critical limiting factor for the transcriptional regulation of gene expression by either p53 or c-Jun. This analysis provides directions for additional experiments to characterize mechanisms by which these transcription factors antagonize each other's functions in the regulation of cell proliferation.

PINA analysis of these two human PPI networks revealed several issues to be addressed in network analysis of PPIs from public databases. First, many such databases lack a substantial portion of PPIs, emphasizing the need to integrate multiple PPI databases. For example, in the LKB1 analysis, the 139 PPIs identified from the PPI databases originated from HPRD (118 PPIs), IntAct (92), BioGRID (63) and MINT (78) demonstrating that at least 15% of PPIs would have been missed using any single database. Also all databases contained PPIs not found in any other database: HRPD (48 PPIs), IntAct (14), BioGRID (11) and MINT (1). HPRD has better coverage in this case because it is dedicated to human PPIs, whereas MIPS MPact is not listed here as it only includes yeast data.

Our analysis also revealed that curated PPI databases contain some mistakes, for instance, through mistaking genes with similar alias names (Supplementary Table 1), demonstrating the importance of being able to manually remove interactions from a network. Similarly the ability to add interactions is useful for inclusion of new or unrecorded PPIs (Fig. 2b). Manual improvement of the networks may considerably enhance the relevance of results generated by analysis tools in PINA.

To facilitate integrative analysis of PPI data, we encourage curators of current PPI databases to fully adopt the Proteomics Standards Initiative–Molecular Interaction (PSI-MI) standard for exchanging protein interactions13. For example, after network construction, it would be interesting to rank PPIs based on the interaction-detection methods by giving additional weights to PPIs identified by more than one method. Unfortunately some of the PPI databases have not fully adopted the PSI-MI standard in curating interaction detection methods, which limits the usefulness of this ranking strategy. In PINA, users can add and share such information as comments.

Our analysis also demonstrated the utility of PINA in translating fragmented knowledge in PPI databases to testable predictions. The application of PINA will accelerate analysis of PPI networks from the rapidly growing amount of public PPI data, and formulation of hypotheses of protein functions and cellular processes. PINA is freely available at http://csbi.ltdk.helsinki.fi/pina/.

Note: Supplementary information is available on the Nature Methods website.