Introduction

Studies of complex microbial communities have advanced considerably in recent years , in part, due to methodological advances such as high-throughput DNA sequencing technologies that yield detailed information on the composition of microbial communities (Sogin et al., 2006). The sequence data are typically derived from sequencing a portion of the small-subunit rRNA gene (Pace, 1997) and a wide variety of techniques can be applied to the analysis of the sequence data in order to describe the composition of microbial communities, their diversity and how communities may change across space, time, or experimental treatments. However, most of the analytical techniques focus on single properties of the communities being studied. For example, studies describing and comparing the structure of microbial communities often focus on the total numbers of taxa or unique lineages found in individual samples (that is, alpha-diversity), the relative abundances of individual taxa or lineages and the extent of phylogenetic or taxonomic overlap between communities or community categories (that is, beta-diversity). Alpha-diversity measures (for example, richness and coverage estimators, rarefaction curves) yield estimates of microbial diversity and its limits in different environments (Hughes et al., 2001; Curtis et al., 2002; Sogin et al., 2006). Likewise, multivariate statistical techniques such as clustering and ordination have allowed microbial ecologists to describe beta-diversity patterns, revealing how biotic and abiotic variables control microbial community composition. For example, analyses of beta-diversity patterns have revealed how microbial communities are structured across a wide range of natural habitats (Lozupone and Knight, 2007; Auguet et al., 2010; Barberán and Casamayor, 2010, 2011), the spatial and temporal variability of microbial communities on and in the human body (Fierer et al., 2008; Costello et al., 2009), and the factors structuring soil bacterial communities (Lauber et al., 2009). In contrast, there has been far less attention focused on using sequence data to explore the direct or indirect interactions between microbial taxa coexisting in environmental samples. Documenting these interactions between taxa (that is, co-occurrence patterns) across complex and diverse communities may help to ascertain the functional roles or environmental niches occupied by uncultured microorganisms (Ruan et al., 2006; Fuhrman and Steele, 2008; Chaffron et al., 2010). With the ever-increasing accumulation of sequence data from microbial communities, we now have the challenge to begin exploring these interactions, and to extend community analyses beyond the exploration of alpha- and beta-diversity patterns that represent the bulk of most sequence-based microbial community analyses.

Network analysis tools and network thinking (Proulx et al., 2005) have been widely used by biologists, mathematicians, social scientists, and computer scientists to explore interactions between entities, whether those entities are individuals in a school (Moody, 2001), species in a food web (Krause et al., 2003), nodes on a computer network (Pastor-Satorras and Vespignani, 2001), or proteins in metabolic pathways (Guimera and Amaral, 2005). Network analysis is used to explore the mathematical, statistical and structural properties of a set of items (nodes) and the connections between them (edges; Newman, 2003). With a few notable exceptions (for example, Ruan et al., 2006; Chaffron et al., 2010; Freilich et al., 2010), network analysis has not been widely applied to exploring co-occurrence patterns between microbial taxa in complex communities. To detect robust associations between microorganisms within and between habitats using network analysis, it is essential to have fairly detailed information on the microbial taxa found across relatively large numbers of samples, as without sufficiently large sample sets it will be difficult to determine whether or not co-occurrence patterns are statistically significant. Ideally sample sets should cover spatial or temporal gradients in environmental conditions in order for there to be a sufficient variability in taxon abundances to resolve co-occurrence patterns. As shown in recent studies that have used barcoded pyrosequencing to survey microbial communities in large numbers of samples (for example Fierer et al., 2008; Costello et al., 2009; Galand et al., 2009; Lauber et al., 2009), it is now possible to generate microbial datasets that can take full advantage of network analysis approaches and we can apply them to even highly diverse communities, like those found in soils, to explore co-occurrence patterns.

Network analysis of taxon co-occurrence patterns offers new insight into the structure of complex microbial communities, insight that complements and expands on the information provided by the more standard suite of analytical approaches. First, inter-taxa associations may help reveal the niche spaces shared by community members (even members of different domains of life, such as Bacteria and Archaea) or, perhaps, more direct symbioses between community members. Such information is particularly valuable in environments, such as soil, where the basic ecology and life history strategies of many microbial taxa remain unknown (Janssen, 2006). Exploring co-occurrence patterns between soil microorganisms can help identify potential biotic interactions, habitat affinities, or shared physiologies that could guide more focused studies or experimental settings. More generally, network analysis represents an approach for exploring and identifying patterns in large, complex datasets, patterns that may be more difficult to detect using the standard alpha/beta diversity metrics widely used in microbial ecology (Proulx et al., 2005).

Here we used network analyses to explore associations between prokaryotic taxa in soil, one of the most complex and taxon-rich microbial habitats on Earth. We analyzed over 160 000 bacterial and archaeal 16S rRNA gene sequences from 151 soil samples from a wide variety of ecosystem types in order to demonstrate the utility of network analyses and address the following questions: (i) Do soil microorganisms tend to co-occur more than expected by chance? (ii) Can the lack of agreement between observed and random intra-phyla co-occurrence be used as a proxy of niche differentiation? and (iii) Which taxa are generalists (broadly distributed across soil habitats) or specialists (restricted to certain habitats but locally abundant) and how these ecological categories shape network structure?

Materials and methods

Soil description and molecular methods

The dataset consisted of 151 soil samples distributed across North and South America, and Antarctica. The collected soils came from a broad range of ecosystems, climates and soil types (Supplementary Table 1). Soil collection protocol and methods for edaphic and environmental properties have been described previously (Fierer and Jackson, 2006; Bates et al., 2010).

Preparation of extracted DNA for pyrosequencing followed the protocol described in detail in Fierer et al., 2008 and Bates et al., 2010. In brief, a region of the 16S rRNA gene (250 nucleotides) was amplified with the primers F515 (5′-GTGCCAGCMGCCGCGGTAA-3′) and R806 (5′-GGACTACVSGGGTATCTAAT-3′) that should amplify nearly all bacteria and archaea with few biases against particular groups (Bates et al., 2010). The resulting barcoded PCR product was normalized in equimolar amounts and sequenced on a Roche GS-FLX 454 automated pyrosequencer (Roche Applied Science, Branford, CT, USA) at the Environmental Genomics Core Facility (Engencore) at the University of South Carolina.

Sequence processing

Raw sequence data generated from pyrosequencing were processed in QIIME (Caporaso et al., 2010). Briefly, sequences were quality trimmed and clustered into operational taxonomic units (OTUs) using a 90% identity threshold with uclust (Edgar, 2010). A 90% identity threshold, which corresponds approximately to the taxonomic level of Family for bacteria (Konstantinidis and Tiedje, 2007), was used to generate consistent OTUs with high abundances for subsequent analyses based on correlations and to circumvent potential taxonomic misclassifications due to sequencing anomalies. If we were to use the more standard ‘species-level’ OTU cutoff (97% sequence identity), the resulting OTU table would be far larger making data visualizations and analysis more difficult. At the 90% identity level, the final OTU table consisted of 160 469 sequences (average of 1063 sequences per sample) distributed into 4088 OTUs, of those 2798 were represented by more than 1 sequence. Taxonomic assignment was carried out with the RDP Classifier (Wang et al., 2007), and manually curated by BLAST searches against the GenBank non-redundant nucleotide database (nt).

Data analysis

Non-random co-occurrence patterns were tested with the checkerboard score (C-score) under a null model preserving site frequencies (Stone and Roberts, 1990). A checkerboard unit is a 2 × 2 matrix where both OTUs occur once but on different sites. For network inference, we calculated all possible Spearman's rank correlations between OTUs with more than five sequences (1577 OTUs). This previous filtering step removed poorly represented OTUs and reduced network complexity, facilitating the determination of the core soil community. We considered a valid co-occurrence event to be a robust correlation if the Spearman's correlation coefficient (ρ) was both >0.6 and statistically significant (P-value <0.01; Junker and Schreiber, 2008). The nodes in the reconstructed network represent the OTUs at 90% identity, whereas the edges (that is, connections) correspond to a strong and significant correlation between nodes (see Supplementary File for the resulting network in GRAPHML format). In order to describe the topology of the resulting network, a set of measures (that is, average node connectivity, average path length, diameter, cumulative degree distribution, clustering coefficient and modularity) were calculated (Newman, 2003). All statistical analyses were carried out in the R environment (http://www.r-project.org) using vegan (Oksanen et al., 2007) and igraph (Csárdi and Nepusz, 2006) packages. Networks were explored and visualized with the interactive platform gephi (Bastian et al., 2009).

Results and discussion

General co-occurrence patterns

Soils are heterogeneous environments that harbor enormously diverse prokaryotic communities (Torsvik et al., 1990; Curtis et al., 2002). Previous work has explored soil microbial diversity from various perspectives, including the estimation of species richness levels in individual samples (for example, Fierer et al., 2007; Roesch et al., 2007; Youssef and Elshahed, 2009), assessment of the abiotic variables that control the diversity and composition of communities (for example, McCaig et al., 2001; Fierer and Jackson, 2006; Lauber et al., 2009), and the assessment of how specific abiotic factors influence specific taxa (for example, Jones et al., 2009). The relationships between microbial taxa also shape the structure of microbial communities (Prosser et al., 2007), and thus, it can be expected that non-random co-occurrence patterns and significant inter-taxa relationships should occur.

After quality filtering and OTU clustering at the 90% identity, we obtained 2798 OTUs represented by more than one sequence distributed across the 151 soil samples included in this study (Supplementary Table 1). In order to assess non-random co-occurrence patterns, we first used an ecological measure based on checkerboard units (C-score; Stone and Roberts, 1990). Overall, we observed non-random co-occurrence pattern using the whole dataset (C-score=46.56, P-value <0.01). Restricting the analysis to only those OTUs showing significant relationships (the ones appearing in Figure 1), the measure increased to C-score=185.03 and P-value <0.01. A recent meta-analysis showed similar patterns of co-occurrence for microorganisms and macroorganisms suggesting that non-random community assembly may be a general characteristic across all life domains (Horner-Devine et al., 2007). This finding that there are significant non-random co-occurrence patterns is not surprising, given that we have known for some time that many bacterial taxa exhibit predictable biogeographical patterns (Prosser et al., 2007). Likewise, documenting non-random co-occurrence patterns is far different from actually identifying the causal mechanisms structuring the communities. However, non-random assembly patterns do indicate the dominance of deterministic processes including competitive interactions, non-overlapping niches or historical effects in shaping community composition (Horner-Devine et al., 2007). Overall, the approach allowed us to conclude that soil microorganisms tended to co-occur more than expected by chance.

Figure 1
figure 1

Network of co-occurring 90% cutoff OTUs based on correlation analysis. A connection stands for a strong (Spearman's ρ>0.6) and significant (P-value <0.01) correlation. The size of each node is proportional to the number of connections (that is, degree). Left panel: OTUs colored by taxonomy. Right panel: OTUs colored by abundance and occupancy (generalists and specialists).

Network description

Once we established that the soil microbial assemblage patterns were certainly non-random, we further explored co-occurrence patterns using network inference based on strong and significant correlations (using non-parametric Spearman's; Junker and Schreiber, 2008). Correlation networks of co-occurring microorganisms permit the visual summary of lots of information (Chaffron et al., 2010) and have been successfully applied to discern associations between marine microorganisms and their environment (Ruan et al., 2006).

The resulting soil microbial network (Figure 1 and Supplementary Figure 1) consisted of 296 nodes (OTUs) and 679 edges (average degree or node connectivity 4.59; see Supplementary Figure 1 for a cumulative degree distribution). Some topological properties commonly used in network analysis were calculated to describe the complex pattern of inter-relationships between OTUs (Newman, 2003). The average network distance between all pairs of nodes (average path length) was 5.53 edges with a diameter (longest distance) of 18 edges. The clustering coefficient (that is, how nodes are embedded in their neighborhood and, thus, the degree to which they tend to cluster together) was 0.33 and the modularity index was 0.77 (values >0.4 suggest that the network has a modular structure; Newman, 2006). Overall, the soil microbial network was comprised of highly connected OTUs (5 edges per node) structured among densely connected groups of nodes (that is, modules) and forming a clustered topology (as expected for real-world networks that are more significantly clustered than random graphs). These structural properties offer the potential for quick and easiest comparisons among complex datasets from different ecosystem types in order to explore how the general traits of a certain habitat type may influence the assembly of microbial communities.

The structural analysis also showed that OTUs from the same phyla tended to co-occur more (33%) than expected when considering observed phyla frequencies and random association (11%; Figure 1, left panel). Thus, the magnitude of the lack of agreement between the observed intra-phyla percentage of significant co-occurrences, on the one hand, and the expected assembling under random association, on the other hand, may be used as an index of niche preferences or synergetic relationships. This index may also capture differences among habitats (for example, aquatic vs soil) that may shed light on the ecological rules guiding microbial community composition. In all likelihood, most of these co-occurrence patterns are derived from taxa sharing similar ecological niches, not direct symbioses, however, these data alone do not allow us to separate these two possibilities. Some of the co-occurrence patterns reveal or confirm interesting ecological patterns for taxa that have not been well studied. For instance, members of the verrucomicrobial phylum tended to co-occur more (1.8%) than expected by chance (0.3%) suggesting that, despite being abundant and ubiquitous in soils (Bergmann et al., 2011), they share a specific (and as yet undefined) niche (Figure 4). Another abundant and cosmopolitan phylum that also showed higher incidences of co-occurrence than expected by random association was Acidobacteria (2.4% at random while 9.4% observed; Figure 4). In this case, the pattern is most likely driven by the previously observed phenomenon that soil pH largely governs the distributions of many soil acidobacterial taxa (Jones et al., 2009). Other examples were the Chloroflexi and Deinococcus groups, which were neither abundant nor generalists/cosmopolitan (Figure 4), but instead, appeared to be mostly restricted to desert soils. Several OTUs shared the same habitat preferences and thus appeared to be very interconnected (1.9% at random while 6.2% observed for Chloroflexi; 0.01% at random while 1% observed for Deinococcus). The degree of disagreement between observed and random co-occurrence may therefore provide further insights in the niche differentiation for the different populations sharing a common phylogeny at different levels of relatedness. Overall, these findings suggest that environmental filtering effects and niche differentiation are evident at broad taxonomic levels, as noted elsewhere (Philippot et al., 2010).

Habitat generalists and specialists

Each of the OTUs represented by more than one sequence was drawn in the abundance vs occupancy plot (Figure 2) to split the set of taxa into two general categories: soil generalists, on the one hand (that is, broadly distributed microbial taxa, which we operationally define here as present in >80 of the 151 soils) and soil specialists (operationally defined here as those that were locally abundant, representing >2% of the sequences in individual libraries, but only found in <10 soil samples), on the other hand. Under this criterion, 2% of the OTUs fell into the generalist category while 1% fell into the specialist category. Despite using in the present work a high-throughput DNA sequencing method with higher sequencing depth than traditional methodologies, we may have missed very low abundance taxa that potentially could be habitat generalists. However, dividing taxa from ecological communities into these two categories, which we admittedly defined somewhat arbitrarily, is useful for defining ecological categories/strategies that offer additional information from those defined by phylogeny, taxonomy, or functional capacity (Magurran and Henderson, 2003). Recently, partitioning microbial taxa based on abundance and occupancy has been proved useful in the analysis of clinical samples (van der Gast et al., 2011). In general ecology, positive relationships between mean abundance and occupancy have been observed at many spatial scales (Guo et al., 2000; van der Gast et al., 2011). However, we did not observe such trend in our dataset. Although most of the soil samples analyzed in the present work had their origins in temperate and fertile soils, the environmental variability covered in this study (that is, different habitats and a broad spatial scale) probably altered this relationship. For example, specialist bacterial OTUs inhabiting extreme environments such as deserts or Antarctica soils had a higher abundance than expected regarding their persistence in the overall sampling range.

Figure 2
figure 2

Abundance (y axis) and occupancy (x axis) plot for the 90% cutoff OTUs. Habitat generalists OTUs (in red) defined as appearing in >80 soil samples. Habitat specialists OTUs (in blue) defined as locally abundant (>18 sequences) and appearance in <10 soils.

Overall, we observed a consistent separation in the co-occurring network analysis between generalists and specialists (see significant correlations in Figure 1, right panel, against all the remaining OTUs, and a detail in Figure 3 adding taxonomic information). Specialists OTUs (locally abundant in a few samples) were composed of a diverse range of phylogenetic groups not common in soils from most biomes (that is, Chloroflexi, Deinococcus, Gemmatimonadetes; Figure 3). Generalists OTUs (distributed broadly), in turn, were typical soil members from the Acidobacteria, Proteobacteria (especially of the Alpha subclass) and Verrucomicrobia groups (see Figure 4, top and low panels, and Janssen, 2006 for a recent review). The different taxonomic composition and range of distribution probably influenced the network structure observed (Figure 3), indicating that these two ecological categories shaped differently the network structure. The generalists network was less connected and more compartmentalized (19 significant co-occurrences and five compartments) than the specialists network (29 significant co-occurrences and two compartments) probably because of the highest habitat variability covered by the former and the presence in restricted environments by the latter. Thus, the two somehow arbitrary ecological categories that we established allowed us to capture additional information on the community assembling structure as previously shown for macroorganisms (Pandit et al., 2009).

Figure 3
figure 3

Network of co-occurring generalists and specialists 90% cutoff OTUs based on correlation analysis. A connection stands for a strong (Spearman's ρ>0.6) and significant (P-value <0.01) correlation. The size of each node is proportional to the number of connections (that is, degree). Labels according to taxonomic affiliation: Ac, Acidobacteria. A.R, Alphaproteobacteria; Rhizobiales. A.Rh, Alphaproteobacteria; Rhodobacterales. A.S, Alphaproteobacteria; Sphingomonadales. Ba.F, Bacteroidetes; Flavobacteria. Ba.S, Bacteroidetes; Sphingobacteria. Ch, Chloroflexi. Cr, Crenarchaeota. Cy, Cyanobacteria; D, Deltaproteobacteria. De, Deinococcus. G, Gemmatimonadetes. Ga, Gammaproteobacteria. V, Verrucomicrobia.

Figure 4
figure 4

Relative abundance of different microbial taxonomic groups. Top panel: number of sequences in all soil samples. Middle panel: number of significant co-occurrent OTUs (nodes from Figure 2). Low panel: number of cosmopolitan OTUs.

Finally, the significant correlations found between the generalists OTUs and the whole dataset are shown in Table 1. The listed correlations do not include co-occurrent events among members of the same taxonomic group. Generalists OTUs belonged to abundant phyla such as Verrucomicrobia, Acidobacteria, Proteobacteria and Bacteroidetes. Interestingly, generalists OTUs not classified as typical and abundant soil members (such as Deltaproteobacteria and Crenarchaeota) co-occurred with other ubiquitous members. The case of the crenarchaeotal OTU (closely related to Candidatus Nitrososphaera gargensis and previously described as ubiquitous in soil; Bates et al., 2010) is particularly remarkable due to our poor understanding of the niches occupied by this taxon, even though it has been proposed that related Crenarchaeota may have an important role in the nitrogen cycle as ammonia oxidizers (Leininger et al., 2006). This particular crenarchaeotal OTU co-occurred with sequences closely related to methane oxidizers (Table 1). This observation may be a first step to revise the expected functional role of soil Crenarchaeota in the nitrification process because of the high sequence similarity of the enzymes that catalyze ammonia oxidation (ammonia monooxygenase) and methane oxidation (particulate methane monooxygenase; Holmes et al., 1995). This is an example of the potential that the approach proposed in the present work has to gain knowledge on elusive but ecologically relevant microorganisms.

Table 1 Taxonomy of generalists OTUs (red nodes in Figure 3) and their significant (P-value <0.01) co-occurrent OTUs

Final remarks

With this work we have demonstrated the utility of including network analysis approaches in the repertory of statistical approaches so far available to microbial ecologists. By employing network analyses to a large soil microbial dataset generated by pyrosequencing, the process of exploring the complex set of data was more feasible and interesting unseen patterns emerged, including non-random association, deterministic processes at different taxonomic levels and unexpected relationships between community members. Different ecological rules guiding microbial community composition may be reflected in diverse network structure among habitats (for example, aquatic vs soil, or early colonized vs late-successional ecosystems) that deserves further research. The next logical step is to go beyond merely describing the patterns revealed by the network analysis and design more focused experiments, or the study of specific environmental gradients and community shifts over time, in order to understand the mechanisms producing patterns of community coexistence, that is, what finally determines which and how many species live together in a community.