Open Access

Toward molecular trait‐based ecology through integration of biogeochemical, geographical and metagenomic data

Jeroen Raes, Ivica Letunic, Takuji Yamada, Lars Juhl Jensen, Peer Bork

Author Affiliations

  1. Jeroen Raes1,2,
  2. Ivica Letunic1,
  3. Takuji Yamada1,
  4. Lars Juhl Jensen1,3 and
  5. Peer Bork*,1,4
  1. 1 Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
  2. 2 Molecular and Cellular Interactions Department, VIB – Vrije Universiteit Brussel, Brussels, Belgium
  3. 3 NNF Center for Protein Research, Copenhagen, Denmark
  4. 4 Max Delbrück Center for Molecular Medicine, Berlin‐Buch, Germany
  1. *Corresponding author. Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstrasse 1, Heidelberg 69117, Germany. Tel.: +49 6 221 387 8526; Fax: +49 6 221 387 8517; E-mail: bork{at}


Using metagenomic ‘parts lists’ to infer global patterns on microbial ecology remains a significant challenge. To deduce important ecological indicators such as environmental adaptation, molecular trait dispersal, diversity variation and primary production from the gene pool of an ecosystem, we integrated 25 ocean metagenomes with geographical, meteorological and geophysicochemical data. We find that climatic factors (temperature, sunlight) are the major determinants of the biomolecular repertoire of each sample and the main limiting factor on functional trait dispersal (absence of biogeographic provincialism). Molecular functional richness and diversity show a distinct latitudinal gradient peaking at 20°N and correlate with primary production. The latter can also be predicted from the molecular functional composition of an environmental sample. Together, our results show that the functional community composition derived from metagenomes is an important quantitative readout for molecular trait‐based biogeography and ecology.

Visual Overview


Metagenomics (shotgun sequencing of pooled DNA of complete microbial communities) is widely used to investigate ecosystem functioning of environmental and clinical samples. However, the nature of this data (usually a gigantic collection of gene fragments of 1000s of organisms) makes it very hard to infer global patterns on microbial ecology of the environment at hand. To address important ecological questions such as ‘How do microbial communities adapt to the environmental conditions?’, ‘What drives the functional variation across the globe and to what extent do genes disperse?’ and ‘What drives variation of CO2 uptake across different locations and communities?’, we integrated 25 ocean metagenomes from the Global Ocean Sampling project with geographical, meteorological and geophysicochemical data. We find that climatic factors (temperature, sunlight) are the major determinants of the functional and phylogenetic composition of an environment and the main limiting factor on whether functions dispersal across the planet. We find a distinct latitudinal gradient in the size and diversity of the functional repertoire of ocean microbial communities, peaking at 20°N, and which correlates with oceanic CO2 uptake. The latter can also be predicted from the molecular functional composition of an environmental sample. Together, our results show that the functional community composition derived from metagenomes can be used as quantitative predictor for molecular trait‐based biogeography and ecology.

  • Climatic factors drive functional and phylogenetic composition of ocean microbial communities.

  • Function dispersal is controlled by environmental conditions.

  • Functional richness has a clear latitudinal gradient and correlates with primary production.

  • Metagenomic data can be used as a predictor for ecosystem processes.

  • To understand the relationship between community composition and environment, functional readouts are the most direct. Metagenomic data enable such trait‐based ecology at the molecular level.


Microbial communities have a central role in global environmental processes and the Earth's biogeochemistry by cycling nutrients and fixing carbon (Falkowski et al, 1998). However, because of the complexity of these communities and the lack of culturability of most of its members, the molecular and ecological details as well as influencing factors of these processes are still poorly understood. Environmental shotgun sequencing (metagenomics) has the potential to start unraveling the underlying complex interspecies ecological interactions and metabolic networks, by quantification of the molecular functions (‘parts lists’) of all microbial communities on Earth (Tringe and Rubin, 2005a; Raes and Bork, 2008). However, despite a wide range of published metagenomics studies (see Liolios et al, 2006; Raes et al, 2007; Wooley et al, 2010, for an overview), our knowledge of the variation, functioning and ecology of complex microbial ecosystems remains limited, mostly because the resulting ‘parts lists’ could not be put into sufficiently detailed environmental context. Although previous studies have shown that the environment has an influence on the parts list of various communities, the extent of this effect and the relative importance of a broad range of different environmental factors (climate, nutrients, physicochemical parameters and so on) is unknown (Tringe et al, 2005b; DeLong et al, 2006; Dinsdale et al, 2008; Kunin et al, 2008; Gianoulis et al, 2009) or was investigated with a focus on single species (Johnson et al, 2006) or specific gene families (Patel et al, 2010). This said, recent models predicting nutritional strategy from metagenomic data show great promise toward the understanding of some of these relationships (Lauro et al, 2009). Also, as microbial biogeography and ecology studies have mostly focused on phylogenetic patterns, little is known about the role of molecular traits (i.e., the genes and their products) in these matters (Martiny et al, 2006; McGill et al, 2006; Green et al, 2008). Likewise, the role of molecular trait variation in important ecosystemic processes such as global primary production is far from clear (Falkowski et al, 1998). To start addressing these issues, we investigated the feasibility of molecular trait‐based ecology by integrating large‐scale marine metagenomics data with geochemical, meteorological and ecological measurements and used this information to investigate (i) the relationship between environment and functional community composition (the metagenome‐derived gene/pathway repertoire of an ecosystem), (ii) the factors influencing functional dispersal (defined here as the functional effects of species dispersal as well as horizontal gene transfer‐ and phage‐mediated gene flow), i.e., the movement of functional traits through geographical space, (iii) the interplay between functional composition and primary production and (iv) the geographic variation in global functional diversity and its consequences. The various correlations we found, despite various imaginable limitations of environmental sequence data (see further), thereby indicate that molecular functional composition, as derived from metagenomes, can serve as a powerful marker and predictor of ecological processes.

Results and discussion

We utilized the Global Ocean Survey (GOS) which is, at the time of writing, the largest published metagenomic study of a single environment (excluding host‐associated communities; Qin et al, 2010), gathering ocean surface samples from a transect around the globe (Rusch et al, 2007). Although it has some drawbacks (e.g., size fractionation excluding eukaryotic plankton; only dominant species are sampled), it still constitutes a unique data set to assess the feasibility of molecular trait‐based ecological studies. We mapped these data onto orthologous groups (OGs) and biochemical pathways (allowing multi‐level functional interpretations and to overcome undersampling issues in gene‐based analyses; Harrington et al, 2007), and linked pathways to species, when possible, in order to interpret correlations in the context of the phylogenetic composition of the community (see Materials and methods). Then, we complemented the metadata gathered in the GOS project (which we studied in Gianoulis et al (2009)) by projecting a broad range of geophysicochemical (e.g., nitrate, phosphate, oxygen measurements, ocean mixing), geographical (latitude, longitude, depth) and meteorological data (e.g., temperature, sunlight) as well as ecological information (primary production) from publically available resources (see Materials and methods and references therein) onto 25 published metagenome sampling points (Rusch et al, 2007) using the sampling time and coordinates (see Materials and methods; Supplementary Table S1 and Supplementary Figure S1).

Functional community composition is mainly driven by climatic factors

Previous studies have shown a clear impact of the environmental conditions on the functional composition of microbial communities (Tringe et al, 2005b; DeLong et al, 2006; Dinsdale et al, 2008; Kunin et al, 2008; Gianoulis et al, 2009). To investigate which environmental conditions are the main drivers in this process, we applied previously established techniques (Gianoulis et al, 2009) to the integrated metagenomic and environmental data. Canonical correlation analysis (CCA; Hotelling, 1936; Gianoulis et al, 2009) shows that the overall correlation between environmental factors and various biochemical pathways is high and that numerous pathways have strong correlations with environmental factors (see Figure 1, Supplementary Tables S2, S3, S4). Generally, temperature, sunlight, oxygen and CO2 concentration have the strongest correlations, as they distribute along the dimension with the highest canonical correlation coefficient (CC=0.944, see Figure 1C), while salinity and nutrients contribute more to the second, less significant, dimension (CC=0.875; only one significant module, Supplementary Table S4; see Supplementary Table S5 for more evaluation metrics). This suggests that, for the ocean surface communities analyzed here, climatic factors such as sunlight and temperature (and correlated dissolved oxygen and CO2 content) are the main determining factors of the functional community composition, whereas nutrient concentrations seem to have less influence, despite their crucial, limiting role in ocean life and productivity (Arrigo, 2005). As phylogenetic composition was reported to be mainly determined by salinity in a survey of various environments (Lozupone and Knight, 2007), we repeated the CCA analysis on the phylogenetic composition of the GOS samples used here (data taken from Biers et al, 2009) but found the phylogenetic results to be in agreement with our functional trends (Supplementary Figure S2). Given the range of salinity in our study was much smaller than in the aforementioned study, it could however be that salinity has a role in more extreme concentrations and outside ‘normal’ oceanic samples as studied here. We should also note that these correlations are based on available monthly average values, so that nutrients could still have a larger role in fast adaptations at much shorter timescales (Gilbert et al, 2009). Data sets that cover a larger and physicochemically more diverse geographic region can be expected in the near future (e.g., from the Tara Oceans project, and will provide a higher resolution to refine our initial observations.

Figure 1.

Correlations between metabolic pathway abundances and environmental conditions deduced from the ocean samples in this study, at various levels of model complexity (see Materials and methods): (A) ‘One‐to‐one’ pairwise correlation (P=0.001) between the abundance of photosystem I genes with average monthly water temperature. (B) ‘One‐to‐many’ linear model of average monthly water temperature, phosphate concentration and hours of sunlight correlating with carbon fixation gene abundance (R2=0.70). (C) ‘Many‐to‐many’ regularized canonical correlation analysis ordination plot showing the correlation between all environmental variables (text labels; see Materials and methods) and pathway modules (colored dots). The distance between two variables on the plot and their distance from the center point indicates the strength of their correlation and their contribution to explaining the global correlation (i.e., their structural correlation in each dimension given on the respective axes: first dimension, vertical, second, horizontal; see Materials and methods). The overall canonical correlation is high (canonical correlation=0.944 in the first dimension), and the two first dimensions explain 62 and 22% of the total environmental and metagenomic variation, respectively, emphasizing the strong correlation between the climatic factors and functional community composition on the first dimension. Module colors indicate their broad functional classes: yellow, amino acid metabolism; orange, central metabolism; red, energy metabolism, dark green, glycan metabolism; cyan, lipid metabolism; purple, metabolism of other molecules; blue, nucleotide metabolism; brown, replication and repair; light green, transcription; pink, translation; gray, transport system. Highlighted modules are described in more detail in the text. mld, mixed layer depth.

Molecular adaptions to environmental conditions

Several correlations between environmental factors and abundance of metabolic pathways as revealed by CCA were further confirmed using pairwise correlation analysis and linear regression (see Materials and methods). Some of these confirm common knowledge of ocean microbial processes. For instance, a strong positive correlation was found for both temperature and hours of sunlight with the abundance of genes encoding the photosynthetic machinery (Figure 1A). Not only the light capturing complexes (PSI, PSII, CytB6) but also modules involved in oxidative phosphorylation and CO2 fixation were positively correlated with temperature, confirming the greater dependence on (photo‐) autotrophic processes in sunny, warm and nutrient‐poor tropical areas (mainly driven by samples from the Gulf of Mexico and, to a lesser extent, the tropical Pacific). The phylogenetic CCA analysis confirms this observation, by showing strong association of photosynthetic groups such as cyanobacteria with temperature (Supplementary Figure S2).

Other examples include a wide range of sulfur‐related processes that are negatively correlated to temperature ranging from enzymes linked to sulfate reduction and/or sulfite detoxification (adenylyl sulfate reductases; Meyer and Kuever, 2007) to pathways involved in methionine degradation and breakdown of sulfur‐containing glycans. The former seem to originate mainly from SAR11‐like species (where they would be reverse‐acting to prevent sulfite accumulation during dimethylsulfoniopropionate assimilation (Meyer and Kuever, 2007)), but are also found in various other bacterial groups. The latter include mainly sulfatases from planktomycetes (see Supplementary Figure S3 for phylomapping results), which are suggested to be involved in the initial breakdown of sulfated heteropolysaccharides in heterotrophic carbon recycling (Woebken et al, 2007). The observed anticorrelation of these processes with temperature suggests a more general dependence on organic sulfur (despite high ambient sulfate concentrations) in the northern, coastal heterotroph‐dominated communities, something which was currently only observed for specific phylogenetic groups (i.e., SAR11; Tripp et al, 2008).

Functional biogeography and dispersal are primarily determined by environmental conditions

Having demonstrated that functional composition can be clearly linked to external conditions, we used functional traits derived from metagenomes to study, as a second application, function dispersal and biogeography in ocean samples. Various rRNA‐based studies over many years remain undecided on the existence and nature micro‐organismal biogeographic patterns (Finlay, 2002; Martiny et al, 2006; Telford et al, 2006). As evidence accumulates that the set of functional traits, not the rRNA genes, are the true ecological determinants of a microbial species (Gevers et al, 2005; McGill et al, 2006; Green et al, 2008; Hunt et al, 2008; Fraser et al, 2009), we investigate the use of molecular functional traits as biogeographic markers. Given rampant horizontal gene transfer and, e.g., the frequent observation of bacterial genes in phages, we here use the term ‘functional biogeography’ to allow for the possibility that the traits themselves disperse irrespective of their original hosts, although it deviates from the strict definition where biogeography is only applicable to the trait‐bearing organisms.

To determine the distribution of function in geographical space and identify its determining factors, we compared the difference in functional (metagenomic) composition (i) between the samples (as measured by the Bray–Curtis distance metric; see Materials and methods), (ii) to the geographical distance between them and (iii) to the difference in environmental conditions (Martiny et al, 2006). As climate is the principle factor influencing the functional composition of the communities studied here, we demonstrate its effect on function dispersal. The functional difference between communities increases with difference in climatological conditions (Figure 2A; Mantel test, P<0.001), and this effect remains even if the difference in environmental conditions caused by geographic distance is subtracted (Figure 2B; partial Mantel test, P=0.01; see Materials and methods and Supplementary Table S6 for details on tests performed). On the other hand, when we tested for biogeographic provincialism (i.e., a ‘distance effect’, implying geographic limitations to function dispersal), no significant correlation between physical distance and functional difference could be found when difference in environmental conditions was taken into account (Figure 2B; partial Mantel test, P=0.1). This implies that the functional traits available to a specific community have no physical constraints on their dispersal, and their abundances are mainly determined by local, contemporary environmental conditions. In other words, a single functional province seems to exist over hundreds to thousands of kilometers in the surface ocean of the Atlantic and Pacific. These observations provide evidence for a functional equivalent of the (organism‐centric) Baas‐Becking theory (Baas Becking, 1934), namely ‘all functions are everywhere, but the environment selects’. One should note that the trend diminishes if not only climatic but also nutritional variables are included (see Materials and methods), suggesting that selection of relevant environmental parameters can lead to the identification of otherwise hidden biogeographic patterns. On the other hand, the same correlation analysis, when performed on the phylogenetic composition (source data from Biers et al, 2009), shows a (weak) distance effect (partial Mantel test, P=0.03, see Supplementary Table S6 and Materials and methods for details) compatible with previous reports of microbial species‐level biogeographic provincialism (Martiny et al, 2006; Telford et al, 2006). However, the phylogenetic composition seems not to be selected for by environmental factors (partial Mantel test, P=0.25, see Supplementary Table S6), suggesting that phylogeny and function are not necessarily coupled. Thus, while for phylogeny more neutral population effects dominate in geographical distinct locations, selection for environmental constraints seems more readily distinguishable in molecular functions.

Figure 2.

The role of environment in the biogeography of functional traits. (A) Coupling of metagenomic distance between samples (measured using KEGG metabolic pathway composition; see Materials and methods) with difference in climatological conditions, identifying climate as a primary determinant of function dispersion. (B) (Partial) Mantel tests (see Materials and methods) showing that this increase is not due to indirect effects, such as the similarity in environmental conditions between geographically close samples.

Functional community composition can predict primary production

The case studies above exemplify a wealth of apparent metabolic adaptations to environmental conditions and prove the ability to extract those from massive amounts of data. This prompted us, in a third application, to explore the potential of functional composition in predicting other global ecological parameters, such as primary production (the amount of CO2 fixed by photosynthetic micro‐organisms that is available for other trophic levels as organic carbon (Lindeman, 1942). Primary production is a crucial parameter in global carbon cycling and climate change and is calculated from, among others, temperature and sun irradiation measurements (Field et al, 1998). We retrieved information on the monthly average primary production for the sampling coordinates (Field et al, 1998) and correlated this information to both the functional composition of the microbial community sampled and the environmental conditions of the sampling site. As expected, primary production was predictable from environmental conditions (data not shown). However, also the abundance of several central, core metabolic pathways from the metagenome strongly correlates with primary production (see Supplementary Table S7). Examples include pathway modules involved in nucleotide (dCTP, UMP) and sugar‐nucleotide (UDP‐glucose/UDP‐galactose) biosynthesis (Spearman's ρ=0.79, −0.69 and −0.69, respectively), photosynthesis (Photosystem I and II; ρ=−0.71 and −0.53, respectively) and oxidative phosphorylation (complex I; ρ=−0.76). The negative correlation of photosynthesis and linked energy generation with primary production, at first sight counterintuitive, can be explained by the filter sizes used in the metagenome sampling, which excludes several of the main contributors to primary production (diatoms, dinoflagellates, green algae). In areas of low nutrient concentration and low global primary production, photosynthetic cyanobacteria thrive because of their specific adaptation to these circumstances (Vaulot et al, 1995). In areas of high nutrient concentration, where the major eukaryotic determinants of primary production thrive, they could outcompete the photosynthetic bacteria, which are the only ones captured in the GOS data sets, as the cell size fractions of the samples used here did not target protist species. On the other hand, the observed strong correlation between many additional core metabolic pathways of only a particular subfraction of the ecosystem population (bacteria/archaea) and the total measured primary production provides evidence for a robust metabolic coupling between these organisms and the carbon fixing eukaryotes in the ocean surface. This hypothesis is reinforced by the fact that linear models consisting of only abundances of the most correlated metabolic Kyoto Encyclopedia of Genes and Genomes (KEGG) maps (Kanehisa et al, 2008) were almost equally successful at predicting primary production as the external environmental measures used (R2env=0.91 versus R2met=0.87; see Supplementary Figure S4 and table S8). This observation confirms a tight coupling of functional composition of the sampled bacterial community and global primary production at that location. Furthermore, this observation suggests that functional properties, identifiable through metagenomic studies can be used to predict various community properties and processes that are hard to measure in environmental settings.

Functional richness and diversity show a distinct latitudinal gradient and are linked to primary production

In addition to the functional and phylogenetic composition of microbial communities, also global estimators of the richness (e.g., number of species) and diversity (richness, corrected for population structure) of an ecosystem are widely used to investigate and understand ecosystem properties (Colinvaux, 1973). For instance, community productivity has been repeatedly correlated to species richness. However, depending on the environmental circumstances, also negative correlations have been observed (Kondoh, 2001; Loreau et al, 2001). In microbial ocean communities, no correlation was found so far (Fuhrman et al, 2008). At the macroscopic functional level, positive correlations between richness in macroscopic traits (e.g., plant functional groups in a field study) and productivity have been reported (Tilman et al, 1997; Loreau et al, 2001). We therefore investigated whether estimators of trait diversity at the molecular level can provide relevant alternatives to study microbial communities, as they consider the direct functional units (genes) that determine the properties inherent to the ecosystem. One comprehensive measure of molecular functional richness is the extrapolated richness in OGs present in a metagenomic sampling of a community, which quantifies the ‘breadth’ of the functional potential of the ecosystem at hand (Raes and Bork, 2008). When also the evenness of the functional distribution can be taken into account, functional diversity can be determined (Raes and Bork, 2008). Applied to the data used here, we observe a clear peak in functional richness at 20 degrees north along the north–south transect of the GOS expedition (for details see Figure 3A and B). Functional diversity shows a similar but less pronounced pattern (see Supplementary Figure S5), and similar results are obtained for gene family richness (see Supplementary Figure S6). These results are in line with previous reports of latitudinal species richness gradients, although in these studies the gradient has a more noisy distribution (Fuhrman et al, 2008) or the position of the peak cannot be observed because of the relatively low number of samples (Pommier et al, 2007).

Figure 3.

Variation of functional richness and diversity, and its coupling to primary production. (A) Global view of primary production for the sampled region (ocean coloring) and functional richness for GOS sampling locations used in this study (3D bars), showing the peak in richness at ±20 degrees North (two outliers were removed for visualization purposes, but are present in (B), showing the full, quantitative data plotted against latitude), (C) functional diversity negatively correlates with primary production (Spearman's ρ=−0.49; P=0.01). Trend lines are Lowess fitted lines with smoothing parameter f=0.7.

We also observe a significant negative correlation between functional diversity and primary production (see Figure 3C; also observed for functional richness, data not shown). These results seem in disagreement with plant macroscopic functional trait richness, which positively correlates with productivity (Tilman et al, 1997; Loreau et al, 2001). In addition, published marine bacterial species richness analyses (Fuhrman et al, 2008), as well as with our own reanalysis of the phylogenetic composition of these samples (see Supplementary Figure S7), show no or at best a very weak (P=0.28) correlation with primary production. Our results are in line with ecological theory though, namely that in areas of high productivity, a narrow functional niche of organisms (low functional richness) impacts the community and drives the majority of productivity. In low‐productivity (nutrition poor) areas, more functional niches are formed because of a higher level of competition for resources, causing a broader global set of functionalities (Krebs, 2001). Given the theoretical support for our results, the weakness of the phylogenetic signal when derived from PCR bias‐free metagenomic sequencing (von Mering et al, 2007a), and in line with the increasing realization that the ecological function of a microbial organism is a true indicator of a species (Gevers et al, 2005; Hunt et al, 2008; Fraser et al, 2009), we conclude that functional diversity estimators (such as richness) and composition might be more relevant and immediate indicators of the link between diversity and ecosystem processes than species‐based measures.

Metagenome‐derived functional composition is an important tool for molecular ecology and biogeography research

Taken together, this study extends previous work (Gianoulis et al, 2009) through a comprehensive integration of metagenomic data with a broad range of quantitative environmental factors extrapolated from a variety of measures independent from the sampling, allowing the identification of climatic factors as the drivers behind functional community composition and functional biogeography. It is further the first attempt to establish the molecular functional repertoire of a metagenomic sample as indicator and predictor of ecological parameters. The metagenomes used here (Rusch et al, 2007) are, of course, only snapshots of the functional potential of the environment and probably still give a biased view on the total biodiversity (e.g., by the sampling method, sequence coverage, filtering and so on). Future dedicated experiments and time series data will greatly improve causal inferences beyond the simple logic used here (environment influences community composition and functioning, which influences ecosystem processes such as primary production). The environmental and ecological variables used are mostly monthly averages of the region, and not exact measurements at the time of sampling. Furthermore, multiple measures had to be taken to avoid parameter overfitting and other testing artifacts (see Materials and methods). However, despite these drawbacks and complications, we see clear, significant trends relating the environment and ecological parameters to metagenomic gene abundances (to the extent of having predictive power), providing first insights into the coupling between the functional traits available to the community, the environmental context and the ecosystem processes that occur. We see this as proof‐of‐principle that molecular functional composition can be used in various other environmental settings such as the human microbiome, where it could be integrated with clinical data to study the molecular ecology and temporo‐spatial variation of the ‘human’ ecosystem.

Materials and methods

GOS data collection and sequence preprocessing

For this study, we filtered the data from the first phase of the GOS expedition to keep only those sites that used a 0.1–0.8 μm filter size (i.e., majority prokaryote samples). In addition, atypical samples from mangroves, estuaries, salt lakes, large depth and so on were not included in the analysis, nor was the sample of Sargasso Sea station 11, because it is suspected of contamination (Mahenthiralingam et al, 2006), and that of Cabo Marshall because of a considerable number of missing environmental data points (see below). For the remaining 25 sites (Supplementary Table S1), protein sequence data were downloaded from CAMERA (Seshadri et al, 2007). Peptides were mapped to sites based on the read‐to‐scaffold and ORF‐to‐scaffold mappings available at the same database, using previously established methods (Tringe et al, 2005b; Kunin et al, 2008; Gianoulis et al, 2009).

Geographical, environmental and ecological metadata mapping

Sampling longitude, latitude, depth, date and time were extracted from the CAMERA database (Seshadri et al, 2007). Average monthly dissolved oxygen, phosphate, nitrate, silicate, temperature and salinity were extracted from the 1 degree gridded data available at the World Ocean Atlas (WOA05) resource (Boyer et al, 2006) for the geographically closest available data points for the selected samples, using the longitude, latitude, month and year of sampling. Gridded dissolved carbon dioxide data from the GLODAP resource (Key et al, 2004), and mixed layer depth data (estimated based upon a 0.5 °C surface water temperature change (mld_pt); Monterey and Levitus, 1997) were extracted using the Ocean Data Viewer software ( Average monthly primary production data, as estimated by the Vertically Generalized Production Model (VGPM) (Behrenfeld et al, 2006), was downloaded from the ocean productivity resource ( Monthly average sunshine fraction (the percentage of time when bright sunshine is recorded during the day) data was obtained from the local climate estimator (LocClim) resource (Grieser, 2002), based upon default interpolation settings. Environmental parameters for which both on‐site (Rusch et al, 2007) as well as monthly averages were available (temperature, salinity) were compared and showed strong consistency among samples (Supplementary Figure S1).

Functional annotation and pathway assignment

The 111 KEGG maps, 141 modules and 191 operons were assigned as in Tringe et al, 2005b; Kunin et al, 2008; Gianoulis et al, 2009. Module definitions were downloaded from KEGG (Kanehisa et al, 2008), and operons were constructed as in von Mering et al, 2003 and Kunin et al, 2008. For clarity, in the remainder of the text, we use the term pathway to refer to all of these levels. Pathway abundance and presence was measured as in Tringe et al, 2005b; Kunin et al, 2008; and Gianoulis et al, 2009. In brief, predicted protein sequences were searched against the extended database of proteins assigned to OGs and pathways in STRING 7.0 (von Mering et al, 2007b), by using BLASTP (Altschul et al, 1990), and an OG/pathway was called present when a hit matching 1 of its proteins occurred (with a BLAST score of at least 60 bits; see Supplementary Table S9 for further annotation statistics). Gene frequencies were determined by counting the number of reads contributing to that gene, standardizing for sample size. Likewise, the OG/pathway frequency for each site was assigned by summing the total number of instances of that OG/pathway (i.e., reads mapping to a gene assigned to that pathway) for a particular site and standardizing by total number of assignments for that site to compensate for sample coverage differences. For all correlation analyses, pathways for which the summed count over all sites constituted equal to or less than 0.01% of the total count were removed to avoid artifacts. Correlation was carried out using the relative counts of genes involved in specific metabolic pathways (or modules/OGs). For ease of reading, this is described as a correlation between an environmental factor and the pathway. All results described were also manually scrutinized, and for all case studies, confirmation was sought at multiple levels of resolution (map‐module‐operon‐OG) to reduce artifacts.

Pairwise correlations, linear regression and canonical correlation analysis

Correlation analysis on the extended set of environmental parameters and ecological variables was performed using the same methodology as used in Gianoulis et al (2009). In brief, we computed pairwise Spearman correlations between pathway frequencies and environmental variables over all samples (controlling the false discovery rate by Benjamini and Hochberg (1995) correction for multiple testing. Linear models were constructed in two directions: (i) in the case of the prediction of primary production the pathway frequencies acted as dependent variables and environmental conditions the response variables, whereas in (ii) the investigation of the effect of environment on community composition, pathway frequency was treated as the response variable and predicted from environmental factors. For both models, we used stepwise regression analysis (SRA; implementation in the R‐stats package) to reduce the number of response variables in the model. To avoid overfitting in (i), we used only the top 15 pathways that showed the highest pairwise correlation (as measured by uncorrected P‐value) with the environmental feature modeled. Linear models were considered significant at P<0.05 for both the total model and the estimate of the variable coefficients. For regressions in both directions, the pathway frequencies were standardized to a mean of 0 and a s.d. of 1. For (i), we used the centered, quantile‐normalized primary production data transformed into percentiles to ensure a truly normal distribution and, thus, accurate P‐values. As the linear model construction procedure did not allow any missing values, we removed all samples with missing environmental data values (21 samples remaining). In (i), a leave‐one‐out cross‐validation procedure was used to assess the behavior of the derived model on samples not used for training. This procedure was used both at the feature selection step as well as the prediction step. First, the set of dependent variables with significant weights in the model were determined for all (n−1) sample combinations using the SRA feature selection approach, and the most frequent set of features was chosen as the final model (see Supplementary Table S8 for results). Next, using these features, parameter estimation was performed on each combination of (n−1) samples and the predicted primary production was compared with the observed value (see Supplementary Figure S4 for results). Regularized CCA was used to identify the set of projections that maximally correlate pathways and environmental variables (Gianoulis et al, 2009). Structural CCs (the correlation between the original variable and the canonical variate) were used to estimate the importance of one variable relative to all of the other in the maximization of the correlation between pathways and factors. To test the statistical significance of structural correlations, 1000 randomized data sets (sample labels permuted) were analyzed in the same way as the original data set. The statistical significance was calculated by comparing the observed structural correlations to the distributions obtained from randomization, and a significance threshold of P<0.05 after Benjamini and Hochberg (1995) correction for multiple testing was applied.

Functional diversity estimates and biogeography

Functional richness and diversity (Raes and Bork, 2008) were estimated from the OG abundance counts (OG richness) using EstimateS ( Richness was calculated using the Chao1 estimator (Chao, 1984) and for functional diversity, Simpson's diversity index was used (Magurran, 1988). We also investigated the use of other indices (e.g., ACE, Michaelis Menten), but this did not affect results very much (data not shown). In addition, gene family richness was determined from gene family mappings of each sample (data from Yooseph et al, 2007).

To investigate the relative importance of environment (climate) and geographic separation, Mantel and partial Mantel tests (based on Pearson's correlations, 10 000 permutations) were performed on environmental similarity, metagenomic similarity and geographic distance matrices, which were calculated using Euclidian distances between the scaled environmental variables, Bray–Curtis distances between pathway frequency matrices and raw haversine distances (in km) between samples, respectively. Calculations were done using the functions in the ‘stats’, ‘ecodist’ and ‘vegan’ libraries in the R‐package (www.r‐; Goslee and Urban, 2007; Oksanen et al, 2008).


Visualization of phylogenetic data was performed using iTol (Letunic and Bork, 2007). Primary production and functional richness data were mapped on the globe using google earth (

Species mapping

Proteins from selected pathways were mapped to nodes of the tree of life (Ciccarelli et al, 2006) using an in‐house perl script, based upon the lowest common ancestor approach (Huson et al, 2007). Input data were BLASTp results of the proteins against the STRING 7 database (von Mering et al, 2007b). Only hits above 60 bits and whose scores lied within 10% of the best score were considered.

Conflict of Interest

The authors declare that they have no conflict of interest.

Supplementary Information

Supplementary Material [msb20116-sup-0001.pdf]

Supplementary Tables S2, S3 and S7 [msb20116-sup-0002.xls]


We thank Chris Creevey and the other Bork laboratory members for their valuable input, as well as four anonymous reviewers for their constructive remarks. This research is supported by the European Union FP7 Program Contract no. HEALTH‐F4–2007‐201052.

Author contributions: JR and PB designed the research. JR, IL, TY, LJJ analyzed the data and/or provided data analysis expertise. JR and PB wrote the paper.


This is an open‐access article distributed under the terms of the Creative Commons Attribution License, which permits distribution, and reproduction in any medium, provided the original author and source are credited. This license does not permit commercial exploitation without specific permission.