Size matters: network inference tackles the genome scale

Boris Hayete, Timothy S Gardner, James J Collins

Author Affiliations

  • Boris Hayete, 1 Bioinformatics Program and Center for BioDynamics, Boston University, Boston, MA, USA
  • Timothy S Gardner, 1 Bioinformatics Program and Center for BioDynamics, Boston University, Boston, MA, USA
  • James J Collins, 1 Bioinformatics Program and Center for BioDynamics, Boston University, Boston, MA, USA

The growing importance of microarray data challenges biologists, and especially the systems biology community, to come up with genome‐scale analysis methods that can convert the large quantity of available high‐throughput data into high‐quality systems‐level insights. One area of systems‐level analysis that has received considerable attention in recent years is that of inferring molecular‐level regulation, with frequent focus on transcriptional regulatory networks (Kholodenko et al, 1997; Tavazoie et al, 1999; Gardner et al, 2003; Segal et al, 2003; Beer and Tavazoie, 2004; Yu et al, 2004; di Bernardo et al, 2005; Gardner and Faith, 2005; Woolf et al, 2005; Margolin et al, 2006; Faith et al, 2007). As microarrays provide a tool for measuring transcript levels of the whole genome, recent interest has shifted to inferring networks on a genome scale. The less‐studied organisms are a natural starting point for such mapping, as it is for these organisms that the rapid, genome‐scale identification of regulatory structure is most needed.

In a recent study, Bonneau et al (2006) apply the Inferelator, their elegant new algorithm, for inferring gene networks, to precisely such a little‐studied but important organism. Specifically, the authors focus on Halobacterium NRC‐I, a model archaeon (DasSarma et al, 2006), to show that, at least for a small genome, it is possible to determine a sizeable portion of the transcriptional regulatory network from microarrays without much prior knowledge. This choice of an organism has two practical advantages. First, the salt‐loving NRC‐I is one of a handful of Halobacteria for which transformation techniques have been well studied, allowing in vivo validation of network predictions. Second, NRC‐I's genome is relatively small and thus, its regulation ought to be comparatively easy to reconstruct. Small genome or not, putting high‐throughput profiling technologies to work on the genome scale requires a confluence of robust algorithms, biologically plausible simplifying assumptions, and a robust verification strategy. The work of Bonneau et al (2006) is a good example, using multiple tools in the bioinformatics toolbox to build a credible blueprint of a transcriptional‐regulatory network involving thousands of genes and more than 100 transcription factors.

In order to appreciate the need for a well‐structured approach to regulatory mapping, consider the mathematical and biological scope of this cross‐disciplinary problem. The tiny archaeon Halobacterium NRC‐I contains about 2400 genes. For each one of these, the goal is to understand the transcriptional regulatory apparatus—that is about 2400 question marks, each with thousands of possible answers in the form of a set of transcriptional regulators. Put that against a typical compendium size of several hundred chips for a given organism, and you get what is known as a ‘small n, large p’ problem, where the number of possible parameters (regulators), p, dwarfs the number of data points (microarrays), n, available to define them. This problem gets considerably worse for complex organisms, where a larger number of available microarrays are more than offset by the vast complexity of large genomes, alternate splice variants, and multiple layers of regulation. For network inference algorithms, ‘small n, large p’ means dearth of data and very high computational demands.

As if this computational complexity were not bad enough, there is the inherent high dimensionality in the biological realm. Regulation happens in the domains of mRNA, proteins, metabolites, kinases, acetylases, and so on, and through a variety of pleiotropic perturbations and influences, such as salinity, temperature, and cell‐wall permeability. As the best high‐throughput data capture only mRNA, one must make simplifying assumptions and skip many important parameters. Bonneau and colleagues’ best simplifying assumption is to focus on predicting the targets of transcription factors in the network, along with some key environmental influences. When only transcription factors are allowed to regulate other genes, the ‘p’ in the ‘small n, large p’ problem is no longer so big. In fact, at 120, it is smaller than the number of chips (268) used in this study.

To further constrain the network learning problem, the Inferelator performs a pre‐processing step of bi‐clustering—organizing experimental data by both genes and conditions. This algorithm, the cMonkey (Reiss et al, 2006), allows further reduction of dimensionality by collapsing genes into conditionally coexpressed modules. cMonkey identified 300 such bi‐clusters, and 159 individual genes that could not be grouped, a nearly six‐fold reduction in dimensionality. Crucially, as the composition of the culture medium used for the microarray‐profiled experiments is known, each bi‐cluster's grouping of genes by experimental condition suggests plausible metabolic or environmental effectors of regulation. The authors exploit this benefit of their approach in one of their verifying experiments. Bi‐clustering, therefore, serves two ends: it limits the number of genes, and thus variables to reconstruct, to fewer than 500 (including only 80 TFs and metabolites), and places each predicted regulatory interaction into an experiment‐specific context.

The problem now becomes mathematically well‐posed, and the authors solve it using LASSO regression, a sparse regression method designed just for such computationally difficult problems (Tibshirani, 1996). LASSO works by selecting a small set of the most likely regulators of a given gene, and simultaneously determines a quantitative influence function relating regulator expression to target expression (Figure 1). In addition, the authors extend the LASSO algorithm beyond its typical linear domain by including piecewise and nonlinear terms in the regression to model saturation effects and pairwise combinatorial regulation. With this approach, the authors construct a model of transcription regulation in Halobacterium that matches 80 transcription factors to 500 predicted gene targets and captures the putative metabolic controllers of these pathways. This is an impressive result, both in size and regulatory complexity, particularly in light of the relatively modest size of the experimental data set (i.e., 268 microarrays). Moreover, this represents a dramatic leap in our understanding of this little‐studied organism.

Figure 1.

(A) Schematic diagram of a hypothetical bacterial operon, represented by a single gene Y, which is regulated by a protein X1 and a protein complex X2X3. (B) Within its dynamic range, the level of the transcript y may be modeled as a function of transcripts of the regulatory proteins X1, X2, and X3. The min function captures the notion of cooperativity, and the general form of g incorporates saturation effects. On the genome scale, the initial model for regulation of y would involve all possible transcription factors, and would greatly benefit from parameter shrinkage by LASSO. (C) This table illustrates the representative power of the chosen design matrix. The model can capture AND, OR, and XOR logical functions and saturation effects (not shown). Assigning the shown values to the coefficients from (B) would cause the model to represent the corresponding logical function for the interaction of X2 and X3.

Having obtained the first‐pass transcriptional blueprint, Bonneau and colleagues ask the obligatory next question: how much do we trust this network? In network inference, three broad types of verification are possible: computational verification through cross‐validation, in vivo verification, and literature‐driven curation. To be effective, the last approach should leverage a large data set documenting connectivity known in the literature, such as TransFac (Matys et al, 2003) or RegulonDB (Salgado et al, 2006). This type of verification not being available for Halobacterium, the authors vigorously pursue the former two, including knockout experimentation and ChIP‐chip analysis, demonstrating that their network can serve as a reliable and useful blueprint of Halobacterium NRC‐I's transcriptional regulation.

Bonneau et al (2006) show the feasibility of mapping a genome‐scale regulatory network from a modestly sized compendium of microarrays, an important success for the systems biology community. As microarray technology continues to improve and costs drop, growing databases of microarrays present an opportunity to infer ever more complex regulatory networks in both microbes and higher organisms. Abundance of data fuels the need for a network inference case study that would clearly map the boundaries of what is possible with today's network mapping algorithms. To this end, we believe that the once and future model organisms like Escherichia coli and Saccharomyces cerevisiae, buoyed by extensive bodies of literature and large databases such as RegulonDB, SGD (Christie et al, 2004), and TransFac, may represent attractive short‐term targets for network inference studies. In addition to the use of curated data sets, it may be possible to seed organisms with small synthetic in vivo networks, the connectivity of which is known by design, and to measure the success of network reconstruction on the whole by success or failure to reconstruct the seed. We are aware of at least one lab doing such work (Cantone et al, 2006). Biological yardsticks in general will gain in importance, as they supplement in silico testing and usher in algorithms’ transition from design to practical use, and from simple organisms to higher eukaryotes.

Challenges remain, but we see the immediate future of network inference as promising and bright. Molecular biologists have long been looking for ways to generate more oomph from their microarrays. Systems biology may have some answers, and we laud Bonneau and colleagues for providing an illuminating step in that direction.