## Abstract

Functional annotation of proteins is a fundamental problem in the post‐genomic era. The recent availability of protein interaction networks for many model species has spurred on the development of computational methods for interpreting such data in order to elucidate protein function. In this review, we describe the current computational approaches for the task, including direct methods, which propagate functional information through the network, and module‐assisted methods, which infer functional modules within the network and use those for the annotation task. Although a broad variety of interesting approaches has been developed, further progress in the field will depend on systematic evaluation of the methods and their dissemination in the biological community.

## Introduction

The past decade has seen a revolution in sequencing technologies, resulting in hundreds of sequenced genomes. A fundamental challenge of the post‐genomic era is the interpretation of this wealth of data to elucidate protein function. To date, even for the most well‐studied organisms such as yeast, about one‐fourth of the proteins remain uncharacterized (Figure 1).

Classical computational approaches to gene annotation collect for each protein a set of features characterizing it, and apply machine‐learning algorithms to infer annotation rules based on those features (Pavlidis *et al*, 2001). The newly available large‐scale networks of molecular interactions within the cell have made it possible to go beyond these one‐dimensional approaches, and study protein function in the context of a network. In particular, novel high‐throughput technologies for protein–protein interaction (PPI) measurements (Aebersold and Mann, 2003; Fields, 2005) have created large‐scale data on protein interaction across human and most model species. These data are commonly represented as networks, with nodes representing proteins and edges representing the detected PPIs.

In this review, we survey the growing body of works on functional annotation of proteins via their network of interactions (summarized in Table I). We distinguish two types of approaches (Figure 2): direct annotation schemes, which infer the function of a protein based on its connections in the network, and module‐assisted schemes, which first identify modules of related proteins and then annotate each module based on the known functions of its members. Naturally, the presented methods and the emphasis on particular ones reflect the opinions of the authors.

## Direct methods

The common principle underlying all direct methods for functional annotation is that proteins that lie closer to one another in the PPI network are more likely to have similar function. As can be seen in Figure 3, there is an evident correlation between network distance and functional distance, that is, the closer the two proteins are in the network the more similar are their functional annotations. The methods described below differ in the way they capture and exploit this correlation. In the following, we denote the PPI network as a graph *G=(V,E)* (see Box 1 for graph‐theoretic definitions).

### Box 1 Graph‐theoretic concepts

A **graph** is a pair *G=*(*V,E*), where *V* is a set of *vertices* (or *nodes*) and *E* is a set of *edges* connecting pairs of vertices. In PPI networks, the vertices represent proteins and the edges represent interactions.

The **distance** between two vertices in a graph is the number of edges on a shortest path between them.

The **diameter** of a graph is the maximum distance between any two of its vertices.

The **neighborhood** of a vertex is the set of vertices connected to it.

The ** n‐neighborhood** of a vertex is the set of vertices whose distance from it is at most

*n*.

A **clique** in a graph is a fully connected subgraph, that is, a subgraph in which every two vertices are connected by an edge.

The **degree** of a vertex is the number of its neighbors.

A **cut** in a graph is a partition of the vertices into two non‐overlapping sets. A **multiway** **cut** is a partition of the vertices into several disjoint sets. The **value** of the cut is the number of edges going between different sets.

** Network flow**: Imagine a graph as a network of interconnected pipes. Suppose water gets into one or more vertices (

*sources*) from the outside, and can exit the network at certain other vertices (

*sinks*). Then, it will spread in the pipes and reach other nodes, until it exits at sinks. The

*capacities*of the edges (i.e., how much the pipe can carry per unit time) and the input at the sources determine the amount of flow along every edge (i.e., how much each pipe actually carries) and the amount exiting at each sink. In the context of a PPI network, by considering proteins that have a certain function as sources and simulating flow in the network, the amount of flow at edges and sinks can be used to annotate additional proteins.

The **density** of a graph is the fraction of edges it actually has out of all possible vertex pairs. Hence, the density of *G*=(*V*,*E*) is 2*∣E∣/*(*∣V∣*(*∣V∣*−1)), and a clique graph has the maximum possible density, that is, 1.

The **clustering coefficient** of a vertex is the density of its neighborhood (Watts and Strogatz, 1998).

A graph is called a ** k‐core** if the minimal degree in it is

*k*.

** Core clustering coefficient**: For a parameter

*k*, the core clustering coefficient of a vertex is the density of the largest

*k*‐core of its immediate neighborhood (Bader and Hogue, 2003).

An **adjacency matrix** of a graph *G*=(*V,E*) is a matrix A_{∣V∣×∣V∣}={*a*_{ij}} where *a*_{ij}*=*1 if and only if *v*_{i} and *v*_{j} are neighbors. As PPI networks usually do not contain loops, in our context *a*_{ii}*=*0.

A graph is called **bipartite** if its vertices can be partitioned into two disjoint sets such that no edge connects two vertices of the same set.

### Neighborhood counting

The simplest and most direct method for function prediction determines the function of a protein based on the known function of proteins lying in its immediate neighborhood. Schwikowski *et al* (2000) predict for a given protein up to three functions that are most common among its neighbors. Although simple and effective, the obvious caveats of this approach are that associations are not assigned any significance values and the full topology of the network is not taken into account in the annotation process.

Hishigaki *et al* (2001) try to tackle the first problem by computing χ^{2}‐like scores for function assignment. In detail, they examine the *n*‐neighborhood of a protein (Box 1). For a protein *p*, each function *f* is assigned a score (*n*_{f}‐*e*_{f})^{2}/*e*_{f}, where *n*_{f} is the number of proteins in the *n*‐neighborhood of *p* that have the function *f* and *e*_{f} is the expectation of this number based on the frequency of *f* among the network's proteins. A shortcoming of this approach is that within the *n*‐neighborhood, proteins at different distances from *p* are treated in the same way. Chua *et al* (2006) try to tackle the second problem by investigating the relation between network distance and functional similarity. They focus on the 1‐ and 2‐neighborhoods of a protein, and devise a functional similarity score that gives different weights to proteins according to their distances from the target protein.

### Graph theoretic methods

As the PPI network is a graph, it is natural to apply graph algorithms for its functional analysis. Two main approaches have been suggested in this context: cut‐based approaches and a flow‐based algorithm. In contrast to the local neighborhood counting methods, these approaches are global and take into account the full topology of the network.

Vazquez *et al* (2003) aim at assigning a function σ_{v} to each unannotated protein *v* so as to maximize the number of edges that connect proteins (unannotated or previously annotated) assigned with the same function. Precisely, they try to maximize

where *E′* is the set of edges incident on two unannotated proteins, δ is a function that equals 1 if *x=y* and 0 otherwise, and *h*_{v}(*f*) denotes the number of neighbors of *v* previously annotated with function *f*. The first term in the optimization criterion concerns unannotated proteins, whereas the second term accounts for interactions between unannotated and previously annotated proteins. This optimization problem, which generalizes the computationally hard problem of minimum multiway cut (Box 1) (Dahlhaus *et al*, 1994), is heuristically solved using simulated annealing.

Karaoz *et al* (2004) use a similar approach but handle one function at a time. Each annotated protein *v* receives a state *s*_{v} that equals +1 if *v* has the function in question and equals −1 otherwise. Next, an assignment of −1 and +1 states to unannotated proteins is sought so as to maximize . The advantage of this formulation is that a partition of the vertices into two sets only is sought. To find a good partition, Karaoz *et al* (2004) apply a local search procedure in which for every vertex in turn (until convergence), the state of the vertex is changed according to the majority of the states of its neighbors. This procedure guarantees a solution with value at least half of the optimum.

A related method suggested by Nabieva *et al* (2005) again formulates the annotation problem as a minimum multiway cut problem, where the goal is to assign a unique function to all unannotated proteins so as to minimize the cost of edges connecting proteins with different assignments. They propose an integer programming reformulation of this problem, which allows them to solve the problem in practice (on the yeast PPI network) by applying a commercial solver (CPLEX).

Finally, Nabieva *et al* (2005) also suggest a flow‐based (Box 1) approach for the annotation problem. As they note that the maximum‐cut‐based approaches described above take into account global properties of the network but do not reward local proximity, they propose a novel method that aims at considering both local and global effects. They handle one function at a time. The basic idea is to treat each protein annotated with the function as the source of a ‘functional flow’. After simulating the spread over time of this functional flow through the network, each unannotated protein is assigned a score for having the function based on the amount of flow it received during the simulation.

### Markov random field

A number of probabilistic approaches to the annotation problem have been suggested, all relying on a Markovian assumption: the function of a protein is independent of all other proteins given the functions of its immediate neighbors. This assumption gives rise to a Markov random field (MRF) model (Box 2), initially proposed by Deng *et al* (2003). Deng *et al* (2003) devise an MRF model in which the probability that a protein *v* is assigned with a certain function that occurs with frequency *f* is a logistic function of , where α and β are model parameters and *N*(*v,1*) and *N*(*v,0*) are the numbers of neighbors of *v* that are assigned or not assigned with the function, respectively. The parameters α and β of the model are first estimated using a quasi‐likelihood method, and then Gibbs sampling is used for inferring the functions of unannotated proteins. Interestingly, as pointed out by Deng *et al* (2004), setting α to 0 and β to 1 yields the exact optimization criterion used by Karaoz *et al* (2004).

#### Box 2 Function prediction using the MRF method

The Markov random field (MRF) model provides a probabilistic framework for simulating the mutual influence of random variables via a neighborhood system. Given a network of influence, the state of any random variable is assumed to be independent of all other random variable states given those of its immediate neighbors. In the function prediction setting, each random variable corresponds to a protein, and its states correspond to certain functional annotations. The joint distribution of the random variables can be shown to factorize over the cliques (Box 1) of the network (Besag, 1974). That is, the probability of a certain assignment of discrete states *x=(x*_{1}*,…,x*_{N}*)* is

where *N* is the total number of variables, *Z* is a normalizing constant, *C* is the set of all cliques in the network, *H*_{c} is a potential function associated with clique *c* and *x*_{c} is the assignment of states to the members of *c*.

Inference in this general model is computationally hard, hence it is common to assign 0 potentials to all cliques of size greater than 2, and further homogenize the model by associating the same potential function with all cliques of the same size. For such a homogeneous second‐order MRF, we have

Deng *et al* (2003) treat one function at a time. To obtain a second‐order MRF model, they assume that the probability of a 0/1 annotation over the entire network is proportional to exp(α*N*_{01}+β*N*_{11}*+N*_{00}), where α,β are parameters for weighting the contributions of the different terms and *N*_{ij} is the number of interacting pairs with assignment *i,j* (unordered). Combining the *a priori* probability of an assignment with *N*_{1} 1 s, which depends on the frequency *f* of the function and is proportional to (*f*/(1 − f))^{N},_{1} they obtain a homogeneous second‐order MRF for which

Hence, the probability that protein *v* is assigned with the function given the annotations of its neighbors *N(v)* is

where *N*(*v,i*) is the number of neighbors of *v* that are assigned with *i*∈{0,1} and *logit* is the logistic function logit(*x*)=1/(1+e^{−x}). Deng *et al* (2003) estimate the two parameters of the model using a quasi‐likelihood method and apply Gibbs sampling to infer the unknown functional annotations.

Letovsky and Kasif (2003) also use an MRF model. Their main assumption is that the number of neighbors of a protein that are annotated with a given term is binomially distributed, where the distribution's parameter depends on whether the protein has that function or not. They employ loopy belief propagation (Murphy *et al*, 1999) to perform inference in their model.

### Integrating multiple information sources

Several authors have integrated data from multiple sources for the annotation task. The approaches differ in the way the sources are combined. The simplest approach treats each data source independently. Such an approach for function prediction was first introduced by Marcotte *et al* (1999) in a different context. Joshi *et al* (2004) integrated PPIs, genetic interactions and coexpression interactions. For each interaction type they estimated the *a priori* probability of functional association given an interaction of this type. By treating all interactions (within‐ and between‐types) as independent of one another, they combined them into a single reliability score for the functional association of a protein given the annotations of its neighbors.

A second approach for combining multiple data types constructs a joint probabilistic model or a prediction function of the different types. Deng *et al* (2004) generalized their earlier MRF approach to allow using multiple networks in the annotation process, and applied the method to annotate yeast proteins using both PPIs and genetic interactions (Deng *et al*, 2004; Lee *et al*, 2006). Lanckriet *et al* (2004) and later Tsuda *et al* (2005) represent each data type using a matrix of kernel similarity values. These matrices are then combined by learning optimal relative weights for the different kernels.

## Module‐assisted methods

The prevalence of the modularity paradigm in molecular cell biology has led to an extensive use of modules in prediction of molecular functions. By *functional module* one typically means a group of cellular components and their interactions that can be attributed to a specific biological function (Hartwell *et al*, 1999). Instead of predicting functions for individual genes, a module‐assisted approach attempts to first identify coherent groups of genes and then assign functions to all the genes in each group. The module‐assisted methods differ mainly in their module detection technique. Once a module is obtained, simple methods are usually used for function prediction within the module. For example, every function shared by the majority of the module's genes is assigned to all the genes in the module. Alternatively, a hypergeometric enrichment *P*‐value is computed for every function:

where *n* is the number of nodes in the PPI network, *f* is the number of genes in the network annotated with the function and *m* is the module size. The functions enriched within the module (i.e. obtaining *P*‐value below some threshold) are then predicted for all the genes in the module.

Module finding algorithms can be divided into methods using solely network topology information and methods that utilize addition data sources, such as gene expression measurements or deletion phenotypes. The algorithms vary in their ability to detect overlapping modules and in the use of interaction reliabilities. The exact definition of a functional module also varies. The purpose of some algorithms is the detection of molecular complexes, but others can also detect sparser structures, such as signaling pathways (Steffen *et al*, 2002; Yeang *et al*, 2005; Scott *et al*, 2006). Here, we focus on the former, as the latter methods are currently not aimed at function prediction.

### Detecting functional modules from network topology

Several works focused on the detection of functional modules based solely on protein interaction data. Most of the works described below decompose the PPI network into subnetworks based on some topological properties.

The molecular complex detection algorithm (MCODE) (Bader and Hogue, 2003) consists of three stages: vertex weighting, complex prediction and an optional post‐processing step. The weighting of nodes is based on the core clustering coefficient (Box 1). Bader and Hogue (2003) propose the use of this coefficient instead of the standard clustering coefficient, as it increases the weights of heavily interconnected graph regions while giving small weights to the less connected vertices, which are abundant in the scale‐free protein interaction networks. Once the weights are computed, the algorithm traverses the weighted graph in a greedy fashion to isolate densely connected regions. The post‐processing step filters or adds proteins based on connectivity criteria. MCODE has been used in several recent publications describing mapping of large‐scale interaction networks (LaCount *et al*, 2005; Rual *et al*, 2005). It is available as a plug‐in for the Cytoscape network visualization software (Shannon *et al*, 2003).

Altaf‐Ul‐Amin *et al* (2006) use a similar approach: they define a *cluster property* of a node *n* with respect to a cluster *C* as the number of edges between *n* and the nodes of *C* divided by that number averaged over the nodes of *C*. Starting from single nodes, clusters are gradually grown as long as the cluster property of the added nodes and the density of the cluster both exceed a certain threshold.

Sharan *et al* (2005) proposed the NetworkBlast algorithm for detecting protein modules in protein interaction networks. Each candidate set of proteins is assigned a likelihood ratio score that measures its fit to a protein complex model versus the chance that its connections arise at random (Box 3). A greedy network search algorithm is subsequently used for the detection of high‐scoring modules. The method can be generalized to identify modules that are conserved over several networks.

#### Box 3 Detecting dense subgraphs using maximum likelihood scoring

Maximum likelihood‐based scoring has been used for detecting molecular complexes in NetworkBlast (Sharan *et al*, 2005) and for integration of heterogeneous data through biclustering in SAMBA (Tanay *et al*, 2002, 2004, 2005). For a given subnetwork *H*, this technique compares the probabilities of two alternatives: (i) *H* has dense structure (the *subnetwork* model) and (ii) *H* is random (*null* model).

When assigning a score to a complex, in the subnetwork model, every possible interaction in the subnetwork exists with some high probability β, independently of other protein pairs. In the null model, every two proteins *u*,*v* are connected with probability *p*_{u,v} that depends on their degrees. *p*_{u,v} can be estimated by generating a collection of random networks preserving the degree of every protein and calculating the fraction of networks in which an interaction between *u* and *v* exists. The log‐likelihood ratio of the two alternatives for a set of proteins *C* with a set of interactions *E*(*C*) is thus

In case edge reliabilities are available, it is possible to incorporate them into the subnetwork scoring model by assessing the probability of the available observations, given that the protein pair *u,v* interacts (*P*(*O*_{u,v})∣*T*_{u,v})) and does not interact (*P*(*O*_{u,v})∣*F*_{u,v})). The log‐likelihood score then becomes

In the SAMBA algorithm (Tanay *et al*, 2002), a bipartite graph represents the genomic data: nodes on side *A* represent genes and nodes on side *B* represent different properties (Figure 4). A likelihood ratio is used to score a potential bicluster (*A,B*) by summing over all the edges between *A* and *B*:

Using these scores, optimization algorithms based on hashing (Tanay *et al*, 2002) and local search (Sharan *et al*, 2005) can be used to detect high‐scoring subnetworks.

### Hierarchical clustering‐based methods

Several works described the use of hierarchical clustering for module detection. A key decision in the use of this approach is the selection of the similarity measure between protein pairs. An intuitive similarity metric is based on pairwise distances (Box 1) between proteins in the network. One of the problems in using it in a hierarchical clustering setting is that the distances between many protein pairs are identical (the *ties in proximity* problem) (Arnau *et al*, 2005). Rives and Galitski (2003) postulated that module members are likely to have similar shortest path distance profiles. They therefore used hierarchical clustering on the all‐pair shortest path distances matrix and outlined modules by manual inspection. Although such an approach is impractical for genome‐scale networks, it has been successfully applied to focused subnetworks, such as the network of interactions between nuclear proteins and a regulatory network of filamentation in *Saccharomyces cerevisiae* (Rives and Galitski, 2003). A more sophisticated version of this approach was recently applied to the gene expression machinery network (Maciag *et al*, 2006). The authors used a special weighted form of mutual clustering coefficient (Goldberg and Roth, 2003) for quantifying similarities between proteins, and a modified version of the *k*‐means algorithm (Hartigan, 1975) was used for cluster detection.

Arnau *et al* (2005) used the shortest path length between proteins as a distance measure and attempted to overcome the ‘ties in proximity’ problem by obtaining multiple, equally valid hierarchical clustering solutions with a random choice when ties are encountered. The fraction of the solutions in which the protein pair was clustered together was then used as a similarity measure for clustering using standard hierarchical algorithms. Additional similarity measures proposed for using in a hierarchical clustering setting include Czekanovski–Dice distance (Brun *et al*, 2003) and the statistical significance of the number of common interaction partners (Samanta and Liang, 2003).

### Graph clustering methods

Numerous graph‐clustering algorithms have been applied to the graph representing the binary interactions. Spirin and Mirny (2003) proposed two such algorithms. The first algorithm is based on superparamagnetic clustering (SPC) (Blatt *et al*, 1996) and the second is a Monte Carlo algorithm maximizing the density of the obtained clusters. Both algorithms require the size of the sought clusters as input. The SPC algorithm is specifically shown to perform well in detecting dense structures loosely connected to other areas of the network. In addition to detecting protein complexes, Spirin and Mirny (2003) also show that their method is capable of detecting sparsely connected functional modules, such as the MAPK signaling cascade.

Przulj *et al* (2004) used the highly connected subgraphs (HCS) algorithm (Hartuv and Shamir, 2000) to determine complexes in PPI data. A highly connected subgraph is defined as a subgraph with *n* nodes such that more than *n/2* edges must be removed in order to disconnect it. It can be shown that this property ensures that the diameter of the subgraph (Box 1) is at most two, and that it is at least half as dense as a clique of the same size. The HCS algorithm finds a minimum cut (Box 1) in the graph and uses it to partition the graph. This process is repeated recursively until highly connected components are reached.

The restricted neighborhood search clustering (RNSC) algorithm proposed by King *et al* (2004) partitions the node set of the network into clusters based on a cost function that is used to evaluate the partitioning. The algorithm starts with a random cluster assignment and proceeds by reassigning nodes, so as to maximize the partition's score. In addition, a tabu list is maintained to avoid cycling back to previously explored partitions. Finally, the clusters are filtered based on their size, density and functional homogeneity.

The Markov clustering (MCL) algorithm (Enright *et al*, 2002) has been recently proposed for complex detection in PPI data (Krogan *et al*, 2006). The algorithm simulates flow on the PPI graph by constructing its adjacency matrix (Box 1) and computing its successive powers to increase the contrast between regions with high flow and regions with a low flow. This process can be shown to converge towards a partition of the graph into high‐flow regions corresponding to protein complexes, separated by regions of no flow.

Additional graph clustering techniques applied include clique percolation (Adamcsek *et al*, 2006), graph flow simulation (Pereira‐Leal *et al*, 2004), edge‐betweenness clustering (Dunn *et al*, 2005) and spectral methods (Bu *et al*, 2003).

### Expansion of complex seeds

In contrast to finding complexes *de novo* in the protein interaction network, several works attempted prediction of new members for partially known protein complexes. The *Complexpander* software (Asthana *et al*, 2004) receives a particular ‘core’ set of proteins and produces a list of candidate proteins, ranked by the probability of membership in the complex. This approach approximates the probability that a given candidate protein is co‐complexed with the core by the probability that a path consisting of stable protein interactions exists between the candidate and a member of the core. The algorithm assigns a weight to each pair of proteins, representing the probability that they interact directly and stably. These weights are then used to estimate the probability that two proteins are connected by a path, by generating a collection of random networks and counting how many of them contain a short path between the protein pair.

For a similar problem, an algorithm called SEEDY was proposed by Bader (2003). SEEDY constructs complexes by adding proteins to a given seed, as long as the reliability of the most reliable path from a candidate to the seed does not fall below a given threshold. The reliability of a path in SEEDY is defined as the product of the reliabilities of its edges.

Wu and Hu (2005) proposed an algorithm that detects ‘community structures’ (Girvan and Newman, 2002) for a given protein seed, in a network without edge reliabilities. The algorithm first identifies a ‘core’ by a heuristic search for a maximum clique containing the seed, and then expands it through a breadth‐first‐search graph traversal (Cormen *et al*, 1990) with additional vertices, such that it will meet a certain ‘community’ criterion.

### Integrated analysis of interactions and expression profiles

Several studies have integrated PPI data with diverse additional sources to infer modular structures. These modular structures can serve as potent function predictors, and also shed light on the interplay between the different sources of information.

Microarray technologies (Schena *et al*, 1995), which measure expression levels on a genome‐wide scale, are currently the largest source of high‐throughput genomic information. A natural question is the relation between transcription pattern similarity of a pair of genes and the existence of a protein interaction between their products. As shown in both simple and complex organisms (Ge *et al*, 2001; Hahn *et al*, 2005), genes of interacting proteins tend to share similar expression patterns. Following this observation, the use of both information sources together for analysis of functional modules has been an appealing concept adopted by many research groups.

One line of works proposed a two‐step approach: extraction of a group of genes that are highly expressed in a certain condition, and then analysis of the topological properties of PPI networks induced by these genes (Luscombe *et al*, 2004; Balazsi *et al*, 2005; de Lichtenberg *et al*, 2005; Wachi *et al*, 2005). Other works used the reverse approach by analyzing the expression coherence of known pathways or complexes (Zien *et al*, 2000; Jansen *et al*, 2002; Tornow and Mewes, 2003; Simonis *et al*, 2004).

Several approaches have been proposed for identifying functional modules by simultaneous analysis of the network and the expression data. Ideker *et al* (2002) introduced a framework for identification of active subnetworks, that is, connected regions of the network that show significant changes in expression over a particular subset of the conditions. This method uses *P*‐values calculated for every measurement in the expression data to derive a statistical score for every candidate subnetwork, and utilizes simulated annealing to search for high‐scoring subnetworks. A similar methodology was recently employed by (Cabusora *et al* (2005), using shortest‐paths algorithms for module finding.

The co‐clustering methodology (Hanisch *et al*, 2002) uses a distance function that combines similarity of gene expression patterns and network topology. The network distance between two nodes is an edge‐weighted version of their topological distance in the network. The expression distance is based on the Pearson correlation between the expression patterns. The two distances are combined into a similarity score using a logistic function, and hierarchical clustering is applied to the matrix of the combined distances.

Segal *et al* (2003) provided a probabilistic formulation, in which a module is a group of genes with high pairwise similarities and with a significant portion of the possible interactions. A probabilistic graphical model was used to extract a pre‐specified number of modules from gene expression measurements combined with a protein interaction data set. This method thus produces groups of genes that are both coexpressed and exhibit a dense interaction pattern.

### Integrated analysis of interactions and diverse genomic data

Kelley and Ideker (2005) described the first module‐assisted integration of protein interaction data with genetic interaction data. Synthetic lethality and synthetic sickness are interactions between two nonessential genes whose combined deletion is lethal or produces a severe growth effect, respectively. Kelley and Ideker (2005) addressed the question of whether genetic interactions occur mostly within or between different pathways by identifying statistically significant modular structures in the combined network of physical (protein–protein and protein–DNA) and genetic interactions. By defining pathways as dense subnetworks in the physical interaction network, Kelley and Ideker (2005) show that genetic interactions occur between pathways much more often than within pathways. The modular structures led to better function prediction than using physical interactions alone.

Another source of rich genomic information that is becoming increasingly available is the collection of phenotypes of strains containing a deletion for a single nonessential gene and exposed to diverse conditions (Brown *et al*, 2006). Haugen *et al* (2004) measured deletion phenotypes along with expression profiles in the arsenic response of *S. cerevisiae*. The data were then integrated with a metabolic network and a physical network using the ActiveModules algorithm (Ideker *et al*, 2002). This type of integration was used to reveal proteins central to the arsenic response and to provide network‐based explanations to the differences between the phenotypically sensitive pathways and the differentially expressed ones.

Tanay *et al* (2004) described an integrative framework allowing the integration of protein interaction data with gene expression, phenotypic sensitivity and transcription factor (TF) binding, using the SAMBA biclustering algorithm (Tanay *et al*, 2002). SAMBA models the genomic data as a bipartite graph (Box 1), where nodes on one side represent genes and that on the other side represent different properties derived from the genomic data (Figure 4). For protein interaction data, a property represents the presence of an interaction with a specific protein. A gene expression experiment is represented by several properties corresponding to different expression level ranges. A TF property represents the binding of the gene promoter by the TF. The bipartite graph is weighted using a maximum likelihood‐based score (Box 3). Heavy subgraphs in the bipartite graph correspond to groups of genes that manifest a common behavior across a large set of heterogeneous experiments. This approach was experimentally shown to provide accurate function prediction, and was later extended to analyze a compendium of some 2000 distinct experiments in *S. cerevisiae*, yielding 1200 statistically significant modules (Tanay *et al*, 2005).

## Performance comparison

The availability of such a wide range of methods calls for a comprehensive comparison among them. Below we summarize some of the key comparisons reported so far.

### Direct methods

For direct methods, no systematic comparison has been reported so far, but some information on the performance of the different methods can be gleaned from comparisons made in the annotation studies reviewed above, one of which we describe below.

Several measures have been suggested to evaluate the quality of a direct annotation method (see e.g., Deng *et al*, 2003; Nabieva *et al*, 2005). All these measures are close variants of the one proposed by in Deng *et al* (2003). The latter is based on measuring the precision and recall of an annotation, computed in a leave‐one‐out setting (i.e., the known annotation of a single protein at a time is hidden and predicted using the network and the annotations of all other proteins), taking into account multiple annotations per protein. Specifically, let *n*_{i} be the number of known functions for protein *i*, let *m*_{i} be the number of predicted functions for the protein when hiding its true annotations and let *k*_{i} the overlap between the two sets. The precision and recall of the predictions are defined as

A discussion of other measures that are applicable if the Gene Ontology (GO) annotation is used and of weighted matches between known and predicted functions based on their positions in the GO hierarchy (Ashburner *et al*, 2000) appears in Deng *et al* (2004).

Chua *et al* (2006) compared several schemes, including neighborhood counting (Schwikowski *et al*, 2000), the χ^{2} method of Hishigaki *et al* (2001), the MRF method of Deng *et al* (2003) and the flow‐based method of Nabieva *et al* (2005). (To avoid possible bias, we do not report here on the method of Chua *et al* (2006), which was also included in the comparison.) Each method was applied to the MIPS interaction set (Mewes *et al*, 2002) using the GO annotation, and was evaluated using the precision and recall measures. The MRF method outperformed the others by a significant margin, whereas the other three methods exhibited similar performance. These results were consistent across the three main MIPS categories: cellular role, biochemical function and subcellular localization. The advantage of MRF is probably owing to the use of a more sophisticated probabilistic model.

### Module‐assisted methods

One of the obstacles to systematic evaluation of the different module‐assisted methods for functional annotation is the lack of agreed upon technique for function prediction within a module. However, if one considers a module as a functional unit whose member proteins have identical functions, then one could evaluate a module‐assisted method by the fit between the produced modules and either the MIPS complexes catalog (Mewes *et al*, 2002) or GO categories (Ashburner *et al*, 2000), using measures similar to those described in the Direct methods section. A systematic quantitative evaluation of four module‐assisted clustering algorithms has been presented recently by Brohee and van Helden (2006): RNSC, SPC, MCODE and MCL. As the authors of this study were not involved in the development of any of the algorithms, the chances of inadvertent evaluation bias are low. Brohee and van Helden (2006) used a test graph constructed by representing 220 known complexes as cliques in the graph and generated 41 altered graphs by random addition or removal of edges in different proportions. The parameters of each of the clustering algorithms were then tuned based on this data set using a statistic combining detection sensitivity and specificity. In addition, the four module‐assisted techniques were applied to six PPI graphs formed based on high‐throughput experiments, and their performance was evaluated by their success in recovering known complexes.

The authors found that the MCL algorithm is remarkably robust to graph alternations. MCL had the best performance on both simulated and real data sets, whereas RNSC was relatively less sensitive to suboptimal parameters. MCODE and SPC were shown to be clearly inferior under most conditions. The comparative analysis also highlighted intrinsic strengths and weaknesses of the algorithms. MCODE performed best on random data, by detecting the fewest false positives, owing to its ability to report complexes covering only part of the network, rather than partitioning all the nodes into complexes. SPC tended to generate ‘mega‐complexes’ of very large size, thus obtaining very high sensitivity but very low specificity. Notably, this comparison used unweighted networks, whereas the MCL and SPC algorithms can deal with weighted graphs and are likely to give better performances if weights are assigned to reflect the reliability of the interactions (Pereira‐Leal *et al*, 2004).

### Comparing direct and module‐assisted methods

To the best of our knowledge, no systematic comparison of network‐based function prediction, covering both direct and module‐assisted methods, has been undertaken to date. In a simplistic comparison of two basic methods (Figure 5), we found that a simple neighbor‐counting method has a higher specificity in predicting functions when compared to the more involved module‐assisted MCODE algorithm. This may be explained by the focus of MCODE on processes imposing subgraphs with a dense interaction pattern. A more comprehensive analysis alongside the development of better prediction techniques will highlight these differences.

## Discussion

Efforts for network‐based function prediction have been going on for over 6 years now, since the introduction of molecular techniques capable of mapping protein interactions on a genome‐wide scale. Despite the large number of techniques suggested for functional annotation using networks, systematic annotation is still mostly based on other data sources, such as sequence homology. Several goals, reviewed below, have to be accomplished in order for the network‐based functional annotation tools to become widely used.

Despite the large number of different algorithms developed for both direct and module‐assisted function prediction, the implementations of only a small fraction of them are publicly accessible. Only a handful, such as MCODE (Bader and Hogue, 2003), PRODISTIN (Baudot *et al*, 2006), CFinder (Adamcsek *et al*, 2006) and NetworkBlast (http://www.pathblast.org/), are currently supported with a graphical interface. Networks are highly visualizable, and as the human eye is better in pattern detection than any computer, a good graphical interface will help make such computational tools widely used by the biological community.

Although the field has advanced considerably in recent years on the methodological side, comprehensive comparisons of the plethora of available annotation methods, similar to that performed by Brohee and van Helden (2006), are greatly in need. Such systematic evaluation efforts were recently performed in other fields, for example, discovery of TF‐binding sites (Tompa *et al*, 2005), biclustering of expression data (Prelic *et al*, 2006) and protein structure prediction (Kryshtafovych *et al*, 2005), which has a long and successful history of community evaluations. Owing to the fundamental differences between the different annotation types, such as biological process and molecular complexes, it is clear that different methods are best suited for different types of annotation. An important prerequisite for a comprehensive comparison is the definition of golden standards for functional annotation. In most of the studies described here, the MIPS complexes catalog and Gene Ontology were used as a benchmark for prediction success. However, both data sets are currently not comprehensive and some annotations are found in one but not in the other.

Although several methods described above use diverse functional genomic data sources, they are still greatly outnumbered by methods utilizing only the network topology. Owing to the increasing accessibility of microarray technology, gene expression measurements have become widely available for diverse conditions across species. As of August 2006, almost 95 000 and 45 000 hybridization samples were available in Gene Expression Omnibus and ArrayExpress databases, respectively. This huge body of data is currently poorly exploited by integrative annotation methods, as most of them focus on expression data derived in a single study. Following the success of several methods integrating expression data from multiple studies (Ihmels *et al*, 2002; Lee *et al*, 2004; Segal *et al*, 2005; Tanay *et al*, 2005), we expect that techniques based on large compendia of expression data and protein interaction networks will significantly increase the accuracy of functional annotation. Additional large‐scale genomic data, such as deletion phenotypes (Brown *et al*, 2006), proteomic measurements (Kislinger *et al*, 2006) and protein cellular localization (Huh *et al*, 2003), can also be used in an integrative framework. Data of a high diversity and dimensionality have been integrated using biclustering (Tanay *et al*, 2005) and kernel‐based methods (Lanckriet *et al*, 2004).

This review focused on methods aimed at genome‐scale functional annotation using network data from a single species. Several additional studies developed methods for detection of functional modules in slightly different contexts. In particular, specific algorithms were developed for detection of molecular complexes from lists of proteins identified in biochemical purification experiments, rather than from binary interaction networks (Krause *et al*, 2003; Hollunder *et al*, 2005; Scholtens *et al*, 2005; Gavin *et al*, 2006). Another set of works attempted to identify evolutionarily conserved functional modules via the integration of networks from multiple organisms (Kelley *et al*, 2003; Sharan *et al*, 2005; Campillos *et al*, 2006; Flannick *et al*, 2006; Gandhi *et al*, 2006; see also the review in Sharan and Ideker, 2006).

Which methods should be used by a newcomer to the field? As mentioned above, the limited information about the comparative performance of the methods presented here makes it difficult to decide which method should be used in a specific setting. When using only PPI data, our initial and limited comparison does seem to indicate that direct methods are currently slightly superior to module‐assisted ones, with MRF and MCL being the leading techniques for direct and module‐assisted function prediction, respectively. New techniques should thus be compared to these methods to prove their superiority. If the goal is actual function prediction rather than methodological improvement, the use is mainly limited to methods that are implemented as a tool with a graphical user interface or available as a web server (Table I). As to methods integrating multiple data sources, no comparative assessment is currently available.

When using interaction networks, whether as a sole information source or in conjunction with other data sources, the current limitations of these data have to be recognized. The currently available protein interaction data are known to be both noisy (von Mering *et al*, 2002) and partial (Hart *et al*, 2006). In addition, as large‐scale interaction mappings are conducted only in a single growth condition or in a single tissue type, interaction data currently lack any spatial or temporal information. Clearly, for some functional annotations, the relevant interactions may occur only under specific conditions in a specific time point. In addition, not every functional aspect of the protein is expected to be manifested in its interaction pattern. Some proteins, such as metabolic enzymes, are most functional on their own without the need for cooperation from other proteins.

Despite these caveats, analysis of interaction networks is a young, promising and very active research area. The utilization of such networks for function prediction is just one of a plethora of possible ways by which this rich source of information can be exploited. Although techniques for network‐based function prediction have been continuously improving, there is still a lot of room for improvement, both in terms of the methodologies and in terms of their evaluation. We expect that improved, more accurate methods that are made readily accessible to the biological community will make interaction networks a prevalent instrument for functional annotation, among their many other important uses.

## Acknowledgements

We thank Ken Chua for providing us with his comparison data. RS was supported by an Alon Fellowship. IU is a fellow of the Edmond J Safra Bioinformatics Program at Tel‐Aviv University. This research was supported by a grant from the Israeli Science Foundation.

## References

- Copyright © 2007 EMBO and Nature Publishing Group