Gene expression is regulated at each step from chromatin remodeling through translation and degradation. Several known RNA‐binding regulatory proteins interact with specific RNA secondary structures in addition to specific nucleotides. To provide a more comprehensive understanding of the regulation of gene expression, we developed an integrative computational approach that leverages functional genomics data and nucleotide sequences to discover RNA secondary structure‐defined cis‐regulatory elements (SCREs). We applied our structural cis‐regulatory element detector (StructRED) to microarray and mRNA sequence data from Saccharomyces cerevisiae, Drosophila melanogaster, and Homo sapiens. We recovered the known specificities of Vts1p in yeast and Smaug in flies. In addition, we discovered six putative SCREs in flies and three in humans. We characterized the SCREs based on their condition‐specific regulatory influences, the annotation of the transcripts that contain them, and their locations within transcripts. Overall, we show that modeling functional genomics data in terms of combined RNA structure and sequence motifs is an effective method for discovering the specificities and regulatory roles of RNA‐binding proteins.
Gene expression is regulated at each step from chromatin remodeling through translation and degradation. Yet, most efforts to understand the regulation of gene expression have been focused on transcription and DNA‐binding regulatory proteins. Although regulatory RNAs have received appreciable attention (Bushati and Cohen, 2007; Coppins et al, 2007), regulatory elements within mRNAs that are recognized by nucleic acid‐binding proteins have been largely ignored until recently (Keene, 2007). This state exists despite observations that suggest changes in mRNA stability may account for half of the changes in mRNA expression in some cells and conditions (Fan et al, 2002; Cheadle et al, 2005). Moreover, it is a mathematical certainty that mRNAs of average stability can only be rapidly downregulated by altering the mRNA decay rate (see Pérez‐Ortín et al, 2007 for derivation). Thus, one way to execute rapid, large‐scale gene expression responses to unpredictable environmental stimuli is through decay‐regulating RNA‐binding proteins (RBPs), whose activity can be rapidly modulated post‐transcriptionally. Early metazoan embryogenesis also requires mRNA stability and translation regulation to orchestrate the activities of maternally deposited transcripts (for review see Vardy and Orr‐Weaver, 2007).
Despite the potential importance of RNA secondary structures as binding sites for regulatory RBPs, computational methods for their discovery have failed to keep pace with current functional genomics technology (e.g. microarrays). Now, well into the era of functional genomics, RNA structure finding algorithms are still sequence‐only methods, having so far failed to use the data‐integrative approaches that are becoming increasingly common for the discovery of DNA‐binding protein specificities (Bussemaker et al, 2001, 2007; Foat et al, 2005, 2006).
In this work, we present a novel, alignment‐free method that discovers secondary structure‐defined cis‐regulatory elements (SCREs) in mRNAs by modeling the effects that their occurrences exert on quantitative measurements of mRNA behavior in the form of microarray data. This process is embodied in a regression‐based algorithm called structural cis‐regulatory element detector (StructRED). The method defines a RNA structure search space, which is small stem–loop structures in this version, and then exhaustively scores all short nucleotide sequences within the structural context for how well their occurrences explain observed microarray measurements. Through an iterative process, a multivariate model consisting of SCRE‐derived mRNA sequence scores is developed to explain the input microarray data. The output of the method is a list of putative SCRE weight matrices and the inferred post‐translational regulatory activities of the unknown trans‐factors across all of the input microarray conditions (trans‐factor activity profiles, TFAPs).
We accurately recover the known stem–loop binding specificities of the RNA‐binding proteins Smaug in Drosophila and Vts1p in S. cerevisiae using mRNA sequences and microarray data. When we inspected the computationally inferred behavior of the Smaug protein across several microarray experiments profiling mRNA levels and translation in developing Drosophila embryos (Pilot et al, 2006; Qin et al, 2007; Tadros et al, 2007; GEO accessions GSE8910, GSE3955, GSE5430), Smaug represses translation of its target mRNAs during the first 2 h of embryogenesis and promotes the degradation of its target transcripts starting at about 2 h of development. Our genome‐wide inferences are consistent with the detailed observations of Smaug destabilizing Hsp83 mRNAs (Semotok et al, 2005) and translationally repressing nanos mRNAs (Dahanukar et al, 1999; Smibert et al, 1999)
In addition to the Smaug SCREs, we discovered six other putative SCREs in Drosophila, which we have labeled Dm1 through Dm6, that have coherent supporting TFAPs and annotation (Figure 4). First, Dm1 and Dm2 were discovered from an mRNA expression microarray time course for Drosophila embryogenesis (Tadros et al, 2007). Those transcripts that contain high‐affinity instances of Dm1 and Dm2 are expressed at decreasing levels as development proceeds, suggesting that they are involved in destabilizing these transcripts at specific developmental stages. The Dm3 and Dm4 specificities were detected using microarray data that compared expression in wild‐type flies and flies lacking the Kep1 RNA‐binding protein (GEO accession GSE6086), suggesting that they may represent the specificity of Kep1. Finally, Dm5 and Dm6 both were detected using polysome association data from the early Drosophila embryo (Qin et al, 2007; GEO accession GSE5430), suggesting that they are involved in regulating translation during embryogenesis.
To answer the question of where the discovered Drosophila SCREs commonly occur in the mRNAs, we scored the occurrences of each SCRE in the 5′ UTR, 3′ UTR, and coding sequence separately and then checked which of these mRNA subsequences performed best at explaining the microarray data. For Dm2, Dm3, Dm4, Dm6, and the Smaug SCREs, the explanatory SCRE occurrences appear primarily in the coding sequences (Figure 5). Dm1, Dm5, and Dm6 still have appreciable signal in the 3′ UTRs, and Dm5 has signal in the 5′ UTR. SCREs frequently appearing in coding sequences provides a strong argument for including whole transcripts when searching for cis‐regulatory elements.
We applied StructRED to human microarray data that measured genome‐wide RBP binding or profiled genome‐wide polysome association. We discovered three SCREs with functionally coherent TFAPs. Occurrences of the Hs1 SCRE correlated with decreased translation in the metastatic colorectal cancer cell line, SW620, versus a non‐metastatic cell line from the same patient, SW480, as measured in a polysome association microarray study (Provenzani et al, 2006; GEO accession GSE2509). Transcripts containing Hs2 SCREs are expressed at a lower level in U937 cells that have been exposed to 12‐myristate 13‐acetate (PMA) and caused to differentiate into a macrophage‐like state (Kitamura et al, 2004; GEO accession GSE1783). Finally, occurrences of Hs3 in mRNAs correlate with increased association with ribosomes in human mammary epithelial cells, regardless of whether translation initiation factor 4F is overexpressed (Larsson et al, 2007; GEO accession GSE6043).
The StructRED algorithm represents a novel method for determining cis‐regulatory RNA structures. Although the current implementation is limited to finding short stem–loop motifs, it may be extensible to other small structures (dsRNA, internal bulges) and perhaps more complex structures. Given its strengths, we expect that StructRED may become the basis of a class of RNA regulatory element search tools that will expand computational and experimental inquiries into post‐transcriptional gene regulation. Overall, we show that structurally defined cis‐regulatory elements can be discovered through integrative modeling of functional genomics and mRNA sequence data.
We present a novel, alignment‐free method (StructRED) that discovers secondary structure‐defined cis‐regulatory elements in mRNAs by modeling the effects that their occurrences exert on quantitative measurements of mRNA behavior in the form of microarray data.
We accurately recover the known stem‐loop binding specificities of the Drosophila RNA‐binding protein Smaug and the S. cerevisiae protein Vts1p from mRNA sequences and microarray data.
We report other putative structure‐sequence specificities for RNA‐binding proteins that likely play diverse roles in Drosophila and humans.
We find that our discovered secondary structure‐defined cis‐regulatory elements exist in coding sequences in addition to untranslated regions.
Gene expression is regulated at each step from chromatin remodeling through translation and degradation. Yet, most efforts to understand the regulation of gene expression have been focused on transcription and DNA‐binding regulatory proteins. Although regulatory RNAs have received appreciable attention (Bushati and Cohen, 2007; Coppins et al, 2007), regulatory elements within mRNAs that are recognized by nucleic acid‐binding proteins have been largely ignored until recently (Keene, 2007). This state exists despite observations that suggest changes in mRNA stability may account for half of the changes in mRNA expression in some cells and conditions (Fan et al, 2002; Cheadle et al, 2005). Moreover, it is a mathematical certainty that mRNAs of average stability can only be rapidly downregulated by altering the mRNA decay rate (see Pérez‐Ortín et al (2007) for derivation). Thus, one way to execute rapid, large‐scale gene expression responses to unpredictable environmental stimuli is through decay‐regulating RNA‐binding proteins (RBPs), whose activity can be rapidly modulated post‐transcriptionally. Early metazoan embryogenesis also requires mRNA stability and translation regulation to orchestrate the activities of maternally deposited transcripts (for review see Vardy and Orr‐Weaver, 2007).
Despite the potential importance of RNA secondary structures as binding sites for regulatory RBPs, computational methods for their discovery have failed to keep pace with current functional genomics technology (e.g. microarrays). Over 20 years ago, Sankoff (1985) described an algorithm for the simultaneous alignment and folding of RNA sequences. The Sankoff algorithm was computationally intractable for even modestly sized datasets. Thus, nearly all subsequent computational approaches to finding RNA structural motifs sought shortcuts to making the RNA structural alignment problem computationally feasible. Now, well into the era of functional genomics, RNA structure finding algorithms are still sequence‐only methods, having so far failed to use the data‐integrative approaches that are becoming increasingly common for the discovery of DNA‐binding protein specificities (Bussemaker et al, 2001, 2007; Foat et al, 2005, 2006). Small regulatory RNA structures likely play a significant role in post‐transcriptional regulation of gene expression. However, due to the lack of computational methods that fully leverage the current wealth of genomics data, we are nearly blind to their existence.
In this work, we present a novel, alignment‐free method that discovers secondary structure‐defined cis‐regulatory elements (SCREs) in mRNAs by modeling the effects that their occurrences exert on quantitative measurements of mRNA behavior in the form of microarray data. This process is embodied in a regression‐based algorithm called structural cis‐regulatory element detector (StructRED). We accurately recover the known stem–loop binding specificities of the Drosophila RBP Smaug and Vts1p in Saccharomyces cerevisiae from mRNA sequences and microarray data. We also provide evidence of the post‐translational regulatory activity of Smaug that is consistent with earlier genetic and biochemical characterization. We report other putative structure‐sequence specificities that likely play diverse roles in Drosophila and humans. Finally, we find that SCREs exist in coding sequences as often as in untranslated regions (UTRs), which presents a caution against the common practice of restricting searches for regulatory elements to non‐coding sequences. Overall, we show that structurally defined cis‐regulatory elements can be discovered through integrative modeling of functional genomics and mRNA sequence data.
Results and discussion
Discovering RNA structural cis‐regulatory elements by observing their effects on mRNA behavior using StructRED
To understand the later biological inference, we must first provide an overview of our StructRED algorithm (Figure 1) and supporting methods. In this work, we develop a way to infer the activities of RBPs and the small RNA structural cis‐regulatory elements to which they bind by modeling how the occurrences of SCREs could give rise to the observed genomic mRNA measurements. We start with the simplest model and pose that the observed microarray value for a particular mRNA Ag is the result of a combination of additive effects of the occurrences Mng of multiple SCREs n. We also allow a base level of signal C and some unexplained signal εg. Each SCRE has a corresponding effect Fn on the measurement that is related to the regulatory activity of a corresponding trans‐factor under the measured condition.
If we had definitions of each of the SCREs n, we could count them in all mRNA sequences and use multiple linear regression to define the global values of all Fn and C. However, as we do not know the SCREs in advance, we define a space of possibilities and test each one to see if its occurrences explain why some mRNAs have different measurements than others. We can then collect each SCRE that is significantly explanatory, in turn, to build an additive model for the measured microarray value of each mRNA. One output of this search process is a list of candidate SCREs n. The sequence information in a discovered SCRE is in the form of a position‐specific affinity matrix (PSAM; Foat et al, 2005), a kind of weight matrix. This means that ‘occurrences’ of an SCRE have values that range between zero and one, depending on the weights of each nucleotide at each position. The other output is the inferred activities Fn of the corresponding trans‐factor in each dataset that we fit with the model. We call the collection of Fn's (divided by their standard errors) for a particular SCRE across multiple microarray experiments the trans‐factor activity profile (TFAP) for that SCRE, as it reflects changes in activity of the corresponding regulator.
For all analysis done for this study, the SCRE search space included all small stem–loops of 9 to 12 nucleotides in length, closed by three hybridizing nucleotides (allowing G‐U pairs), and containing up to seven positions with sequence information. Stem–loops are attractive because, while being true structures, the single‐stranded loop allows for additional specific protein–nucleotide interactions with the unpaired Watson‐Crick edges of the bases. Despite the simplicity of this stem–loop structural constraint, it provides enough additional specificity information for the discovered SCREs to display significant correlations with one or more microarray datasets. To confirm the utility of the structural constraint using the Drosophila SCREs discussed below, we tested only the sequence specificity of each SCRE for its ability to explain the best‐correlating dataset for the structurally constrained SCRE. In all cases, the stem–loop‐constrained SCRE better explained the data than the sequence‐only specificity (Supplementary Table 1).
Modeling gene expression regulation: hold the thresholds
Our equation for the predicted occupancy of an mRNA by a trans‐factor (see Materials and methods) sums over all possible binding site positions in the mRNA. Why is this more rational than other expressions of sequence scores (e.g. maximum score)? As we illustrate using a recently characterized concrete example of mRNA stability regulation, a sequence score that sums over all possible binding sites is a way of representing the in vivo total average occupancy of an mRNA by trans‐factors, and under reasonable assumptions of regulatory mechanisms, increased occupancy could result in increased regulatory strength. When bound to an mRNA, the Smaug RBP recruits the CCR4–NOT complex, promoting deadenylation and decay of the transcript (Semotok et al, 2005). Semotok et al (2008) predicted that the Hsp83 transcript contained eight high‐affinity binding sites for the Smaug RBP. The authors tested a variety of fragments of the Hsp83 transcript for their sufficiency in conferring instability to the mRNA. They found that a fragment containing six predicted Smaug‐binding sites caused the greatest instability, followed by fragments containing three, two, one or no predicted binding sites. Thus, Semotok et al (2008) observed a rough correlation between Smaug binding opportunities on the test transcripts and actual destabilization. One can mechanistically motivate this observation in at least two ways: First, if Smaug is at a sufficiently low concentration that none of the sites are saturated, an mRNA with six binding sites will have, on average, a Smaug protein bound to it six times more often than an mRNA with one binding site, resulting in six times more opportunities for deadenylation. Another possibility is that interactions between Smaug and the CCR4–NOT complex are rare (low CCR4–NOT concentration). Thus, even if Smaug‐binding sites are always saturated, a transcript with six binding sites, again, has six times more opportunities to bind a deadenylase than a transcript with one site and a single bound Smaug protein.
The above discussion leads us to address a common shortfall in regulatory sequence analysis, which is the application of thresholds, either to the sequence score or to functional genomics data. First, why not treat regulation as an on/off, binary variable? If one were to assume that regulation is an on/off event and that an mRNA requires a threshold number of high‐affinity binding sites to be regulated, the appropriate statistical test would be a t‐test between the sample of mRNAs that were above the threshold for predicted binding versus the sample of mRNAs below the threshold. However, it can be shown that the Pearson correlation, when applied to categorical data, reduces to a t‐test (Lev, 1949; Tate, 1954). Thus, using standard regression statistics can detect binary or gradual relationships between the predicted occupancy and microarray data equally well. Also if a threshold was imposed on the sequence score when the relationship was indeed gradual, there would be a cost in sensitivity, as informative covariance was discarded. Supplementary Figure 1 shows scatter plots and regression lines for the correlation between occurrences of our Drosophila SCREs with mRNA expression data. Moreover, plotted are curves that follow the t‐value that would be calculated if each predicted occupancy was used as a threshold to perform a t‐test. The t‐value calculated by the two‐sample t‐test never becomes much more significant than the t‐value of the regression (only slightly higher for one SCRE). Moreover, although a reasonable threshold could be chosen in many cases, the multiple hypothesis correction required to choose a good threshold would greatly reduce the statistical power of the analysis. Upon inspection (Supplementary Figure 1), some SCREs (e.g. Smg‐4, Smg‐5) seem to have a gradual relationship between predicted binding and observed regulation, but others are less clear. Thus, whether regulation of mRNA stability and translation is executed in a binary or gradual manner remains an open question for future investigation. Nevertheless, our regression‐based computational approach based on gradual regulation assumptions is an efficient and powerful method to discover regulatory specificities and activities, which could then serve as starting points for detailed mechanistic studies.
Thresholds on functional genomics data may also discard informative covariance between sequence features and observed mRNA levels. Although the study was performed on transcription factors and not RBPs, Tanay (2006) provided evidence suggesting that transcription factor binding is analog rather than digital, and this variable binding is detectable in corresponding microarray measurements. Such observations combined with our discussion in the last paragraph lead us to question the ultimate sensitivity of common approaches to regulatory sequence analysis where a threshold is placed on the microarray data and then sequence features enriched in the above‐threshold set are analyzed. Rabani et al (2008) recently performed such an analysis to discover RNA structural elements within mRNAs related to the processes measured in a variety of microarray datasets. Their method discovers a candidate structure that can be seen in as many of the input sequences as possible using a stochastic context‐free grammar‐based method. Although their approach is powerful and valid, it is essentially placing thresholds on both the sequence score and the microarray values, possibly discarding usable signal from both. Our method does not discover structures as complex as those found by Rabani et al (2008), but our biophysically motivated regression approach represents a radical conceptual departure from the sequence‐only methods RNA structure‐discovery algorithms that have dominated for 25 years (Sankoff, 1985). Moreover, our method is particularly well suited to discovering small structures (<20 nt), which may be too small to provide sufficient signal to sequence‐only methods.
Annotating SCREs by scoring the responding mRNAs
Given our model, every sequence window in an mRNA contributes to the predicted occupancy of a transcript by the SCRE‐binding regulator. mRNAs that actually contain functional, high‐affinity instances of a SCRE, should have both a high occurrence score based on the mRNA sequence and a high predicted occurrence score based on its microarray measurement. Rather than assume an arbitrary sequence score threshold, a ranking based on a combination of these scores should give a sense for how likely it is that a particular mRNA responds to the SCRE sites that it contains (see Materials and methods). Functional gene annotations can be scored for uneven distribution toward highly ranked genes by using the Mann–Whitney–Wilcoxon test. We scored the Gene Ontology annotations (Ashburner et al, 2000) in this way for S. cerevisiae and Drosophila. In addition, Drosophila genes have been annotated by allele phenotype (Wilson et al, 2008) and in situ expression (Tomancak et al, 2002) using controlled, hierarchical vocabularies similar to Gene Ontology. Thus, these two annotation systems were similarly scored. The various annotations can give valuable insights into the biological roles of the newly discovered SCREs.
Testing StructRED in S. cerevisiae—the specificity of Vts1p
We first turned our attention to analyzing data from the yeast S. cerevisiae. One of the few RBPs that is known to recognize a specific SCRE is the yeast protein Vts1p (Aviv et al, 2003, 2006a, 2006b; Edwards et al, 2006; Johnson and Donaldson, 2006; Oberstrass et al, 2006). Vts1p has a RNA‐binding specificity that is similar to another, better characterized sterile alpha motif (SAM)‐containing protein, Smaug in Drosophila (Aviv et al, 2003). Vts1p recognizes a small stem–loop structure with a four or five nucleotide loop with a ‘G’ at the third position of the loop and complementary bases at positions one and four of the loop (Aviv et al, 2006a, 2006b; Edwards et al, 2006; Johnson and Donaldson, 2006; Oberstrass et al, 2006). The biological role of Vts1p is unknown, although there is some evidence that suggests that Vts1p regulates the poly(A) tail length of mRNAs through interaction with the Ccr4–Pop2–Not complex and thus affects mRNA stability (Aviv et al, 2003; Oberstrass et al, 2006; Rendl et al, 2008). Two groups have performed functional genomics experiments to identify possible Vts1p target mRNAs. Motivated by the observation that a reporter transcript containing three Smaug‐binding sites was less stable in wild‐type yeast than it was in a Δvts1 strain, Oberstrass et al (2006) used microarrays to look for mRNAs that were differentially expressed in a wild‐type versus a Δvts1 strain (Gene Expression Omnibus (GEO) accession GSE3859). They confirmed with Northern blots that a few transcripts were present at different levels between the two strains and contained predicted Vts1p‐binding sites. Aviv et al (2006b) used a pull‐down/microarray approach to identify those mRNAs that were most often associated with Vts1p (GEO accession GSE3741). They also confirmed that some of the mRNAs had lower steady state levels in a wild‐type strain than in a Δvts1 strain.
We applied the StructRED algorithm to search for any stem–loop SCREs in the wild‐type versus Δvts1 (Oberstrass et al, 2006) and the Vts1p pull‐down (Aviv et al, 2006b) microarray data in addition to approximately 6500 other microarray experiments retrieved from the NCBI GEO (Barrett et al, 2007). We confirmed the specificity of Vts1p (Figure 2) using the pull‐down microarray data (Aviv et al, 2006b; Figure 3A). This Vts1p specificity is in good agreement with the Vts1p specificity shown in earlier work (Aviv et al, 2006a, 2006b; Edwards et al, 2006; Johnson and Donaldson, 2006; Oberstrass et al, 2006). Thus, StructRED successfully performs the task for which it was designed—to detect SCREs based on genome‐wide measurements of the effects that their occurrences exert on mRNAs. Those mRNAs that we predict are most likely to contain Vts1p SCREs are enriched for functional categories involving carbohydrate metabolism and transmembrane transport (Supplementary Table 2). However, too little is known about the biological role of Vts1p to draw conclusions from these observations.
Vts1p‐binding site occurrences did not significantly explain genome‐wide mRNA expression in the wild‐type versus Δvts1 microarray dataset (Oberstrass et al, 2006) or any of the other approximately 6500 other microarray experiments that we analyzed (Supplementary information). This suggests at least three possibilities: first, we discovered the Vts1p SCRE using pull‐down microarray data in which both regulated and unregulated Vts1p‐bound mRNAs get pulled down and contribute specificity information. Vts1p may actually regulate only a few genes, and therefore, its specificity may not explain a significant amount of the genome‐wide variation in mRNA expression. The second possible explanation is that none of the microarray experiments tested a physiological condition in which mRNA stability regulation by Vts1p causes a significant global expression difference between the measured samples. The final possibility is that the primary role of Vts1p is to regulate translation and that none of the approximately 6500 microarray experiments contained information relevant to this role. The SAM‐containing protein Smaug has both mRNA stability (Semotok et al, 2005) and translation‐regulating effects (Dahanukar et al, 1999; Smibert et al, 1999), which may suggest that a role in regulating translation is possible for Vts1p.
On to Drosophila melanogaster—the specificity and function of Smaug
Smaug is a multifunctional RBP that regulates the stability of at least one transcript (Semotok et al, 2005) and regulates the translation of at least one other transcript (Dahanukar et al, 1999; Smibert et al, 1999) in Drosophila. Although the experimentally confirmed target mRNAs of Smaug are few, its regulatory activities have been intensely studied due to its role in translationally repressing maternally deposited nanos mRNA everywhere but in the posterior of the early embryo, thus helping to establish the anterior–posterior axis during development (Dahanukar et al, 1999; Smibert et al, 1999; Semotok et al, 2005).
Knowing the importance of post‐transcriptional regulation in early embryogenesis (see Vardy and Orr‐Weaver, 2007 for review), we applied StructRED to microarray time courses that measured mRNA expression and ribosome association in early Drosophila development. We identified the specificity of Smaug (Figure 2), which is strikingly similar to that of Vts1p, considering that Vts1p has little overall homology to Smaug except for 40% sequence identity to the 61 amino‐acid SAM domain. The TFAPs for the Smaug SCREs support the established roles of Smaug as both a translation and mRNA stability regulator (Figure 3B and C). The three microarray time courses in Figure 3B examined gene expression in activated Drosophila eggs or embryos during early embryogenesis (Pilot et al, 2006; Tadros et al, 2007; GEO accessions GSE8910, GSE3955). In both wild‐type time courses, having high‐affinity Smaug‐binding sites correlates with reduced mRNA concentration starting at the 2–4 h time point and T1 (slow phase). This observation is consistent with the timing observed earlier for Smaug destabilizing Hsp83 mRNA (Semotok et al, 2005). As this large‐scale mRNA destabilization is not observed in the time course data for the Δsmg embryos (Figure 3B, left), it is likely that we identified the specificity and activity of Smaug. Tadros et al (2007) likewise noted an enrichment in Smaug‐binding sites in unstable maternal mRNAs.
We also observe the translation‐repressing effects of Smaug. An increasingly common microarray method for gaining insight into translation regulation are polysome association microarrays. In this method, cell lysate is run through a sucrose gradient column. mRNAs associated with different numbers of ribosomes separate into different density fractions in the column, with mRNAs associated with multiple ribosomes, presumably caught while being actively translated, moving into the heavier fractions. Qin et al (2007) performed such microarray profiling of mRNA–ribosome association during a time course of the first 10 h of Drosophila development (GEO accession GSE5430). By examining the TFAPs of the Smaug specificities over two replicates each of a 0–2 h sample and a 4–6 h sample, we see that mRNAs that are bound by Smaug are being specifically excluded from ribosome association during the 0–2 h time point, as indicated by a strong negative correlation between the Smaug specificity and enrichment in the denser gradient fractions (Figure 3C). This presumed translational repression by Smaug is relieved by the 4–6 h time point. The timing of Smaug‐effected translational repression that we observed genome‐wide mirrors the characterized ability of Smaug to translationally repress nanos mRNAs (Dahanukar et al, 1999; Smibert et al, 1999). This observation suggests that the temporal regulation of all embryonic mRNA targets by Smaug is similar to that of nanos transcripts. Our observed change in Smaug's global regulatory activity is likely due to the earlier observed reduction in Smaug expression after the first 3 h of embryogenesis (Dahanukar et al, 1999; Smibert et al, 1999).
We scored the ranking of Smaug target genes (see Materials and methods) against Gene Ontology categories, phenotypes, and in situ expression annotations. The high‐scoring in situ categories ‘Stages 1–3 maternally deposited’ and ‘Stages 4–6 rapidly degraded,’ reflected the degradation function of Smaug. Other significantly enriched categories (P‐value ⩽0.001) for in situ expression included categories involving the early stage development of the female reproductive system and categories relating to later development of the nervous system. Significant Gene Ontology categories and allele phenotype annotations similarly support the role of Smaug in the development of the reproductive and nervous systems (Supplementary Table 2).
Tadros et al (2007) provided data that suggests that Smaug may regulate the stability of about 712 transcripts. In order for StructRED to detect the specificity and regulatory activity of a trans‐factor, it must regulate at least tens or hundreds of transcripts to make the correlation statistically significant. Thus, merely detecting the Smaug specificity and activities from genome‐wide polysome association data suggests that Smaug has a similarly large role in regulating translation during early embryogenesis. As we have microarray data reflecting both the mRNA stability and translational repression activities of Smaug, we have the opportunity to evaluate whether the mRNA targets that are destabilized by Smaug are the same as those that are translationally repressed. First, we took the Δsmg and wild‐type time course data from Tadros et al (2007) and transformed it so the data compared the wild‐type versus Δsmg mRNA levels, thus primarily displaying the Smaug‐dependent changes in mRNA levels. We also took the 0–2 h time point data (polysome fractions 8–12) from the mRNA–ribosome association time course of Qin et al (2007) and subtracted the contribution of all SCREs except those belonging to Smaug, leaving the data depleted for any unrelated translation regulation. Finally, we tested how well having Smaug‐dependent mRNA stability regulation was predicted by Smaug‐enriched translation repression in early development. We saw a significant correlation between Smaug‐dependent destabilization and putatively Smaug‐related translation repression (P‐value <10−11), suggesting that there are at least some transcripts that have both their translation and stability regulated by Smaug.
The predicted specificities and roles of other SCREs in Drosophila development
When we applied StructRED to the Drosophila development time courses and other microarray data, we did not specifically try to find the Smaug specificity. In fact, in addition to the Smaug SCREs, we discovered six other putative SCREs, which we have labeled Dm1 through Dm6, that have coherent supporting TFAPs and annotation (Figure 4). First, Dm1 and Dm2 were discovered from the same mRNA expression microarray time course for Drosophila embryogenesis that we discussed for the Smaug SCREs (Tadros et al, 2007). Those transcripts that contain high‐affinity instances of Dm1 and Dm2 are expressed at decreasing levels as development proceeds, suggesting that they are involved in destabilizing these transcripts at specific developmental stages. Dm1‐ and Dm2‐containing transcripts have weak enrichments for Gene Ontology categories related to development and protein transport (Supplementary Table 2). Transcripts that contain Dm1 or Dm2 display a bias for expression in the developing female reproductive system. Notably, Dm1 and Dm2 occurrences do not correlate with decreasing expression in Δsmg embryos (data not shown), suggesting that the putative destabilizing effect of Dm1 and Dm2 depends on Smaug. However, Dm1 and Dm2 do not show the translational repression activity that we see with the Smaug SCREs.
The Dm3 and Dm4 specificities were detected using microarray data that compared expression in wild‐type flies and flies lacking the Kep1 RBP (GEO accession GSE6086), suggesting that they may represent the specificity of Kep1. There were no significant themes among Gene Ontology, phenotype, or in situ annotations for Dm3 and Dm4, besides that their targets seem to be membrane associated (Supplementary Table 2). Earlier genetic and biochemical characterization of Kep1 suggests that it may be involved in regulating splicing (Fruscio et al, 2003; Robard et al, 2006).
Finally, Dm5 and Dm6 both were detected using polysome association data from the early Drosophila embryo (Qin et al, 2007; GEO accession GSE5430). Both have strong correlations during the 0–2 h time point and have almost no effect by 4–6 h. Dm5 has a strong positive correlation with the lightest fraction, suggesting that transcripts that contain instances of Dm5 are more often without ribosomes or associated with only a partially assembled ribosome, perhaps indicating ribosome pausing in the 5′ UTR. Dm6 increases the likelihood that its transcripts are associated with a moderate number of ribosomes during the 0–2 h time point, suggesting active translation. Annotations for the genes most likely to contain high‐affinity Dm5 or Dm6 sites are similar. Gene Ontology annotations for Dm5 and Dm6 center around oogenesis, nervous system development, and appendage development. Concordantly, phenotype annotations for Dm5 and Dm6 responders involve the nervous system, germ band, wing, and leg. In situ expression for Dm5 and Dm6 responding genes show them in the ectoderm and neurogenic regions in later embryonic stages (Supplementary Table 2).
SCREs are in coding sequences and UTRs
Cis‐regulatory elements in mRNAs are commonly assumed to reside in the 3′ UTR or occasionally the 5′ UTR. Although there are undoubtedly more examples of characterized regulatory elements in UTRs than in coding sequences, this may simply reflect reporting bias. Coding sequences are often assumed only to contain coding information, and as they are under selective pressure to maintain their coding properties, it seems less likely that cis‐regulatory elements would reside there. However, although actively translating ribosomes may reduce the binding of trans‐factors, nothing excludes the possibility of cis‐regulatory elements existing in coding sequences.
Contributing to this problem is the success of cross species comparisons for the identification of putative cis‐regulatory elements. These methods rely on mutations occurring over evolutionary time to lay bare those parts of the genome that are under strong selective pressures. The simplest interpretation for conserved sequences in non‐coding regions is that they serve regulatory functions for adjacent genes. However, if regulatory elements exist in coding sequences, conservation‐based methods need to be more sophisticated to detect them, as they must differentiate between conservation of codons in a coding sequence and conservation of regulatory nucleotide sequence. When looking for cis‐regulatory elements in coding sequences, regression‐based methods like StructRED have a major advantage. As StructRED does not use sequence conservation, all sequences, coding and non‐coding, are treated equally. StructRED detects structural cis‐regulatory elements based on the effect that they have on the mRNAs that contain them, as observed in genome‐wide measurements. The functional signal will be equally strong from a SCRE in a coding sequence as in a UTR.
The SCRE discovery in this work was always performed on approximated full‐length mRNAs. However, to answer the question of where the discovered SCREs commonly occur in the mRNAs, we scored the occurrences of each SCRE in the 5′ UTRs, 3′ UTRs, and coding sequences separately and then checked which of these mRNA subsequences performed best at explaining the microarray data. If most of the functional SCREs are in the 3′ UTRs, as is commonly assumed, then the TFAPs for the 3′ UTRs alone should be strongly significant and appear similar to the TFAPs when the full‐length mRNA sequences are used. For most of the Drosophila SCREs, especially the Smaug SCREs, the occurrences that appear in the coding sequences perform best at explaining the microarray measurements of gene expression and polysome association (Figure 5). Thus, most of the functional sites for Dm2, Dm3, Dm4, Dm6, and the Smaug SCREs reside in coding sequences. Recent characterization of Smaug stability regulation of the Hsp83 transcript showed that all eight predicted binding sites for Smaug do indeed reside in the coding sequence (Semotok et al, 2008). Dm1, Dm5, and Dm6 still have appreciable signal in the 3′ UTRs, and Dm5 has signal in the 5′ UTRs. We also calculated the length‐normalized scores for the UTRs and the coding sequences for each SCRE. Dm3, Dm4, Dm5, Dm6, and the Smaug SCREs had the highest concentration of binding sites in the same regions that strongly predicted expression (Supplementary Figure 4). Only Dm1 and Dm2 were inconsistent, with Dm1 having a higher density of sites in coding sequences, while the scores in the 3′ UTRs were more predictive, and Dm2 having a higher density of sites in the 3′ UTRs, while the scores in the coding sequences were more predictive. SCREs frequently appearing in coding sequences provides a strong argument for including whole transcripts when searching for cis‐regulatory elements.
Explaining published observations
Tadros et al (2007) investigated the role of Smaug in early Drosophila development by measuring genome‐wide mRNA expression in both wild‐type and Δsmg activated eggs. Thus, they identified maternally deposited transcripts that were degraded by both Smaug‐dependent and independent mechanisms. The authors do not provide candidate regulators that affect the degradation of the Smaug‐independent mRNAs. With the hopes of identifying the specificity of such a regulator, we looked for any of our SCREs that negatively correlated with mRNA levels in both the wild‐type and Δsmg activated eggs. We found none. However, as part of the preprocessing steps before the SCRE search, we discovered all of the significantly explanatory single‐stranded mRNA motifs. The intention of this step was to increase the likelihood that any discovered SCREs reflected real structural elements and not a structure‐like reflection of an otherwise single‐stranded regulatory element. Although we do not discuss most of those results here, inspecting them for any motifs that correlated with mRNA degradation in the Δsmg eggs revealed two such motifs (Figure 6). The Dm7 motif and activity profile may belong to a yet uncharacterized RBP in Drosophila. Dm7 has the correct TFAP for a Smaug‐independent regulator of mRNA stability during early embryogenesis, having strong negative correlations with mRNA levels as time elapses to 4–6 h in both wild‐type and Δsmg activated eggs. Dm7 is similar to the UUGUU motifs identified in maternal unstable transcripts by De Renzis et al (2007). Another motif that we label ‘Pum’ is almost certainly the specificity of Pumilio, the founding member of the Puf family of RBPs (Zhang et al, 1997). We confirmed that the Pum motif is the Pumilio specificity by testing how well its sequence scores predicted genome‐wide association of mRNAs with Pumilio, using the microarray data of (Gerber et al, 2006; P‐value <10−25). Strangely, the Pum motif only correlates with the degradation of mRNAs in the Δsmg activated eggs. This observation suggests that there is a more complex relationship between the regulatory activities of Pumilio and Smaug that could benefit from future experimental characterization.
A peek into human mRNAs
With hopes of making similar inferences about regulatory RBPs in humans, we applied StructRED to human microarray data that measured genome‐wide RBP binding or polysome association. However, we had a few handicaps in performing the analysis. First, rather than well‐annotated mRNA sequences as are available for Drosophila, we needed to retrieve EST sequences from NCBI Entrez Nucleotide that were relevant to each microarray platform that we used. Moreover, systematic functional annotation of these sequences do not exist. Thus, we could only discover SCREs and follow their regulatory effects through the analyzed microarray conditions.
We discovered three SCREs with functionally coherent TFAPs (Figure 7A and B). Occurrences of the first SCRE, Hs1, correlated with decreased translation in the metastatic colorectal cancer cell line, SW620, versus a non‐metastatic cell line from the same patient, SW480, as measured in a polysome association microarray study (Provenzani et al, 2006; GEO accession GSE2509). Transcripts containing Hs2 SCREs are expressed at a lower level in U937 cells that have been exposed to 12‐myristate 13‐acetate (PMA) and caused to differentiate into a macrophage‐like state (Kitamura et al, 2004; GEO accession GSE1783). Finally, occurrences of Hs3 in mRNAs correlate with increased association with ribosomes in human mammary epithelial cells, regardless of whether translation initiation factor 4F is overexpressed (Larsson et al, 2007; GEO accession GSE6043).
Although a Smaug homolog exists in the human genome, we did not detect a Smaug/Vts1p‐like specificity in the data that we analyzed. Nevertheless, we could calculate TFAPs for the Drosophila Smaug specificities across the human data to look for Smaug activities that were too weak to detect in the original search. Indeed, there were two RBP pull‐down microarray studies, one for poly‐pyrimidine tract binding protein (PTB; GEO accession GSE6021; Gama‐Carvalho et al, 2006) and one for Staufen1 and Staufen2 (Furic et al, 2008; GEO accessions GSE8437, GSE8438), where pulled‐down mRNAs were enriched for Smaug‐binding sites (Figure 7C). This suggests that Smaug binds to many of the same targets as PTB, Staufen1, and Staufen2. Our observed correlation between the Smaug specificity and Staufen1 binding is consistent with earlier observations that hSmaug is found in cytoplasmic granules with Staufen1 (Baez and Boccaccio, 2005).
Final thoughts and looking ahead
This work has two major themes: first, we showed that functional genomics data such as microarrays can be used to find cis‐regulatory elements in mRNAs defined by both sequence and structure by posing a model for their existence and then using the data to identify which modeled SCREs are supported by the data. We implemented this approach in our algorithm StructRED. We successfully applied StructRED to discover known and putative stem–loop cis‐regulatory elements using only full‐length mRNA sequence data and functional genomics data, without needing to perform any kind of sequence alignment. Second, we applied our StructRED algorithm to data from yeast, flies, and humans both to validate the approach and to contribute significant knowledge to the field of inquiry into post‐transcriptional regulation of gene expression. We accurately recover the specificities of the stem–loop RBPs Vts1p from yeast and Smaug from flies. We also discover and computationally characterize putative SCREs in flies and humans involved in diverse roles. Finally, we show that many of our discovered SCREs in flies are found in coding regions as much or more than in UTRs and they are conserved in multiple species of Drosophila.
The StructRED algorithm represents a novel method for determining cis‐regulatory RNA structures. Although the current implementation is limited to finding short stem–loop motifs, it may be extensible to other small structures (dsRNA, internal bulges) and perhaps more complex structures. Given its strengths, we expect that StructRED may become the basis of a class of RNA regulatory element search tools that will expand computational and experimental inquiries into post‐transcriptional gene regulation.
Materials and methods
We performed the search for SCREs on best estimates of full‐length mRNAs. Unfortunately, the yeast transcriptome had not been completely sequenced at the time of this analysis, so we did not know the transcription start and 3′ processing sites for most genes. However, David et al (2006) measured the mRNA levels for every yeast gene using a genome tiling microarray. Thus, they created a nucleotide‐resolution map of the transcriptome expressed under log‐growth conditions. Also, Miura et al (2006) sequenced two yeast cDNA libraries. Although neither dataset contains measurements for every gene, we averaged the annotated UTR lengths from these two sources to provide real full‐length mRNAs for about half of the genome. We used the mean 5′ and 3′ UTR lengths from the known half to provide approximate mRNA sequences for the unknown half. Drosophila transcript sequences were downloaded from FlyBase (Wilson et al, 2008), and the longest transcript representing each gene in Drosophila melanogaster was used for further analysis. When estimating full‐length mRNAs in other Drosophila species, the lengths of the annotated UTRs from D. melanogaster were used. Human sequences were downloaded from NCBI Entrez Nucleotide based on associated microarray annotations. To prevent errors in the regression procedure, the sequence sets were purged of redundant entries by using Blast (Altschul et al, 1990) to compare each sequence to every other sequence in the same set. One representative sequence was then chosen to represent each group that was hard clustered by having E‐values for similarity <10−10. An example of problematic sequences that this process removes are transposons. As a particular transposon type can have nearly identical sequences in multiple places in the genome and be represented by multiple microarray measurements, it can cause false correlations.
Genome‐wide expression data
Data were downloaded from the NCBI GEO (Barrett et al, 2007). All data were purged of extreme outliers by using the Grubbs’ test (Grubbs, 1969; P‐value ⩽10−9). We used datasets for further analysis if (1) we could automatically resolve the data in terms of systematic ORF identifiers and (2) after removing outliers, the dataset contained data for at least 3000 genes.
RNA StructRED is a data integrative, regression‐based algorithm with similarities to the algorithms represented by Stormo et al (1986), REDUCE (Bussemaker et al, 2001), and MatrixREDUCE (Foat et al, 2005, 2006). As input, StructRED takes one or more FastA sequence files with one or more entries per ORF identifier and one or more microarray datasets indexed by the same ORF identifiers. StructRED then builds a model to explain the observed mRNA levels in terms of small stem–loop motifs contained in the associated sequence data. The output of StructRED is a list of candidate SCREs as well as the inferred activity of the corresponding trans‐factors across the input microarray conditions, a TFAP (Figure 1).
Assume we have spotted cDNA microarray data that provides a log2‐ratio of an experimental condition intensity to a control condition intensity for each gene in the genome. Now let us assume that one or more RBPs are actively regulating under the tested condition and each one binds to a different SCRE. We pose the simplest possible starting model: We assume that binding by each RBP to an mRNA g causes an independent and therefore additive change in the observed log2‐ratio Ag that is proportional by Fn to the number Mng of its specifically bound SCREs n present in that particular mRNA sequence. We also allow for an intercept term C to account for any overall changes in the measured intensities between the two conditions and some experimental noise εg. Thus, we have a linear model that expresses the log2‐ratio of an mRNA in terms of the SCREs that it contains:
Even if the simplifications of this model are questionable, fitting the microarray data and nucleotide sequence to this model will still detect correspondences between the microarray values and SCRE occurrences. The magnitude of Fn is indicative of the regulatory strength of the trans‐factor in that particular tested condition. If we normalize these coefficients by their standard error (resulting in a t‐value) and track them across multiple microarray experiments, we have a TFAP.
If we have a dictionary of possible SCREs, we can discover the real SCREs by iteratively building the model explaining the microarray data through forward parameter selection. First, we score all candidate SCREs in the dictionary by performing ordinary least squares (OLS) regression on the counts of SCREs within mRNA sequences Mng and the respective microarray log2‐ratios Ag. The most explanatory SCRE, the one whose counts best fit the data, is selected from the dictionary, transformed into a matrix (see below), and added to the model. Next, the expression value predicted by the model (which now only contains one SCRE) is subtracted from each expression value, leaving a residual value. Each time using the residuals of the earlier iteration, the process repeats performing OLS on the entire dictionary, adding a SCRE term to the linear model, performing a multivariate fit including all SCRE terms, and calculating the residuals of the current model. When the last SCRE term added to the model does not significantly improve the fit beyond a predefined threshold P‐value (corrected for multiple hypothesis testing), the process is stopped. The SCREs in the resulting model are the best candidates for real, functional SCREs that play a role in mRNA regulation for the process measured by the microarray experiment.
In practice, the model is not developed for a single experiment at a time. At every iteration, the best scoring SCRE in any experiment is added to the model and then scored for its fit to all experiments, resulting in its TFAP. Thus, a single model is built to explain all input experiment data. If a particular SCRE is not relevant to a particular experiment, its coefficient for that experiment will simply be near zero.
The dictionary of SCREs that this version of StructRED searches includes small stem–loop structures with stem and loop sizes that can be selected at the beginning of a run. The user can specify whether to include sequence information on the stem, loop, or both. The size of the stem–loop SCREs is limited only by available computational power. For all runs in this work, we used stems of length three and loops of lengths three to six nucleotides and allowed sequence information in one to seven positions on the stem and loop.
Concise pseudocode for the StructRED algorithm can be found as Supplementary Figure 2. The full software code will be made available upon request.
Converting oligonucleotide motifs into position‐specific affinity matrices
In a PSAM, the definition of the weight wjb for a particular nucleotide b at a particular position j is the ratio of equilibrium association constants Ka for the binding of a trans‐factor to a reference oligonucleotide Sref and an oligonucleotide that has a single base mutated Smut, changing base position j to base b (see Foat et al, 2006 for a derivation):
To understand the later equations, we need to expand on equation (2). Composing Fn is a protein concentration [Pn], an affinity Ka(Pn, Sref) of protein Pn for a reference binding site Sref, and a catch‐all term β that includes unmodeled technology parameters.
In the case of scoring exact oligonucleotide structures, rather than weight matrix structures, Mng just represents the count of occurrences of the oligonucleotide structure n in the sequence of mRNA g, and Ka(Pn, Sref) becomes the Ka for the interaction of protein Pn with the oligonucleotide structure.
If the occurrences of a reference oligonucleotide Sref and a mutant oligonucleotide Smut were fit for genome‐wide correlation with mRNA measurements, they would have two different coefficients Fref and Fmut, respectively. The difference in the magnitude of Fref and Fmut would be due to the different Ka's for these two sequences, as all else is constant. Thus, for a genome large enough to define Fref and Fmut, wjb is the ratio of these coefficients:
StructRED uses this last equation to convert a best scoring oligonucleotide stem–loop motif into a stem–loop matrix SCRE. For each nucleotide in the best scoring stem–loop motif, the wjb value in the matrix equals one. All other values of wjb, are calculated by taking the ratio of the coefficients for each motif differing by a single nucleotide from the best motif versus the coefficient of the best motif. If the ratio is negative, meaning one motif has a positive correlation and one has a negative correlation, it is set to zero, as it suggests that the trans‐factor cannot bind to the site with that particular single base change. Better estimates of the wjb values are obtained by calculating their geometric mean across all microarray experiments in which the best motif correlated nearly as well with expression as in the best experiment (within one standard error of the best t‐value). As only significantly correlating data contributes to the SCRE matrix, a matrix will not worsen if more, unrelated data are included as input.
If the assumption that nucleotides contribute independently to the binding of the trans‐factor is valid, then this calculation loses information from motifs with more than one base change from the best scoring oligonucleotide. However, the above calculation is essentially free when compared with the calculations required in other PSAM‐producing methods (Foat et al, 2005, 2006). This approximation creates both a faster and a simpler algorithm that maintains most of the information provided by more elaborate PSAM calculations.
Once the PSAM is derived, the sequence score Mng of mRNA g for SCRE n is:
where Ni is the number of sequence windows i in mRNA g; Li is the length of the window i; W (j) is the weight in the PSAM at position j for the base Si(j) that appears at position j in the current window sequence Si; and Ωn (Si) is a function that returns one if the sequence Si in window i forms the correct structure and zero otherwise. This form of the model, having separate terms for the RNA structure and the nucleotide specificities, assumes that the RNA structure exists independent of the trans‐factor's binding and that the trans‐factor then recognizes specific nucleotides in a specific three‐dimensional RNA configuration. An ‘induced fit’ scenario for the binding of the trans‐factor would require a more complex relationship between the nucleotide specificities and the structure definition. The current StructRED method with a the binary Ωn(Si) function could be considered a simple approximation of a hypothetical algorithm in which Ωn(Si) would return values between zero and one depending on how likely it is that a particular window will form the correct secondary structure. The potential for such an algorithm is explored in Supplementary Figure 3.
Ensuring the discovered elements are structurally defined
To reduce the possibility that discovered SCREs actually represent single‐stranded RNA‐binding sites that contain inverted repeat sequences, we first fit a model to the microarray data that is composed of all significantly explanatory single‐stranded RNA matrices, using a method similar to that described by Foat et al (2005). We then performed the search for SCREs using the residual microarray values after subtracting the model composed of single‐stranded RNA matrices. This single‐stranded matrix search is not strictly necessary. It would be possible to check the SCREs after the analysis to ensure that they could not be represented equally well without the stem–loop constraint. Factoring out single‐stranded matrices first simply cuts down on false positives later, possibly at the expense of false negatives. However, any SCRE that could be well represented solely by its sequence composition would be discoverable by many existing sequence analysis tools and would not highlight the strengths of our method. To be certain that, the single‐stranded matrix search worked as intended, all Drosophila SCREs were tested to make sure that the sequence specificity alone did not explain the data as well as the combined sequence/structure SCRE specificity (P‐value <0.001; Supplementary Table 1). All discovered Drosophila and human SCREs were also checked to confirm that they did not explain mRNA measurements when the SCREs were scored on the strand complementary to the mRNA sequences. This check supported the premise that they are binding sites within mRNA transcripts and not transcriptional regulatory binding sites in UTRs or coding sequence on the DNA.
Structural cis‐regulatory element responder ranking and annotation
One would expect that for a SCRE‐containing mRNA to be a real, regulated target, it would have two major characteristics: It has one or more predicted high‐affinity instances of a SCRE, and when the SCRE‐binding factor is predicted to have a strong, genome‐wide effect, the mRNA is regulated in the predicted direction. Thus, by combining the sequence score of each mRNA for a SCRE with its microarray measurements for important conditions for the SCRE, all genes in the genome can be ranked by how likely they are to contain real instances of the SCRE.
First, for each significantly correlating (P‐value ⩽0.001) microarray experiment a for a particular SCRE m, we calculate a residual mRNA expression value ega that only contains the effect of the SCRE of interest. This value is the original value Aga minus the model for n−1 of the SCREs, where the excluded SCRE is the SCRE of interest:
We then divide emga by its coefficient Fma to obtain a predicted sequence score mg. All genes are then ranked by mg for each experiment a.
The probability of a particular gene g having a rank as good or better than its rank x is simply its rank divided by the number of genes G:
Assuming independence over experiments a, the probability of the gene g having ranks as good or better than its observed ranks in all significantly correlating experiments Na is:
Taking the 1/Na power, of this probability gives us a mean probability across all a:
Now if we want to have a mean probability that equally weights the probability of the predicted sequence score with the probability of a ranking yg or better for the actual sequence score Mmg, we take another geometric mean:
In the end all we want is a new ranking zg by this combined probability, so the G−1 scaling factor is irrelevant. Thus, we rank all genes by this two‐step, geometric mean of ranks:
The only way for an mRNA to have a high responder rank zg for a particular SCRE m is to have a high actual sequence score Mmg and to be predicted to have a high sequence score for the SCRE by mg, based on the mRNA's behavior in conditions in which the SCRE is correlated with a large genome‐wide effect. The probabilistic motivation of this ranking scheme is similar to that used in the rank product method for microarray analysis (Breitling et al, 2004).
Once all mRNAs in the genome are ranked by their responder scores, functional categories of genes are scored for enriched representation among the top‐ranked mRNAs by using the Mann–Whitney–Wilcoxon test. Thus, the Gene Ontology (Ashburner et al, 2000) annotations for S. cerevisiae (Cherry et al, 1998) and D. melanogaster (Wilson et al, 2008) were used to score all category levels of the ontology trees. In addition, the fly community has created hierarchical controlled vocabularies for fly development and anatomy and have created gene annotations for allele phenotypes (Wilson et al, 2008) and in situ expression (Tomancak et al, 2002) using those controlled vocabularies. Thus, these two annotation systems were scored in a similar manner to the Gene Ontology system. No annotations were available for the human sequences.
SCRE location analysis
For the SCRE discovery and TFAP calculation, full‐length mRNAs were used. However, it is possible to score subsequences to determine where in transcripts particular SCREs tend to occur genome‐wide. Every mRNA was divided up into its 5′ UTR, CDS, and 3′ UTR, and TFAPs were recalculated for each SCRE using only each subsequence individually. Thus, the subsequence TFAP that most resembles the full‐length mRNA TFAP is indicative of in which piece of the transcripts the SCRE tends to occur.
We also calculated the genome‐wide averaged, length‐normalized scores in the CDS and UTRs for each SCRE. Although no inference can be drawn from these calculations, it is interesting to observe if increased predictive power of a sequence region corresponds with increased density of predicted high‐affinity binding sites (Supplementary Figure 4).
Visualizing SCRE position‐specific affinity matrices with LoopLogo
StructRED produces representations of cis‐regulatory elements consisting of both a weight matrix and a RNA structure. We developed LoopLogo, a script that generates the structural logos seen in Figures 1, 2, 4, and 7. In this representation, squares mark positions that contain sequence information. The height of each nucleotide character is proportional to its relative affinity. Nucleotides of weaker affinity are smaller and stacked on the larger higher‐affinity nucleotides. Dark circles represent nucleotide positions that do not contain nucleotide information. Open circles on the 3′ side of the stem were not allowed to contain independent sequence information, as their sequence is largely specified by the sequence on the 5′ side of the stem (with the exception of G‐U pairs). LoopLogo is implemented in Perl and creates Scalable Vector Graphics output.
We thank Craig Smibert, Barak Cohen, Jim Skeath, Yue Zhao, and Ryan Christensen for critical readings of the paper. BCF is a PhRMA Foundation Informatics Fellow and a Washington University School of Medicine, Department of Genetics Fellow. This work was supported by National Institutes of Health grant HG00249 to GDS.
Conflict of interest
The authors declare that they have no conflict of interest.
Supplementary Materials 1
Supplementary Figures and Tables
Supplementary Data 1
Output files of StructRED and other analyses
This is an open‐access article distributed under the terms of the Creative Commons Attribution License, which permits distribution, and reproduction in any medium, provided the original author and source are credited. This license does not permit commercial exploitation without specific permission.
- Copyright © 2009 EMBO and Nature Publishing Group