## Abstract

The cellular environment is abuzz with noise originating from the inherent random motion of reacting molecules in the living cell. In this noisy environment, clonal cell populations show cell‐to‐cell variability that can manifest significant phenotypic differences. Noise‐induced stochastic fluctuations in cellular constituents can be measured and their statistics quantified. We show that these random fluctuations carry within them valuable information about the underlying genetic network. Far from being a nuisance, the ever‐present cellular noise acts as a rich source of excitation that, when processed through a gene network, carries its distinctive fingerprint that encodes a wealth of information about that network. We show that in some cases the analysis of these random fluctuations enables the full identification of network parameters, including those that may otherwise be difficult to measure. This establishes a potentially powerful approach for the identification of gene networks and offers a new window into the workings of these networks.

## Introduction

Computational modeling in biology seeks to reduce complex systems to their essential components and functions, thereby arriving at a deeper understanding of biological phenomena. However, measuring or estimating key model parameters can be difficult when measurement noise corrupts experimental data. Thus, when cellular variability or ‘noise’ (Elowitz *et al*, 2002) leads to measurement fluctuations, it may appear deleterious. However, this is not the case. Just as white noise inputs help to identify dynamical system parameters (Ljung, 1999), so too can characterization of noise dynamics elucidate natural mechanisms. For example, steady state noise characteristics can distinguish between different logical structures, such as AND or OR gates (Warmflash and Dinner, 2008). At the same time, temporal measurements of transient dynamics can aid in the construction of reaction pathways (Arkin *et al*, 1997). In combination, noise and temporal analyses yield powerful tools for parameter identification. For example, the averages of correlations in cell expression at many time points reveal feed‐forward loops in the galactose metabolism genes of *Escherichia coli* (Dunlop *et al*, 2008). Similarly, manipulating certain gene network transcription rates while observing the response of statistical cumulants can help to identify reaction rates for some gene regulatory networks (Raffard *et al*, 2008). In this study, we examine the possibility of identifying system parameters and mechanisms directly from single‐cell distributions, such as those obtainable with flow cytometry, without time‐varying control and at only a handful of different time points. We prove that the analysis of variability provides more information than the mean behavior alone. Furthermore, we illustrate potential of our approach using numerical and experimental analyses of common gene regulatory networks.

## Results and discussion

### Gene expression model

We adopt the gene expression model used in the study carried out by Thattai and van Oudenaarden (2001), which is characterized by random integer numbers of mRNA and protein molecules: *R* and *P*, respectively. Transcription, translation, and degradation events change the system state by altering these numbers. mRNA changes are modeled as random events that occur according to exponentially distributed waiting times that depend on the transcription and degradation rates *k*_{r} and γ_{r}. Thus, given the state of *r* mRNA molecules, the probability that a single mRNA molecule is degraded within the time increment d*t* is given by *r*·(γ_{r}·d*t*). Similarly, translation and degradation of proteins are dictated by rates *k*_{p} and γ_{p}. The resulting stochastic model is represented by a continuous time, discrete state Markov process. The probability of finding the system in a given state (*R*(*t*)=*r*, *P*(*t*)=*p*) is fully characterized by the system's master equation from which the evolution of moments *E*[*R*(*t*), *E*[*P*(*t*), *E*[*R*^{2}(*t*)],… can be described (see Supplementary Section 1).

Our first finding is that all parameters of this model are identifiable from cell population distributions of mRNA/protein measured at least at two time points. In contrast, two time point measurements of mRNA/protein population averages are never sufficient for identifiability. To show this, the use of first and second‐order moments, or equivalently means, variances, and covariances of proteins and mRNAs is sufficient, instead of the use of full distributions. At a given time point, *t*, each such measurement yields a vector: ** v**(

*t*)=(

*E*[

*R*(

*t*)],

*E*[

*P*(

*t*)],

*E*[

*R*(

*t*)

^{2}],

*E*[

*P*(

*t*)

^{2}],

*E*[

*R*(

*t*)

*P*(

*t*)]). Given

**(**

*v**t*

_{0}) and

**(**

*v**t*

_{1}) at two distinct time points

*t*

_{0}<

*t*

_{1}, there generically exists a set of parameters

*k*

_{r},

*k*

_{p}, γ

_{r}, γ

_{p}that uniquely gives these measurements—all other parameter sets yield different measurements (see Figures 1E and 2A). We illustrate this here only for transcription (Supplementary Section 3 provides an implicit expression for the parameters of the full model). Suppose that {μ

_{0}, μ

_{1}} and σ

_{0}

^{2}, σ

_{1}

^{2}represent the measured mRNA mean and variance at two time points

*t*

_{0}<

*t*

_{1}<∞. Then the parameters, {

*k*

_{r}, γ

_{r}} are fully identifiable, and

where τ:=*t*_{1}−*t*_{0}.

Thus, the statistics, {μ_{0}, σ_{0}^{2}, μ_{1}, σ_{1}^{2}} contain sufficient information to identify the model parameters. However, measurement of just the population averages, for example, *E*[*R*], is insufficient for identifiability, and there exists an infinite set of parameters {*k*_{r}, γ_{r}}, that is consistent with the same two mean measurements μ_{0} and μ_{1}.

Although parameters are identifiable from transient moment measurements, we find that it is impossible to identify all parameters from stationary moments. Measuring means, variances, and other statistics after all the transients have died represents a lost opportunity to peek into the cell's inner workings and to recover the network parameters. For example, two different parameter sets may produce very different protein distributions after a short interval time (Figure 1E), but indistinguishable distributions after a longer interval (Figure 1D). Supplementary Section 2 provides a proof that stationary moments of any arbitrary order are insufficient to uniquely identify the model parameters *k*_{r}, *k*_{p}, γ_{r}, γ_{p}. Such stationary distributions will only enable the determination of relative parameter values, but any positive scaling of these values would produce the exact same measurements for *v*_{∞}. We note that stationary correlations, for example, *E*[*R*(*t*)*R*(*t*+τ)] for small time intervals, τ, could also provide the necessary dynamic information (Cinquemani *et al*, 2009), but taking such measurements is more difficult and requires the tracking of individual cells between measurement times.

After having determined that full identification is achievable using two measurements of all first and second order moments, we now explore the effect of partial moment measurements. We consider two new scenarios: (a) only {*E*[*R*], *E*[*P*]} measurements are available; and (b) only {*E*[*P*], *E*[*P*^{2}]} measurements are available. For each scenario, Figure 2A shows the number of measurements needed for parameter identifiability and demonstrates the advantage of using full second order statistics. Furthermore, the performance with partial information depends on which partial information is being used. When protein and mRNA mean measurements alone are used, full parameter identifiability is possible using three measurements. However, with only protein mean and variance measurements, at least, five time measurements are needed. When only protein mean measurements are available, full identifiability is impossible, regardless of the number of measurements (see Supplementary Section 4).

Time measurements of moment dynamics impose nonlinear algebraic constraints on model parameters. The above results can be understood by exploring the number of such constraints that is needed to uniquely solve for the unknown parameters. The gene expression model has four unknown parameters (*p*=4) and five unknown initial conditions (moments at *t*=0). Thus, one would expect that, at least, nine independent measurements are needed to identify these unknowns. The five elements of *v* at *t*_{0} and *t*_{1} provide ten pieces of information and are generally sufficient (see Figure 2A). Conversely, in a model for just the mean values {*E*[*R*(*t*)], *E*[*P*(*t*)]}, there are four parameters (*p*=4) and two initial conditions, and one expects that, at least, six independent pieces of information would be needed for the identification. Indeed, at least three time measurements are required and two measurements are never enough (see Figure 2A). However, for a model that describes only protein mean and variance measurements, at least five time measurements are needed for full parameter identifiability. In this case, the dynamics of {*E*[*P*], *E*[*P*^{2}]} are coupled to those of {*E*[*R*], *E*[*R*^{2}], *E*[*RP*]}, and additional measurements are needed to identify the initial values for these. Finally, we note that in these cases, the number of measurements needed for parameter identification are far fewer than the 2*p*+1 measurements that were shown (Sontag, 2002) to be sufficient for the identification of the *p* unknown parameters of a general nonlinear dynamical system.

The results above establish the principle that transient measurements of full second order moments carry information that allows one to identify all model parameters, at least, assuming noise‐free measurements. If the measurements are corrupted by noise, it is often possible to compensate with a larger number of measurements. To illustrate this, we have conducted 100 simulated identification studies in which the unknown parameters were taken from a broad lognormal distribution (Figure 2B). For these, we supposed that **v**_{j}:=**v**(*t*_{j}) could be measured at *m* equally separated time points {*t*_{0}, ….,*t*_{m−1}}, and that each measurement had unknown errors of ±10% To explore the effect of incomplete measurements, we performed the identification method for the three data scenarios considered earlier: (1) all moments; (2) only the means; and (3) only the protein means and variances. For each scenario, we investigated the impact on parameter identification of using an increasing number of noisy measurements obtained from a different number of independent experiments (with different randomly chosen unknown initial conditions).

As more data were gathered, the effects of measurement error were overcome and the probability of successful identification increased for every strategy (see Figure 2C). Using many measurements, the parameters and the unknown initial conditions of mRNAs and proteins could be resolved even from inaccurate protein data alone—provided that it included information on the protein variance. All of the above numerical experiments were conducted assuming that the initial conditions were unknown; for known or specified initial conditions, we found that the identification was even more successful (see Supplementary Figure 5). We have thus shown that for the simple gene expression model, cellular noise enhances the opportunity for system parameter identification, whereas measurement noise impedes it. The deleterious effects of measurement noise can be overcome by increasing the number of measurements.

### Experimental identification of *lac* induction

Among the most studied gene regulatory elements is the *lac* operon of *Escherichia coli*. This mechanism has been used to construct toggle switches (Gardner *et al*, 2000; Kobayashi *et al*, 2004), genetic oscillators (Elowitz and Leibler, 2000; Atkinson *et al*, 2003) and logical circuits (Weiss, 2001). Despite its ubiquitous use, precise *in vivo* single‐cell quantification of the system remains insufficient. Indeed, most such quantification attempts have come from *in vitro* experiments or population level studies. For example, the *lac* repressor dissociation constant has been estimated to be *K*_{d}=10^{−11}–10^{−9} M (Oehler *et al*, 1990). In an *E. coli* cell with a volume of 10^{−15} l, such dissociation constants mean the occupancy of the *lac* promoter is 94–99.94% when there are ten such molecules. At best, such measurements have only a probabilistic meaning at the level of single cells; at worst, they have no relevance at all as other mechanisms, such as nonspecific binding (Kao‐Huang *et al*, 1977), take on much greater significance.

We used flow cytometry experiments and computational analyses to identify a parameter set to describe the *in vivo* single‐cell dynamics of green fluorescent protein (GFP) controlled by the *lac* operon under isopropyl‐β‐d‐thiogalactoside (IPTG) induction (see Figure 3A and Materials and methods section). We explored the response of the system at several IPTG levels and at multiple time points. Although many mechanistic models may capture the available data, we focused on the simplest consistent model, which consists of diffusion of IPTG into the cell, [IPTG]_{IN}=[IPTG]_{OUT}·(1−exp(−*rt*)), and four basic reactions, *R*_{1}, *R*_{2}, *R*_{3}, and *R*_{4} corresponding to production and degradation of LacI and GFP.

The production of LacI is constant, *w*_{1}=*k*_{L}, corresponding to constitutive expression. However, production of GFP is a nonlinear function of the LacI level:

where *k*_{G} is the unrepressed GFP production rate, α describes LacI occupancy strength, and the Hill coefficient, η, accounts for cooperative binding of LacI. The GFP degradation rate, *w*_{4}=δ_{G}·[GFP], is fixed, but LacI can be degraded or inactivated by IPTG such that the total LacI removal depends on the IPTG concentration and is assumed to have the form *w*_{2}=δ_{L}·[LacI], where δ_{L} = δ_{L}^{(0)} + δ_{L}^{(1)} [IPTG]_{IN}. The model also explicitly characterizes uncertainties in the flow cytometry measurements (see Materials and methods). In total, there are ten unknown positive real parameters for the regulatory system, Λ = {*k*_{L},*k*_{G}, δ_{L}^{(0)}, δ_{L}^{(1)}, δ_{G}, α η, *r*, μ_{GFP}, σ_{GFP}^{2}} ∈ *R*_{+}^{10}.

The measured fluorescence histograms at different times and different IPTG levels (Figure 3) cannot adequately be captured using low order moments. Furthermore, as *w*_{G} is a nonlinear function of LacI, there is no known analytical expression for the statistical moments of GFP. Instead, we used a new method, called finite state projection (FSP), to identify the unknown parameters on the basis of their probability densities (see Materials and methods section). In the identification routine, a parameter search was conducted to find parameter sets such that the total predicted fluorescence distribution was as close as possible to the measured distribution in a least squares sense for all time points and IPTG levels.

Figure 3B shows that the identified model results match the experimentally measured distributions exceptionally well. However, with the full set of ten unknowns in Λ, this identification is not unique, and we found multiple parameter sets that provided equally good fits. However, by utilizing additional information about the system, we could reduce the uncertainty of identification. In particular, assuming that GFP is lost solely to dilution, we could specify the rate δ_{G}=3.8 × 10^{−4} N^{−1} s^{−1}, corresponding to a half‐life of 30 min. The remaining nine parameters could then be identified as:

where *N* refers to molecule number.

As the assumed model represents a simplified description of multiple events (folding dynamics, elongation, etc.), these parameters are best viewed as model‐specific empirical measurements. Even so, it is possible to make some comparisons between the identified parameters and previous analyses. First, the production and degradation rates of LacI yield a mean number of *k*_{L}/ γ_{L}^{(0)} ≈ 5 molecules per cell at steady state in the absence of IPTG, on the same magnitude of wild‐type levels of about ten per cell. Second, the level of LacI required for half occupancy of the *lac* operon is [LacI]_{1/2}=(1/α)^{1/η}=0.012, which compares well to values 0.006–0.6 molecules (10^{−11}–10^{−9} M, Oehler *et al*, 1990). Third, a Hill coefficient of 2.1 is reasonable considering that LacI binds to the operon as a tetramer. Finally, the degradation rate LacI, δ_{L}^{(0)} is close to the dilution rate of 3.8 × 10^{−4} N^{−1} s^{−1}, reflecting the high stability of that protein. In addition to comparing the parameters to values in the literature, we have used the parameter set identified from {5, 10, 20} μM IPTG induction to predict the fluorescence under {40, 100} μM IPTG. Figure 3C shows that these predictions match the subsequent experimental measurements very well despite the vastly different shapes observed at the high induction levels.

Using single‐cell experimental techniques, it has become possible to efficiently measure fluctuations in cell constituents. When properly extracted and processed with rapidly improving computational tools, these measurements contain sufficiently rich information as to enable the unique identification of parameters. We have shown that transient dynamics are important to this effort, and in principle, identification can be accomplished when accurate distributions are measured at only two distinct time points. More time points are needed if the distributions are poorly measured, but the idea remains the same. We have show the potential of our approach by experimentally identifying a predictive model of *lac* regulation from flow cytometry data. Hence, the proposed integration of single‐cell measurements and stochastic analyses establishes a promising approach that offers new windows into the workings of cellular networks.

## Materials and methods

### Medium and reagents

Cells were grown in Luria–Bertani (LB) medium supplemented with 1% tryptone, 0.5% yeast extract, and 0.4% NaCl and containing IPTG at the concentrations noted. To select for plasmid maintenance, antibiotics were used at the following concentrations: 100 μg/ml ampicillin (amp); 40 μg/ml kanamycin (kan); and 12.5 μg/ml tetracycline (tet).

### Bacterial strains and plasmids

The *E. coli* strain used was DL5905—*E. coli* K‐12 (isolate MC4100) containing [F′ proAB lacI_{q}ZM15 Tn10 (Tet^{r})] from strain XL‐1 Blue (Stratagene) and plasmid pDAL812. To construct plasmid pDAL812, GFP(LVA) (Anderdson *et al*, 1998) was PCR amplified from plasmid pRK9 (a gift from John Cronan) using the forward primer (5′‐CAACAAAGATCTATTAAAGAGGAGAAATTAAGCATGAGTAAAGGAGAAGAACTTTTCA‐3′) that includes a *Bgl*II site and removes an *Sph*I site from the original pRK9 sequence, and the reverse primer (5′‐CAACAAGCATGCATTAAGCTACTAAAGCGTAGTTTTCGTCGTTTGC‐3′) that adds an *Sph*I site. This fragment was digested using *Bgl*II and *Sph*I and cloned into *Bgl*II and *Sph*I sites of pLAC33 (Warren *et al*, 2000), removing a portion of the Tet^{R} cassette.

### Fluorescence induction experiments

Twenty‐four separate cell cultures were allowed to grow in LB broth containing the appropriate antibiotics to an approximate OD_{600} of 0.2, and were then induced with {0, 5, 10, 20, 40, 100} μM concentrations of IPTG at 5, 4, 3, and 0 h before flow cytometry measurements. Flow cytometry was carried out using a BD Biosciences FACSAria instrument with a 100‐μm sorting nozzle at low pressure. GFP(LVA) was excited using a 488‐nm blue laser and detected using 530/30‐nm filter. For each sample, 1 000 000 events were collected. To ensure repeatability, experiments were conducted twice, each on a separate day.

### GFP induction model

The stochastic model for the IPTG–GFP induction is composed of four nonlinear production/degradation reactions given in the main text. The rates of these reactions depend on the integer populations of the proteins LacI and GFP, as well as the set of nonnegative parameters, *k*_{L}, *k* _{G}, δ^{(0,1)}, δ*G*, α,*r*,η} ∈ *R*^{8}. For the stochastic system modeled here, the joint (LacI, GFP) probability distributions of both proteins evolve according to the infinite dimensional chemical master equation (CME; van Kampen, 2007). This can in turn be expressed as an infinite set of linear ordinary differential equations—. Unlike in the simple transcription/translation model, the toggle reactions are nonlinear, and the CME has no known exact solution. We use a finite state projection approach (Munsky and Khammash, 2006) that makes it possible to approximate the solution to any degree of accuracy. For any error tolerance ε>0, we systematically find a finite‐dimensional projected system——the solution for which is within the desired tolerance. More precisely,

where the index vector *J* denotes the set of states included in the projection, **P**_{J} is the corresponding probability of those states, and **A**_{J} is the corresponding principle sub‐matrix of **A** (Munsky and Khammash, 2006). The one‐norm measure is used to ensure that absolute sum of the probability density error is guaranteed to lie within the tolerance. The solution of each projected master equation is found using the stiff ode solver *ode23s* in MathWorks Matlab.

### Modeling flow cytometry data

In addition to modeling the regulatory dynamics of the system, one must also account for the inherent uncertainty within measured levels of fluorescence activity. The process used to account for this uncertainty has three components. First, in an effort to remove outliers in cell volume and density, and thereby reduce the effects of unmodeled dynamics, each cell population was gated separately using forward and side scatter data. Specifically, the forward and side scatter measurements were used to form a two‐dimensional joint histogram with 50 × 50 logarithmically distributed bins (see Supplementary Figure 6). The maximum point in this histogram was recorded and then the gating region was chosen to include every bin that had, at least, one third as many counts as the maximal bin. Second, flow cytometry measurements in the absence of IPTG have been used to calibrate the background fluorescence of cell populations at various instances in time, and it has been assumed that the background fluorescence distribution, *f*_{BG}(*x*), is independent of the levels of IPTG, LacI, and GFP. Third, each GFP molecule is assumed to emit a normally distributed random amount of fluorescence with unknown mean, μ_{GFP}, and variance, σ_{GFP}^{2}, both of which are to be identified. Thus, if *p*_{n}=*p*_{n}(*t*, **Λ**, [IPTG]) denotes the probability of having exactly *n*={0, 1, 2, …} molecules of GFP, then the probability density of having exactly *x* arbitrary units of fluorescence because of GFP is computed as:

Finally, the total observable fluorescence is the sum of the GFP florescence plus the background noise, and the distribution of total fluorescence is found using the convolution:

### Identification procedure

With the FSP solution and the computation of the expected fluorescence, the identification procedure is carried out by finding the parameter vector **Λ**^{*} that minimizes the one norm difference between the experimentally measured distribution *f*_{Meas}^{(i)} (*t*,[rm IPTG]) and the numerical solution of that distribution:

where the summation is taken over all of the different experimental conditions of different induction times and IPTG levels, and the weight *q*_{i} specifies the relative importance to each of these measurements. These weights have been chosen such that each IPTG level has the same total importance and so that greater importance is placed on measurements that differ the most from the background fluorescence. The values for these weights are given in Figure 3. The parameter identification is accomplished by starting with an initial parameter guess, **Λ**_{0}, and then this set is updated iteratively using gradient‐based and simulated annealing searches until the computed distribution matches the experimental distribution as closely as possible. The optimization procedure is repeated for multiple, randomly generated initial parameter guesses. An optimal parameter set is regarded as unique if the given solution yields the smallest achieved value for the objective function, and if that parameter has been achieved during many such identification runs each beginning with different parameter guesses.

## Acknowledgements

The authors thank John Little and Hana El‐Samad for thoughtful discussions and David Low for advice, assistance, and materials required in the experimental aspects of this study. This study is funded by the LANL LDRD program and NSF through Grants ECCS‐0835847 and ECCS‐0802008, and by ICB Grant DAAD19‐03‐D‐0004 from the US Army Research Office.

## Conflict of Interest

The authors declare that they have no conflict of interest.

## Supplementary Information

Supplementary text, Supplementary figures S1‐6 [msb200975-sup-0001.pdf]

## References

This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.

- Copyright © 2009 EMBO and Nature Publishing Group