Identifying all essential genomic components is critical for the assembly of minimal artificial life. In the genome‐reduced bacterium Mycoplasma pneumoniae, we found that small ORFs (smORFs; < 100 residues), accounting for 10% of all ORFs, are the most frequently essential genomic components (53%), followed by conventional ORFs (49%). Essentiality of smORFs may be explained by their function as members of protein and/or DNA/RNA complexes. In larger proteins, essentiality applied to individual domains and not entire proteins, a notion we could confirm by expression of truncated domains. The fraction of essential non‐coding RNAs (ncRNAs) non‐overlapping with essential genes is 5% higher than of non‐transcribed regions (0.9%), pointing to the important functions of the former. We found that the minimal essential genome is comprised of 33% (269,410 bp) of the M. pneumoniae genome. Our data highlight an unexpected hidden layer of smORFs with essential functions, as well as non‐coding regions, thus changing the focus when aiming to define the minimal essential genome.
A genome essentiality analysis in the genome‐reduced bacterium Mycoplasma pneumoniae, reveals that protein essentiality should be considered at the domain level and that small proteins (< 100 aa) and ncRNAs are frequently essential genomic elements.
A genome essentiality analysis is performed using two mini‐transposon mutant libraries of M. pneumoniae.
The results indicate that ORF essentiality should be considered at the protein domain level.
Small ORFs are as essential as conventional ORFs and they can interact with DNA.
Some essential antisense ncRNAs are involved in the regulation of essential ORF expression.
Defining the minimal genome that is required for sustaining life is currently one of the major challenges in biology. The essential genome of an organism, aside from protein‐coding regions (ORFs), comprises regulatory (5′‐UTRs and non‐coding RNAs (ncRNAs)) and structural elements (Gil et al, 2004; Christen et al, 2011). Most of the previous essentiality studies (Glass et al, 2006; Lluch‐Senar et al, 2007; French et al, 2008) made use of the conventional genome annotations which are biased against small proteins (smORFs; < 100 aa) (Samayoa et al, 2011) and regulatory elements such as ncRNAs. However, an accurate essentiality study is limited by the completeness of the genome annotation. Therefore, M. pneumoniae is an ideal organism due to its reduced genome size (816 kb) (Guell et al, 2009; Kuhner et al, 2009; Yus et al, 2009, 2012; Schmidl et al, 2010; Maier et al, 2011; van Noort et al, 2012; Lluch‐Senar et al, 2013) and its detailed genome annotation based on experimental data. The current annotation of the M. pneumoniae genome contains 694 ORFs (32 of which are smORFs), 311 ncRNAs and 43 conventional RNAs (tRNAs, rRNAs, etc.) (Supplementary Table S2); all genes are well supported by transcriptome data, or in combination with proteome data [Supplementary Materials and Methods or http://mycoplasma.crg.eu/ for details (Wodke et al, 2014)]. This fine annotation of M. pneumoniae has been facilitated by the vast “‐omics” datasets collected over the years (Guell et al, 2009; Maier et al, 2011; Yus et al, 2012), providing a better chance to gain a biased view on all putative essential elements in a minimal cell.
Results and Discussion
To determine the essentiality map, we used two mini‐transposon mutant libraries (differing in the antibiotic resistance) of M. pneumoniae (Fig 1A and B, Materials and Methods) and high‐throughput insertion tracking by deep sequencing (HITS) (Wong et al, 2011) of cells at different days and serial passages (Fig 1C). We analyzed day 12 sample since the number of insertions for the essential ORF gold set is close to zero, while this number for non‐essential genes remains approximately constant (Fig 1C) (Supplementary Table S1). We found a small insertion bias against G/C‐rich quadruplet base sequences, but this does not affect the essentiality of smORFs since they have a similar composition as ORFs (Supplementary Materials and Methods). Based on the number of reads per insertion in the essential and non‐essential gold sets, we define two thresholds to decide whether an insertion was annotated or not (a relaxed one with seven reads per insertion, and a stringent one with 41 reads) (Supplementary Materials and Methods). In the following, unless specified, we used the stringent value.
The resulting integrated essentiality map (Supplementary File S1) after 12 days of growth consists of 69,994 unique mini‐transposon insertions with a resolution of ~4 bp for non‐essential genes. Based on the analysis of the gold sets of essential and non‐essential ORFs (Supplementary Table S1), we developed an essentiality probability criterion (Supplementary Materials and Methods; Fig 1D) (Christen et al, 2011). Using this criterion, the 694 annotated ORFs were assigned to three distinct categories: essential (E; 342 ORFs), non‐essential (NE; 259 ORFs) and fitness (F; 93 ORFs) (Supplementary Table S2) (Christen et al, 2011). The robustness of the classification was validated by the ability to isolate 92% of randomly selected F (12 genes) and NE (24 genes) clones, and the lack of success for 90% of E ORFs (28 out of 31, Supplementary Table S2). The 3 isolated E clones come out as fitness with the relaxed seven reads per insertion threshold, suggesting that they are severely affected in their growth. Moreover, when comparing with the predicted set of the minimal protein machinery in mollicutes, including 129 genes (Grosjean et al, 2014), we find 92% of them essential and 7% fitness. The dependency on the number of reads per insertion cutoff on fitness genes indicates that some of them could be incorrectly classified as essential when it is too strict. On the other hand, relaxing this cutoff results in some gold set essential genes being classified as fitness. This illustrates the limitation of transposon essentiality studies using deep sequencing for fitness genes.
Notably, we found that the insertions were not evenly distributed along the entire ORFs as previously observed in Caulobacter crescentus (Christen et al, 2011). In this respect, it is important to note that our mini‐transposon has an internal promoter that could allow expression of downstream genes or domains if there is a start codon for translation. This hints at the existence of individual domains that mediate the interactions within sub‐complexes. Indeed, we found that multi‐domain proteins involved in protein complexes are frequently more essential than proteins with a single domain and they are involved in important cellular processes such as transcription and DNA replication (Supplementary Fig S1). Analyzing the essentiality of individual protein domains revealed that in 81 multi‐domain proteins, the essentiality status of individual structural domains differs (Fig 2, Supplementary Table S2, Supplementary Materials and Methods). Furthermore, cloning and expression of some of these structural domains (C‐terminus of MPN241, Fig 2A and N‐terminus of MPN683, Fig 2B) showed autonomous folding since they can be expressed in a soluble manner (Supplementary Fig S2). These results indicate that identification of a transposon insertion as criterion for protein essentiality should be revised and domain essentiality analysis should be routinely applied instead.
Within non‐transcriptionally active sequences of the M. pneumoniae genome, we detected 0.9% of essential intergenic regions (> 100 bp), which may function as structural elements including the origin of replication (oriC) (Fig 3A, Supplementary Fig S3, Supplementary Table S3). In addition, we found that the percentages of essential transcriptionally active 5′‐UTRs and ncRNAs (intergenic and overlapping with non‐essential genes; for those overlapping with essential genes, no essentiality could be assigned) are 26 and 5%, respectively, and for conventional RNAs 82% (Fig 3, Supplementary Table S2, Supplementary Materials and Methods). Strikingly, a large number of the ncRNAs (~95%) overlap with coding genes on the opposite strand, which suggests that they have regulatory roles in gene expression. To gain insight into their functionality, we studied the correlation of expression of ncRNAs with their overlapping ORFs along 10 different time points of the growth curve by RNAseq. Interestingly, ncRNAs that anti‐correlate with the overlapping ORF have higher essentiality coefficients than those that correlate (Fig 3B, Supplementary Table S4). More importantly, the percentage of essential anti‐correlated ORFs is higher than that of correlated ones (Fig 3C; means of percentages: 50% versus 37%, respectively; P = 5.63e‐10 applying Welch's two sample t‐test), suggesting that essential ORFs are down‐regulated by ncRNAs.
It is possible that some ncRNAs encode for smORFs similar to some long ncRNAs in eukaryotes (Cohen, 2014). In fact, smORFs have been found in bacteria associated with a diverse set of cellular functions (Hobbs et al, 2011; Samayoa et al, 2011). To investigate this, we translated all ncRNAs in the three reading frames and identified the putative ORFs by sequence searches and by combining mass spectroscopy (MS) with protein fractionation methodologies. Sequence conservation analysis with other bacterial species predicted eleven possible smORFs (Supplementary Fig S3, Supplementary Table S5, marked with α), of which four were identified by MS. Interesting examples are as follows: MPN391a, a cysteine‐rich peptide predicted to be involved in peroxide resistance (Zimmerman & Herrmann, 2005), MPN347a, as part of an anti‐toxin pair (Supplementary Fig S4) (Liu et al, 2008), and MPN155a that is homologous to a putative RNA‐binding protein, YlxR, (Osipiuk et al, 2001) and is found in the same operon (Supplementary Fig S4). Interestingly, each fractionation methodology revealed new smORFs (Fig 4A) extending the number from 32 annotated smORFs (25 detected proteins, mostly ribosomal, 56%) to a total of 67 smORFs (~9% of the total ORFs). Additional fractionation experiments did not further increase the number of smORFs, suggesting that we are close to defining the complete M. pneumoniae small proteome (under the experimental limitation of identifiable peptides by MS for smORFs, Fig 4A). As observed for the conventional ORFs, smORFs are often highly transcribed and essential (53%) (Supplementary Table S5, Fig 3A).
In order to get insight into the reasons behind the high essentiality of the smORFs, we first investigated whether they are part of large protein complexes as previously suggested for some smORFs (Gassel et al, 1999). By size‐exclusion chromatography coupled to MS (SEC‐MS), we found that the vast majority (31 out of 34; 11 new) of the detectable smORFs eluted in fractions of significantly higher molecular weight than expected from the size of the individual proteins. This indicates that smORFs are frequently associated within larger protein complexes (Fig 4B, Supplementary Table S6) and probably this is the case for the majority of the smORF. For example, overexpressing two smORFs, MPN060a and MPN155a, not detected in the original SEC‐MS experiments, we find them eluting in high molecular weight fractions (Fig 4B, Supplementary Fig S3B). Second, we used DNA–cellulose (DNAC) affinity chromatography coupled to MS to analyze DNA‐ or RNA‐binding properties (Mai et al, 1998). We found that out of 35 smORFs identified in this experiment (14 previously unknown, Fig 4A), 42% of new smORFs (including the putative RNA‐binding protein MPN155a, YlxR) bind to DNA/RNA, compared to 15% of the conventional ORFs (excluding well‐known DNA and RNA directly binding proteins) in M. pneumoniae (Fig 4C, Supplementary Materials and Methods).
Understanding the minimal set of essential genetic elements is important for several applications, ranging from synthetic biology approaches to drug targets identification in pathogenic bacteria (Gallagher et al, 2007). Based on our analysis, we conclude that essentiality should be considered at a protein domain resolution and that smORFs as well as regulatory elements (5′‐UTRs and ncRNAs) are frequently essential genomic elements, considerably increasing the repertoire of building blocks that need to be considered for a minimal genome. Furthermore, we revealed a previously unknown layer of essentiality composed of smORFs that are likely to play important roles in protein complex functionality and DNA transcriptional regulation. Thus, it is crucial to more carefully consider smORFs in genome annotations as they can comprise 9% of the genome ORFs.
Materials and Methods
The mini‐transposon mutant libraries of M. pneumoniae were obtained after transforming with pMT85 and pMTnTetM438 vectors and doing serial passages (Supplementary Materials and Methods) (Pich et al, 2006). Genomic DNAs were collected using the Illustrabacteria genomic kit (GE) and sequenced with the HITS approach (Fig 1A and B) using standard Illumina paired‐end sequencing. Raw reads were filtered by inverted repeats (IR) and then mapped to the M. pneumoniae reference genome (NC_000912, NCBI) using BLASTs (Supplementary Table S7).
Two gold standard sets were manually assembled; one contained 37 protein‐coding genes that are known to be essential, and the other contained 29 NE ORFs (Supplementary Table S1). The two datasets were evaluated using our mini‐transposon library, and then, a scoring system was developed that consisted of two parameters, PE, the probability for a genomic region of being essential, and PNE, the probability of being non‐essential rounded to two decimals (Supplementary Table S2). This analysis revealed three distinct groups of genes with 99% confidence (Supplementary Table S2): those that are essential (E; PE > 0 and PNE = 0), those that are non‐essential (NE; PE = 0 and PNE > 0) and a third group with an intermediate essentiality score that we define as fitness (F; PE > 0; PNE > 0 or PE = 0; PNE = 0). The fitness category includes those genes that essentiality could depend on condition and transposon insertions and despite having an impact on growth, they do not affect cell viability.
To study whether the protein products of smORFs could be involved in protein complexes, ten smORFs were selected and cloned into vector pMT85‐clpB‐TAPtag SfiI/NotI (Kuhner et al, 2009). After transforming M. pneumoniae, the protein complexes were studied by molecular weight exclusion chromatography coupled to Western blot. Fractions from molecular weight exclusion chromatography were trypsin‐digested and then subjected to MS (Supplementary Materials and Methods). DNA/RNA‐binding proteins were identified by DNA–cellulose (DNAC) affinity chromatography coupled to MS (Supplementary Materials and Methods).
The raw data of transposon libraries and RNAseq have been submitted to the ArrayExpress database (http://www.ebi.ac.uk/arrayexpress) and assigned the identifier E‐MTAB‐3075 and E‐MTAB‐3076, respectively. Additionally, genome re‐annotation and MS data used for identification of smORFs have been submitted to ProteomeXchange via the PRIDE database (http://www.ebi.ac.uk/pride) and assigned the identifier PXD001611.
We thank Dr. Christina Kiel for her comments and the Genomics, Proteomics and Protein Technologies Core Facilities at CRG. Also we thank to Dr. Marc Güell and Dr. Hinnerk Eilers for fruitful discussions. Besray Unal was co‐funded by Marie Curie Actions. This work was supported by the European Research Council (ERC), the Fundación Marcelino Botin, the Spanish Ministerio de Economía y Competitividad BIO2007‐61762 and the ISCIII (PI10/01702).
LS and PB conceived the study; MLS, JDB and WHC assembled and analyzed the data and wrote the manuscript; PB, LS, JS and ACG revised the manuscript; FJO, TF, MLS and VvN performed experiments of protein complexes; JAHW generated the database of ORFs. VLLR and EBU helped with the analyses of the transcriptome data; EY and SM did DNA‐binding experiments; MLS and AV developed HITS technique. RJN obtained the DNA samples of transposon libraries at the different passages. AS participated in isolation of M. pneumoniae mutants from the library, all authors have read and approved the manuscript.
Conflict of interest
The authors declare that they have no conflict of interest.
Supplementary Figure S1
Supplementary Figure S2
Supplementary Figure S3
Supplementary Figure S4
Supplementary Figure S5
Supplementary Figure S6
Supplementary Figure S7
Supplementary Figure S8
Supplementary File S1
Supplementary Table S1
Supplementary Table S2
Supplementary Table S3
Supplementary Table S4
Supplementary Table S5
Supplementary Table S6
Supplementary Table S7
Supplementary Table S8
Supplementary Table S9
Supplementary Table S10
Supplementary Table S11
FundingEuropean Research Council (ERC)
This is an open access article under the terms of the Creative Commons Attribution 4.0 License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
- © 2015 The Authors. Published under the terms of the CC BY 4.0 license