Cellular functions are mediated through complex systems of macromolecules and metabolites linked through biochemical and physical interactions, represented in interactome models as ‘nodes’ and ‘edges’, respectively. Better understanding of genotype‐to‐phenotype relationships in human disease will require modeling of how disease‐causing mutations affect systems or interactome properties. Here we investigate how perturbations of interactome networks may differ between complete loss of gene products (‘node removal’) and interaction‐specific or edge‐specific (‘edgetic’) alterations. Global computational analyses of ∼50 000 known causative mutations in human Mendelian disorders revealed clear separations of mutations probably corresponding to those of node removal versus edgetic perturbations. Experimental characterization of mutant alleles in various disorders identified diverse edgetic interaction profiles of mutant proteins, which correlated with distinct structural properties of disease proteins and disease mechanisms. Edgetic perturbations seem to confer distinct functional consequences from node removal because a large fraction of cases in which a single gene is linked to multiple disorders can be modeled by distinguishing edgetic network perturbations. Edgetic network perturbation models might improve both the understanding of dissemination of disease alleles in human populations and the development of molecular therapeutic strategies.
Genotype‐to‐phenotype relationships in human genetic disease are often modeled as: ‘mutation in gene X leads to loss of gene product X, which leads to disease A’. However, single ‘gene‐loss’ models cannot explain the increasingly appreciated prevalence of complex genotype‐to‐phenotype relationships, particularly with instances of allelic or locus hetrogeneity (Goh et al, 2007).
Genes and gene products function not in isolation but as components of complex networks of macromolecules (DNA, RNA, or proteins) and metabolites linked through biochemical or physical interactions, often represented in ‘interactome’ network models as ‘nodes’ and ‘edges’, respectively. Here we use network perturbation models to explain molecular dysfunctions underlying human disease in addition to the gene‐loss model.
We hypothesize that different mutations leading to different molecular defects to proteins may cause distinct perturbations of cellular networks, giving rise to distinct phenotypic outcomes (Figure 1). For example, truncations close to the start of an open‐reading frame, or mutations that grossly destabilize a protein structure, can be modeled as removing a protein node from the network (‘node removal’). Alternatively, single amino‐acid substitutions that affect specific binding sites, or truncations that preserve certain domains of a protein, may give rise to partially functional gene products with specific changes in distinct molecular interaction(s) (edge‐specific or ‘edgetic’ perturbations) (Figure 1B).
Taking advantage of the large number of known disease‐causing allelic variations in human Mendelian disorders, we investigated how disease‐associated mutations may cause complete loss of gene products or, alternatively, may cause specific loss or gain of individual molecular interaction(s). We examined ∼50 000 Mendelian disease‐causing alleles, affecting over 1900 protein‐coding genes, altogether associated with more than 2000 human disorders available in the Human Gene Mutation Database (HGMD) (Stenson et al, 2003), that can be subdivided into two subsets: truncating’ alleles (truncations or frameshifts caused by stop codons, out‐of‐frame insertions or deletions, or defective splicing) versus ‘in‐frame’ alleles (missense mutations and in‐frame insertions or deletions). Over 50% (27 919/52 491) of Mendelian alleles in HGMD correspond to ‘in‐frame’ mutations. Our hypothesis is that, ‘in‐frame’ alleles may affect specific interactions of a given gene product while leaving most other interactions unperturbed.
Although exceptions may apply, our hypothesis has several predictions. First, ‘truncating’ versus ‘in‐frame’ alleles may distribute differently among autosomal dominant and autosomal recessive disease, given that dominant mutations are more likely to be edgetic than recessive ones. Indeed, autosomal dominant and autosomal recessive traits annotated in the Online Mendelian Inheritance in Man (OMIM) database (Hamosh et al, 2005) show a clear separation with respect to the associated ‘in‐frame’ versus ‘truncating’ mutations. Among genes affected solely by ‘in‐frame’ mutations, the proportion of dominant diseases is ∼10‐fold higher than that of recessive ones, supporting ‘in‐frame’ mutations causing distinct molecular defects as opposed to ‘truncating’ mutations.
A proof‐of‐principle characterization of binary protein interaction defects of mutant alleles associated with five genetic disorders supports our hypothesis that ‘in‐frame’ alleles indeed produce mostly functional proteins, preserving many specific protein interactions. As grossly disruptive mutations versus mutations leading to loss or gain of specific interaction(s) probably distribute differently on protein structures, we examined available three‐dimensional structures of all disease proteins. Mutated residues in autosomal dominant disease are significantly more exposed to the surface of the structure than those in autosomal recessive disease, consistent with the idea that disease with distinct modes of inheritance probably involves distinct network perturbations.
A second testable prediction of our edgetic perturbation model is that edgetic perturbation versus gene loss for a given gene product might in some cases cause different diseases. We examined 142 genes associated with two or more distinct diseases in which at least five distinct alleles have been reported for each disease. We found ∼30% of the cases for which distribution of ‘in‐frame’ versus ‘truncating’ mutations is significantly different between the two diseases linked to the same gene (P<0.05). Hence, when affecting the same gene, node removal versus edgetic perturbation can confer strikingly different phenotypes.
A third testable prediction is that different edgetic perturbations for a given gene product might cause phenotypically distinguishable diseases (Figure 6). We used predicted Pfam domains (Finn et al, 2006) as surrogates for functional interaction domains, assuming that ‘in‐frame’ mutations located in distinct Pfam domain‐encoding sequences probably alter distinct interactions. Among 169 genes associated with two or more diseases and encoding proteins containing at least two Pfam domains, nine proteins have at least two Pfam domains significantly enriched with ‘in‐frame’ mutations (P<0.05). For each of the nine proteins, we found a striking pattern of near mutual exclusivity, whereby different Pfam domains seem to be specifically affected in distinct disorders (Figure 6B).
We conclude that edgetic alleles probably underlie many complex genotype‐to‐phenotype relationships in human disease, such as incomplete penetrance or variable expressivity, as well as allele‐specific phenotypic variations among patients. Edgetic perturbation of human inherited disorders might help explain how seemingly devastating alleles have appeared and persevered in human populations.
We present alternative models to explain molecular dysfunctions underlying human inherited disorders based on interaction‐specific or “edgetic” perturbations rather than complete loss of gene products.
We find that a substantial fraction of known genetic variants in human Mendelian disorders likely cause edgetic perturbations.
We find frequent situations where edgetic perturbation models can explain how different mutations in a single gene can cause distinct disorders.
Edgetic perturbation models should provide alternative explanations to complex genotype‐to‐phenotype relationships
Decades of research into human Mendelian disorders has led to the discovery of a massive amount of disease‐associated allelic variations. Most disease‐causing mutations are thought to confer radical changes to proteins (Wang and Moult, 2001; Botstein and Risch, 2003; Yue et al, 2005; Subramanian and Kumar, 2006). Consequently, genotype‐to‐phenotype relationships in human genetic disorders are often modeled as: ‘mutation in gene X leads to loss of gene product X, which leads to disease A’. A single ‘gene‐loss’ model seems pertinent for many diseases (Botstein and Risch, 2003). However, this model cannot fully reconcile with the increasingly appreciated prevalence of complex genotype‐to‐phenotype associations for even ‘simple’ Mendelian disorders (Goh et al, 2007), particularly in which: (i) a single gene can be associated with multiple disorders (allelic heterogeneity), (ii) a single disorder can be caused by mutations in any one of several genes (locus heterogeneity), (iii) only a subset of individuals carrying a mutation are affected by the disease (incomplete penetrance), or (iv) not all individuals with a given mutation are affected equally (variable expressivity). More complex models to interpret genotype‐to‐phenotype relationships would probably improve the understanding of human disease.
Genes and gene products function not in isolation but as components of complex networks of macromolecules (DNA, RNA, or proteins) and metabolites linked through biochemical or physical interactions, often represented in ‘interactome’ network models as ‘nodes’ and ‘edges’, respectively. Cellular networks seem to exhibit systems properties underlying phenotypic variations (Goh et al, 2007). Here we propose network‐perturbation models to explain molecular dysfunctions underlying human disease.
We hypothesize that distinct mutations causing distinct molecular defects to proteins may lead to distinct perturbations of cellular networks, giving rise to distinct phenotypic outcomes (Figure 1A). Truncations close to the start of an open‐reading frame, or mutations that grossly destabilize a protein structure, can be modeled as removing a protein node from the network (‘node removal’). Alternatively, single amino‐acid substitutions that affect specific binding sites, or truncations that preserve certain domains of a protein, may give rise to partially functional gene products with specific changes in distinct biophysical or biochemical interaction(s) (edge‐specific genetic perturbation or ‘edgetic’ perturbations; Figure 1B).
Edgetic network perturbations provide alternative molecular explanations for protein dysfunction in addition to gene loss. Taking advantage of the large number of known disease‐causing allelic variations in human Mendelian disorders, we investigated how such mutations may cause complete loss of gene products or, alternatively, cause specific loss or gain of distinct molecular interaction(s). We further tested edgetic perturbation models in cases in which a single gene is associated with multiple disorders. Together, both experimental and computational evidence support edgetic perturbation models in human inherited disorders. Edgetic perturbations probably underlie many complex genotype‐to‐phenotype relationships.
Global distribution of disease‐causing mutations
To investigate possibly differing network perturbations in human inherited disorders, we examined ∼50 000 Mendelian disease‐causing alleles, affecting over 1900 protein‐coding genes, altogether associated with more than 2000 human disorders available in the Human Gene Mutation Database (HGMD) (Stenson et al, 2003). We differentiated all disease alleles into two subsets probably causing different molecular defects to proteins. The first subset (‘truncating’ alleles) comprises all mutations that lead to the synthesis of truncated gene products, including nonsense mutations, out‐of‐frame insertions or deletions, or defective splicing. The second subset (‘in‐frame’ alleles) comprises mutations that probably give rise to nearly full‐length gene products, including missense mutations and in‐frame insertions or deletions. Over 50% (27 919/52 491) of Mendelian alleles in HGMD correspond to ‘in‐frame’ alleles (Figure 2A). Our hypothesis is that ‘truncating’ and ‘in‐frame’ alleles probably cause distinct molecular defects in proteins, and are thus enriched in distinct node removal or edgetic perturbations, respectively. This hypothesis is based on the assumption that ‘truncating’ alleles are less prone to produce stably folded proteins than ‘in‐frame’ alleles. Although exceptions may apply, our hypothesis predicts that ‘truncating’ versus ‘in‐frame’ alleles may distribute differently among diseases involving distinct node removal versus edgetic perturbations.
Given that, with the exception of haploinsufficiency, many established molecular explanations for dominance entail production of a mutated protein that interferes in some way with the function of the product of the normal allele, autosomal dominant disease should be more frequently associated with edgetic perturbation than node removal (Figure 2B). To test the hypothesis that ‘truncating’ versus ‘in‐frame’ alleles are enriched in distinct node removal versus edgetic perturbations, respectively, we retrieved the inheritance information, by manual curation, for each HGMD‐annotated phenotype from the Online Mendelian Inheritance in Man (OMIM) database (Hamosh et al, 2005). ‘Truncating’ versus ‘in‐frame’ alleles distribute differently among autosomal dominant and autosomal recessive traits. Among genes affected solely by ‘in‐frame’ mutations, the proportion of autosomal dominant diseases is ∼10‐fold higher than that of autosomal recessive diseases (Figure 2C). This trend holds even after removing all human predicted orthologs of essential genes from the analysis (Supplementary Figure S1).
We next examined whether distinct distribution of ‘truncating’ versus ‘in‐frame’ alleles can also be found among autosomal dominant traits that are probably caused by different molecular mechanisms. Mutations in cytoskeleton proteins frequently cause dominant‐negative effects, in which incorporation of expressed abnormal molecules into multimeric assemblies of structural proteins disrupts the integrity and function of the complex (Wilkie, 1994). In contrast, germline mutations in transcription factors are more frequently associated with haploinsufficiency (Wilkie, 1994; Seidman and Seidman, 2002) probably because of insufficient activity or production of the remaining wild‐type allele in heterozygotes. Consistent with this distinction, a significantly higher fraction of ‘in‐frame’ mutations was found for autosomal dominant Mendelian disorders associated with structural proteins than with transcription factors (Figure 2D).
Distinct global distributions of ‘truncating’ versus ‘in‐frame’ mutations among diseases with distinct modes of inheritance, and in proteins probably associated with distinct molecular mechanisms of dominance, support our hypothesis that ‘truncating’ versus ‘in‐frame’ alleles are probably enriched in distinct node removal versus edgetic perturbations, respectively. The distinctions observed between autosomal dominant and autosomal recessive mutations may be more pronounced if haploinsufficiency could be separated overall from dominant‐negative and other molecular mechanisms of dominance, but such information is currently unavailable at the global level.
Distinguishing edgetic perturbation from node removal
For a proof‐of‐principle analysis of allele‐specific network perturbations by disease proteins, we used an integrated experimental approach to characterize binary protein interaction defects of disease‐causing mutant alleles. Our approach includes (i) Gateway recombinational cloning of mutations by PCR‐based site‐directed mutagenesis (Suzuki et al, 2005), (ii) high‐throughput mapping of binary protein–protein interactions (Rual et al, 2005), (iii) high‐throughput characterization of protein–protein interaction defects of all cloned disease‐causing mutant proteins, and (iv) integration of network perturbations by disease‐causing mutations with structural or functional information of disease proteins.
We selected disease proteins that have: (i) multiple mutations annotated in HGMD (Stenson et al, 2003), (ii) wild‐type clones available in our human ORFeome collection, hORFeome 3.1 (Lamesch et al, 2007), (iii) structural information available in Protein Data Bank (PDB, http://www.rcsb.org/pdb), and (iv) two or more interactions reported in our previous binary human interactome map (Rual et al, 2005). We also requested that at least one of the observed interactions by yeast two‐hybrid (Y2H) analysis be supported by functional characterization in the literature. Given these criteria, we could apply our allele‐profiling platform to one autosomal recessive disease protein (CBS), and to three autosomal dominant disease proteins with likely dominant‐negative (ACTG1), abnormal activation (CDK4), or haploinsufficiency (PRKAR1A) molecular defects (Figure 3A). We included one additional autosomal recessive disease protein (HGD) that meets all criteria except that no protein–protein interaction data were available (Figure 3A). We carried out a genome‐wide Y2H screen against a set of ∼8100 human open‐reading frames (Rual et al, 2005), and identified three interactions for wild‐type HGD. We cloned disease‐causing mutants annotated in HGMD for these five proteins and profiled each mutant against the corresponding wild‐type interactors.
Profiling interaction defects of 29 alleles associated with five distinct genetic disorders revealed three classes of interaction‐defective alleles (Supplementary information and Figure 3B): (i) five alleles that behaved as null, eliminating all interactions, (ii) 16 edgetic alleles that lost specific interaction(s) while retaining other interactions, and (iii) eight alleles that behaved as ‘pseudo‐wild‐type’, retaining all currently available protein–protein interactions tested here. Null‐like alleles were observed only for two autosomal recessive disease proteins (CBS and HGD) and in a supposed case of dominant haploinsufficiency (PRKAR1A), consistent with differing network perturbations in diseases associated with distinct modes of inheritance (Figure 2B). We propose that many disease‐causing alleles scoring as pseudo‐wild‐type in the assay described here might still be true edgetic alleles. Further analysis with additional physical and biochemical interactors using additional assays should eventually settle that question.
We related Y2H interaction profiles of each mutant to structural properties of disease proteins (Supplementary information and Supplementary Figure S2–6). Grossly disruptive mutations tend to affect buried residues of the protein, whereas mutations leading to loss or gain of specific interaction(s) tend to lie on the surface. Edgetic perturbation of some disease alleles revealed diverse molecular mechanisms of protein dysfunction (Supplementary information). Complex allele‐specific perturbations were also found to be associated with phenotypic variability among patients, such as their response to specific treatments (Supplementary information for CBS).
Structural analyses of disease‐causing mutations
To further investigate the extent to which mutations found in human genetic disorders may grossly disrupt proteins or cause alterations in specific biochemical or biophysical interaction(s), we examined available three‐dimensional structures of all disease proteins. As grossly disruptive mutations versus mutations leading to loss or gain of specific interaction(s) probably distribute differently on protein structures (Figure 4A), we divided missense disease‐causing mutations into three non‐redundant categories: buried residues (<5% of surface accessible to water), exposed residues (⩾30% of surface accessible to water), and residues with intermediate exposure (5–30% of surface accessible to water). Among all 3664 affected residues in 236 proteins for which three‐dimensional X‐ray structures are available, about one‐third of the mutated residues are buried, whereas another one‐third are exposed, probably representing complete loss of gene products versus loss or gain of specific molecular interaction(s), respectively (Supplementary Figure S7). Consistent with differing network perturbations in disease with distinct modes of inheritance (Figure 2B), autosomal dominant versus autosomal recessive disease mutations exhibit significant separation with respect to their solvent‐accessible surface areas (P<3 × 10−10; Figure 4B). About 40% of mutated residues in autosomal dominant disease are exposed (with relative solvent‐accessible surface areas ⩾30%), whereas only 27% of mutated residues in autosomal recessive disease fall in the same category (Figure 4B).
Allele‐specific perturbations observed in PRKAR1A (Supplementary Figure S6) indicate that interaction‐specific perturbation by truncations is also possible. As ‘truncating’ alleles outside of protein domains may preserve function of certain domains, giving rise to interaction‐specific perturbations (Figure 4C), we determined the distribution of ‘truncating’ mutations in Pfam domains (Finn et al, 2006). Although disease‐causing ‘truncating’ mutations seem to exhibit a random distribution with respect to Pfam domains (enrichment: 1.0, P=0.2), ‘truncating’ mutations in autosomal dominant disease are slightly depleted in Pfam domains, whereas ‘truncating’ mutations in autosomal recessive disease are slightly enriched in Pfam domains (Figure 4D). This finding is consistent with the hypothesis that different ‘truncating’ mutations may cause distinct node removal versus edgetic perturbations giving rise to disease with distinct modes of inheritance. In agreement with distinct molecular mechanisms of dominance (Figure 2B), we found a depletion of autosomal dominant ‘truncating’ mutations in Pfam domains for structural proteins against an enrichment for transcription factors (Figure 4D), probably associated with dominant‐negative effects versus haploinsufficiency, respectively.
Node removal versus edgetic perturbation in complex gene‐disease associations
The complex patterns of disease mutations noted so far indicate that a substantial fraction of causative alleles in human genetic disorders may cause edgetic perturbations rather than node removal. Distinct network perturbation models, leading to distinct phenotypic outcomes (Figure 1), predict that ‘truncating’ versus ‘in‐frame’ alleles for a given gene product might cause different diseases (Figure 5A). We therefore examined 142 genes associated with two or more diseases for which at least five distinct alleles have been reported for each disease. Among 278 disease pairs, each associated with a single one of these 142 genes, we found 88 pairs (∼30%) for which the proportion of ‘in‐frame’ versus ‘truncating’ mutations is significantly different between the two diseases (P<0.05; Figure 5B and Supplementary Table 2). A noteworthy example involves the four types (I, II, III, and IV) of osteogenesis imperfecta (OI) with COL1A1 ‘in‐frame’ mutations causing strikingly more severe phenotypes (in type II, III, or IV) than ‘truncating’ mutations involved in type I (Hamosh et al, 2005; Figure 5B).
Among 34 genes that are linked to both autosomal dominant and autosomal recessive disorders, the fraction of ‘in‐frame’ versus ‘truncating’ mutations per gene is significantly higher for autosomal dominant mutations than for autosomal recessive ones (Supplementary Figure S8). This finding further supports our hypothesis that distinct ‘in‐frame’ versus ‘truncating’ mutations probably cause distinct network perturbations giving rise to disease with distinct modes of inheritance (Figure 2).
Edgetic interaction profiles of CBS and PRKAR1A mutant proteins (Figure 3) revealed possible connections between allele‐specific interaction defects and differential treatment responses or phenotypic severity among patients (Supplementary information). In addition to clinical variability, edgetic perturbation models also predict that distinct edgetic perturbations for a given gene product might cause phenotypically distinguishable disorders (Figure 6A). We used predicted Pfam domains as surrogates for functional protein domains (Sammut et al, 2008), assuming that ‘in‐frame’ mutations located in different Pfam domains probably alter protein functions differently. Among 169 genes associated with two or more diseases and encoding proteins containing at least two Pfam domains, 77 had significant enrichment of ‘in‐frame’ mutations in Pfam domains (P<0.05). There were nine proteins with at least two Pfam domains significantly enriched with ‘in‐frame’ mutations (P<0.05). For each of the nine proteins, we found a striking pattern of near mutual exclusivity, whereby different Pfam domains seem to be specifically affected in distinct disorders (Figure 6B and Supplementary Table 3). A compelling example is TP63 (van Bokhoven and Brunner, 2002) in which two clinically distinct developmental disorders, ectrodactyly ectodermal dysplasia (EEC) and ankyloblepharon ectodermal dysplasia (AEC), are caused by mutations in two separate domains, one predicted to bind DNA and the other to mediate protein–protein interaction(s) (Figure 6B). Current information on protein functional domains is incomplete, thus limiting the resolution for distinguishing phenotypes and genotypes. With more detailed structural and biochemical information available, more such allele‐specific edgetic phenotype‐to‐genotype correlations should be uncovered.
There are commonalities behind disease mutations that have been discerned, such as disease mutations tend to present at highly conserved regions and to confer radical changes to proteins (Wang and Moult, 2001; Botstein and Risch, 2003; Yue et al, 2005; Subramanian and Kumar, 2006), but there are more complexities to disease mutations and these should not be overlooked. Here we uncovered both experimental and computational evidences that strongly support distinct network perturbations in human Mendelian disorders resulting from complete loss of gene products (node removal) or specific alterations in distinct molecular interaction(s) (edgetic perturbation), respectively (Figures 2, 3 and 4). Distinct edgetic network perturbations probably underlie many complex genotype‐to‐phenotype relationships in human genetic disorders (Figures 5 and 6) supporting the idea that edgetic perturbation versus node removal may confer fundamentally different functional consequences.
Edgetic network perturbation models focus on specific alterations in distinct molecular interactions. Although the ‘node‐centered’ gene knockout or knockdown approaches are convenient and useful in determining effects of gross disruption of proteins in model organisms, an ‘edge‐centered’ allele‐profiling approach, as carried out here and elsewhere (Dreze et al, in press), dissects the dynamics and complexities of biological systems, in which different interactions may occur independently, and in which a single protein may carry out different functions with different partners or in different biological contexts. Edgetic alleles with suboptimal but largely preserved molecular interactions may become insufficient when expressed at reduced levels or may become less stable. Such properties of edgetic alleles may be regulated by other genetic or environmental factors. In this regard, functional characterization of edgetic alleles may help explain phenotypic variations among patients, such as incomplete penetrance or variable expressivity, as well as differential clinical treatment responses (e.g. CBS alleles, Supplementary information). In addition, edgetic network perturbation models might improve our understanding of why and how disease alleles have disseminated in human populations.
Just as high‐throughput sequencing technologies are revolutionizing genotyping platforms, and as functional genomics and proteomics are becoming increasingly able to characterize gene products resulting from whole genome sequencing and gene prediction, functional characterizations of genetic variations may be applied at large‐scale to characterize mutations with uncertain pathological consequences.
We considered the effects of disease‐causing mutations on physical protein–protein interactions, perturbation of which has emerged as a characteristic shared by many disease mutations (Ye et al, 2006; Hsu et al, 2007; Schuster‐Böckler and Bateman, 2008). Complete understanding of network perturbations in disease would require comprehensive analysis of disease mutant proteins by integration of data available from multiple functional assays. First, the current interactome network derived from Y2H analysis is probably incomplete. Many biologically relevant interactors remain to be tested and many may not be recovered by Y2H alone or by any other single protein interaction assay (Braun et al, 2009; Venkatesan et al, 2009). Second, Y2H detects binary protein interactions. A positive Y2H readout does not necessarily warrant proper protein complex assembly in vivo. In oligomer assembly, multiple interaction surfaces of the monomer may be utilized. Mutant alleles that disrupt one but not all interaction surfaces may show positive interaction in the Y2H analysis, but may still affect proper oligomerization. Third, Y2H is not quantitative. Subtle alterations in the affinity of protein–protein interactions, which are undetectable by Y2H, may confer phenotypic changes. Finally, disease mutations may affect protein functions by altering biochemical activities or protein–DNA or protein–RNA interactions.
Disease‐associated alleles may also gain new interactions, which is another important potential mechanism for pathogenicity. Gain‐of‐interaction alleles may be discovered by screening for new interactions specific for an individual mutant. Although we can assay only known edges at any given moment, as more physical and biochemical interactions become identified with time, deeper edgetic profiling will become possible. The pilot step taken here will reach its full potential when applied at genome or proteome scale, with the results integrated into extensive molecular networks.
Materials and methods
The lists of genes and associated phenotypes were downloaded from HGMD website (Stenson et al, 2003) (June 2006). The corresponding gene IDs were retrieved from Entrez Gene (Maglott et al, 2005) (June 2006). By manual annotation we linked phenotypes associated with each mutation, as annotated in HGMD, to the corresponding disease in the OMIM database (Hamosh et al, 2005). The resulting list contains 2269 gene‐to‐OMIM disease ID entries associated with 48 774 distinct mutations. We carried out all analyses on the resulting gene–OMIM disease associations. We obtained the inheritance information for the corresponding disease available in OMIM and separated mutations associated with autosomal dominant or autosomal recessive inheritance. A total of 1777 gene‐to‐OMIM disease entries, which involve 1281 genes, 1466 OMIM disease IDs and 35 154 mutations, are associated with either autosomal dominant or autosomal recessive inheritance.
Fraction of ‘in‐frame’ mutations
We grouped missense and small in‐frame insertions, deletions and indels (types of mutations as defined in HGMD) as ‘in‐frame’ mutations, whereas nonsense, splicing and small out‐of‐frame frame insertions, deletions and indels we grouped as ‘truncating’ mutations. We calculated the fraction of ‘in‐frame’ mutations as the number of ‘in‐frame’ mutations divided by the total number of mutations in each gene for each mode of inheritance (Figure 2C and D and Supplementary Figures S1 and S8) or for each disease (Figure 5B). To minimize the possibility of any existing trend being obscured by genes with few mutations, we limited our analysis to genes that have five or more mutations associated with each inheritance (Figure 2C and D and Supplementary Figures S1 and S8) or each disease (Figure 5B).
Essential human genes were estimated from the orthologs of mouse (Goh et al, 2007), fly, worm and yeast essential genes. Fly essential genes were extracted from Flybase (Wilson et al, 2008b; phenotype class: ‘lethal’), yeast essential genes from SGD (Ball et al, 2000; phenotype: ‘inviable’), and worm essential genes from RNAiDB (Gunsalus et al, 2004; phenotypes: ‘lethal’, ‘embryonic lethal’, ‘larval lethal’ and ‘adult lethal’).
Profiling interaction defects of mutant proteins
Disease mutant clones were generated by PCR mutagenesis essentially as described previously (Suzuki et al, 2005). Forward and reverse internal primers used are listed (Supplementary Table 4). All sequence‐confirmed Entry clones of mutant alleles were transferred individually by Gateway recombinational cloning into both pDB‐dest and pAD‐dest‐CYH destination vectors, generating DB–ORF allele and AD–ORF allele fusions (Rual et al, 2005). To test against wild‐type interactors, the DB–ORF and AD–ORF clones for CBS, HGD, ACTG1, CDK4 and PRKAR1A mutant proteins were transformed into MATα MaV203 or MATa MaV103 yeast strains, respectively. Each interaction pair was tested for growth on SC‐His+3AT (synthetic medium without leucine, tryptophan and histidine, containing 20 mM 3‐amino‐1,2,4‐triazole) plates to confirm GAL1::HIS3 transcriptional activity, on yeast extract–peptone–dextrose (YPD) medium to determine GAL1::lacZ transcriptional activity using a ‐galactosidase filter assay, and on SC‐Ura plates (synthetic medium without leucine, tryptophan and uracil) to determine SPAL10::URA3 transcriptional activity. Scoring of Y2H reporters was done by comparing to a set of Y2H control strains that contain plasmids expressing pairs of proteins with a spectrum of interaction strengths (Supplementary Figure S9). Activation of at least two of the three reporter genes was taken as a positive interaction. Interaction pairs showing less than two positive reporters are scored as ‘−’. Interaction pairs showing the same number of positive reporters as the corresponding wild type are scored as ‘+’. Interactions that lose expression of one reporter but still show expression of the other two reporters are scored as ‘R’.
For immunoblotting, yeast cells with AD–ORF fusions were cultured overnight at 30°C in synthetic medium without tryptophan and then grown in YPD medium to mid‐exponential phase. Cells were collected and treated with 150 mM of NaOH on ice for 15 min and then lysed in 0.8% SDS buffer (0.024 M Tris–HCl (pH 6.8), 10% glycerol, 0.04% bromophenol blue and 0.4% 2‐mercaptoethanol) for 5 min at 95°C. Whole cell lysates were cleared by centrifugation at 14 000 g. Resulting supernatants were separated on NuPAGE acrylamide gels (Invitrogen) and electrophoretically transferred onto a PVDF membrane (Invitrogen). AD fusion proteins were detected by standard immunoblotting techniques using anti‐GAL4 (Activation domain) antibody produced in rabbit (Sigma) as the primary antibody.
For comparison with experimental data, the following structures were used: 1JBQ for CBS (Meier et al, 2001), 1EYB and 1EY2 for HGD (Titus et al, 2000), 2BTF (Schutt et al, 1993), 1HLU (Chik et al, 1996) and 2OAN (Lassing et al, 2007) for bovine β‐actin, 2W9F, 2W9Z, 2W96, 2W99 (Day et al, 2009) for CDK4, and 1G3N (Jeffrey et al, 2000) for CDK6–CDKN2C complex. Figures of tertiary structures were generated with PyMol (http://www.pymol.org). The relative solvent‐accessible surface areas (%ASAs) were calculated with PSAIA (Mihel et al, 2008).
Protein structures were downloaded from the Protein Data Bank website (PDB, http://www.rcsb.org/pdb). Removal of redundant structures was achieved using the PISCES server (Wang and Dunbrack, 2005) with the following criteria: X‐ray structures only; no structure with Cα only; resolution ⩽3 Å; R‐factor ⩽0.3; sequence length between 40 and 10 000 amino acids; and maximum 90% of sequence identity between similar PDB structures. This filtering collected 249 non‐redundant protein structures corresponding to 236 genes in HGMD. To repair residual mismatches between the residue numeration in PDB files and in HGMD, PDB sequences were aligned against their corresponding cDNA sequences in HGMD using CLUSTALW (Chenna et al, 2003). The relative accessibility of over 91 000 residues in all 249 structures was calculated using PSAIA (Mihel et al, 2008). With multimers, accessibility was computed for all monomers considered independently and the multiple values obtained for the same residue were averaged. Among the 3664 residues affected by missense mutations, 1590 and 1045 were associated with autosomal recessive and autosomal dominant diseases, respectively.
Pfam domain assignment
Pfam domains (Pfam‐A family only) were computed for cDNA sequences provided by HGMD, using InterProScan version 4.3 (http://www.ebi.ac.uk/Tools/InterProScan/). Missense, nonsense, in‐frame and out‐of‐frame small insertions, deletions, and indels were then mapped onto the cDNA sequences and Pfam domains, generating a dataset containing 1348 genes with at least one Pfam‐A domain and 34 964 associated mutations. Among them, a total of 10 904 ‘truncating’ mutations are used for the analysis shown in Figure 4D, including 6212 associated with autosomal dominant diseases and 4692 associated with autosomal recessive diseases. Statistics were generated on the sum of a particular mutation type that either fell into or out of any Pfam‐A domain in its respective protein versus the total fraction of the Pfam‐A domain sequences in the protein sequence.
Transcription factors and structural proteins
Information on genes encoding transcription factors was obtained from Gene Ontology (Harris et al, 2004) annotations (948 genes with the GO term of ‘transcription factor activity’) and predictions in the transcription factor database (DNA Binding Domain, DBD; Wilson et al, 2008a; 1467 genes). A total of 1697 human transcription factor genes were retrieved. Among them, 82 genes associated with autosomal dominant diseases that have at least one mutation in HGMD were used for Pfam analysis (Figure 4D), and 56 genes with five mutations or more were used for analysis of ‘in‐frame’ mutations (Figure 2D). Structural protein coding genes were retrieved from Gene Ontology annotations of ‘cytoskeleton’ (992 genes). Among them, 72 genes with at least one mutation in HGMD were used for Pfam analysis (Figure 4D), and 47 genes with five mutations or more were used for analysis of ‘in‐frame’ mutations (Figure 2D). DBD and Gene Ontology data were downloaded in March 2008.
Error bars represent the s.e.m. values. Significance of the observed difference in the distributions of ‘in‐frame’ versus ‘truncating’ mutations in autosomal dominant and autosomal recessive disease, the greater proportions of ‘in‐frame’ mutations in structural proteins than in transcription factors, as well as the greater accessibility of residues mutated in autosomal dominant versus autosomal recessive diseases, was evaluated using the non‐parametric Mann–Whitney U test. Enrichments of disease alleles in Pfam domains were determined using odds ratio and the significance thereof using Fisher's exact test. A fold enrichment higher than one means Pfam domains contain more mutations than expected at random, whereas an enrichment between zero and one means a depletion in mutations. The differences between proportions of ‘in‐frame’ mutations in each pair of diseases associated with the same gene were assessed by Fisher's exact test. All statistics were computed using the R package (http://www.r‐project.org/).
We thank all members of the Vidal Lab and the Center for Cancer Systems Biology (CCSB), Dr Patricia K Donahoe and Dr Roseann Mulloy for helpful suggestions; Ines M Pinto for help with experiments. This study was supported by the Ellison Foundation and the WM Keck Foundation (MV), NIH grants R01‐HG001715 from NHGRI (MV and F Roth), U01‐CA105423 (PI, S Orkin, project leader, MV), U54‐CA112952 (PI, J Nevins, subcontract, MV) and R33‐CA132073 (MV) from NCI, and by Institute Sponsored Research funds from the Dana‐Farber Cancer Institute Strategic Initiative awarded to CCSB. KV was supported by an NIH NRSA training grant fellowship (T32‐CA09361). BC was supported by the Belgian Program on Interuniversity Attraction Poles initiated by the Federal Office for Scientific, Technical and Cultural Affairs (IAP P6/19 PROFUSA). MV and RB are ‘Honorary Research Associate’ and ‘Research Director’ from the Fonds de la Recherche Scientifique (FRS‐FNRS, French Community of Belgium), respectively.
Conflict of Interest
The authors declare that they have no conflict of interest.
Supplementary Text, Supplementary Figures S1–S9, Supplementary Tables S1–S4
This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.
- Copyright © 2009 EMBO and Nature Publishing Group