The impact of disease‐causing defects is often not limited to the products of a mutated gene but, thanks to interactions between the molecular components, may also affect other cellular functions, resulting in potential comorbidity effects. By combining information on cellular interactions, disease‐‐gene associations, and population‐level disease patterns extracted from Medicare data, we find statistically significant correlations between the underlying structure of cellular networks and disease comorbidity patterns in the human population. Our results indicate that such a combination of population‐level data and cellular network information could help build novel hypotheses about disease mechanisms.
Most cellular functions are carried out by a complex network of genes, proteins, and metabolites that interact through biochemical and physical interactions (Gerstein et al, 2002; Barabási and Oltvai, 2004; Albert, 2005; Basso et al, 2005; Almaas, 2007; Alon, 2007; Yildrim et al, 2007). Therefore, disease‐causing defects may initiate cascades of failures that trigger the co‐emergence of multiple diseases in a patient, such as diabetes and obesity. Yet, given the environmental, lifestyle, or treatment‐related factors that all contribute to comorbidity, it is not obvious whether these cellular network‐based interdependencies manifest themselves at the individual or at the population level. Discovering such systematic correlations between cellular networks and disease patterns could potentially open new avenues for understanding the human interactome, and may help uncover hitherto unknown disease mechanisms (Ergun et al, 2007; Loscalzo et al, 2007; Braun et al, 2008).
The possibility that there may be systematic links between hereditary diseases, thanks to their common genetic origins, was postulated recently by Goh et al, who created a Human Disease Network (HDN) by connecting all hereditary diseases that share a disease‐causing gene according to the Online Mendelian Inheritance in Man (OMIM) database (Goh et al, 2007; Feldman et al, 2008). Although some of the diseases connected in the HDN captured well‐known comorbidity patterns, the functional relevance of the links in the network remains to be demonstrated, leaving open the question whether most diseases connected in the HDN exhibit significant comorbidity. Interestingly, the most disconnected disease class in the HDN is that of metabolic diseases. However, Lee et al recently showed that metabolic diseases can be also organized in a metabolic disease network if the enzymes and their associated diseases are linked through metabolic pathways (Lee et al, 2008). Most importantly, the study found that metabolic diseases connected through shared pathways tend to show significant comorbidity, suggesting that information encoded in the structure of the metabolic network is amplified, becoming discernible at the population level as comorbidity patterns.
Metabolic networks represent only one of the several networks functionally relevant to our understanding of cellular activity. Indeed, when it comes to cellular interactions of potential importance to human diseases, we need to consider protein–protein interaction (PPI) and coexpression networks as well as the links between diseases generated by shared genes. Therefore, earlier research raises an important question: are the cellular‐level relationships encoded by PPIs, coexpression, and shared genes amplified at the population level? That is, should we expect statistically significant comorbidity patterns for disease pairs that share a gene, whose proteins interact, or whose genes show high coexpression patterns? To answer these questions, we analyzed the large‐scale comorbidity pattern extracted from the US Medicare claims database and the gene‐‐disease association network from OMIM (McCusick, 1998). We find that cellular interaction links indeed manifest themselves at the population level, resulting in statistically significant comorbidity patterns. We quantify the relative magnitude of these correlations and discuss the current difficulties in mapping population‐ and cellular‐level data into each other, as well as the benefits of such an approach toward elucidating disease mechanisms.
Results and discussion
The starting point of our study is the Medicare claims database containing the diagnoses that Cij led to the hospitalization of N=13 039 018 elderly patients, each disease or condition identified by an ICD‐9‐CM code. We denote the incidence of disease i with Ii, and the number of patients who were simultaneously diagnosed with diseases i and j with Cij. The comorbid tendency between the two diseases can be quantified using either the relative risk, RR=Cij/Cij*, where Cij*=IiIj/N is the expectation value of Cij when the two diseases are independent, or the ϕ‐correlation defined as . When two diseases co‐occur more frequently than expected by chance, we have RR>1 and ϕ>0. Note, however, that although RR and ϕ are not independent of each other, each carries unique biases that are complementary. Therefore, we use both measures of comorbidity to ensure the robustness of our findings (see Supplementary information (SI) for further details). The disease‐‐gene associations used in the study were obtained from the OMIM database, which contains >4900 such associations as of October 2008. Although the disease‐‐gene record is far from complete, OMIM is currently the most complete repository of all known disease genes and their associated disorders.
It is important to note that disease names used in the Medicare database by the medical and the insurance communities (the ICD‐9‐CM scheme) and those used in the OMIM database by geneticists are not identical. Therefore, we enlisted a professional ICD‐9‐CM coder to manually map the OMIM disease names into ICD‐9‐CM codes and established connections between the genetic associations and the comorbidity measures (see Box 1 and SI sections S1 and S2 for more detail). On account of the discrepancies in disease names and the complex, hierarchical nature of the ICD‐9‐CM scheme, we recognize that the mapping is not perfect, and may contain debatable and occasionally erroneous ICD‐9‐CM‐to‐OMIM correspondence. Therefore, we are providing the mapping used by us in the SI, offering a chance for the community to improve on it in future studies.
Box 1 Didactic Box
Schematic description of the procedure used to connect comorbidity (calculated in the Medicare Layer, top) and genetic associations (given in the OMIM Layer, bottom) between a pair of diseases. Breast Cancer and Bone and Cartilage Cancer are treated as the example here, also presented in Figure 1B. In the Medicare Layer (top), each disease is represented by an ICD‐9‐CM code, a widely used hierarchical disease diagnosis code system. The incidence Ii of each disease (represented by a blue line) is found by counting patients in the Medicare database diagnosed with the corresponding ICD‐9‐CM code and its sub‐level codes (i.e. 174.1 is also counted as an incidence of 174 for breast cancer), while the co‐occurrence Cij (red line) of a disease pair is found by counting patients diagnosed with both codes. The comorbidity measures RR and ϕ can be calculated from these quantities and the total number of patients in the Medicare database (approximately 13 million). The associated genes of each disease are provided in the OMIM Layer (bottom, green lines). Because of differences in the disease‐labeling schemes in the Medicare (ICD‐9‐CM) and the OMIM databases (the codes are as given in Goh et al, 2007), we manually constructed a mapping between the two (grey lines). See Supplementary information for detail.
As OMIM comprises the set of hereditary or complex diseases with validated gene‐‐disease associations, it is anticipated that only a subset of ICD‐9‐CM codes would correspond to the diseases in the OMIM. Indeed, we find that, of the >12 000 available ICD‐9‐CM codes, 763 unique ICD‐9‐CM codes can be mapped to OMIM diseases. The fact that our analysis is limited to 5% of possible diagnosis codes contained in the Medicare database, could limit our population (patient) coverage. We find, however, that this is not the case: as Figure 1A shows, 90% of patients in the Medicare database are diagnosed with at least one disease whose ICD‐9‐CM code is contained in our mapping to the OMIM database.
We use the following three quantities to capture the cellular network‐level relationship between diseases i and j, as illustrated in Figure 1B for the case of breast cancer (ICD‐9‐CM 174) and cancer of bone and cartilage (ICD‐9‐CM 170.9, see also SI):
(i) nijg, the number of shared genes associated with both diseases i and j, which quantifies the potential common genetic origin of the two diseases (Goh et al, 2007);
(iii) , the average Pearson correlation of coexpression between pairs of genes from each disease, capturing the degree to which the genes associated with the two diseases are coexpressed (Ge et al, 2005).
The main question can be formulated as follows: does the existence of these cellular‐level links (i.e., nijg>0, nijp>0, ij>0) between the two diseases increase the likelihood that individuals simultaneously develop both conditions? We start our investigation by measuring the Pearson correlation between the cellular variables (nijg, nijp, ij) and comorbidities (RR and ϕ) for 83 924 disease pairs. Of these, 2239 pairs are linked through either shared genes (nijg⩾1) or PPIs (nijp⩾1; 658 with shared genes, and 1873 with PPIs). In Figure 2A and Table I we present the Pearson correlation coefficients (PCCs) between the comorbidity measures and the genetic variables. Although ng, in general, has the highest correlation with comorbidity, we do observe positive PCC with all three variables.
There are numerous factors that determine whether two diseases co‐occur in a patient, some of which are environmental, lifestyle‐related or treatment‐induced. Our study captures only the role of the cellular network on comorbidity. The small magnitude of the correlations observed by us suggests that the cellular network offers only a small contribution to the observed comorbidity. Note, however, that placing significant emphasis on the magnitude of these correlations is premature, as two known effects limit the correlations observed by us. First, the magnitude of the correlation is limited by the predictive power of specific genetic mutations catalogued in the OMIM database, and the likelihood of a patient developing a particular disease. Indeed, it is known that genetic mutations result in an increase of at most a few percentage points in the likelihood of an individual developing a specific complex disease (Loscalzo, 2007) and the correlations observed by us cannot exceed the known disease‐‐gene correlations. Second, the correlations are further limited by the noise in the mapping between the OMIM diseases and the ICD‐9‐CM codes. As we noted, there is inherent ambiguity both in the mapping as well as in the process of assigning a particular diagnosis to specific ICD‐9‐CM codes in hospitals. Each instance of such misdiagnosis or mapping ambiguity decreases the magnitude of the observed correlations. Therefore, at this point it is not the magnitude, but the statistical significance of the correlations that we can rely on. As summarized in Table I, the observed correlations are statistically significant.
To quantify the degree of comorbidity caused by the observed correlations, we measured the average comorbidities 〈RR〉 and 〈ϕ〉 for disease pairs that are connected at the cellular network level. Compared with the entire set of 83 924 pairs of hereditary diseases considered in our study, we find (see Figure 2B; Table II) a two‐ to four‐fold increase in the average comorbidity in disease pairs that share genes (nijg⩾1), indicating that if a patient develops a particular disease associated with a gene or multiple genes in the HDN, then they have a two‐fold higher chance of developing another disease mapped to one or more common genes in the HDN, compared with diseases that are not. An increased comorbidity is also observed for disease pairs linked through PPIs (nijp⩾1) and high coexpression (ij ⩾geqslant 0.5).
The observed correlations between the cellular links and comorbidities raise a related question: would disease pairs that are more interconnected than others (i.e., have larger nijg, nijp, or ij) show higher comorbidity? To address this, in Figure 2C we show that comorbidity increases rapidly with the number of shared genes: sharing two or more genes (nijg⩾2) results in nearly a five‐fold increase in comorbidity compared with hereditary disease pairs that do not share genes. An increase in comorbidity is observed with increasing nijp and ij as well (Figure 2D and E), although the effect is weaker than that observed for nijg, which is not unexpected given the smaller impact that nijp and ij have on comorbidity in comparison with nijg.
Note that the average comorbidity measured between all pairs of diseases is >1 (Figure 2B), indicating that many patients develop multiple disorders, whether or not the specific diseases are linked at the cellular level. Such correlations have been observed in other studies focused on comorbidity patterns (Rzhetsky et al, 2007; Hidalgo et al, 2009). These overall comorbidity patterns are not particularly surprising considering that the Medicare population is 65 years of age or older, the age at which individuals do develop multiple disorders. Thus, the overall comorbidity represents the baseline against which we can assess the impact of the genetic and cellular networks. It is reassuring, therefore, that hereditary diseases that are linked in the HDN (and thus at the cellular level) show comorbidity higher than the baseline 〈RR〉=1.92±0.01 and 〈ϕ〉=(1.84±0.02) × 10−3 observed for the set of all disease pairs.
Despite the significant increase in 〈RR〉 and 〈ϕ〉, there are many disease pairs that share genes yet fail to show significant comorbidity. We hypothesize that this is, in part, because of pleiotropy, which in this context means that different mutations on the same gene can have different pathological effects on a protein (Dudley et al, 2005), thereby predisposing an individual to different disorders. In general, we expect that disease pairs associated with mutations on the same functional domain of the shared protein show higher comorbidity than disease pairs whose mutations occur in different functional domains. To test this hypothesis, we identified the functional domains of disease‐causing mutations on shared genes using the Pfam database (Finn et al, 2006). In agreement with our hypothesis, we find higher 〈RR〉 and 〈ϕ〉, for disease pairs whose mutations are on the same domain of the shared gene, compared with disease pairs whose mutations are in distinct functional domains (Figure 2B).
The observed correlations suggest that a combination of disease data and cellular network information may assist us in identifying new comorbidity patterns alongside their potential genetic origin. Indeed, upon inspection of the 2239 disease pairs that are genetically linked (i.e., nijg⩾1 or nijp⩾1), we find several disease pairs whose comorbidity patterns are already well known to the medical community, such as diabetes and obesity (Evans et al, 2002), or breast cancer and osteosarcoma (Knowling and Basco, 1986). At the same time, due to the aforementioned mismatch between disease names used by clinicians (within the ICD‐9 coding scheme) and by geneticists (within the OMIM tabulation), several highly comorbid disease pairs are readily anticipated (such as diabetes and hypoglycemia, as hypoglycemia is a common side effect of the treatment of diabetes) or cases in which one disease is a broader version of the other (such as mononeuritis and hereditary peripheral neuropathy). Such mapping limitations notwithstanding, we find several interesting disease pairs that are linked at the cellular level and also show significant comorbidity. For example, consider Alzheimer's disease (ICD‐9‐CM 331) and myocardial infarction (ICD‐9‐CM 410.9), for which earlier comorbidity studies were either inconclusive or contradictory (Bursi et al, 2006). As Figure 3A shows, we not only find statistically significant comorbidity (P≈10−5) between the two, but the figure suggests that the shared ACE and APOE genes may contribute to the observed effect. Similarly, we observe significant comorbidity (P≈10−148) between autonomic nervous system disorder (ICD‐9‐CM 337.9) and carpal tunnel syndrome (ICD‐9‐CM 354, Figure 3B). A known mechanism is L‐chain amyloidosis, which may affect the autonomic nervous system and causes carpal tunnel syndrome when the amyloid infiltrates the flexor retinaculum of the patient's wrist (Haan and Peters, 1994). Figure 3B, however, suggests that a PPI between the associated genes of each disorder may also play a role in the observed effect. Although there may be additional possible physiological or social explanations for some of the observed comorbidities (see SI), the method described above has the potential to offer new, testable hypotheses about the biological basis of disease interrelationships. These examples were selected only to demonstrate the potential of the combined investigation of the network and population‐level data in identifying potentially interesting disease pairs worthy of further study. A more detailed description of these disease pairs, along with the complete list of the 2239 genetically linked disease pairs and their genetic associations are provided in the SI.
The main finding of this paper is that health care and treatment data on a large number of individuals offer information useful to systems biology that can complement the information from the well‐established genomic studies. Indeed, Medicare and insurance databases already collect the health care history of millions of individuals, allowing us to uncover the correlations in the occurrence of various diseases. In parallel, increasing knowledge about the molecular origin of disease indicates that many disorders are rooted in defects in gene products that are part of the same cellular network, raising the possibility that these diseases should co‐occur in the same individual. Admittedly, much of the currently available network data are incomplete and probably noisy. We may be approaching a tipping point, however, where we have acquired sufficient knowledge of human cellular networks to begin understanding the way a disturbance in the networks may contribute to the development of a disease and suggest potential disease‐modifying factors.
To test the validity of this hypothesis, here we correlated cellular level information for human cells, namely data on shared genes, PPIs, and coexpression patterns, with comorbidity data obtained from the Medicare database. Despite the aforementioned limitations of the mapping and the data collection process, we found statistically significant correlations between cellular interactions and comorbidity patterns. We also found that disease pairs with higher correlations tend to be linked more strongly in the cellular network.
Although our work was mainly driven by the desire to uncover evidence that cellular information is amplified in the human population and thus can be detected from patient data, our results point to the potential usefulness of our approach in uncovering disease mechanisms. Indeed, we discuss two disease pairs in which the network‐based information offers a plausible mechanism for statistically significant comorbidity patterns. These results suggest that Medicare and other insurance databases could play an increasing role in future studies of the systems biology of human cells and diseases.
Materials and methods
We used the HDN from Goh et al for disorder‐‐gene associations (Goh et al, 2007), updated based on the version of the Morbid Map from OMIM at the time of the study. The most up‐to‐date version can be found at http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim. The PPI data were taken from Rual et al (2005) and Stelzl et al (2005). The genetic coexpression levels were calculated on the basis of an Affeymetrix microarray data (Ge et al, 2005) (see www.affymetrix.com). Protein domain information is available on UnitProt (http://www.uniprot.org/) and Pfam (http://www.sanger.ac.uk/Software/Pfam/).
Estimating P‐values and errors
The P‐values for the PCCs shown in Figure 2A and Table I were calculated by a Monte Carlo sampling method. We generate a randomized sequence of the genetic variables and calculate its PCC with comorbidity. After 2 million randomizations, the P‐value is the fraction of the total trials that resulted in a PCC that is larger than what was observed.
As RR and ϕ are monotonically increasing functions of Cij, their one‐sided P‐value is equal to the sum of probabilities that the co‐occurrence Cij is larger than the actual value. It can be obtained using standard computational software such as Mathematica (www.wolfram.com) by approximating the binomial distribution generated from the number of patients N and Cij*=Np=IiIj/N as a Poisson distribution, and therefore
We thank Quan Zhong, Cesar Hidalgo, Nick Blumm, and Marc Vidal for useful discussions. This research was supported by JSMF 220020084, NSF ITR DMR‐0426737, NIH CEGS‐1P50HG4233/CFDA #93.172, NIH U01 A1070499‐01/111620‐2, and NIH U56 CA113004/sub MGH.
Conflict of Interest
The authors declare that they have no conflict of interest.
Supplementary Information and Supplementary Figure S1‐S2
Supplementary Table 1
Mapping between OMIM and ICD‐9‐CM
Supplementary Table 2
Comorbidity between genetically connected diseases
Supplementary Table 3
- Copyright © 2009 EMBO and Nature Publishing Group