Open Access

A simple and efficient algorithm for genome‐wide homozygosity analysis in disease

Wei Liu, Jinhui Ding, Jesse Raphael Gibbs, Sue Jane Wang, John Hardy, Andrew Singleton

Author Affiliations

  1. Wei Liu*,1,2,
  2. Jinhui Ding1,
  3. Jesse Raphael Gibbs1,3,
  4. Sue Jane Wang2,
  5. John Hardy3 and
  6. Andrew Singleton1
  1. 1 Laboratory of Neurogenetics, NIA, Porter Neuroscience Building, NIH Main Campus, Bethesda, MD, USA
  2. 2 Office of Biostatistics, OTS, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD, USA
  3. 3 Department of Molecular Neuroscience and Reta Lila Weston Laboratories, Institute of Neurology, University College London, Queen Square London, UK
  1. *Corresponding author. DB2, Office of Biostatistics, WO 21, Mail Stop 3562, Silver Spring, MD 20993, USA. Tel.: +1 301 796 2427; Fax: +1 301 796 9735; E-mail: Wei.Liu{at}
View Abstract


Here we propose a simple statistical algorithm for rapidly scoring loci associated with disease or traits due to recessive mutations or deletions using genome‐wide single nucleotide polymorphism genotyping case–control data in unrelated individuals. This algorithm identifies loci by defining homozygous segments of the genome present at significantly different frequencies between cases and controls. We found that false positive loci could be effectively removed from the output of this procedure by applying different physical size thresholds for the homozygous segments. This procedure is then conducted iteratively using random sub‐datasets until the number of selected loci converges. We demonstrate this method in a publicly available data set for Alzheimer's disease and identify 26 candidate risk loci in the 22 autosomes. In this data set, these loci can explain 75% of the genetic risk variability of the disease.


Advances in whole‐genome single nucleotide polymorphism (SNP) assay technology have provided a powerful array of tools for simultaneously scoring common genetic variation. However, it is often difficult to identify loci associated with disease because of the large number of tests carried out and the associated conservative multiplicity adjustment, such as Bonferroni method. We are interested in identifying such loci associated with a disease likely due to recessive mutation or gene deletions.

High density SNP analysis readily reveals the presence of large homozygous segments in unrelated subjects (Hinds et al, 2005; Simon‐Sanches et al, 2007; Wang et al, 2007). The probability of a randomly selected SNP locus being homozygous (‘AA’ or ‘BB’) based on data from HapMap is about 0.65 (Hinds et al, 2005; Rabbee and Speed, 2006) and this may lend itself to autozygosity mapping in ostensibly outbred populations; however, traditional autozygosity mapping methods (Lander and Botstein, 1987; Mueller and Bishop, 1993; Gschwend et al, 1996) based on consanguineous relationships are not appropriate for unrelated individuals. To identify loci with possible recessive effects of relatively high penetrance in outbred populations, large sample sizes are needed for genotyping. Some recent studies on homozygosity analysis of SNP assays have been attempted using different approaches (Woods et al, 2004; Lencz et al, 2007; Miyazawa et al, 2007). However, they either have some familial relationship requirements (Woods et al, 2004; Miyazawa et al, 2007) or a high false positive rate (Lencz et al, 2007).

In the context of SNP genotyping, it is often not easy to distinguish heterozygous genomic deletion from homozygosity; thus a segment with all loci genotyped being ‘AA’ or ‘BB’ in a pedigree genotype file could be either a region of genuine homozygosity or effective hemizygosity caused by genomic deletion. We call such a region ‘apparently homozygous region’ (AH). By carrying out an appropriate association analysis on AHs, one can detect not only the possible recessively mutated loci from some common ancestor but also deletions (Hunter, 2005; Klein et al, 2005; Van Eyken et al, 2007).

In this paper, we propose a simple statistical algorithm for genome‐wide AH analysis (GAHA) of case–control data in unrelated subjects. It can robustly identify loci that are associated with disease by efficiently removing false positive loci. We demonstrate this method in a publicly available data set for Alzheimer's disease (AD) (Coon et al, 2007), consisting 502 627 SNP loci genotyped in unrelated 859 cases and 552 neurologically normal controls. A total of 26 loci from the 22 autosomes are identified and they explain 75% of the genetic risk variability of the disease.

Results and discussion

AH size threshold

In the context of the current data, it is not appropriate to use the number of loci as a measure of AH size as previously reported (Lencz et al, 2007) because of its dependence on SNP density. Here we use the number of nucleotide basepairs between the first and last loci of an AH as a measure of AH size.

Let C be a size threshold of AHs. We are interested in identifying loci proportions of which are significantly different between controls and cases in AHs with sizes ⩾C. As seen in Figure 1, for example, there are n1 cases and with a given C we count the proportion of the locus SNP‐1 on AHs p1=(number of AHs containing SNP‐1)/n1. Similarly, for n0 controls, we find the proportion p0 of the same locus. Using p1 and p0, we compute z‐statistic for proportional test as described in Materials and methods. The locus is selected for further screening if ∣z∣⩾z1−α/2, where α is the level of significance. The test statistic z follows a standard normal distribution asymptotically as n0 and n1 increase with each greater than 30.

Figure 1.

Scheme for computing the proportion of a locus on AHs. For a given chromosome of a subject, the symbols (•, ○) represent SNP loci. The shaded segments denote AHs with size greater than or equal to a pre‐selected threshold C. The proportion of a locus on AHs is computed as p= (the number of AHs containing this locus)/(the total number of individuals), for example p1=4/6 for SNP‐1.

We investigated the power for selecting loci based on α, AH percentage difference between cases and controls, and AH size threshold C through simulation. The relationships between z value and AH percentage difference with various C are shown in Supplementary Figure 1. At a significance level α=0.001, the powers to detect candidate loci were computed accordingly. We define that a candidate locus is detectable if the power>0.8. Our results showed that at a significance level α=0.001, we could detect a locus on AHs⩾C with a difference of 30% between cases and controls using C=10 kb, or only of 7% using C=1 Mb.

On the basis of above significance level α and a moderate C value, typically thousands of loci could be selected with a large false positive rate from data of unrelated subjects. A key step is to efficiently remove these falsely associated loci from the candidate list. If we knew the minimum size of risk loci, then we would set it as C and consider only AH⩾C, leading to a lower false positive rate. However, such a C value is unknown. One approach is to use multiple values of C as discussed below. In convention, define C=1 for considering AHs with size ⩾1.

Algorithm for screening risk loci

We propose to use multiple C values for screening risk loci. Suppose we choose C1 and C2, with C1<C2, for selecting candidate loci with ∣z∣⩾z1−α/2. It should be noted that the distance between C1 and C2 must be larger than the minimum distance between loci of the platform and may be chosen by referring to some public genotyping parameters (for example, the average distance between loci is ∼9 kb in Affymetrix 500K GeneChip, and a median distance is ∼3 kb in Illumina HumanHap550 BeadChip according to Gunderson et al, 2005; Steemers and Gunderson, 2007). Let S1 be the set containing the loci selected with C1 and S2 with C2, respectively. As the true AHs with size ⩾C2>C1 will remain using either C1 or C2, the loci, not in S1S2, should be more likely false positives and thus be removed. For example, in the AD data using a significance level α=0.001, among the 25 086 loci on chromosome 1, there were 18 loci selected using C=10 kb and 12 loci using C=30 kb, respectively, with only three being common loci in both sets. In general, we set C={Ci, i=1, 2,…, L} with C1<C2<…<CL to cover a wide range of AHs and let S be the set containing all loci common in adjacent sets S={S1S2, S2S3,…, SL1SL}. This loci‐selecting procedure is called ‘procedure of adjacent‐C‐selection’ (PACS).

The PACS can efficiently remove false positive loci, however, for a real data set in unrelated individuals with large genetic variation, the selected loci usually still contain some false positives, many of which could be removed through further ‘purification’. To achieve this, ideally we should repeat the above steps using an independent data set from the same population to get another candidate set. Then identify the common loci from both sets. This new candidate set contains fewer false positive loci, which could be further removed by repeating above steps iteratively until the number of candidate loci converges. Although it is generally not realistic to do so, we could do the ‘purification’ using random subsets from the full data set as described below.

Let nk*=[f × nk]>30 be the size of a random subset from the full data set of size nk, where k=1 for cases and k=0 for controls, and f be a constant with 0<f<fmax, fmax= (mink (nk) − 1)/mink (nk. The randomly and independently chosen n1* cases and n0* controls form a random case–control sub‐data set for further removing the false positive loci from the candidate set using the same set of C values as applied to the full data set.

Let S be the set containing the selected loci from the full data set and S* be that from the first random sub‐data set. Let S1*=S*S containing the common loci in both sets and N1=∣S1*∣ be the number of loci in S1*. Next we generate a new S* from the second random sub‐data set and let S2*=S1*S* with N2=∣S2*∣. Repeating these steps to update the candidate loci set until the number of Nt, t=1,2,………, converges to a constant integer Nc with Nc =0 if the null hypothesis of no difference between p1 and p0 is true and Nc>0 if the alternative hypothesis p1p0 is true. For a given f, there are

Embedded Image

possible ways for selecting case–control subset, which should be much larger than the number required for reaching convergence at an appropriate level of significance. The above GAHA algorithm is summarized in Box 1.

Box 1 Outline of the GAHA algorithm

(1)  For case–control SNP data with n1 cases and n0 controls, choose a level of significance α, set AH thresholds C={Ci, i=1, 2,…, L} with C1<C2<…<CL, and then find AHs with size Ci, i=1, 2,…, L, for each subject

(2)  Compute z at each locus and select it if ∣z∣⩾z1‐α/2. Perform the PACS and let Sold be the set of selected loci and Nold=∣Sold∣. Chose 0 < f < mink{nk}−1/mink{nk}, and ℓ=0

(3)  Randomly select a case–control sub‐dataset from (1) with n1* = [f × n1] >30 cases and n0* = [f × n0] >30 controls. Find AHs for each subject at given C, then compute z at each locus and select it if ∣z∣⩾ z1‐α/2

(4)  Carry out the PACS and let S* be the set containing all the loci selected from the sub‐dataset. Find Snew=SoldS* Nnew=∣Snew

(5)  Embedded Image

The false positive rate of a locus in the final set should be ⩽α. The false negative rates of loci selection in a random subset were estimated under the same settings for the full data set (Supplementary Table 2).

Application to AD data set

Set C={1, 10 kb, 30 kb, 50 kb, 100 kb, 140 kb, 250 kb, 500 kb, 1 Mb} and α=0.001. We identified 607 loci from 4054 loci whose ∣z∣⩾z1−α/2 (Figure 2A) from the 22 autosomes in the AD data set (Coon et al, 2007).

Figure 2.

The plot of z versus nucleotide basepair of chromosome 19 in the AD data set: (A) before and (B) after the procedure of adjacent‐C‐selection, (C) the most significant region—the peak locus is rs4420638, (D) the most significant region with two loci on APOE (↓).

The most significant AH region was on 19q13.2 (see Figure 2B) with positive z values suggesting significantly more AHs in controls than in cases. This region, covering the whole apolipoprotein E (APOE) gene, contains four loci including rs4420638 (Figure 2C), which is in linkage disequilibrium with APOE (Coon et al, 2007). However, there were no genotypes within APOE in the AD data. We added available genotyping information (Coon et al, 2007) of two loci on APOE, rs429358 and rs7412, to the AD data. The two APOE loci define the ε2/ε3/ε4 genotypes. Figure 2D shows the APOE loci indeed on the AH region where the majority controls have the ε3 genotype, supporting the observation that APOE ε3 is protective against the disease when compared with ε4 (Farrer et al, 1997).

To further reduce the false positive rate within this list, we chose f=0.9 for generating random subsets, each with 773 cases and 497 controls. The use of f=0.9 may not be the statistically optimal choice; it is, however, the best we tried. The convergence of the loci number is shown in Figure 3. There were 26 loci in the final list (Figure 3B) (Table I). Based on a logistic regression model fit, the percent variation of the genetic risk explained by these 26 loci was 75.3%. Model selection removed 10 confounder loci and retained 16 loci (each with P‐value<0.05), including rs4420638, in the reduced model with 74.8% of the genetic risk variation explained (Supplementary Table 3, 4).

Figure 3.

Convergence of the loci number. (A) At a level of significance α=0.001, a total of 607 loci (□) were selected from the 4054 loci for which ∣z∣⩾ z1−α/2 (Δ) by applying the procedure of adjacent‐C‐selection in the AD data set. Random case–control subsets were generated using f=0.9 and used in screening iteration (○). (B) The enlarged plot showing the convergence of selected loci to the number 26.

View this table:
Table 1. List of candidate loci associated with AD from the 22 autosome of the AD SNP genotype data (Coon et al, 2007)

The APOE ε4 was carried by ∼40% of the later‐onset AD cases (Poirier et al, 1993; Laws et al, 2003). Recall that rs4420638 is in linkage disequilibrium with APOE, we found that the percent genetic risk variation explained by this locus alone was 34.2%. However, when rs4420638 was excluded from the reduced model, the percentage genetic risk variation explained by the remaining 15 loci was decreased only by 2.9% (from 74.8% to 71.9%). This suggests these loci explain the genetic risk variation of AD as a group. Several of the 26 loci identified in this screening were also found in homozygous regions identified in an early onset AD study of a consanguineous family (Clarimón et al, 2008), suggesting that one of these regions harbors a recessive genetic lesion causing AD.

The 26 loci are on 20 genes of which 13 are in known functional pathways or networks as revealed from an Ingenuity Pathway Analysis (Ingenuity Systems, (Supplementary Pathway/Network analysis). On the basis of the correlations among the 20 genes and AD status of subjects, we construct an AD genetic network (Supplementary Figure 2).


We propose a statistical method for GAHA of SNP case–control data in unrelated subjects to identify risk loci that are most likely associated with a disease or abnormality due to recessive mutation or deletion. The main novelty of this method over other approaches is to minimize the false positive rate of the risk candidates. We remove the false positive loci by selecting the common loci with different size thresholds of homozygous segments and repeating these steps iteratively using random sub‐data sets until the number of selected loci converges. Furthermore, this method allows selects risk loci from a wider AH size range. By demonstrating of the method using a publicly available AD SNP assay data set, we identified 26 candidate risk loci from the 22 autosomes.

Materials and methods


Suppose there are n SNP loci genotyped on a given chromosome (an autosome). We view the sequences of SNP loci on a chromosome as linked regions either being heterozygous or AHs. Let H be a set such that H={h1, h2,…, hm} where hi denotes the number of AHs containing i consecutive SNP loci genotyped, and m is the maximum number of consecutive SNP loci. The probability of a randomly selected SNP locus on AHs with SNP number being equal to or larger than a predetermined integer k is Embedded Image.


A SNP genotype data set of late‐onset AD(500K Affymetrix) was downloaded from a publicly available website,, to demonstrate our method. This data set consists of 502 627 SNP loci genotyped in unrelated 859 cases and 552 neurologically normal controls.

Proportion test

We are interested in identifying loci at which the proportion of a SNP locus, on AHs with size equal to or larger than a given threshold C, is significantly different between controls and cases. Our null hypothesis is that the SNP at a given locus has the same probability of being on AHs with size ⩾C in the control and case groups. The test statistic in a standard proportion test is

Embedded Image

and follows a Gaussian distribution under the null hypothesis, where the p0 is the proportion of the locus on AHs for the n0 control subjects and the p1 is that for the n1 cases. We define z=0 when both p0=0 and p1=0. For a given level of significance α, a locus is selected if ∣z∣⩾z1−α/2. This test requires large sample size (n0, n1>30).

Logistic regression

In logistic regression using the selected loci as predictor variables, let xij=1 if the ith locus of the jth subject is on an AH with size being equal to or larger than C=10 kb and xij=0 otherwise. Logistic regression is carried out using SAS 9.0.


The views expressed in this article do not represent those of the US Food and Drug Administration.

Conflict of Interest

The authors declare that they have no conflict of interest.

Supplementary Information

Supplementary information

Supplementary figures S1–2, Supplementary tables S1–4, Pathway/Network Analysis [msb200953-sup-0001.doc]


This study was supported by the Intramural Program of the National Institute on Aging, National Institutes of Health and Department of Health and Human Services, project number AG000950‐07. This study used high‐performance computational capabilities of the Biowulf Systems at the National Institutes of Health, Bethesda, MD (


Creative Commons logo

This is an open‐access article distributed under the terms of the Creative Commons Attribution License, which permits distribution, and reproduction in any medium, provided the original author and source are credited. This license does not permit commercial exploitation without specific permission.

View Abstract