Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Linked region detection using high-density SNP genotype data via the minimum recombinant model of pedigree haplotype inference

Linked region detection using high-density SNP genotype data via the minimum recombinant model of... Background: With the rapid development of high-throughput genotyping technologies, efficient methods for identifying linked regions using high-density SNP genotype data have become more and more important. Recently, a deterministic method that works very well on SNP genotyping data has been developed (Lin et al. Bioinformatics 2008, 24(1): 86–93). However, that program can only work on a limited number of family structures. In particular, the results (if any) will be poor when the genotype data for the whole chromosome of one of the parents in a nuclear family is missing. Results: We have developed a software package (LIden) for identifying linked regions using high- density SNP genotype data. We focus on handling the case where the genotype data for the whole chromosome of one of the parents in a nuclear family is missing. We use the minimum recombinant model for haplotype inference in pedigrees. Several local optimization algorithms are used to infer the haplotype of each individual and determine the linked regions based on the inferred haplotype data. We have developed a more flexible method to combine nuclear families to further refine (reduce the length of) the linked regions. Conclusion: Our new package (LIden) is efficient software for linked region detection using high- density SNP genotype data. LIden can handle some important cases where the existing programs do not work well. In particular, the new package can handle many cases where the genotype data of one of the two parents is missing for the entire chromosome. The running time of the program is O(mn), where m is the number of members in the family and n is the number of SNP sites in the chromosome. LIden is specifically suitable for handling big sized families. This research also demonstrates another practical use of the minimum recombinant model for haplotype inference in pedigrees. The software package can be downloaded at http://www.cs.cityu.edu.hk/~lwang/software/Link. understanding of human genomic sequences has been Background With the completion of the human genome sequencing extended dramatically. Due to the development of SNP project and the development of HapMap project [1], our genotyping technology, genotyping of hundreds of thou- Page 1 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 sands of single nucleotide polymorphism (SNP) markers Implementation in a high-throughput format has become a routine job in We use the minimum recombinant model to infer the many labs. haplotype configuration for all the family members. In 2002, Qian and Beckman [7] proposed a model to recon- Compared to classical genotyping methods mainly using struct haplotype configurations from genotype data in a microsatellite markers, SNP genotyping is faster and eas- pedigree under the Mendelian law of inheritance. In this ier. It provides complete coverage of the genome and model, the resulting haplotype configurations should much more information on covered regions. Linkage have the minimum number of recombinants (i.e. recom- analysis is a method to identify genomic regions that bination events). cosegregate with an inherited disease in a family and to Minimum Recombinant Haplotype Configuration facilitate the eventual identification of the mutation in that region causing the disease. Leykin et al. in [2] and Given a pedigree and the genotype information for each Sellick et al. in [3] demonstrated that high-density SNP member, the object is to find a haplotype configuration genotype data, such as that from microarrays, can be used such that the total number of recombinants in the whole for large-scale and cost-effective linkage analysis. The pedigree is minimized. main reason is that there will be a sufficient number of informative markers between any two recombination The problem is called Minimum Recombinant Haplotype points and thus the allele sharing status among the family Configuration (MRHC). The MRHC problem was proved members can be precisely determined. Therefore, efficient to be NP-hard by Li and Jiang [8]. Lots of algorithms have programs are highly demanded for allele sharing determi- been proposed. Some algorithms run in time exponential nation that work on a large number of markers and big in terms of the number of SNP sites and some algorithms sized families. run in time exponential in terms of the number of family members [9]. An integer linear programming approach Classical linkage analysis methods are designed for sparse was proposed to handle incomplete genotype data [10]. microsatellite markers. They are mainly based on two algorithms, the Elston-Steward algorithm that is limited Linkage analysis aims to identify regions whose allele is by the number of total markers used [4] and the Lander- shared by all or most diseased members in a family but by Green algorithm that is limited by the total number of none or few normal family members. In dominant inher- individuals in a family [5]. As a result, they either cannot itance situations, sharing of one mutation allele can cause handle genotype data based on large number of SNPs at a disease phenotype. In recessive cases, sharing of two dis- all or they cannot handle families of a large size, especially ease alleles in that region is necessary for there to be a dis- together with large numbers of genotyped SNPs, due to eased status. We will first design algorithms to infer the memory constraint. allele sharing status with minimum recombinants and then use an algorithm to find the linked regions(regions Recently, a deterministic method that works very well on shared by all or most of the diseased individuals but not SNP genotyping data [6] has been developed. This was shared by any normal individuals) by possibly changing one of a series of efforts to develop software that is partic- the inferred allele sharing status. ularly suitable for SNP genotyping data and runs in time linear to both the number of SNP sites and the number of Algorithm family members. However, the program in [6] can only Our package contains a set of (heuristic) algorithms to work on a limited number of family structures. Here we handle various kinds of situations and sometimes for one use the minimum recombinant model for haplotype case, we use two (local optimization) algorithms to itera- inference in pedigrees and develop a set of algorithms to tively refine the output. Before we discuss the algorithms, minimize the total number of recombinants and produce we give the basic data structures used in the package. a software package that works on a much wider range of family structures. Extensive simulations on Affymetrix The basic data structures Human Mapping 50 K/250 K GeneChips showed that the For each individual in the family, we use five arrays: gen- new package can correctly identify the linked regions on a otype, paternal-h, maternal-h, which-p and which-m to wide range of family structures. In particular, the new store the information. The possible values for each ele- package outperforms the old program in many important ment in the arrays are given in Table 1. The genotype value cases where the genotype data of one of the parents is of individual I at site i is in I. genotype [i], which is {A, missing on the entire chromosome. This research also A}, {A, B} or {B, B}. The haplotype value of individual I demonstrates another practical use of the minimum from the father at site i is in I.paternal-h [i], which is recombinant model for haplotype inference in pedigrees. either A or B. Similarly, the haplotype value of individual I at site i from the mother is in I. maternal-h [i], which is Page 2 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Table 1: The possible values of each element in the five arrays. eased individuals and the open circles/boxes indicate nor- mal individuals. This setting applies to all the figures in Array Name Possible Values of each element this paper. To handle nuclear families with both parents, we use two algorithms, the basic algorithm and the horizon- genotype {AA}, {AB}, {BB} tal local optimization algorithm. paternal-h A, B The basic algorithm In our basic algorithm, we consider a site at a time. Sup- maternal-h A, B pose that paternal [i], maternal [i], which-p [i] and which-m which-p 0, 1 [i] of each individual at site i have been fixed. For site i + 1, there are (at most) 4 different haplotype configurations which-m 0, 1 of the two parents fitting the given genotype data. For each of the 4 haplotype configurations of the two parents, we can fix C .paternal-h [i + 1], C .maternal-h [i + 1], C .which- j j j either A or B. For each individual, there are two (paternal .which-m [i + 1] for every child C (j = 1, 2,..., p [i + 1] and C j j and maternal) copies of haplotypes. We use 0 and 1 to dis- n) such that the number of break points of child C tinguish them. I.which-p [i] can be 0 or 1. between site i and i + 1 is minimized. Note that, for each child C , the number of breakpoints between site i and site I.which-p [i] = 0 indicates that the haplotype I.paternal- i + 1 could be 0, 1 or 2. When two choices are equally h [i] of individual I is from his/her father's 0-th haplotype good for child C , we arbitrarily select one. We then and I.which-p [i] = 1 indicates that the haplotype I.pater- choose one of the 4 haplotype configurations of the par- nal-h [i] of individual I is from his/her father's 1-th hap- ents at site i + 1 such that the total number of break points lotype. Similarly, I.which-m [i] = 0 indicates that the for all the n children between site i and site i + 1 is mini- haplotype I. maternal-h [i] of individual I is from his/her mized. It is worth pointing out that the basic algorithm mother's 0-th haplotype and I.which-m [i] = 1 indicates here considers all the children site by site while the old that the haplotype I. maternal-h [i] of individual I is from algorithm in [6] considers all the sites for every child and his/her mother's 1-th haplotype. The main purpose here is then handles the children of the nuclear family one by to keep a record of which grandparent the allele came one. Clearly, the quality of the solution at site i + 1 heavily from. depends on the quality of the solution at site i. Thus, we will first use a method to select a starting site that gener- The algorithms for nuclear families with data available for ates a good solution and then we use the algorithm to both parents compute the solution to the left and right ends of the Let us consider a nuclear family with two parents and n chromosome, respectively. children C , C ,..., C . The pedigree is shown in Figure 1. 1 2 n Finding a good starting site A box represents a male individual and a circle represents a female individual. The filled circles/boxes indicate dis- Here we try to find a site, where the haplotypes for all indi- viduals are uniquely determined according to the given genotype data. This can be done if both genotypes of the parents are "AB", and every child's genotype is "AA" or "BB". If there is no such site, we look for the "second best" site, where some of the individuals' haplotypes can be par- tially fixed. The second best site is a site where the geno- type of one parent is "AB", and the genotype of each child is "AA" or "BB". In this case, we can uniquely determine the haplotypes for all children, but partially determine the inheritance information, i.e., one parent has genotype "AA" ("BB") and the inherited "A" ("B") of a child from this parent could be any one of "AA" ("BB"). For this case, Figure 1 A family with n children we arbitrarily give a solution which fits the genotype data. A family with n children. A nuclear family with two par- The third best site is a site where both genotypes of father ents and n children C , C ,..., C . A box represents a male indi- 1 2 n and mother are "AB", and the genotypes of some (but not vidual and a circle represents a female individual. The filled all) children are "AA" or "BB". In this case, we can fix the circles/boxes indicate diseased individuals and the open cir- haplotypes of those children with "AA" or "BB" correctly. cles/boxes indicate normal individuals. This setting applies to all the figures in this paper. But for a child with genotype "AB", we have a risk of mak- ing mistakes. Again, in this case, we arbitrarily give a solu- Page 3 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 tion which fits the genotype data. The worst case is that mization algorithm to further improve the solution. The both genotypes of father and mother are "AA" or "BB". In whole algorithm is as follows: this case, we cannot fix the inheritance source for any child. In practice, we can always find a site which is one of 1. Find a good starting point as described. the first three types. 2. Use the basic algorithm to get a solution for all individ- Note that there is no guarantee that we can always find a uals in the nuclear family. starting site with uniquely determined haplotypes. When the solution on the starting site is wrong, our algorithm 3. Identify a short segment with a large number of break- may produce a short segment with many breakpoints. points and apply the basic algorithm to this segment to re- Whenever such a segment is found, our algorithm will re- calculate using the inverse order (thus a different starting calculate the solution of the segment using the reverse point). If we can reduce the total number of break points order and thus another starting point. then we use the new solution for this segment. The horizontal local optimization algorithm 4. Use the horizontal local optimization algorithm for After we obtain a haplotype solution from the basic algo- each child to refine the solution. rithm, we can look at three individuals, two parents and one of their children C , at a time. Assuming that the hap- The algorithms for nuclear families with data available for single parents lotypes of the two parents are fixed, the number of break points in C might be further reduced if we change the Now, we consider the case, where the genotype data of haplotypes of C . This is due to the existence of multiple one of the parents in the nuclear family is unknown over solutions at a site and the fact that the haplotype solution the entire chromosome. To handle this case, the basic idea at site i heavily depends on that of its previous site. Thus, is similar to that for nuclear families with data available we use an algorithm that can give a haplotype of C with for both parents. For the basic algorithm, we guess the minimum number of break points when the haplotypes haplotype of the unknown parent whenever needed. of the two parents are fixed (by the basic algorithm). In Since each individual has two copies of haplotypes on this way, the total number of break points can be reduced. each chromosome, there are four different haplotype con- pq Let D (i) be the minimum number of breakpoints of C figurations at each site. The two haplotypes for an individ- for the first i sites such that at site i the paternal haplotype ual can be AA, AB, BA, and BB. Thus, we can modify the is from the p-th haplotype of the father and the maternal basic algorithm to handle this case, where the genotype haplotype is from the q-th haplotype of the mother, where data for one parent is missing. In the basic algorithm, pq p = 0, 1 and q = 0, 1. Then D (i + 1) can be computed instead of considering at most 4 configurations of the two 00 01 10 11 based on D (i), D (i), D (i) and D (i). For example, parents, we consider at most 8 configurations, where the 00 00 01 the value of D (i + 1) can be one of D (i), D (i) + 1, unknown parent has 4 configurations, and the known 10 11 D (i) + 1 or D (i) + 2. We can check each of the cases parent has at most 2 configurations. Similarly, for each of and see if the genotype of C at site i + 1 can fit each of the the 8 configurations of the two parents, we can fix C .pater- j j four configurations under the Mendelian law of inherit- nal-h [i + 1], C .maternal-h [i + 1], C .which-p [i + 1] and j j ance. Among all the possible configurations, we choose C .which-m [i + 1] for every child C (j = 1, 2,..., n) such that j j the number of breakpoints of child C between site i and i the one corresponding to the minimum value. The run- ning time of the algorithm is O(n), where n is the number + 1 is minimized. The rest of the algorithm remains the of sites in the chromosome. same. The algorithm for a family with three or more generations We apply the horizontal local optimization algorithm to each of the n children in the nuclear family one by one in To handle a family with three generations, we will view an arbitrarily fixed order. (The order among the children the set of all individuals in the first and second genera- does not affect the results.) tions as a nuclear family which is referred to as the main nuclear family. For any child C in the main nuclear fam- The whole algorithm for a nuclear family ily, if C has his/her own children (third generation indi- Consider a nuclear family containing two generations. For viduals), then we will view C as a super child representing a segment from position i to position i + k, we can use the the second generation nuclear family including C , C 's j j basic algorithm in two ways, i.e., from left to right or from spouse and all their children. The basic algorithm for a right to left. We may get different solutions since the start- family with three generations is similar to the basic algo- ing points are different. After we obtain a solution using rithm for a nuclear family. Here we focus on the main the basic algorithm, we can use the horizontal local opti- nuclear family and give some special treatment for super children. Page 4 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 The basic algorithm for a family with three generations Dealing with missing individuals in the second generation Again, we consider a site at a time. Suppose that paternal- In this subsection, we deal with the cases where the geno- h [i], maternal-h [i], which-p [i] and which-m [i] of each type data for one of the parents in a second generation individual in the main nuclear family for site i are fixed. nuclear family is missing. The algorithm is similar to the Let us consider site i + 1. If the genotype data is known for basic algorithm for a family with three generations. Let C both parents, there are (at most) 4 different haplotype be a super child. Two cases arise. configurations of the two parents (first generation indi- viduals) fitting the given genotype data. If the genotype The genotype data for C 's spouse is missing data of a parent is missing, there are at most 8 different For this case, when we try to fix C .paternal-h [i + 1], haplotype configurations of the two parents. Let C , C ,..., C .maternal-h [i + 1], C .which-p [i + 1] and C .which-m [i + 1 2 j j j C be the n children in the second generation and some of 1] such that the number of breakpoints in the second gen- them might be super children. For each possible haplo- eration nuclear family represented by C between site i and type configurations of the two parents (first generation i + 1 is minimized as in A2, we have to guess the haplotyes individuals), we try to fix C .paternal-h [i + 1], C .maternal- of C 's spouse by trying all possible haplotypes AA, AB, BA j j j .which-p [i + 1] and C .which-m [i + 1] for every 's spouse is given, h [i + 1], C and BB. When the genotype data for C j j j child C (j = 1, 2,..., n) as follows: there are two choices. This will slightly slow down the pro- gram. Moreover, if there are more than one second gener- A1: if C is not a super child, we fix C .paternal-h [i + 1], ation nuclear families missing the super child's spouse's j j C .maternal-h [i + 1], C .which-p [i + 1] and C .which-m [i + genotype data, the speed will not be affected too much, j j j 1] such that the number of breakpoints of child C since children in the second generation are processed one between site i and i + 1 is minimized. Note that for each by one. child C , the number of breakpoints between site i and site The genotype data for C is missing i + 1 could be 0, 1 or 2. For this case, when we try to fix C .paternal [i + 1], C .mater- j j nal [i + 1], C .which-p [i + 1] and C .which-m [i + 1] such A2: if C is a super child, we fix C .paternal-h [i + 1], j j j j that the number of breakpoints in the second generation C .maternal-h [i + 1], C .which-p [i + 1] and C .which-m [i + j j j between site i and i + 1 is nuclear family represented by C 1] such that the number of breakpoints B in the second minimized as in A2, we have to guess the haplotyes of C by trying all possible haplotypes AA, AB, BA and BB with- generation nuclear family C represented by C between j j out genotype data. Again, the running time is still O(m × site i and i + 1 is minimized. Here if the genotype data for n) since children in the second generation are processed both C and C 's spouse is given, there are at most two j j one by one. choices for each of C and C 's spouse. We can call the basic j j Remarks algorithm to get B . Note that B could be greater than C C j j For the algorithm in [6], a top-down approach is used to deal with three-generation families. The algorithm proc- esses nuclear families (with two generations) one by one Again, when several choices are equally good for child C , from the top to the bottom. The old approach cannot give we arbitrarily select a choice. good solutions when the size of the main nuclear family is small. The reason is that the quality of solutions heavily Among all the possible haplotype configurations of the depends on the sizes of nuclear families and if the size of parents (first generation individuals) at site i + 1, we select the main nuclear family is small, the obtained haplotypes one such that the total number of breakpoints for all the for the (super) children in the second generation could be individuals in the three-generation family between site i wrong and this wrong information will be passed to the and site i + 1 is minimized. This process can be used recur- processing of the third generation. A better strategy is to sively to handle more than three generations. In fact, for work on big nuclear families first. However, even this our package, there is no limit on the number of genera- strategy is not as good as our approach here since we do tions in the family. Clearly, if the genotype data of both not fix the solution of super children in the second gener- parents in all nuclear families is known, the running time ation. We have observed that the inferred haplotypes for of the basic algorithm is O(mn), where m is the total big sized second generation families such as two parents number of individuals in the whole family and n is the and 5 children could be wrong though the breakpoint number of SNP sites in the chromosome. positions are very accurate. If the wrong haplotyes are used to handle other nuclear families, the whole linked region could be missed. Page 5 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Table 2: The experimental results for nuclear families when the genotype data for both parents is available. 2 children 3 children 4 children 5 children 6 children No. of Breakpoints 11.11 (10.59) 15.63 (15.50) 21.32 (21.28) 26.61 (26.60) 31.66 (31.60) precision 0.47 (0.50) 0.93 (0.98) 0.95 (0.97) 0.98 (0.99) 0.98 (0.99) recall 0.47 (0.48) 0.91 (0.96) 0.94 (0.97) 0.96 (0.98) 0.96 (0.98) length of real linked region (cM) 76.66 59.51 40.82 35.97 29.90 length of found linked region (cM) 78.70 (79.51) 60.79 (60.77) 42.15 (42.13) 37.00 (37.01) 30.97 (30.99) overlap/real 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) overlap/found 0.97 (0.96) 0.97 (0.97) 0.96 (0.96) 0.96 (0.96) 0.95 (0.95) linked region recovery 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) The row "No. of Breakpoints" indicates the number of breakpoints in the whole family. The row "precision" is the ratio of the number of correctly inferred breakpoints to the total number of inferred breakpoints. The row "recall" is the ratio of the number of correctly inferred breakpoints to the number of real breakpoints. The row "length of real linked region (cM)" is the length of the linked region generated in the simulation. The row "length of found linked region (cM)" is the length of the linked region calculated by our program. The row "overlap/real" is the length of the region shared by both the real linked region and the inferred linked region divided by the length of the real linked region. The row "overlap/found" is the length of the region shared by both the real linked region and the inferred linked region divided by the length of the inferred linked region. The row "linked region recovery" is the number of times that the inferred linked region contains the mutation site divided by the total number of experiments. In each cell, the first value is by our program and the one in brackets is by the program in [6]. Table 3: The experimental results for nuclear families when the genotype data of only one parent is available and this parent is diseased. 2 children 3 children 4 children 5 children 6 children No. of Breakpoints 3.68 (3.75) 7.47 (14.59) 10.32 (25.47) 12.90 (34.83) 15.48 (50.38) precision 0.33 (0.39) 0.78 (0.26) 0.89 (0.23) 0.94 (0.24) 0.96 (0.21) recall 0.23 (0.27) 0.72 (0.48) 0.86 (0.55) 0.91 (0.63) 0.92 (0.66) length of real linked region (cM) 83.14 57.98 45.62 35.15 31.01 length of found linked region (cM) 111.1 (91.45) 65.08 (66.32) 48.85 (49.38) 37.40 (38.02) 32.57 (32.80) overlap/real 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) overlap/found 0.86 (0.94) 0.93 (0.94) 0.94 (0.93) 0.93 (0.92) 0.93 (0.92) linked region recovery 1.00 1.00 1.00 1.00 1.00 (255/285) (229/285) (267/285) (276/285) (281/285) The meaning of each line is the same as in Table 2. Page 6 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Table 4: The experimental results for nuclear families when the genotype data of only one parent is available and this parent is normal. 3 children 4 children 5 children 6 children No. of Breakpoints 7.39 (13.47) 10.29 (24.47) 12.94 (35.69) 15.23 (55.84) precision 0.75 (0.31) 0.89 (0.25) 0.94 (0.24) 0.96 (0.19) recall 0.68 (0.53) 0.86 (0.59) 0.90 (0.65) 0.92 (0.66) length of real linked region (cM) 55.56 43.61 36.61 28.45 length of found linked region (cM) 67.66 (59.25) 43.90 (44.14) 36.87 (35.83) 29.23 (28.12) overlap/real 0.99 (0.99) 0.99 (0.98) 0.99 (0.98) 0.99 (0.97) overlap/found 0.85 (0.94) 0.99 (0.98) 0.99 (0.99) 0.98 (0.98) linked region recovery 1.00 1.00 1.00 1.00 (275/285) (274/285) (274/285) (266/285) The meaning of each line is the same as in Table 2. The current version of our program works for any number will simply delete both breakpoints that are within x SNP of generations. It can handle the case, where the genotype sites. We suggest setting the value of x based on the error data for one of a couple is missing. rate. When the error rate is smaller than 0.1%, x = 5. When the error rate is between 0.1% and 0.3%, x = 8. When the Genotype data error correction error rate is between 0.3% and 0.5%, x is set to be 10. For large-scale SNP genotyping, a certain number of When the error rate cannot be estimated, we simply use x experimental errors is unavoidable. We observe that when = 8. genotype data errors occur, the inferred haplotypes con- Identifying mutation regions tain many breakpoints that are close to each other. In order to get the correct allele sharing status, our algorithm After obtaining the inferred haplotype data for all the individuals in the family, we can find possible mutation regions by looking at the SNP sites one by one. The possi- Table 5: The experimental results for pedigrees P1 to P4. P1 P2 P3 P4 No. of Breakpoints 20.49 20.20 14.20 20.38 precision 0.58 0.77 0.49 0.57 recall 0.63 0.72 0.43 0.53 length of real linked region (cM) 34.69 36.42 43.89 41.28 length of found linked region (cM) 45.84 47.28 62.88 43.25 overlap/real 0.99 1 1 1 overlap/found 0.83 0.80 0.75 0.95 Pedi Figure 2 grees P1 to P4 Pedigrees P1 to P4. Pedigrees P1 to P4. A slash on an indi- linked region recovery 1111 vidual indicates that the genotype data for this individual is missing. The meaning of each line is the same as in Table 2. Page 7 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Table 6: The experimental results for pedigrees P5 to P8. P5 P6 P7 P8 No. of Breakpoints 20.81 (31.60) 26.37 (47.27) 38.83 (108.4) 44.65 (145.4) precision 0.86 (0.45) 0.96 (0.41) 0.97 (0.26) 0.90 (0.23) recall 0.82 (0.66) 0.94 (0.71) 0.93 (0.71) 0.92 (0.76) length of real linked region (cM) 38.58 19.77 21.92 12.37 length of found linked region (cM) 41.00 (43.28) 22.98 (23.51) 23.37 (24.57) 16.04 (16.20) overlap/real 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) overlap/found 0.93 (0.93) 0.86 (0.84) 0.91 (0.91) 0.79 (0.78) linked region recovery 1.00 1.00 1.00 1.00 (232/285) (266/285) (228/285) (265/285) The meaning of each line is the same as in Table 2. ble mutation regions should be regions shared by all or The current version of the package reports all the possible most of the diseased family members (considering phen- regions. The users are asked to input the maximum ocopy) but none or few of the normal family members number of normal individuals to be allowed to share the (considering penetrance). Those regions are reported as mutation allele (allowing for penetrance) and the maxi- suspected mutation regions. Due to the existence of mul- mum number of diseased individuals in the family to be tiple solutions and haplotype inference error, it is possible allowed not to share the potential mutation region that the reported regions do not completely overlap the (allowing for phenocopy). true mutation region. In our algorithm, we use a subrou- tine to extend the suspected mutation regions site by site Results and Discussion in both directions, where we can revise the haplotype In this section, we will test our software package using allele such that the allele is shared by diseased family simulated data. We have considered a wide range of pedi- members, but not shared by the normal family members. gree structures. Unlike the program in [6], here we have to extend in both directions. Table 7: The experimental results for pedigrees P9 to P12. P9 P10 P11 P12 No. of Breakpoints 23.43 (41.07) 18.46 (26.38) 37.44 (92.49) 20.68 (31.52) precision 0.95 (0.40) 0.92 (0.49) 0.91 (0.29) 0.94 (0.47) recall 0.91 (0.67) 0.89 (0.68) 0.89 (0.70) 0.91 (0.71) length of real linked region (cM) 29.80 26.06 22.26 22.05 length of found linked region (cM) 32.01 (36.72) 28.26 (31.93) 24.07 (24.53) 29.78 (30.57) overlap/real 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) overlap/found 0.90 (0.81) 0.91 (0.83) 0.90 (0.89) 0.80 (0.78) linked region recovery 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (90/95) The meaning of each line is the same as in Table 2. Page 8 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Table 8: The experimental results for pedigrees P13 to P16. P13 P14 P15 P16 No. of Breakpoints 23.56 (46.66) 44.20 (124.0) 31.37 (48.25) 34.60 (69.26) precision 0.97 (0.37) 0.98 (0.27) 0.96 (0.47) 0.88 (0.38) recall 0.94 (0.71) 0.95 (0.75) 0.93 (0.71) 0.87 (0.75) length of real linked region (cM) 21.36 18.22 21.41 18.40 length of found linked region (cM) 25.61 (25.91) 19.68 (20.42) 25.12 (27.60) 21.81 (22.39) overlap/real 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 0.99 (1.00) overlap/found 0.86 (0.84) 0.90 (0.89) 0.88 (0.81) 0.86 (0.85) linked region recovery 1.00 (84/95) 1.00 (75/95) 1.00 (1.00) 1.00 (1.00) The meaning of each line is the same as in Table 2. Generating haplotype data using the Chi-square model nation with m equals 4 [6,12,13] and according to male/ In order to test the program, we generated haplotype data- female averaged genetic map for chromosome 1 down- sets based on the Chi-square model to see if our program loaded from HapMap [14]. The simulation process is can infer the haplotype data correctly from the corre- identical to that of [6,15]. When disease status was consid- sponding genotype data. The founder haplotypes were ered, a mutation was randomly assigned to be close to a obtained from real datasets (Affymetrix Human Mapping SNP site (called mutation site), and the diseased individu- 50 K/250 K GeneChips [11]), and children haplotypes als were forced to inherit the mutation strand and the nor- were generated through random inheritance of paternal/ mal individuals were forced not to inherit the mutation maternal alleles using the Chi-square model for recombi- strand. This process is done generation by generation. Pedi Figure 3 grees P5 to P8 Pe Figure 4 digrees P9 to P12 Pedigrees P5 to P8. Pedigrees P9 to P12. Page 9 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 A pedig Figure 7 ree for case study A pedigree for case study. point is within 20 SNPs away from the real location. From Table 2, we can see that all the elements in the row "over- lap/real" are 1 indicating that our package can always identify the whole mutation region. The elements in the row "overlap/found" are about 0.97 or 0.96 indicating that the sizes of the reported regions are slightly larger than (almost identical to) that of the real mutation regions. The values of recall and precision are also very Pedi Figure 5 grees P13 to P14 good when there are more than two children. The values Pedigrees P13 to P14. in brackets are by the program in [6]. We can see that both programs perform very well in those cases. Nuclear families with data available for both parents Nuclear families with data available for single parents Let us consider a nuclear family as shown in Figure 1. Let us consider nuclear families with a single parent. In When the genotype data for both parents is given, the this case, the genotype data for the other parent is missing. results are shown in Table 2. Here we test the cases, where About half of the children are diseased. There are two there are 2, 3,..., 6 children in the nuclear family. The cases: (1) the available parent is diseased and (2) the results are listed in column 2, 3,..., 6, respectively. We have available parent is normal. The results are shown in Table selected individuals to form 95 couples as in [6]. We have 3 and Table 4, respectively. Each cell in the tables contains done experiments on 285 datasets (three times for each two values. The first one is by our program and the one in couple) and calculated the average. For all the simula- brackets is by the program in [6]. We consider the cases tions, the number of data sets is a multiple of 95. Each cell where there are 2, 3,..., 6 children. For both cases, we can in the table contains two values. The first one is by our see that for our program, the precision and the recall are program and the one in brackets is by the program in [6]. slightly worse than that in Table 2. The length of the A breakpoint is correctly inferred if the inferred break- inferred region is slightly longer. Most importantly, all the elements in the row "linked region recovery" are 1 for our program. This indicates that our program can always find the linked region that contains the mutation site. We can also see that our new program performs much better than the program in [6] in those cases. Comparing case 1 and case 2, we can see that case 2 is harder to handle. In case 2, the result for 2 children is very poor and this case is not solvable by our program. Complicated pedigrees We did experiments on some complicated pedigrees. First we study some pedigrees, where the genotype data for the super children in the second generation (shared by two Pedi Figure 6 grees P15 to P16 Pedigrees P15 to P16. nuclear families) is missing. See P1–P4 in Figure 2. A slash on an individual indicates that the genotype data for this Page 10 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Th Figure 8 e simulated grand-paternal haplotype allele sharing status among all members, excluding grandmother The simulated grand-paternal haplotype allele sharing status among all members, excluding grandmother. Each individual has two haplotypes, paternal haplotype and maternal haplotype. Note that individual D is the founder of the dis- ease in the family. The two haplotypes of individual D are shown as 1-th haplotype and 0-th haplotype at the top and at the bottom, respectively. Any other individual may inherit alleles from both 0-th haplotype and 1-th haplotype of individual D. Thus, any other individual in the family appears twice (upper part and lower part) in the figure. The upper part shows the seg- ments inherited from the 1-th haplotype of D and the lower part shows the segments inherited from the 0-th haplotype of D. The segments from diseased individuals are red and the segments from normal individuals are blue. The simulated (true) linked region is indicated by the horizontal black double-direction arrow in the middle. individual is missing. This setting applies to all the figures results are listed in Table 6. From the row "linked region in the paper. recovery", we can see that, if the nuclear family of the first generation has a single parent, and the nuclear family of We have done 285 experiments on each of P1–P4. From the second generation has at least three (with both par- Table 5, we can see that the inferred region of our program ents) or four (with single parent) children, we can always can almost always cover the entire real linked region. The find the linked region that contains the mutation site. The length of the inferred linked region is a bit (about 15%) program in [6] missed the mutation site at a rate of about longer than that of the real linked region. From the row 13.07%. Now we consider pedigrees P9–P16 in Figures 4, "linked region recovery", we can see that our program can 5 and 6. The experimental results are shown in Table 7 always precisely find the linked region that contains the and Table 8, respectively. We can see that for common mutation site. Note that the program in [6] cannot handle pedigrees of different structures, our program can always P1–P4 at all. This is a significant improvement in our new find the linked region containing the mutation site and program. Let us consider pedigrees P5–P8 in Figure 3, our program reports the linked region more precisely than where individuals with missing genotype data are not the program in [6]. Note that the program in [6] some- super children (not shared by two nuclear families). We times missed the mutation sites. From all the listed exper- have done 285 experiments for each of the pedigrees. The imental results, we can see that our program has much Page 11 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Th Figure 9 e inferred grand-paternal haplotype allele sharing status among all members, excluding grandmother The inferred grand-paternal haplotype allele sharing status among all members, excluding grandmother. higher precision and recall than the program in [6], which Note that individual D is the founder of the disease in the indicates that our new program can infer the haplotypes family. The two haplotypes of individual D are shown as more precisely. 1-th haplotype and 0-th haplotype at the top and at the bottom, respectively. Any other individual may inherit A Case Study alleles from both 0-th haplotype and 1-th haplotype of We studied a pedigree to see if the program can correctly individual D. Thus, any other individual in the family identify the linked region when some of the family mem- appears twice (upper part and lower part) in the figure. bers are purposely excluded. The pedigree is shown in Fig- The upper part shows the segments inherited from the 1- ure 7. The simulated allele sharing status and the linked th haplotype of D and the lower part shows the segments region are shown in Figure 8. Each individual has two inherited from the 0-th haplotype of D. The segments haplotypes, paternal haplotype and maternal haplotype. from diseased individuals are red and the segments from Table 9: The experimental results for genotype data error correction using Affymetrix 50 K GeneChips data. precision recall linked region recovery overlap/real overlap/found No error 97.74% 95.05% 100% 99.98% 92.46% (69.62%) (86.56%) (100%) (100%) (91.78%) 0.1% error 84.40% 94.94% 100% 99.93% 92.00% (67.19%) (86.19%) (100%) (100%) (91.13%) 0.5% error 79.00% 93.76% 100% 99.84% 92.57% (61.81%) (86.10%) (100%) (100%) (91.94%) Page 12 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Table 10: Comparison with Merlin using Affymetrix 50 K Table 11: Comparison with Merlin using Affymetrix 250 K GeneChips. GeneChips. Run time (s) overlap/real overlap/found Run time (s) overlap/real overlap/found 0+3 16.717 (0.898) 0.941 (0.805) 0.605 (0.862) 0+3 67.680 (2.979) 0.954 (0.975) 0.590 (0.866) 0+4 23.325 (1.126) 0.928 (0.824) 0.594 (0.930) 0+4 97.082 (4.067) 0.965 (0.988) 0.548 (0.955) 0+5 30.634 (1.746) 0.941 (0.901) 0.587 (0.936) 0+5 124.43 (6.558) 0.983 (0.994) 0.607 (0.960) 0+6 39.028 (3.901) 0.987 (0.929) 0.576 (0.969) 0+6 131.71 (12.50) 0.994 (0.992) 0.675 (0.986) 2+3(4 bits) 3.276 (1.340) 1.000 (0.928) 0.960 (1) 2+3(4 bits) 9.454 (7.122) 1 (0.936) 0.995 (1) 2+4(6 bits) 4.596 (1.735) 0.9997 (0.937) 0.967 (0.990) 2+4(6 bits) 11.625 (8.133) 1 (0.989) 0.988 (1) P5(7 bits) 7.714 (2.325) 0.999 (0.933) 0.919 (0.979) P5(7 bits) 16.662 (8.262) 1 (0.892) 0.983 (0.938) P9(9 bits) 7.365 (3.255) 0.999 (0.879) 0.926 (0.958) P9(9 bits) 24.18 (16.65) 1.000 (0.934) 0.973 (1) P10(11 bits) 19.15 (8.829) 0.999 (0.797) 0.912 (0.968) P10(11 bits) 38.54 (31.39) 1.000 (0.797) 0.970 (0.938) P11(12 bits) 11.37 (14.01) 1.000 (0.938) 0.924 (0.979) P11(12 bits) 29.94 (415.18) 1.000 (0.991) 0.972 (1) P12(13 bits) 25.62 (31.57) 1.000 (0.922) 0.842 (0.958) P12(13 bits) 49.28 (1237.7) 1.000 (0.968) 0.871 (1) P7(14 bits) 13.58 (58.14) 1.000 (0.949) 0.933 (0.990) P7(14 bits) 31.611 (NA) 1(NA) 0.960(NA) P15(14 bits) 46.96 (43.15) 1.000 (0.906) 0.894 (0.979) P15(14 bits) 103.269 (NA) 1(NA) 0.927(NA) P13(15 bits) 23.72 (1282.05) 1.000 (0.970) 0.875 (1) P13(15 bits) 54.114 (NA) 1.000(NA) 0.881(NA) P16(15 bits) 37.93 (384.78) 0.988 (0.950) 0.877 (0.990) P16(15 bits) 141.266 (NA) 0.991(NA) 0.922(NA) P14(16 bits) 15.042 (NA) 1.000 (NA) 0.927 (NA) P14(16 bits) 52.911 (NA) 1(NA) 0.979(NA) P6(17 bits) 40.754 (NA) 1.000 (NA) 0.869 (NA) P6(17 bits) 63.984 (NA) 1.000(NA) 0.862(NA) Each cell contains two values. The first one is by our program and the Each cell contains two values. The first one is by our program and the one in brackets is by Merlin. Pedigree 0+i for i = 3, 4, 5 and 6 stands one in brackets is by Merlin. Pedigree 0+i for i = 3, 4, 5 and 6 stands for a nuclear family with i children and the genotype data for parents for a nuclear family with i children and the genotype data for both is not available. Pedigree 2+i for i = 3 and 4 stands for a nuclear family parents is not available. Pedigree 2+i for i = 3 and 4 stands for a with i children and the genotype data for both parents is available. nuclear family with i children and the genotype data for both parents "NA" means that the program failed to execute on this pedigree. is available. "NA" means that the program failed to execute on this pedigree. normal individuals are blue. At any position in the chro- mosome, each second generation individual gets an allele bps). The inferred linked region remains the same if we from D that is either from the 0-th haplotype or 1-th hap- exclude individual E or F. The inferred linked region also lotype. For the third generation individual, say, C1, the remains unchanged if we simultaneously remove E, M allele at a position may or may not be from D. The simu- and N. When we simultaneously remove E, K, N, D1, D2 lated (true) linked region is from 189.61 cM to 211.55 cM and D3, the inferred linked region is from 189.61 cM to (physical position from 169183745 bps to 197201161 225.48 cM, which is only slightly enlarged. When we bps) indicated by the horizontal black double-direction simultaneously remove E, F, M, C1, C2, C3 and C4, the arrow in the middle. The inferred configuration is shown inferred linked region is again enlarged and the region is in Figure 9. We can see that the inferred configuration is from 172.07 cM to 211.63 cM. When F, K, N, D1, D2 and roughly the same as the simulated configuration except D3 are simultaneously removed from the pedigree, the that it is upside down. The inferred linked region is from inferred linked region is from 189.61 cM and 221.90 cM. 189.61 cM(169174855 bps) to 211.63 cM(197295627 Page 13 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Genotype data error correction the genotype data of one parent is missing for the entire To test the effect of genotype data errors on our program, chromosome. we did experiments on the pedigree in Figure 7. We gener- ated genotype errors by randomly changing the genotype Availability and requirements value (which is one of AA, BB and AB) into a different Project Name: CityU 121608 value (which is one of the other two values) at a position with probability 0.1% and 0.5%, respectively. We simu- Project Homepage: http://www.cs.cityu.edu.hk/~lwang/ lated 475 data sets. The length of real linked region ranges software/Link from 0.76 cM to 65.17 cM. The experimental results are shown in Table 9. For each cell, there are two values. The Operating system(s): Platform independent first one is by our program and the one in brackets is by the program in [6]. We can see that our program performs Programming language: Java better and can always recover the real linked regions. Other requirements: Java 1.6.0 or higher Comparison with Merlin We compared our program with Merlin [16]. We did the Licence: None experiments on a PC with a CPU of 3.0 GHz and 1.00 GB memory. The results are shown in Table 10 and Table 11 Any restrictions to use by non-academics: None for Affymetrix 50 K GeneChips and Affymetrix 250 K GeneChips, respectively. We have also considered differ- Authors' contributions ent kinds of pedigrees. LW carried out the algorithm design, provided some of initial pseudo codes, checked part of computer codes, par- Running time ticipated in the design of experiments, and drafted the For small sized families, both programs can generate manuscript. ZW produced the computer program, carried results in a few seconds. When the sizes of the families and out the experiments, drafted experiment results, and par- the number of the markers increase, the running time of ticipated in algorithm design. WY initiated the study, par- our program increases linearly. For large families, Merlin ticipated in the design of experiments, polished the requires really long running time. Most importantly, Mer- manuscript, and was in charge of communication with lin needs large memory space for big sized families and the editor. All the authors read and approved the final cannot successfully complete the computation for some manuscript. pedigrees (see P14 and P6 in Table 10 and Table 11). Acknowledgements We thank the referees for their helpful suggestions. Lusheng Wang is fully Output Quality supported by a grant from the Research Grants Council of the Hong Kong We again use "overlap/real" and "overlap/found" to indi- Special Administrative Region, China [Project No. CityU 121608]. cate the quality of the computational results. Our pro- gram always clearly gives a computed candidate region for References each input. Merlin calculates a LOD score for each marker. 1. The International HapMap Consortium: A second generation We evaluated the segment with the highest LOD score as human haplotype map of over 3.1 million SNPs. Nature 2007, 449:851-861. the output linked region for Merlin. From Table 10 and 2. Leykin I, Hao K, Cheng J, Meyer N, Pollak M, Smith R, Wong W, Rose- Table 11, we can see that the results of our program are now C, Li C: Comparative linkage analysis and visualization of less than optimal when parental data is not available at high-density oligonucleotide SNP array data. BMC Genet 2005, 6(1):7. all. When the family size becomes bigger, our program 3. Sellick G, Longman C, Tolmie J, Newbury-Ecob R, Geenhalgh L, outperforms Merlin. Our program can quickly produce Hughes S, Whiteford M, Garrett C, Houlston R: Genomewide link- accurate linked regions when the family size is big while age searches for mendelian disease loci can be efficiently conducted using high-density SNP genotyping arrays. Nucleic Merlin failed to execute on big sized families. Acids Res 2004, 32:e164. 4. Elston R, Stewart J: A general model for the genetic analysis of pedigree data. Hum Hered 1971, 21(6):523-542. Conclusion 5. Lander E, Green P: Construction of multilocus genetic linkage We have developed a software package that infers the hap- maps in humans. Proc Natl Acad Sci USA 1987, 84:2363-2367. lotype allele sharing status for the members of a pedigree 6. Lin G, Wang Z, Wang L, Lau YL, Yang W: Identification of linked regions using high-density SNP genotype data in linkage based on the minimum recombinants model. The run- analysis. Bioinformatics 2008, 24(1):86-93. ning time of the program is linear in terms of the input 7. Qian D, Beckmann L: Minimum recombinant haplotyping in pedigrees. Am J Hum Genet 2002, 70:1434-1445. size O(mn), where m is the total number of individuals in 8. Li J, Jiang T: Efficient inference of haplotypes from genotypes the whole family and n is the number of SNP sites in the on a pedigree. J Bioinform Comput Bio 2003, 1:41-69. chromosome. The new package can handle a wide range 9. Doi K, Li J, Jiang T: Minimum recombinant haplotype configu- ration on tree pedigrees. Proc. of the 3rd Annual Workshop on Algo- of pedigree structures. It works very well for cases where rithms in Bioinformatics (WABI'03) 2003:339-353. Page 14 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 10. Li J, Jiang T: Computing the minimum recombinant haplotype configuration from incomplete genotype data on a pedigree by integer linear programming. Journal of Computational Biology 2005, 12:719-739. 11. Affymetrix Human Mapping GeneChips [http://www.affyme trix.com/products_services/arrays/specific/100k.affx] 12. Broman K, Weber J: Characterization of human crossover interference. Am J Hum Genet 2000, 66(6):1911-1926. 13. Zhao H, Speed T, McPeek M: Statistical analysis of crossover interference using the chi-square model. Genetics 1995, 139(2):1045-1056. 14. International HapMap Project [http://www.hapmap.org] 15. Yang W, Wang Z, Wang L, Sham P, Huang P, Lau Y: Predicting the number and sizes of IBD regions among family members and evaluating the family size requirement for linkage studies. European Journal of Human Genetics 2008, 16:1535-1543. 16. Abecasis G, Cherny S, Cookson W, Cardon L: Merlin – rapid anal- ysis of dense genetic maps using sparse gene flow trees. Nature Genetics 2002, 30:97-101. Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 15 of 15 (page number not for citation purposes) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals

Linked region detection using high-density SNP genotype data via the minimum recombinant model of pedigree haplotype inference

BMC Bioinformatics , Volume 10 (1) – Jul 15, 2009

Loading next page...
 
/lp/springer-journals/linked-region-detection-using-high-density-snp-genotype-data-via-the-C2iY5UGiWO

References (19)

Publisher
Springer Journals
Copyright
Copyright © 2009 by Wang et al; licensee BioMed Central Ltd.
Subject
Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms
eISSN
1471-2105
DOI
10.1186/1471-2105-10-216
pmid
19604391
Publisher site
See Article on Publisher Site

Abstract

Background: With the rapid development of high-throughput genotyping technologies, efficient methods for identifying linked regions using high-density SNP genotype data have become more and more important. Recently, a deterministic method that works very well on SNP genotyping data has been developed (Lin et al. Bioinformatics 2008, 24(1): 86–93). However, that program can only work on a limited number of family structures. In particular, the results (if any) will be poor when the genotype data for the whole chromosome of one of the parents in a nuclear family is missing. Results: We have developed a software package (LIden) for identifying linked regions using high- density SNP genotype data. We focus on handling the case where the genotype data for the whole chromosome of one of the parents in a nuclear family is missing. We use the minimum recombinant model for haplotype inference in pedigrees. Several local optimization algorithms are used to infer the haplotype of each individual and determine the linked regions based on the inferred haplotype data. We have developed a more flexible method to combine nuclear families to further refine (reduce the length of) the linked regions. Conclusion: Our new package (LIden) is efficient software for linked region detection using high- density SNP genotype data. LIden can handle some important cases where the existing programs do not work well. In particular, the new package can handle many cases where the genotype data of one of the two parents is missing for the entire chromosome. The running time of the program is O(mn), where m is the number of members in the family and n is the number of SNP sites in the chromosome. LIden is specifically suitable for handling big sized families. This research also demonstrates another practical use of the minimum recombinant model for haplotype inference in pedigrees. The software package can be downloaded at http://www.cs.cityu.edu.hk/~lwang/software/Link. understanding of human genomic sequences has been Background With the completion of the human genome sequencing extended dramatically. Due to the development of SNP project and the development of HapMap project [1], our genotyping technology, genotyping of hundreds of thou- Page 1 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 sands of single nucleotide polymorphism (SNP) markers Implementation in a high-throughput format has become a routine job in We use the minimum recombinant model to infer the many labs. haplotype configuration for all the family members. In 2002, Qian and Beckman [7] proposed a model to recon- Compared to classical genotyping methods mainly using struct haplotype configurations from genotype data in a microsatellite markers, SNP genotyping is faster and eas- pedigree under the Mendelian law of inheritance. In this ier. It provides complete coverage of the genome and model, the resulting haplotype configurations should much more information on covered regions. Linkage have the minimum number of recombinants (i.e. recom- analysis is a method to identify genomic regions that bination events). cosegregate with an inherited disease in a family and to Minimum Recombinant Haplotype Configuration facilitate the eventual identification of the mutation in that region causing the disease. Leykin et al. in [2] and Given a pedigree and the genotype information for each Sellick et al. in [3] demonstrated that high-density SNP member, the object is to find a haplotype configuration genotype data, such as that from microarrays, can be used such that the total number of recombinants in the whole for large-scale and cost-effective linkage analysis. The pedigree is minimized. main reason is that there will be a sufficient number of informative markers between any two recombination The problem is called Minimum Recombinant Haplotype points and thus the allele sharing status among the family Configuration (MRHC). The MRHC problem was proved members can be precisely determined. Therefore, efficient to be NP-hard by Li and Jiang [8]. Lots of algorithms have programs are highly demanded for allele sharing determi- been proposed. Some algorithms run in time exponential nation that work on a large number of markers and big in terms of the number of SNP sites and some algorithms sized families. run in time exponential in terms of the number of family members [9]. An integer linear programming approach Classical linkage analysis methods are designed for sparse was proposed to handle incomplete genotype data [10]. microsatellite markers. They are mainly based on two algorithms, the Elston-Steward algorithm that is limited Linkage analysis aims to identify regions whose allele is by the number of total markers used [4] and the Lander- shared by all or most diseased members in a family but by Green algorithm that is limited by the total number of none or few normal family members. In dominant inher- individuals in a family [5]. As a result, they either cannot itance situations, sharing of one mutation allele can cause handle genotype data based on large number of SNPs at a disease phenotype. In recessive cases, sharing of two dis- all or they cannot handle families of a large size, especially ease alleles in that region is necessary for there to be a dis- together with large numbers of genotyped SNPs, due to eased status. We will first design algorithms to infer the memory constraint. allele sharing status with minimum recombinants and then use an algorithm to find the linked regions(regions Recently, a deterministic method that works very well on shared by all or most of the diseased individuals but not SNP genotyping data [6] has been developed. This was shared by any normal individuals) by possibly changing one of a series of efforts to develop software that is partic- the inferred allele sharing status. ularly suitable for SNP genotyping data and runs in time linear to both the number of SNP sites and the number of Algorithm family members. However, the program in [6] can only Our package contains a set of (heuristic) algorithms to work on a limited number of family structures. Here we handle various kinds of situations and sometimes for one use the minimum recombinant model for haplotype case, we use two (local optimization) algorithms to itera- inference in pedigrees and develop a set of algorithms to tively refine the output. Before we discuss the algorithms, minimize the total number of recombinants and produce we give the basic data structures used in the package. a software package that works on a much wider range of family structures. Extensive simulations on Affymetrix The basic data structures Human Mapping 50 K/250 K GeneChips showed that the For each individual in the family, we use five arrays: gen- new package can correctly identify the linked regions on a otype, paternal-h, maternal-h, which-p and which-m to wide range of family structures. In particular, the new store the information. The possible values for each ele- package outperforms the old program in many important ment in the arrays are given in Table 1. The genotype value cases where the genotype data of one of the parents is of individual I at site i is in I. genotype [i], which is {A, missing on the entire chromosome. This research also A}, {A, B} or {B, B}. The haplotype value of individual I demonstrates another practical use of the minimum from the father at site i is in I.paternal-h [i], which is recombinant model for haplotype inference in pedigrees. either A or B. Similarly, the haplotype value of individual I at site i from the mother is in I. maternal-h [i], which is Page 2 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Table 1: The possible values of each element in the five arrays. eased individuals and the open circles/boxes indicate nor- mal individuals. This setting applies to all the figures in Array Name Possible Values of each element this paper. To handle nuclear families with both parents, we use two algorithms, the basic algorithm and the horizon- genotype {AA}, {AB}, {BB} tal local optimization algorithm. paternal-h A, B The basic algorithm In our basic algorithm, we consider a site at a time. Sup- maternal-h A, B pose that paternal [i], maternal [i], which-p [i] and which-m which-p 0, 1 [i] of each individual at site i have been fixed. For site i + 1, there are (at most) 4 different haplotype configurations which-m 0, 1 of the two parents fitting the given genotype data. For each of the 4 haplotype configurations of the two parents, we can fix C .paternal-h [i + 1], C .maternal-h [i + 1], C .which- j j j either A or B. For each individual, there are two (paternal .which-m [i + 1] for every child C (j = 1, 2,..., p [i + 1] and C j j and maternal) copies of haplotypes. We use 0 and 1 to dis- n) such that the number of break points of child C tinguish them. I.which-p [i] can be 0 or 1. between site i and i + 1 is minimized. Note that, for each child C , the number of breakpoints between site i and site I.which-p [i] = 0 indicates that the haplotype I.paternal- i + 1 could be 0, 1 or 2. When two choices are equally h [i] of individual I is from his/her father's 0-th haplotype good for child C , we arbitrarily select one. We then and I.which-p [i] = 1 indicates that the haplotype I.pater- choose one of the 4 haplotype configurations of the par- nal-h [i] of individual I is from his/her father's 1-th hap- ents at site i + 1 such that the total number of break points lotype. Similarly, I.which-m [i] = 0 indicates that the for all the n children between site i and site i + 1 is mini- haplotype I. maternal-h [i] of individual I is from his/her mized. It is worth pointing out that the basic algorithm mother's 0-th haplotype and I.which-m [i] = 1 indicates here considers all the children site by site while the old that the haplotype I. maternal-h [i] of individual I is from algorithm in [6] considers all the sites for every child and his/her mother's 1-th haplotype. The main purpose here is then handles the children of the nuclear family one by to keep a record of which grandparent the allele came one. Clearly, the quality of the solution at site i + 1 heavily from. depends on the quality of the solution at site i. Thus, we will first use a method to select a starting site that gener- The algorithms for nuclear families with data available for ates a good solution and then we use the algorithm to both parents compute the solution to the left and right ends of the Let us consider a nuclear family with two parents and n chromosome, respectively. children C , C ,..., C . The pedigree is shown in Figure 1. 1 2 n Finding a good starting site A box represents a male individual and a circle represents a female individual. The filled circles/boxes indicate dis- Here we try to find a site, where the haplotypes for all indi- viduals are uniquely determined according to the given genotype data. This can be done if both genotypes of the parents are "AB", and every child's genotype is "AA" or "BB". If there is no such site, we look for the "second best" site, where some of the individuals' haplotypes can be par- tially fixed. The second best site is a site where the geno- type of one parent is "AB", and the genotype of each child is "AA" or "BB". In this case, we can uniquely determine the haplotypes for all children, but partially determine the inheritance information, i.e., one parent has genotype "AA" ("BB") and the inherited "A" ("B") of a child from this parent could be any one of "AA" ("BB"). For this case, Figure 1 A family with n children we arbitrarily give a solution which fits the genotype data. A family with n children. A nuclear family with two par- The third best site is a site where both genotypes of father ents and n children C , C ,..., C . A box represents a male indi- 1 2 n and mother are "AB", and the genotypes of some (but not vidual and a circle represents a female individual. The filled all) children are "AA" or "BB". In this case, we can fix the circles/boxes indicate diseased individuals and the open cir- haplotypes of those children with "AA" or "BB" correctly. cles/boxes indicate normal individuals. This setting applies to all the figures in this paper. But for a child with genotype "AB", we have a risk of mak- ing mistakes. Again, in this case, we arbitrarily give a solu- Page 3 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 tion which fits the genotype data. The worst case is that mization algorithm to further improve the solution. The both genotypes of father and mother are "AA" or "BB". In whole algorithm is as follows: this case, we cannot fix the inheritance source for any child. In practice, we can always find a site which is one of 1. Find a good starting point as described. the first three types. 2. Use the basic algorithm to get a solution for all individ- Note that there is no guarantee that we can always find a uals in the nuclear family. starting site with uniquely determined haplotypes. When the solution on the starting site is wrong, our algorithm 3. Identify a short segment with a large number of break- may produce a short segment with many breakpoints. points and apply the basic algorithm to this segment to re- Whenever such a segment is found, our algorithm will re- calculate using the inverse order (thus a different starting calculate the solution of the segment using the reverse point). If we can reduce the total number of break points order and thus another starting point. then we use the new solution for this segment. The horizontal local optimization algorithm 4. Use the horizontal local optimization algorithm for After we obtain a haplotype solution from the basic algo- each child to refine the solution. rithm, we can look at three individuals, two parents and one of their children C , at a time. Assuming that the hap- The algorithms for nuclear families with data available for single parents lotypes of the two parents are fixed, the number of break points in C might be further reduced if we change the Now, we consider the case, where the genotype data of haplotypes of C . This is due to the existence of multiple one of the parents in the nuclear family is unknown over solutions at a site and the fact that the haplotype solution the entire chromosome. To handle this case, the basic idea at site i heavily depends on that of its previous site. Thus, is similar to that for nuclear families with data available we use an algorithm that can give a haplotype of C with for both parents. For the basic algorithm, we guess the minimum number of break points when the haplotypes haplotype of the unknown parent whenever needed. of the two parents are fixed (by the basic algorithm). In Since each individual has two copies of haplotypes on this way, the total number of break points can be reduced. each chromosome, there are four different haplotype con- pq Let D (i) be the minimum number of breakpoints of C figurations at each site. The two haplotypes for an individ- for the first i sites such that at site i the paternal haplotype ual can be AA, AB, BA, and BB. Thus, we can modify the is from the p-th haplotype of the father and the maternal basic algorithm to handle this case, where the genotype haplotype is from the q-th haplotype of the mother, where data for one parent is missing. In the basic algorithm, pq p = 0, 1 and q = 0, 1. Then D (i + 1) can be computed instead of considering at most 4 configurations of the two 00 01 10 11 based on D (i), D (i), D (i) and D (i). For example, parents, we consider at most 8 configurations, where the 00 00 01 the value of D (i + 1) can be one of D (i), D (i) + 1, unknown parent has 4 configurations, and the known 10 11 D (i) + 1 or D (i) + 2. We can check each of the cases parent has at most 2 configurations. Similarly, for each of and see if the genotype of C at site i + 1 can fit each of the the 8 configurations of the two parents, we can fix C .pater- j j four configurations under the Mendelian law of inherit- nal-h [i + 1], C .maternal-h [i + 1], C .which-p [i + 1] and j j ance. Among all the possible configurations, we choose C .which-m [i + 1] for every child C (j = 1, 2,..., n) such that j j the number of breakpoints of child C between site i and i the one corresponding to the minimum value. The run- ning time of the algorithm is O(n), where n is the number + 1 is minimized. The rest of the algorithm remains the of sites in the chromosome. same. The algorithm for a family with three or more generations We apply the horizontal local optimization algorithm to each of the n children in the nuclear family one by one in To handle a family with three generations, we will view an arbitrarily fixed order. (The order among the children the set of all individuals in the first and second genera- does not affect the results.) tions as a nuclear family which is referred to as the main nuclear family. For any child C in the main nuclear fam- The whole algorithm for a nuclear family ily, if C has his/her own children (third generation indi- Consider a nuclear family containing two generations. For viduals), then we will view C as a super child representing a segment from position i to position i + k, we can use the the second generation nuclear family including C , C 's j j basic algorithm in two ways, i.e., from left to right or from spouse and all their children. The basic algorithm for a right to left. We may get different solutions since the start- family with three generations is similar to the basic algo- ing points are different. After we obtain a solution using rithm for a nuclear family. Here we focus on the main the basic algorithm, we can use the horizontal local opti- nuclear family and give some special treatment for super children. Page 4 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 The basic algorithm for a family with three generations Dealing with missing individuals in the second generation Again, we consider a site at a time. Suppose that paternal- In this subsection, we deal with the cases where the geno- h [i], maternal-h [i], which-p [i] and which-m [i] of each type data for one of the parents in a second generation individual in the main nuclear family for site i are fixed. nuclear family is missing. The algorithm is similar to the Let us consider site i + 1. If the genotype data is known for basic algorithm for a family with three generations. Let C both parents, there are (at most) 4 different haplotype be a super child. Two cases arise. configurations of the two parents (first generation indi- viduals) fitting the given genotype data. If the genotype The genotype data for C 's spouse is missing data of a parent is missing, there are at most 8 different For this case, when we try to fix C .paternal-h [i + 1], haplotype configurations of the two parents. Let C , C ,..., C .maternal-h [i + 1], C .which-p [i + 1] and C .which-m [i + 1 2 j j j C be the n children in the second generation and some of 1] such that the number of breakpoints in the second gen- them might be super children. For each possible haplo- eration nuclear family represented by C between site i and type configurations of the two parents (first generation i + 1 is minimized as in A2, we have to guess the haplotyes individuals), we try to fix C .paternal-h [i + 1], C .maternal- of C 's spouse by trying all possible haplotypes AA, AB, BA j j j .which-p [i + 1] and C .which-m [i + 1] for every 's spouse is given, h [i + 1], C and BB. When the genotype data for C j j j child C (j = 1, 2,..., n) as follows: there are two choices. This will slightly slow down the pro- gram. Moreover, if there are more than one second gener- A1: if C is not a super child, we fix C .paternal-h [i + 1], ation nuclear families missing the super child's spouse's j j C .maternal-h [i + 1], C .which-p [i + 1] and C .which-m [i + genotype data, the speed will not be affected too much, j j j 1] such that the number of breakpoints of child C since children in the second generation are processed one between site i and i + 1 is minimized. Note that for each by one. child C , the number of breakpoints between site i and site The genotype data for C is missing i + 1 could be 0, 1 or 2. For this case, when we try to fix C .paternal [i + 1], C .mater- j j nal [i + 1], C .which-p [i + 1] and C .which-m [i + 1] such A2: if C is a super child, we fix C .paternal-h [i + 1], j j j j that the number of breakpoints in the second generation C .maternal-h [i + 1], C .which-p [i + 1] and C .which-m [i + j j j between site i and i + 1 is nuclear family represented by C 1] such that the number of breakpoints B in the second minimized as in A2, we have to guess the haplotyes of C by trying all possible haplotypes AA, AB, BA and BB with- generation nuclear family C represented by C between j j out genotype data. Again, the running time is still O(m × site i and i + 1 is minimized. Here if the genotype data for n) since children in the second generation are processed both C and C 's spouse is given, there are at most two j j one by one. choices for each of C and C 's spouse. We can call the basic j j Remarks algorithm to get B . Note that B could be greater than C C j j For the algorithm in [6], a top-down approach is used to deal with three-generation families. The algorithm proc- esses nuclear families (with two generations) one by one Again, when several choices are equally good for child C , from the top to the bottom. The old approach cannot give we arbitrarily select a choice. good solutions when the size of the main nuclear family is small. The reason is that the quality of solutions heavily Among all the possible haplotype configurations of the depends on the sizes of nuclear families and if the size of parents (first generation individuals) at site i + 1, we select the main nuclear family is small, the obtained haplotypes one such that the total number of breakpoints for all the for the (super) children in the second generation could be individuals in the three-generation family between site i wrong and this wrong information will be passed to the and site i + 1 is minimized. This process can be used recur- processing of the third generation. A better strategy is to sively to handle more than three generations. In fact, for work on big nuclear families first. However, even this our package, there is no limit on the number of genera- strategy is not as good as our approach here since we do tions in the family. Clearly, if the genotype data of both not fix the solution of super children in the second gener- parents in all nuclear families is known, the running time ation. We have observed that the inferred haplotypes for of the basic algorithm is O(mn), where m is the total big sized second generation families such as two parents number of individuals in the whole family and n is the and 5 children could be wrong though the breakpoint number of SNP sites in the chromosome. positions are very accurate. If the wrong haplotyes are used to handle other nuclear families, the whole linked region could be missed. Page 5 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Table 2: The experimental results for nuclear families when the genotype data for both parents is available. 2 children 3 children 4 children 5 children 6 children No. of Breakpoints 11.11 (10.59) 15.63 (15.50) 21.32 (21.28) 26.61 (26.60) 31.66 (31.60) precision 0.47 (0.50) 0.93 (0.98) 0.95 (0.97) 0.98 (0.99) 0.98 (0.99) recall 0.47 (0.48) 0.91 (0.96) 0.94 (0.97) 0.96 (0.98) 0.96 (0.98) length of real linked region (cM) 76.66 59.51 40.82 35.97 29.90 length of found linked region (cM) 78.70 (79.51) 60.79 (60.77) 42.15 (42.13) 37.00 (37.01) 30.97 (30.99) overlap/real 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) overlap/found 0.97 (0.96) 0.97 (0.97) 0.96 (0.96) 0.96 (0.96) 0.95 (0.95) linked region recovery 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) The row "No. of Breakpoints" indicates the number of breakpoints in the whole family. The row "precision" is the ratio of the number of correctly inferred breakpoints to the total number of inferred breakpoints. The row "recall" is the ratio of the number of correctly inferred breakpoints to the number of real breakpoints. The row "length of real linked region (cM)" is the length of the linked region generated in the simulation. The row "length of found linked region (cM)" is the length of the linked region calculated by our program. The row "overlap/real" is the length of the region shared by both the real linked region and the inferred linked region divided by the length of the real linked region. The row "overlap/found" is the length of the region shared by both the real linked region and the inferred linked region divided by the length of the inferred linked region. The row "linked region recovery" is the number of times that the inferred linked region contains the mutation site divided by the total number of experiments. In each cell, the first value is by our program and the one in brackets is by the program in [6]. Table 3: The experimental results for nuclear families when the genotype data of only one parent is available and this parent is diseased. 2 children 3 children 4 children 5 children 6 children No. of Breakpoints 3.68 (3.75) 7.47 (14.59) 10.32 (25.47) 12.90 (34.83) 15.48 (50.38) precision 0.33 (0.39) 0.78 (0.26) 0.89 (0.23) 0.94 (0.24) 0.96 (0.21) recall 0.23 (0.27) 0.72 (0.48) 0.86 (0.55) 0.91 (0.63) 0.92 (0.66) length of real linked region (cM) 83.14 57.98 45.62 35.15 31.01 length of found linked region (cM) 111.1 (91.45) 65.08 (66.32) 48.85 (49.38) 37.40 (38.02) 32.57 (32.80) overlap/real 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) overlap/found 0.86 (0.94) 0.93 (0.94) 0.94 (0.93) 0.93 (0.92) 0.93 (0.92) linked region recovery 1.00 1.00 1.00 1.00 1.00 (255/285) (229/285) (267/285) (276/285) (281/285) The meaning of each line is the same as in Table 2. Page 6 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Table 4: The experimental results for nuclear families when the genotype data of only one parent is available and this parent is normal. 3 children 4 children 5 children 6 children No. of Breakpoints 7.39 (13.47) 10.29 (24.47) 12.94 (35.69) 15.23 (55.84) precision 0.75 (0.31) 0.89 (0.25) 0.94 (0.24) 0.96 (0.19) recall 0.68 (0.53) 0.86 (0.59) 0.90 (0.65) 0.92 (0.66) length of real linked region (cM) 55.56 43.61 36.61 28.45 length of found linked region (cM) 67.66 (59.25) 43.90 (44.14) 36.87 (35.83) 29.23 (28.12) overlap/real 0.99 (0.99) 0.99 (0.98) 0.99 (0.98) 0.99 (0.97) overlap/found 0.85 (0.94) 0.99 (0.98) 0.99 (0.99) 0.98 (0.98) linked region recovery 1.00 1.00 1.00 1.00 (275/285) (274/285) (274/285) (266/285) The meaning of each line is the same as in Table 2. The current version of our program works for any number will simply delete both breakpoints that are within x SNP of generations. It can handle the case, where the genotype sites. We suggest setting the value of x based on the error data for one of a couple is missing. rate. When the error rate is smaller than 0.1%, x = 5. When the error rate is between 0.1% and 0.3%, x = 8. When the Genotype data error correction error rate is between 0.3% and 0.5%, x is set to be 10. For large-scale SNP genotyping, a certain number of When the error rate cannot be estimated, we simply use x experimental errors is unavoidable. We observe that when = 8. genotype data errors occur, the inferred haplotypes con- Identifying mutation regions tain many breakpoints that are close to each other. In order to get the correct allele sharing status, our algorithm After obtaining the inferred haplotype data for all the individuals in the family, we can find possible mutation regions by looking at the SNP sites one by one. The possi- Table 5: The experimental results for pedigrees P1 to P4. P1 P2 P3 P4 No. of Breakpoints 20.49 20.20 14.20 20.38 precision 0.58 0.77 0.49 0.57 recall 0.63 0.72 0.43 0.53 length of real linked region (cM) 34.69 36.42 43.89 41.28 length of found linked region (cM) 45.84 47.28 62.88 43.25 overlap/real 0.99 1 1 1 overlap/found 0.83 0.80 0.75 0.95 Pedi Figure 2 grees P1 to P4 Pedigrees P1 to P4. Pedigrees P1 to P4. A slash on an indi- linked region recovery 1111 vidual indicates that the genotype data for this individual is missing. The meaning of each line is the same as in Table 2. Page 7 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Table 6: The experimental results for pedigrees P5 to P8. P5 P6 P7 P8 No. of Breakpoints 20.81 (31.60) 26.37 (47.27) 38.83 (108.4) 44.65 (145.4) precision 0.86 (0.45) 0.96 (0.41) 0.97 (0.26) 0.90 (0.23) recall 0.82 (0.66) 0.94 (0.71) 0.93 (0.71) 0.92 (0.76) length of real linked region (cM) 38.58 19.77 21.92 12.37 length of found linked region (cM) 41.00 (43.28) 22.98 (23.51) 23.37 (24.57) 16.04 (16.20) overlap/real 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) overlap/found 0.93 (0.93) 0.86 (0.84) 0.91 (0.91) 0.79 (0.78) linked region recovery 1.00 1.00 1.00 1.00 (232/285) (266/285) (228/285) (265/285) The meaning of each line is the same as in Table 2. ble mutation regions should be regions shared by all or The current version of the package reports all the possible most of the diseased family members (considering phen- regions. The users are asked to input the maximum ocopy) but none or few of the normal family members number of normal individuals to be allowed to share the (considering penetrance). Those regions are reported as mutation allele (allowing for penetrance) and the maxi- suspected mutation regions. Due to the existence of mul- mum number of diseased individuals in the family to be tiple solutions and haplotype inference error, it is possible allowed not to share the potential mutation region that the reported regions do not completely overlap the (allowing for phenocopy). true mutation region. In our algorithm, we use a subrou- tine to extend the suspected mutation regions site by site Results and Discussion in both directions, where we can revise the haplotype In this section, we will test our software package using allele such that the allele is shared by diseased family simulated data. We have considered a wide range of pedi- members, but not shared by the normal family members. gree structures. Unlike the program in [6], here we have to extend in both directions. Table 7: The experimental results for pedigrees P9 to P12. P9 P10 P11 P12 No. of Breakpoints 23.43 (41.07) 18.46 (26.38) 37.44 (92.49) 20.68 (31.52) precision 0.95 (0.40) 0.92 (0.49) 0.91 (0.29) 0.94 (0.47) recall 0.91 (0.67) 0.89 (0.68) 0.89 (0.70) 0.91 (0.71) length of real linked region (cM) 29.80 26.06 22.26 22.05 length of found linked region (cM) 32.01 (36.72) 28.26 (31.93) 24.07 (24.53) 29.78 (30.57) overlap/real 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) overlap/found 0.90 (0.81) 0.91 (0.83) 0.90 (0.89) 0.80 (0.78) linked region recovery 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 1.00 (90/95) The meaning of each line is the same as in Table 2. Page 8 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Table 8: The experimental results for pedigrees P13 to P16. P13 P14 P15 P16 No. of Breakpoints 23.56 (46.66) 44.20 (124.0) 31.37 (48.25) 34.60 (69.26) precision 0.97 (0.37) 0.98 (0.27) 0.96 (0.47) 0.88 (0.38) recall 0.94 (0.71) 0.95 (0.75) 0.93 (0.71) 0.87 (0.75) length of real linked region (cM) 21.36 18.22 21.41 18.40 length of found linked region (cM) 25.61 (25.91) 19.68 (20.42) 25.12 (27.60) 21.81 (22.39) overlap/real 1.00 (1.00) 1.00 (1.00) 1.00 (1.00) 0.99 (1.00) overlap/found 0.86 (0.84) 0.90 (0.89) 0.88 (0.81) 0.86 (0.85) linked region recovery 1.00 (84/95) 1.00 (75/95) 1.00 (1.00) 1.00 (1.00) The meaning of each line is the same as in Table 2. Generating haplotype data using the Chi-square model nation with m equals 4 [6,12,13] and according to male/ In order to test the program, we generated haplotype data- female averaged genetic map for chromosome 1 down- sets based on the Chi-square model to see if our program loaded from HapMap [14]. The simulation process is can infer the haplotype data correctly from the corre- identical to that of [6,15]. When disease status was consid- sponding genotype data. The founder haplotypes were ered, a mutation was randomly assigned to be close to a obtained from real datasets (Affymetrix Human Mapping SNP site (called mutation site), and the diseased individu- 50 K/250 K GeneChips [11]), and children haplotypes als were forced to inherit the mutation strand and the nor- were generated through random inheritance of paternal/ mal individuals were forced not to inherit the mutation maternal alleles using the Chi-square model for recombi- strand. This process is done generation by generation. Pedi Figure 3 grees P5 to P8 Pe Figure 4 digrees P9 to P12 Pedigrees P5 to P8. Pedigrees P9 to P12. Page 9 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 A pedig Figure 7 ree for case study A pedigree for case study. point is within 20 SNPs away from the real location. From Table 2, we can see that all the elements in the row "over- lap/real" are 1 indicating that our package can always identify the whole mutation region. The elements in the row "overlap/found" are about 0.97 or 0.96 indicating that the sizes of the reported regions are slightly larger than (almost identical to) that of the real mutation regions. The values of recall and precision are also very Pedi Figure 5 grees P13 to P14 good when there are more than two children. The values Pedigrees P13 to P14. in brackets are by the program in [6]. We can see that both programs perform very well in those cases. Nuclear families with data available for both parents Nuclear families with data available for single parents Let us consider a nuclear family as shown in Figure 1. Let us consider nuclear families with a single parent. In When the genotype data for both parents is given, the this case, the genotype data for the other parent is missing. results are shown in Table 2. Here we test the cases, where About half of the children are diseased. There are two there are 2, 3,..., 6 children in the nuclear family. The cases: (1) the available parent is diseased and (2) the results are listed in column 2, 3,..., 6, respectively. We have available parent is normal. The results are shown in Table selected individuals to form 95 couples as in [6]. We have 3 and Table 4, respectively. Each cell in the tables contains done experiments on 285 datasets (three times for each two values. The first one is by our program and the one in couple) and calculated the average. For all the simula- brackets is by the program in [6]. We consider the cases tions, the number of data sets is a multiple of 95. Each cell where there are 2, 3,..., 6 children. For both cases, we can in the table contains two values. The first one is by our see that for our program, the precision and the recall are program and the one in brackets is by the program in [6]. slightly worse than that in Table 2. The length of the A breakpoint is correctly inferred if the inferred break- inferred region is slightly longer. Most importantly, all the elements in the row "linked region recovery" are 1 for our program. This indicates that our program can always find the linked region that contains the mutation site. We can also see that our new program performs much better than the program in [6] in those cases. Comparing case 1 and case 2, we can see that case 2 is harder to handle. In case 2, the result for 2 children is very poor and this case is not solvable by our program. Complicated pedigrees We did experiments on some complicated pedigrees. First we study some pedigrees, where the genotype data for the super children in the second generation (shared by two Pedi Figure 6 grees P15 to P16 Pedigrees P15 to P16. nuclear families) is missing. See P1–P4 in Figure 2. A slash on an individual indicates that the genotype data for this Page 10 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Th Figure 8 e simulated grand-paternal haplotype allele sharing status among all members, excluding grandmother The simulated grand-paternal haplotype allele sharing status among all members, excluding grandmother. Each individual has two haplotypes, paternal haplotype and maternal haplotype. Note that individual D is the founder of the dis- ease in the family. The two haplotypes of individual D are shown as 1-th haplotype and 0-th haplotype at the top and at the bottom, respectively. Any other individual may inherit alleles from both 0-th haplotype and 1-th haplotype of individual D. Thus, any other individual in the family appears twice (upper part and lower part) in the figure. The upper part shows the seg- ments inherited from the 1-th haplotype of D and the lower part shows the segments inherited from the 0-th haplotype of D. The segments from diseased individuals are red and the segments from normal individuals are blue. The simulated (true) linked region is indicated by the horizontal black double-direction arrow in the middle. individual is missing. This setting applies to all the figures results are listed in Table 6. From the row "linked region in the paper. recovery", we can see that, if the nuclear family of the first generation has a single parent, and the nuclear family of We have done 285 experiments on each of P1–P4. From the second generation has at least three (with both par- Table 5, we can see that the inferred region of our program ents) or four (with single parent) children, we can always can almost always cover the entire real linked region. The find the linked region that contains the mutation site. The length of the inferred linked region is a bit (about 15%) program in [6] missed the mutation site at a rate of about longer than that of the real linked region. From the row 13.07%. Now we consider pedigrees P9–P16 in Figures 4, "linked region recovery", we can see that our program can 5 and 6. The experimental results are shown in Table 7 always precisely find the linked region that contains the and Table 8, respectively. We can see that for common mutation site. Note that the program in [6] cannot handle pedigrees of different structures, our program can always P1–P4 at all. This is a significant improvement in our new find the linked region containing the mutation site and program. Let us consider pedigrees P5–P8 in Figure 3, our program reports the linked region more precisely than where individuals with missing genotype data are not the program in [6]. Note that the program in [6] some- super children (not shared by two nuclear families). We times missed the mutation sites. From all the listed exper- have done 285 experiments for each of the pedigrees. The imental results, we can see that our program has much Page 11 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Th Figure 9 e inferred grand-paternal haplotype allele sharing status among all members, excluding grandmother The inferred grand-paternal haplotype allele sharing status among all members, excluding grandmother. higher precision and recall than the program in [6], which Note that individual D is the founder of the disease in the indicates that our new program can infer the haplotypes family. The two haplotypes of individual D are shown as more precisely. 1-th haplotype and 0-th haplotype at the top and at the bottom, respectively. Any other individual may inherit A Case Study alleles from both 0-th haplotype and 1-th haplotype of We studied a pedigree to see if the program can correctly individual D. Thus, any other individual in the family identify the linked region when some of the family mem- appears twice (upper part and lower part) in the figure. bers are purposely excluded. The pedigree is shown in Fig- The upper part shows the segments inherited from the 1- ure 7. The simulated allele sharing status and the linked th haplotype of D and the lower part shows the segments region are shown in Figure 8. Each individual has two inherited from the 0-th haplotype of D. The segments haplotypes, paternal haplotype and maternal haplotype. from diseased individuals are red and the segments from Table 9: The experimental results for genotype data error correction using Affymetrix 50 K GeneChips data. precision recall linked region recovery overlap/real overlap/found No error 97.74% 95.05% 100% 99.98% 92.46% (69.62%) (86.56%) (100%) (100%) (91.78%) 0.1% error 84.40% 94.94% 100% 99.93% 92.00% (67.19%) (86.19%) (100%) (100%) (91.13%) 0.5% error 79.00% 93.76% 100% 99.84% 92.57% (61.81%) (86.10%) (100%) (100%) (91.94%) Page 12 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Table 10: Comparison with Merlin using Affymetrix 50 K Table 11: Comparison with Merlin using Affymetrix 250 K GeneChips. GeneChips. Run time (s) overlap/real overlap/found Run time (s) overlap/real overlap/found 0+3 16.717 (0.898) 0.941 (0.805) 0.605 (0.862) 0+3 67.680 (2.979) 0.954 (0.975) 0.590 (0.866) 0+4 23.325 (1.126) 0.928 (0.824) 0.594 (0.930) 0+4 97.082 (4.067) 0.965 (0.988) 0.548 (0.955) 0+5 30.634 (1.746) 0.941 (0.901) 0.587 (0.936) 0+5 124.43 (6.558) 0.983 (0.994) 0.607 (0.960) 0+6 39.028 (3.901) 0.987 (0.929) 0.576 (0.969) 0+6 131.71 (12.50) 0.994 (0.992) 0.675 (0.986) 2+3(4 bits) 3.276 (1.340) 1.000 (0.928) 0.960 (1) 2+3(4 bits) 9.454 (7.122) 1 (0.936) 0.995 (1) 2+4(6 bits) 4.596 (1.735) 0.9997 (0.937) 0.967 (0.990) 2+4(6 bits) 11.625 (8.133) 1 (0.989) 0.988 (1) P5(7 bits) 7.714 (2.325) 0.999 (0.933) 0.919 (0.979) P5(7 bits) 16.662 (8.262) 1 (0.892) 0.983 (0.938) P9(9 bits) 7.365 (3.255) 0.999 (0.879) 0.926 (0.958) P9(9 bits) 24.18 (16.65) 1.000 (0.934) 0.973 (1) P10(11 bits) 19.15 (8.829) 0.999 (0.797) 0.912 (0.968) P10(11 bits) 38.54 (31.39) 1.000 (0.797) 0.970 (0.938) P11(12 bits) 11.37 (14.01) 1.000 (0.938) 0.924 (0.979) P11(12 bits) 29.94 (415.18) 1.000 (0.991) 0.972 (1) P12(13 bits) 25.62 (31.57) 1.000 (0.922) 0.842 (0.958) P12(13 bits) 49.28 (1237.7) 1.000 (0.968) 0.871 (1) P7(14 bits) 13.58 (58.14) 1.000 (0.949) 0.933 (0.990) P7(14 bits) 31.611 (NA) 1(NA) 0.960(NA) P15(14 bits) 46.96 (43.15) 1.000 (0.906) 0.894 (0.979) P15(14 bits) 103.269 (NA) 1(NA) 0.927(NA) P13(15 bits) 23.72 (1282.05) 1.000 (0.970) 0.875 (1) P13(15 bits) 54.114 (NA) 1.000(NA) 0.881(NA) P16(15 bits) 37.93 (384.78) 0.988 (0.950) 0.877 (0.990) P16(15 bits) 141.266 (NA) 0.991(NA) 0.922(NA) P14(16 bits) 15.042 (NA) 1.000 (NA) 0.927 (NA) P14(16 bits) 52.911 (NA) 1(NA) 0.979(NA) P6(17 bits) 40.754 (NA) 1.000 (NA) 0.869 (NA) P6(17 bits) 63.984 (NA) 1.000(NA) 0.862(NA) Each cell contains two values. The first one is by our program and the Each cell contains two values. The first one is by our program and the one in brackets is by Merlin. Pedigree 0+i for i = 3, 4, 5 and 6 stands one in brackets is by Merlin. Pedigree 0+i for i = 3, 4, 5 and 6 stands for a nuclear family with i children and the genotype data for parents for a nuclear family with i children and the genotype data for both is not available. Pedigree 2+i for i = 3 and 4 stands for a nuclear family parents is not available. Pedigree 2+i for i = 3 and 4 stands for a with i children and the genotype data for both parents is available. nuclear family with i children and the genotype data for both parents "NA" means that the program failed to execute on this pedigree. is available. "NA" means that the program failed to execute on this pedigree. normal individuals are blue. At any position in the chro- mosome, each second generation individual gets an allele bps). The inferred linked region remains the same if we from D that is either from the 0-th haplotype or 1-th hap- exclude individual E or F. The inferred linked region also lotype. For the third generation individual, say, C1, the remains unchanged if we simultaneously remove E, M allele at a position may or may not be from D. The simu- and N. When we simultaneously remove E, K, N, D1, D2 lated (true) linked region is from 189.61 cM to 211.55 cM and D3, the inferred linked region is from 189.61 cM to (physical position from 169183745 bps to 197201161 225.48 cM, which is only slightly enlarged. When we bps) indicated by the horizontal black double-direction simultaneously remove E, F, M, C1, C2, C3 and C4, the arrow in the middle. The inferred configuration is shown inferred linked region is again enlarged and the region is in Figure 9. We can see that the inferred configuration is from 172.07 cM to 211.63 cM. When F, K, N, D1, D2 and roughly the same as the simulated configuration except D3 are simultaneously removed from the pedigree, the that it is upside down. The inferred linked region is from inferred linked region is from 189.61 cM and 221.90 cM. 189.61 cM(169174855 bps) to 211.63 cM(197295627 Page 13 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 Genotype data error correction the genotype data of one parent is missing for the entire To test the effect of genotype data errors on our program, chromosome. we did experiments on the pedigree in Figure 7. We gener- ated genotype errors by randomly changing the genotype Availability and requirements value (which is one of AA, BB and AB) into a different Project Name: CityU 121608 value (which is one of the other two values) at a position with probability 0.1% and 0.5%, respectively. We simu- Project Homepage: http://www.cs.cityu.edu.hk/~lwang/ lated 475 data sets. The length of real linked region ranges software/Link from 0.76 cM to 65.17 cM. The experimental results are shown in Table 9. For each cell, there are two values. The Operating system(s): Platform independent first one is by our program and the one in brackets is by the program in [6]. We can see that our program performs Programming language: Java better and can always recover the real linked regions. Other requirements: Java 1.6.0 or higher Comparison with Merlin We compared our program with Merlin [16]. We did the Licence: None experiments on a PC with a CPU of 3.0 GHz and 1.00 GB memory. The results are shown in Table 10 and Table 11 Any restrictions to use by non-academics: None for Affymetrix 50 K GeneChips and Affymetrix 250 K GeneChips, respectively. We have also considered differ- Authors' contributions ent kinds of pedigrees. LW carried out the algorithm design, provided some of initial pseudo codes, checked part of computer codes, par- Running time ticipated in the design of experiments, and drafted the For small sized families, both programs can generate manuscript. ZW produced the computer program, carried results in a few seconds. When the sizes of the families and out the experiments, drafted experiment results, and par- the number of the markers increase, the running time of ticipated in algorithm design. WY initiated the study, par- our program increases linearly. For large families, Merlin ticipated in the design of experiments, polished the requires really long running time. Most importantly, Mer- manuscript, and was in charge of communication with lin needs large memory space for big sized families and the editor. All the authors read and approved the final cannot successfully complete the computation for some manuscript. pedigrees (see P14 and P6 in Table 10 and Table 11). Acknowledgements We thank the referees for their helpful suggestions. Lusheng Wang is fully Output Quality supported by a grant from the Research Grants Council of the Hong Kong We again use "overlap/real" and "overlap/found" to indi- Special Administrative Region, China [Project No. CityU 121608]. cate the quality of the computational results. Our pro- gram always clearly gives a computed candidate region for References each input. Merlin calculates a LOD score for each marker. 1. The International HapMap Consortium: A second generation We evaluated the segment with the highest LOD score as human haplotype map of over 3.1 million SNPs. Nature 2007, 449:851-861. the output linked region for Merlin. From Table 10 and 2. Leykin I, Hao K, Cheng J, Meyer N, Pollak M, Smith R, Wong W, Rose- Table 11, we can see that the results of our program are now C, Li C: Comparative linkage analysis and visualization of less than optimal when parental data is not available at high-density oligonucleotide SNP array data. BMC Genet 2005, 6(1):7. all. When the family size becomes bigger, our program 3. Sellick G, Longman C, Tolmie J, Newbury-Ecob R, Geenhalgh L, outperforms Merlin. Our program can quickly produce Hughes S, Whiteford M, Garrett C, Houlston R: Genomewide link- accurate linked regions when the family size is big while age searches for mendelian disease loci can be efficiently conducted using high-density SNP genotyping arrays. Nucleic Merlin failed to execute on big sized families. Acids Res 2004, 32:e164. 4. Elston R, Stewart J: A general model for the genetic analysis of pedigree data. Hum Hered 1971, 21(6):523-542. Conclusion 5. Lander E, Green P: Construction of multilocus genetic linkage We have developed a software package that infers the hap- maps in humans. Proc Natl Acad Sci USA 1987, 84:2363-2367. lotype allele sharing status for the members of a pedigree 6. Lin G, Wang Z, Wang L, Lau YL, Yang W: Identification of linked regions using high-density SNP genotype data in linkage based on the minimum recombinants model. The run- analysis. Bioinformatics 2008, 24(1):86-93. ning time of the program is linear in terms of the input 7. Qian D, Beckmann L: Minimum recombinant haplotyping in pedigrees. Am J Hum Genet 2002, 70:1434-1445. size O(mn), where m is the total number of individuals in 8. Li J, Jiang T: Efficient inference of haplotypes from genotypes the whole family and n is the number of SNP sites in the on a pedigree. J Bioinform Comput Bio 2003, 1:41-69. chromosome. The new package can handle a wide range 9. Doi K, Li J, Jiang T: Minimum recombinant haplotype configu- ration on tree pedigrees. Proc. of the 3rd Annual Workshop on Algo- of pedigree structures. It works very well for cases where rithms in Bioinformatics (WABI'03) 2003:339-353. Page 14 of 15 (page number not for citation purposes) BMC Bioinformatics 2009, 10:216 http://www.biomedcentral.com/1471-2105/10/216 10. Li J, Jiang T: Computing the minimum recombinant haplotype configuration from incomplete genotype data on a pedigree by integer linear programming. Journal of Computational Biology 2005, 12:719-739. 11. Affymetrix Human Mapping GeneChips [http://www.affyme trix.com/products_services/arrays/specific/100k.affx] 12. Broman K, Weber J: Characterization of human crossover interference. Am J Hum Genet 2000, 66(6):1911-1926. 13. Zhao H, Speed T, McPeek M: Statistical analysis of crossover interference using the chi-square model. Genetics 1995, 139(2):1045-1056. 14. International HapMap Project [http://www.hapmap.org] 15. Yang W, Wang Z, Wang L, Sham P, Huang P, Lau Y: Predicting the number and sizes of IBD regions among family members and evaluating the family size requirement for linkage studies. European Journal of Human Genetics 2008, 16:1535-1543. 16. Abecasis G, Cherny S, Cookson W, Cardon L: Merlin – rapid anal- ysis of dense genetic maps using sparse gene flow trees. Nature Genetics 2002, 30:97-101. Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 15 of 15 (page number not for citation purposes)

Journal

BMC BioinformaticsSpringer Journals

Published: Jul 15, 2009

There are no references for this article.