Journal of Heredity Advance Access originally published online on June 15, 2007
Journal of Heredity 2007 98(5):461-467; doi:10.1093/jhered/esm027
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Analysis of the Unassembled Part of the Dog Genome Sequence: Chromosomal Localization of 115 Genes Inferred from Multispecies Comparative Genomics
From the CNRS UMR6061 Génétique et Développement, Université de Rennes 1, IFR140, 2 Av du Pr Léon Bernard, CS 34317, 35043, Rennes, France
Address correspondence to C. Hitte at the address above, or e-mail: hitte{at}univ-rennes1.fr.
The identification of dog genes and their accurate localization to chromosomes remain a major challenge in the postgenomics era. The 132 annotated canine genes with human orthologs remaining in the unassembled part (chrUnknown) of the dog sequence assembly (CanFam1) are of limited use for candidate gene approaches or comparative mapping studies. We used a two-step comparative analysis to infer a canine chromosomal interval for localization of the chrUn genes. We first constructed a human–dog synteny map, using 14 456 gene-based comparative anchors. We then mapped the 132 chrUn genes onto the reference (human) synteny map and identified the corresponding, orthologous segment on the canine map, based on conserved gene order. Our results show that 110 chrUn genes could be localized to short intervals on 18 dog chromosomes, whereas 22 genes remained assigned to 2 possible intervals. We extended this comparative analysis to multiple species, using the chimpanzee, mouse, and rat genome sequences. This made it possible to narrow down the intervals concerned and to increase the number of canine chrUn genes with an inferred chromosome location to 115. This study demonstrates that dog chromosomal intervals for chrUn genes can be rapidly inferred, using a reference species, and indicates that comparative strategies based on larger numbers of species may be even more effective.
Many large-scale mapping and sequencing projects have been completed in the last 10 years (Lander et al. 2001; Waterston et al. 2002; Gibbs et al. 2004; Hitte et al. 2005; Lindblad-Toh et al. 2005). This has made it possible to compare the genomes of different species and to study evolutionary changes (Hardison 2003; Murphy et al. 2005). The emerging field of comparative genomics has already yielded outstanding results in domains such as speciation and evolutionary studies (Hillier et al. 2004; Jaillon et al. 2004), genome annotation (Ashurst et al. 2005), and the identification of new sets of functional elements within annotated genomes (Dujon et al. 2004). Ongoing low-coverage sequencing projects will also provide new additional resources for many model organisms commonly used as human surrogates for research (Margulies et al. 2005, http://www.genome.gov/10002154).
Synteny maps are generated by identifying unambiguous orthologous sequence pairs across species (Pan et al. 2005; Liang and Dandekar 2006). These pairs, known as comparative anchors (Chen et al. 1999), are connected to show regions of conserved synteny and break point regions. Region of conserved synteny are composed of conserved segments (CS) and conserved ordered segments (CSO) (O'Brien et al. 1993). CS are segments shared by 2 or more species that contain orthologous anchor markers with no notion of order. CSO are conserved segments that run continuously, with the same orientation and the same order of genes, reflecting intrachromosomal rearrangements occurring during evolution (Pevzner and Tesler 2003). The extent of gene-order conservation within CSO depends both on the phylogenetic distance between organisms and on the frequency of species-specific rearrangements since divergence from the last common ancestor (Kirkness et al. 2003). Analyses of CS and CSO gene content across species are commonly carried out in studies of gene family expansion or contraction and to facilitate the identification and annotation of orthologous genes (Fischer et al. 2001; Hardison 2003; Zheng et al. 2005).
We report here a strategy making use of the large CSO revealed by synteny maps and gene adjacency information for the canine and human genomes to refine the localization of canine gene repertoire. We applied this method to the unassembled part of the canine genome sequence (CanFam1) and were able to localize 115 canine chrUn genes to short chromosomal segments. We extended this approach to a multispecies comparative analysis including chimpanzee, rat, and mouse to refine chrUn gene localization.
| Materials and Methods |
|---|
|
|
|---|
Gene Data Sets
Orthologous gene data were downloaded from Ensembl v39 with the biomart tool (http://www.ensembl.org/Multi/martview). Data sets for human–dog, chimpanzee–dog, mouse–dog, and rat–dog pairs were successively downloaded. Ensembl describes several categories of orthology: one-to-one, one-to-many, and many-to-many. We used this classification to extract one-to-one orthologs. Data sets were stored in a mySQL database. Synteny maps were constructed, using one-to-one orthologs as comparative anchors. Synteny maps were built and drawn using the AutoGRAPH program (http://genoweb.univ-rennes1.fr/tom_dog/AutoGRAPH/).
Colinearity Rate
The colinearity rate was calculated as the proportion of genes from the target species in the same order as the genes of the reference species. It was determined for CSOs containing at least 3 genes. If gene order is perfectly conserved, then the colinearity rate is 100%. The algorithm used by AutoGRAPH makes it possible to relax constraints on colinearity rate. We set a gap penalty threshold of 5, making it possible to include a gene in a CSO, even if gene order is not conserved within a range of 5 positions (http://genoweb.univ-rennes1.fr/tom_dog/AutoGRAPH/Tutorial.php).
Inference of Canine Intervals Serving as Candidates for chrUn Gene Localization
Genomic intervals containing a human gene with an ortholog in the canine chrUn pool were defined based on the closest human flanking genes with one-to-one orthologs. Only flanking orthologous pairs in conserved order were used for the inference of canine ortholog intervals as candidates for chrUn gene localization. The same method was applied to 3 other reference species: chimpanzee, mouse, and rat.
Repeat Content
Repeat sequence content was calculated as the ratio of cumulative repeat sequence size to the total size of the chromosome sequence (the UCSC repeat content table can be downloaded from http://genome.ucsc.edu/cgi-bin/hgTables).
CanFam2 Analysis
Canine chrUn gene sequences and their flanking ortholog sequences obtained from the comparative study were aligned with the CanFam2 assembly, using Blat version 33. The results of chromosomal assignment studies and the order of the 3 sequence alignments were used for comparison with the comparative genomics analysis.
| Results and Discussion |
|---|
|
|
|---|
Genes from the Unassembled Part of the Canine Sequence Assembly (chrUn)
The unassembled part of a genome sequence assembly is usually placed in a "chromosome unknown" pool (chrUn), corresponding to sequences that cannot be effectively localized to particular chromosomal positions or assembled into sizeable contigs. The 132 protein-coding genes with a single human ortholog extracted from the canine chUn pool account for only 26% of the 506 genes present in this pool. This proportion is significantly smaller than that for Ensembl (Birney et al. 2006) one-to-one orthologs found for all other CFA (82%). The precise reason for this lack of orthology remains unclear, but the sequencing problems encountered in sequence assembly may be also involved in ortholog identification. The high repetitive sequence content of the chrUn—43% repeat elements versus only 35% in the assembled CFA (Karolchik et al. 2004)—may account for problems with the identification of real orthologs of chrUn genes by sequence alignment or phylogenetic studies. The human orthologs of the 132 chrUn genes are widely dispersed over the entire set of human chromosomes (except HSA4), 4 of which (HSA 7, 10, 17, and 21) contain 45% of the genes mapping in clusters at the ends of chromosomes (Table 1).
|
Selection of Human/Dog Comparative Anchors and Synteny Map Construction
We downloaded canine protein-coding genes from Ensembl v39 (EnsMart tool, Kasprzyk et al. 2004) and selected 14 456 such genes that were annotated and had a one-to-one orthologous relationship with a human gene. One-to-many and many-to-many orthologous relationships were excluded from the analysis to maintain data reliability and to prevent the occurrence of uncertainties in subsequent synteny analysis. We used AutoGRAPH—a web server developed in our laboratory (Derrien et al. 2007, Materials and Methods) that formalizes the construction of comparative maps, making it possible to relax marker-order conservation criteria through an adjacency penalty value—to build a human–dog synteny map. Based on this synteny map, we were able to identify 222 CSOs, with a mean length of 11.8 Mb, containing a mean of 91 genes (range 3–493) (Figure 1). Break points—regions separating CSOs—were identified and characterized on the basis of size (mean size 806 kb), as defined by the immediately flanking one-to-one ortholog pairs. Gene-order conservation within CSOs containing more than 3 genes was evaluated by determining the colinearity rate, corresponding to the percentage of genes in the same order in the target (dog) and reference (human) species. Mean colinearity rate was 94%, indicating a high level of gene adjacency conservation within chromosomal segments inherited without rearrangements from the last common ancestor of dogs and humans. The proportion of genes for which order does not appear to be conserved (6%) in the human–dog synteny map may correspond to microrearrangements that have shaped genome evolution or to dubious orthologous relationships associated with the expansion or contraction of gene families (Goodstadt and Ponting 2006), making it difficult to identify true orthologs, as sequence assembly errors cannot be eliminated.
|
Inference of Orthologous Chromosomal Intervals from the Synteny Map
We used one-to-one orthologous gene-based comparative anchors to ensure the construction of a dense and reliable human–dog synteny map (Figure 1A). However, the accuracy of comparative maps may be decreased by errors in the definition of orthology, spurious gene annotations, or dubious genomic coordinates leading to the misinterpretation of gene-order conservation. We assessed the robustness of the conserved gene-order approach (Zheng et al. 2005), by selecting, at random, 1000 genes widely dispersed over all the human chromosomes and using these genes to evaluate the likelihood of identifying the chromosomal position of the dog gene based on the gene coordinates in the reference species (human). We masked canine genomic localization and mapped the 1000 reference species genes on the human–dog synteny map. We applied the conserved gene-order rule to identify the orthologous segment on the canine map, based on the closest flanking one-to-one orthologs (Figure 1B). We found that 93.2% of the 1000 genes tested were correctly localized on canine chromosomes, within an interval of 347 kb on average, as defined by the closest flanking one-to-one orthologs. For 3.1% of the genes, it was possible to infer position to within 170 kb, based on the interval defined by the closest flanking orthologs. This may be due to short microarrangements interrupting colinearity. It may also reflect inaccuracy in the coordinates of the gene in the human or dog genome sequence. For 2.4% of the genes, the predicted interval mapped to the junction of 2 CS of the synteny map, thereby precluding assignment to a single canine chromosome (Figure 1C). For the remaining 1.3%, the position on the canine interval was not correctly inferred because they correspond to singleton genes.
Mapping the chrUn Genes on the Synteny Map
When producing the human–dog synteny map, we mapped the human genes orthologous to the canine 132 chrUn genes in the reference species. We then identified the segment defined by the closest flanking genes with a one-to-one canine ortholog (Figure 1B). We applied the conserved gene-adjacency approach to identify the orthologous segment in the canine sequence most likely to contain the chrUn gene.
Considering only segments corresponding to CSOs containing at least 3 genes, we were able to identify a corresponding canine segment for 110 chrUn genes, the remaining 22 mapping to break point regions, preventing their assignment to a single interval on a single canine chromosome. The 110 corresponding canine intervals were distributed over 18 chromosomes, including 4 large segments on chromosomes 6, 18, 28, and 31, with up to 22 chrUn genes predicted on CFA28 (Table 1). Not surprisingly, these 4 segments correspond to genes also mapping in clusters at the ends of 4 human chromosomes (HSA 7, 10, 17, and 21; Table 1). The size of the 110 dog chromosomal segments was then evaluated as the distance between the 2 closest flanking one-to-one orthologs (or one ortholog and the chromosome end) (Table 2 and Supplementary Table 3). We were able to define 44 intervals based on flanking orthologous genes on either side of the interval, and 66 intervals at the end of the chromosome were defined based on a flanking gene on one side and the telomere on the other. The mean length of the defined intervals was 415 kb, markedly longer than the interval size determined in the pilot analysis on 1000 randomly selected genes (347 kb), presumably due to overestimation of the size of intervals defined by an ortholog on one side and the telomere on the other. The large number of predicted intervals mapping to chromosome ends may be accounted for by difficulties in the assembly of these regions, due to problems retaining chromosome ends during construction of the clone library. Sequence coverage is lower for these regions than for other regions, and they are therefore less likely to be assembled.
|
We searched for gaps in the canine assembly and calculated the cumulative size of gaps, represented as stretches of N in the sequence (Karolchik et al. 2004; http://genome.ucsc.edu/cgi-bin/hgTables). Gap sequences were found to account for 9% of the intervals predicted by inference to contain a chrUn gene. This value is significantly higher than that for random intervals (n = 1000), in which gaps account for 1.5% of the sequence. We also found that 32% of the inferred intervals in dog contained large sequence gaps (arbitrarily set to 1000 bp in the assembly), corresponding to regions at the junction between supercontigs and related only by the physical map (UCSC server). The corresponding proportion was only 0.4% for random studies of 1000 intervals. Thus, intervals predicted, by inference, to contain chrUn genes contained a significantly higher proportion of gap sequences than randomly selected intervals. These findings are consistent with gaps in the inferred interval corresponding to chrUn gene sequences grouped in the chrUn pool.
Multispecies Comparative Analysis
We constructed pairwise synteny maps between the dog as tested genome and chimpanzee, mouse and rat species as reference genomes. These maps contained 12 204, 14 171 and 13 309 comparative anchors annotated by Ensembl between dog and chimpanzee, mouse and rat, respectively. We mapped 95% (126/132) of the chrUn genes on the reference species on at least one of the pairwise synteny maps and 56.8% (75) of these genes could be mapped on all 4 pairwise synteny maps, making use of the one-to-one orthologous relationships described between dog and the 4 species considered (human, chimpanzee, rat, mouse). From all pairwise synteny maps, genomic intervals in the dog sequence were predicted for 115 chrUn genes, 82% (92) of which were identified in at least 2 species. Furthermore, the multispecies analysis made it possible to shorten and refine the interval, using 2 flanking orthologous genes, for 8 and 5 intervals, respectively (Table 2). For 5 of the 22 canine chrUn genes initially assigned to break points, making it impossible to identify a single CFA with the human–dog synteny map alone, multispecies analysis led to the identification of a single CFA.
The use of the multispecies comparative approach made it possible to examine several sets of one-to-one orthologs differing in both number and nature. These differences may correspond to real biological differences. For example, a one-to-zero relationship for an olfactory receptor between dog and human may correspond to a one-to-one relationship between mouse and dog, both of which have a highly developed sense of smell (Quignon et al. 2005). Differences in orthologous data sets may also arise from inaccuracies in ortholog identification (Goodstadt and Ponting 2006).
Gene Sequence Analysis
Although an updated sequence assembly, CanFam2, has been released (Lindblad-Toh et al. 2005), the corresponding gene annotation set is not yet available from Ensembl database version 41. We have, however, performed sequence alignment analyses on the CanFam2 assembly (see Materials and Methods), using the Blat algorithm (Kent 2002) for the 115 dog chrUn genes and flanking orthologs localized in this study. We detected 78 of these gene sequences on canine chromosomes, with 94.8% (74/78) of the identified chromosomal locations consistent with the canine chromosomal interval determined in this study. The 4 gene sequences not confirmed by sequence alignment analysis may correspond to microrearrangements between dog and human undetectable on the synteny map or the incorrect definition of orthologous relationships. Twenty-five gene sequences were aligned to the chrUn pool, 7 were involved in major contig reassembly in the CanFam2 release, and 5 chrUn genes could not be mapped to any annotated chromosome in CanFam2 with significant sequence alignment. For these sets of genes, comparative genomics approaches are likely to be the most efficient for inferring the most probable location.
RH Mapping Validation
Among the 110 genes with new locations described in this work, a subset of 17 genes have been previously mapped in the canine 10 000 gene RH map built with the 9000-rad panel. Sixteen gene (16/17) placements were found in agreement with the computational approach, and one was slightly discrepent. These results indicate both the accuracy of the computational inference and the power of RH mapping to place chrUn genes.
| Conclusion |
|---|
|
|
|---|
In this study, we addressed the question of how comparative genomics can be used to map genes currently in the chrUn pool of the dog genome assembly. This approach is based on the construction of dense, accurate synteny maps with one-to-one gene-based comparative anchors and the use of gene-order conservation. It then makes use of the capacity to observe the conservation of gene order to infer orthologous intervals as candidates for the localization of chrUn genes. The evolutionary distance between the target and reference organisms is a critical factor with this method. It is essential to select phylogenetically close species (Margulies et al. 2005), such as pairs of mammals, and to consider multiple species, to maximize analysis efficiency. We constructed several synteny maps (human–dog, chimpanzee–dog, rat–dog, and mouse–dog) to identify, by inference, the canine segments most likely to contain the 115 chrUn genes. We found that the frequency of gap sequences was high in the inferred intervals, supporting our findings, and we validated 95% of the inferred intervals by sequence analysis on the updated release of the dog assembly.
Multispecies comparative analysis should be improved by the use of additional species, including species more phylogenetically distant than chimpanzee and human or mouse and rat. The availability of high coverage (>6x) genome assemblies, such as those for cow, macaque, and opossum, will provide additional data that could be integrated into studies of this type. In contrast, the use of low-coverage (2x) sequence projects is of limited interest for this aspect of the topic, due to the lack of continuity in chromosomal assignation.
Our results demonstrate that the inference of canine chromosomal intervals for chrUn genes from a multispecies comparative genomics approach is efficient and can be rapidly achieved. Similar strategies could be used to localize chrUn genes of any species in mammals with complete sequence assemblies.
| Supplementary Material |
|---|
|
|
|---|
Supplementary Table 3 is available online at http://genoweb.univ-rennes1.fr/tom_dog/J_Hered_supplementary.html.
| Acknowledgments |
|---|
We thank the GenOuest Bioinformatics Platform for hosting the mySQL database and the AutoGRAPH server. We also thank the French Centre National de la Recherche Scientifique for supporting this work and the Conseil Regional de Bretagne for awarding a fellowship to T.D.
| Footnotes |
|---|
This paper was delivered at the 3rd International Conference on the Advances in Canine and Feline Genomics, School of Veterinary Medicine, University of California, Davis, CA, August 3–5, 2006.
Corresponding Editor: Urs Giger
| References |
|---|
|
|
|---|
-
Ashurst JL, Chen CK, Gilbert JG, Jekosch K, Keenan S, Meidl P, Searle SM, Stalker J, Storey R, Trevanion S, et al. The Vertebrate Genome Annotation, Vega database. Nucleic Acids Res (2005) 33:D459–D465.
Birney E, Andrews D, Caccamo M, Chen Y, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, et al. Ensembl 2006. Nucleic Acids Res (2006) 36:D556–D661.
Chen Z-Q, Lautenberger JA, Lyons LA, McKenzie L, O'Brien SJ. A human genome map of comparative anchor tagged sequences. J Hered (1999) 90:477–484.
Derrien T, Andre C, Galibert F, Hitte C. AutoGRAPH: an interactive web server for automating and visualizing comparative genome maps. Bioinformatics (2007) 23(4):498–499.
Dujon B, Sherman D, Fischer G, Durrens P, Casaregola S, Lafontaine I, De Montigny J, Marck C, Neuveglise C, Talla E, et al. Genome evolution in yeasts. Nature (2004) 430:35–44.[CrossRef][Medline]
Fischer G, Neuveglise C, Durrens P, Gaillardin C, Dujon B. Evolution of gene order in the genomes of two related yeast species. Genome Res (2001) 11(12):2009–19.
Gibbs RA, Weinstock GM, Metzker ML, Muzny DM, Sodergren EJ, Scherer S, Scott G, Steffen D, Worley KC, Burch PE, et al. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature (2004) 428:493–521.[CrossRef][Medline]
Goodstadt L, Ponting CP. Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human. PLoS Comput Biol (2006) 2(9):e133.[CrossRef][Medline]
Hardison RC. Comparative genomics. PLoS Biol (2003) 2:E58.[CrossRef]
Hillier LW, Miller W, Birney E, Warren W, Hardison RC, Ponting CP, Bork P, Burt DW, Groenen MA, Delany ME, et al. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature (2004) 432:695–716.[CrossRef][Medline]
Hitte C, Madeoy J, Kirkness EF, Priat C, Lorentzen TD, Senger F, Thomas D, Derrien T, Ramirez C, Scott C, et al. Facilitating genome navigation: survey sequencing and dense radiation-hybrid gene mapping. Nat Rev Genet (2005) 8:643–648.
Jaillon O, Aury JM, Brunet F, Petit JL, Stange-Thomann N, Mauceli E, Bouneau L, Fischer C, Ozouf-Costaz C, Bernot A, et al. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature (2004) 431:946–957.[CrossRef][Medline]
Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ. The UCSC Table Browser data retrieval tool. Nucleic Acids Res (2004) 32:D493–D496.
Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E. EnsMart: a generic system for fast and flexible access to biological data. Genome Res (2004) 14(1):160–9.
Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res (2002) 12:656–664.
Kirkness EF, Bafna V, Halpern AL, Levy S, Remington K, Rusch DB, Delcher AL, Pop M, Wang W, Fraser CM, et al. The dog genome: survey sequencing and comparative analysis. Science (2003) 301(5641):1898–1903.
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al. Initial sequencing and analysis of the human genome. Nature (2001) 409:860–921.[CrossRef][Medline]
Liang C, Dandekar T. InGeno—an integrated genome and ortholog viewer for improved genome to genome comparisons. BMC Bioinformatics (2006) 7:461.[CrossRef][Medline]
Lindblad-Toh K, Wade CM, Mikkelsen T, Karlsson E, Jaffe DB, Zody MC, Clamp M, Kamal M, Kulbokas EJ, Chang JL, et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature (2005) 438:803–819.[CrossRef][Medline]
Margulies EH, Vinson JP, Miller W, Jaffe DB, Lindblad-Toh K, Chang JL, Green ED, Lander ES, Mullikin JC, Clamp M, Comparative Sequencing Program NISC. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc Natl Acad Sci USA (2005) 102:4795–4800.
Murphy WJ, Larkin DM, Everts-van der Wind A, Bourque G, Tesler G, Auvil L, Beever JE, Chowdhary BP, Galibert F, Gatzke L, et al. Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science (2005) 309:613–617.
O'Brien SJ, Womack JE, Lyons LA, Moore KJ, Jenkins NA, Copeland NG. Anchored reference loci for comparative genome mapping in mammals. Nat Genet (1993) 2:103–112.[CrossRef][ISI]
Pan X, Stein L, Brendel V. SynBrowse: a synteny browser for comparative sequence analysis. Bioinformatics (2005) 21:3461–3468.
Pevzner P, Tesler G. Genome rearrangements in mammalian evolution: lessons from human and mouse genomes. Genome Res (2003) 13(1):37–45.
Quignon P, Giraud M, Rimbault M, Lavigne P, Tacher S, Morin E, Retout E, Valin AS, Lindblad-Toh K, Nicolas J, et al. Genome Biol (2005) 6(10):R83.[CrossRef][Medline]
The Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature (2005) 437:69–87.[CrossRef][Medline]
Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al. Initial sequencing and comparative analysis of the mouse genome. Nature (2002) 420:520–562.[CrossRef][Medline]
Zheng XH, Lu F, Wang ZY, Zhong F, Hoover J, Mural R. Using shared genomic synteny and shared protein functions to enhance the identification of orthologous gene pairs. Bioinformatics (2005) 6:703–710.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
