Journal of Heredity 2003:94(1)
© 2003 The American Genetic Association 94:15-22
A Survey of Canine Expressed Sequence Tags and a Display of Their Annotations Through a Flexible Web-Based Interface
From the Cold Spring Harbor Laboratory, Genome Research Center, 500 Sunnyside Blvd., Woodbury, NY 11797 (Palmer, O'Shaughnessy, Preston, Santos, Balija, Nascimento, Zutavern, and McCombie); the Department of Clinical Studies, University of Pennsylvania, School of Veterinary Medicine, 3900 Delancey St., Philadelphia, PA 19104-6010 (Henthorn); and Cold Spring Harbor Laboratory, 1 Bungtown Rd., Cold Spring Harbor, NY 11724 (Hannon).
Address correspondence to W. R. McCombie at the address above, or e-mail: mccombie{at}cshl.org.
| Abstract |
|---|
|
|
|---|
We have initially sequenced approximately 8,000 canine expressed sequence tags (ESTs) from several complementary DNA (cDNA) libraries: testes, whole brain, and Madin-Darby canine kidney (MDCK) cells. Analysis of these sequences shows that they provide partial sequence information for about 5%10% of the canine genes. An analysis pipeline has been created to cluster the ESTs and to map individual ESTs as well as clustered ESTs to both the human genome and the human proteome. Gene ontology (GO) terms have been assigned to the ESTs and clusters based on their top matches to the International Protein Index (IPI) set of human proteins. The data generated is stored in a MySQL relational database for analysis and display. A Web-based Perl script has been written to display the analyzed data to the scientific community.
The availability of the human genome sequence presents a great opportunity for the biomedical community. These data will be beneficial in understanding a variety of human diseases. Yet the ultimate goal of genomics is to correlate the structure of the genome with its function in both disease processes as well as normal functions. Analysis of both inter- and intraspecies sequence variation will be critical in achieving this goal.
The broad range of phenotypic variation between the more than 300 breeds of dogs available worldwide offers a unique opportunity for determining the underlying sequence variation. For example, dogs exhibit a broad range of phenotypic variation in traits such as size, conformation, and behavior between breeds (Galibert et al. 1998; Ostrander et al. 2000; Ostrander and Giniger 1997, 1999; Patterson 2000; Patterson et al. 1988). However, within a particular breed there may be little phenotypic variation. In addition, in the process of creating distinct breeds by selective breeding, a number of unobserved traits, such as increased susceptibility to cancer and other diseases, have been isolated in the breeds.
The development of these genetically isolated breeds, coupled with detailed pedigree information, makes purebred dogs a rich source of genetic information. Genomic analysis of the species will allow utilization of this resource. As one important component in this process, we have begun sequencing from canine complementary DNA (cDNA) libraries to generate reference sequences that will facilitate the analysis of sequence variation among canines and between canines and humans (and other mammals). Approximately 8,000 canine-expressed sequence tags (ESTs) have initially been sequenced and analyzed. An analysis pipeline has been created to cluster, annotate, and map the EST sequences to the human genome and proteome. We have made this data available to the public on our canine Web site (http://nucleus.cshl.edu/genseq/dogweb/index.htm).
This initial set of canine ESTs will help in establishing a reference sequence data set of canine genes. This data set will be useful for a number of purposes, such as generating probes to sequence full-length genes or for aiding the assembly of the dog genome if that project is undertaken. In addition, a reference set of canine ESTs will also aid in the annotation of the human genome by providing additional evidence for putative transcripts.
| Materials and Methods |
|---|
|
|
|---|
EST Sequencing
cDNA libraries from canine testes, whole brain, and MadinDarby canine kidney (MDCK) cells were oligo(dT) primed using the Stratagene ZAP cDNA synthesis kit. Cloned fragments were rescued via Lambda ZAP excision into pBluescript phagemids. The pBluescript phagemids were transformed into XL1-blue electrocompetent cells and plated on selective media. pBluescript plasmids were isolated via a modified SPRI magnetic bead templating method (Hawkins et al. 1994). The plasmids were then sequenced from the 5' end with a 1/16th-volume Big Dye terminator chemistry in a 384-well format. Fragments were separated and analyzed on ABI 3700 capillary sequencers.
EST sequencing reads were trimmed for vector and low-quality sequences using the trim_alt option of PHRED (Ewing and Green 1998; Ewing et al. 1998). If the trimmed sequence contained 100 or more bases, it was submitted to GenBank and used in this study.
EST Analysis Pipeline
All scripts used in our analysis were written in Perl 5. Individual reads were clustered using the PHRED and PHRAP programs (Ewing and Green 1998; Ewing et al. 1998). The sequences derived from the three cDNA libraries (MDCK, testes, and whole brain) were each individually clustered. In addition, all three libraries were clustered together.
Canine ESTs from other laboratories were also retrieved from GenBank and incorporated into the analysis pipelines. These ESTs were not clustered, nor were they included in the analysis of the sequences we generated. They are available to the public, however, with the associated gene ontology (GO) terms and mapping information through our Web site.
Both the clusters and individual ESTs were mapped to the human genome and proteome. Sequences were mapped to the human genome with the University of California at Santa Cruz (UCSC) BLAT server (http://genome.ucsc.edu) using the April 2002 freeze of the human genome assembly and a cutoff score of 100 to eliminate poor matches. Fifty thousand bases upstream and downstream of where BLAT mapped the sequence in the human genome were retrieved. The SIM4 program (Florea et al. 1998) was then used to map the sequence to the genomic DNA that was retrieved.
Sequences were mapped to the human proteome by performing a BLASTX (Altschul et al. 1997) search against the International Protein Index (IPI) database of nonredundant human proteins. The top match to an IPI protein containing a GO term was used to assign GO terms to each sequence. The June 2002 data for the IPI and GO databases were used in this study (http://www.ebi.ac.uk/IPI/ and http://www.godatabase.org/dev/database/, respectively).
EST clusters were mapped to Online Mendelian Inheritance in Man (OMIM) loci by using BLASTX to match the sequences to a database of curated human RefSeq proteins (ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/hs.faa.gz). To get a list of loci that are related to cancer, the OMIM database was searched for records that contained the term "cancer." Only records that contained a single locus were retained. EST clusters were mapped to the OMIM records using the top match to the RefSeq database using BLASTX. RefSeq to OMIM relations were established using the locus link databases loc2ref and mim2loc (ftp://ftp.ncbi.nih.gov/refseq/LocusLink/).
A MySQL relational database was constructed to store the analysis information. Perl-based Web scripts were written to display the data stored in the MySQL database. The Web interface can be found at http://nucleus.cshl.edu/genseq/dogweb/index.htm by selecting the EST Annotations link.
| Results and Discussion |
|---|
|
|
|---|
We have sequenced approximately 8,000 canine ESTs from several nonnormalized cDNA libraries, including whole brain, testes, and MDCK cells. The ESTs from each library were clustered using PHRED and PHRAP. In addition, the three libraries were clustered together. Table 1 shows the total number of clusters (multi-EST clusters plus singletons), the number of multi-EST clusters, and the number of singletons generated by the clustering process. These data indicate that our sequencing to date has uncovered about 4,000 unique sequences. An exact number of genes is difficult to establish because two separate clusters may in fact represent the same gene, or singletons may not be clustered due to poor sequence quality. However, approximately 2,000 unique human sequences have been matched by the canine ESTs (see below), so we estimate that our EST coverage represents approximately 5%10% of the estimated 30,00040,000 member mammalian gene set (International Human Genome Sequencing Consortium, 1999).
|
Each EST and cluster sequence was mapped to the human genome using the BLAT server at UCSC (http://genome.ucsc.edu). The EST and cluster sequences were also mapped to the human proteome by querying the sequence against the IPI set of nonredundant human proteins using BLASTX. In addition, we have assigned GO terms to the sequences based on their matches to the IPI database.
The percentages of clusters that mapped to the human proteome, to the human genome, to both, or to neither are shown in Figure 1. The whole brain library had the largest number of clusters that did not match either the human genome or proteome (31%), while only 20% of the testes clusters did not match the human genome or proteome under the conditions used. Some of the "nonmatches" may be due to the incompleteness of the human genome and/or IPI database, while others may be due to sequencing errors or to the stringency of our cutoffs for matches.
|
As an indicator of the diversity of proteins matched by our EST libraries, of the 4,594 different EST clusters (from each library clustered separately), 2,306 distinct IPI proteins were found as the top match in a BLASTX search. There was very little overlap between the three libraries. Ninety-two percent of the 2,306 IPI proteins were matched by clusters from only one library. Only 1% of the IPI proteins were matched by clusters from all three libraries. Within each library, more than 95% of the 2,306 IPI proteins were matched by five or fewer ESTs. These percentages will change as sampling is increased. Clustering of ESTs resulted in more than 90% of IPI proteins being matched by only a single cluster, thus providing one estimate of the redundancy in our clustered set. Of the unclustered ESTs, approximately 65%75% of the matching IPI proteins were matched by only a single sequence (Figure 2).
|
A Web-based interface for the database (found at the EST Annotations link at http://nucleus.cshl.edu/genseq/dogweb/index.htm) has been developed to view the results of the automated analysis pipeline. The Web browser allows the user to limit searches to various criteria. Users can choose to download sequences, accession numbers, or annotations (step 1). The next step for the user is to choose which library (or all libraries) or clusters to search. The remaining steps (3 and 4) are optional. Step 3 allows the user to limit the retrieval of sequences that map to a particular human chromosome. In the future we plan to allow the user to limit searches to ESTs that map to a particular region in the chosen chromosome. In the final step (step 4), the user can limit the retrieval of sequences that either have specified GO terms or match a particular human gene or protein (see Figure 3 for steps 14). To select by GO term, the user can simply select a term in the pull-down list, which contains GO terms two levels down from the biological process, molecular function, or cellular component. In addition, a GO term not found in the list can be entered manually. Finally, one can enter an accession number of a human sequence to find any EST or cluster that has been linked to that accession through the BLASTX searches against the IPI database. For example, one can enter the RefSeq identifications (IDs) listed in Table 2 to retrieve a list of ESTs or clusters that are linked to the RefSeq ID.
|
|
Figure 4 displays an example of annotations retrieved for both an individual EST and a cluster. The "Human Genome Match" columns include the coordinates the sequence maps to in the human genome and any RefSeq genes that have been mapped nearby. In addition, the top match to the IPI database is listed under "Human Proteome Match."
|
Selecting the EST accession or cluster name will redirect the user to a more in-depth sequence annotation (Figure 5). This includes the sequence (Figure 5A) as well as additional information regarding the EST sequence, such as library construction (Figure 5B). A probable name derived from IPI hits is shown in the next panel (Figure 5C) along with GO terms that are associated with the protein. Finally, results of BLASTX searches against the IPI (Figure 5D) and NR (Figure 5E) databases are shown. In the IPI panel (Figure 5D), the protein sequence used for annotation of GO terms and probable name is indicated with an asterisk. Because not all sequences in the IPI database have associated GO terms, we relied on the first match with GO terms for assigning the terms to our EST sequences.
|
The user can graphically view the EST or cluster mapped to the human genome by selecting the View link found under in "Human Genome Match" columns (Figure 4). Selecting this link redirects the user to the UCSC genome browser. Information is sent to the genome browser so that the SIM4 mapping results can be displayed within the browser (Figure 6).
|
The UCSC browser contains various tracks of information. The top track, labeled DOG_EST, is information that has been sent by our Web-based script. In the example above, the sequence has been mapped to a known RefSeq gene MYL4 (a myosin light chain). The sequence maps to multiple exons within the MYL4 gene. The other tracks displayed (including mouse and fish homology, spliced human ESTs, and messenger RNA [mRNA]) are annotations provided from the UCSC annotation databases.
We next wanted to demonstrate the utility of the data set and tools we have built. We determined which of our EST clusters matched cancer-related genes. This was accomplished by first matching the EST clusters to curated RefSeq proteins using BLASTX. A list of 946 OMIM records that contained the term "cancer" and had a single locus associated with the record was retrieved from the National Center for Biotechnology Information (NCBI). These loci are not necessarily associated with cancer, however, only that the term is mentioned in the OMIM record. The clustered canine ESTs were matched to these OMIM records through several NCBI databases that linked RefSeq accession numbers to OMIM records. Table 3 lists the number of OMIM records that were matched for each clustered library. The MDCK cells had the largest proportion of matching OMIMs to the number of clusters sequenced. This is not surprising, as MDCK cells are an immortalized cell line. There is little overlap in OMIM records that are matched between the three libraries. A total of 113 different OMIMs are matched by one of the three libraries (there is an additional OMIM entry that is matched when all three libraries are clustered together). Only two of these cancer-related OMIM loci were matched by all three libraries. These two loci are ribosomal protein L5 (RPL5) and laminin receptor (LAMR). Only eight OMIMs were matched by two libraries. A total of 103 OMIMs were matched by only a single library. Table 2 lists a few examples of OMIM loci that were matched by the canine EST clusters. The table lists the RefSeq ID of the protein that links the cluster to the OMIM record. This ID can be entered in our Web interface, as described below, to retrieve the ESTs and clusters that match the particular RefSeq ID.
|
Comparative genomics and the correlation of sequence variation with phenotype will both be powerful tools for understanding the human genome. The canine genome represents a valuable resource for both of these approaches. We have developed a public resource that allows flexible extraction of data from a database of EST sequences from the dog. These sequences represent the partial reference sequence of genes, which can provide a starting point for future explorations of the dog genome. We have populated this database with about 4,000 unique ESTs, likely representing approximately 5%10% of the canine genes.
We are in the process of sequencing additional ESTs from multiple tissue sources. A small number of canine breeds will initially be utilized for creation of cDNA libraries to generate this initial reference sequence set. In the future, using these reference sequences as a starting point, the sequence variation among canine breeds will be determined. Ultimately this will greatly enhance our understanding of the relationship between sequence variation and phenotype in mammals. Our Web site will also be updated with additional tools and resources, such as using BLAST searches as a possible method for retrieving canine ESTs and their annotations.
| Acknowledgments |
|---|
We thank Mark Haskins for providing tissue samples. This work was funded by the Canine Health Foundation and National Institutes of Health grant RR02512. This paper was delivered at the Advances in Canine and Feline Genomics symposium, St. Louis, MO, May 1619, 2002.
| Footnotes |
|---|
Corresponding Editor: Elaine Ostrander
Received September 1, 2002
Accepted September 26, 2002
| References |
|---|
|
|
|---|
-
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ, 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.
Ewing B, Green P, 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8:186-194.
Ewing B, Hillier L, Wendl MC, Green P, 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175-185.
Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W, 1998. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8:967-974.
Galibert F, Andre C, Cheron A, Chuat JC, Hitte C, Jiang Z, Jouquand S, Priat C, Renier C, Vignaux F, 1998. The importance of the canine model in medical genetics. Bull Acad Natl Med. 182:811-821.[Web of Science][Medline]
Hawkins TL, O'Connor-Morin T, Roy A, Santillan C, 1994. DNA purification and isolation using a solid-phase. Nucleic Acids Res. 22:4543-4544.
International Human Genome Sequencing Consortium., 1999. Initial sequencing and analysis of the human genome. Nature. 409:860-921.
Ostrander EA, Galibert F, Patterson DF, 2000. Canine genetics comes of age. Trends Genet. 16:117-123.[CrossRef][Web of Science][Medline]
Ostrander EA, Giniger E, 1997. Semper fidelis: what man's best friend can teach us about human biology and disease. Am J Hum Genet. 61:475-480.[Web of Science][Medline]
Ostrander EA, Giniger E, 1999. Let sleeping dogs lie? Nat Genet. 23:3-4.[CrossRef][Web of Science][Medline]
Patterson D, 2000. Companion animal medicine in the age of medical genetics. J Vet Intern Med. 14:1-9.[CrossRef][Web of Science][Medline]
Patterson DF, Haskins ME, Jezyk PF, Giger U, Meyers-Wallen, VN, Aguirre G, Fyfe JC, Wolfe JH, 1988. Research on genetic diseases: reciprocal benefits to animals and man. J Am Vet Med Assoc. 193:1131-1144.[Web of Science][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

100) and proteome (e < 1e - 10 in BLASTX search of IPI database), ESTs that matched neither the genome or proteome, and those that matched both the human genome or proteome



