Journal of Heredity Advance Access originally published online on August 31, 2005
Journal of Heredity 2005 96(5):618-622; doi:10.1093/jhered/esi094
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Computer Note |
PRECISE: Software for Prediction of cis-Acting Regulatory Elements
From the Graduate School of Experimental Plant Sciences, Laboratory of Plant Breeding, Department of Plant Sciences, Wageningen University, P.O. Box 386, 6700 AJ Wageningen, The Netherlands (Trindade, Berloo, and Visser); and Plant Research International, P.O. Box 16, 6700 AA Wageningen, The Netherlands (Fiers)
Address correspondence to Luisa M. Trindade at the address above, or e-mail: luisa.trindade{at}wur.nl.
| Abstract |
|---|
|
|
|---|
The regulation of gene expression at the transcription initiation level is highly complex and requires the presence of multiple transcription factors. These transcription factors are often proteins or peptides that bind to the so-called cis-acting elements, which are present in the promoter regions and conserved among different species. In order to predict these cis-acting elements, a computer program called PRECISE (Prediction of REgulatory CIS-acting Elements) was developed. The power of the tool lies in its user-friendly interface and in the possibility of using empirical motif frequency tables to filter through the many discovered motifs. The tools to create the empirical motif frequency table (e.g., from a whole genome sequence) are included in the package. In the first case study, the upstream regions of all the genes in the Arabidopsis genome were used to create an empirical motif frequency table and a set of 64 upstream sequences of genes known to be involved in starch metabolism was subjected to analysis by PRECISE. The 20 motifs with the highest specificity in the selected set were analyzed in more detail. Of these 20 motifs, 15 showed a very high or complete homology to the sequences of known cis-acting elements. These cis-acting elements are regulated by light, auxin, and abscisic acid, and confer specific expression in sink organs such as leaves and seeds. All these factors have been shown to play an important role in starch biosynthesis. In the second case study, the upstream regions of 16 genes whose transcription is induced by gibberellins (GA) in Arabidopsis were analyzed with PRECISE and compared to the motifs present in the PLACE database. Among the most promising motifs found by PRECISE were 6 of the 17 known GA motifs. These results indicate the power of the PRECISE software package in the prediction of regulatory elements.
During their development and differentiation, multicellular organisms need to integrate a wide range of tissue, developmental, and environmental signals to control gene expression. A major level at which gene expression is regulated is the initiation of transcription, which is highly complex and often requires the presence of multiple transcription factors (Fickett and Wasserman 2000).
A common approach to uncover regulatory elements entails the construction of a series of deletions or replacements in the upstream intergenic region of a gene, followed by an assay for altered regulation (Bulyk et al. 2001; Daraselia et al. 1996; Liu et al. 1990). An efficient method for predicting the most likely location of regulatory sequences could guide these experiments more quickly to the sought after elements (e.g., Roth et al. 1998).
Completely sequenced genomes, together with mRNA quantification of the entire genome, open new possibilities to predict regulatory elements. Several authors have tried to find regulatory elements in different groups of genes. One of the most frequently described approaches is to find common cis-acting elements in the promoter regions of genes showing a similar expression profile (Pilpel et al. 2001; Roth et al. 1998; Vilo et al. 2000; Vilo and Kivinen 2001). Bussemaker et al. (2001) proposed a method for the detection regulatory elements that is based on correlating gene expression values and motif occurrences. Other alternative approaches for the clustering of expression profiles are retrieving a group of genes with similar functions (Hughes et al. 2000; Zhu and Zhang 2000), sorting genes based on the magnitude of their expression response under certain conditions (Jensen and Knudsen 2000), grouping all the genes affected by single-gene knock-out studies (Hughes et al. 2000), or comparing genes involved in the same metabolic pathway (Brazma et al. 1998; van Helden et al. 1998). Although several methods have been proposed in order to find putative cis-acting elements, none of those referred to above resulted in a publicly available software program where users can compare their own set of promoters to their favorite reference sets.
In this article, a simple and fast method for identifying regulatory cis-acting elements is described. A new and versatile software program named PRECISE (Prediction of REgulatory CIS-acting Elements) was developed. This software package can filter through promoter regions of a given set of genes entirely selected by the user in order to identify motifs that are likely to be involved in gene regulation. This program was tested using the Arabidopsis genome and two sets of promoters were used to validate this method.
| Methods |
|---|
|
|
|---|
The methods used to discover putative cis-acting regulatory sequence motifs can be divided into (1) the creation of a reference set, needed for significance assessment (which needs to be done only once per organism), and (2) the scanning of a selected set of promoter sequences for motifs that appear with a high frequency (which can be repeated many times using different input and settings). The "reference set" is defined as the set of sequences one uses to compare the "selected set" of sequences in order to determine the relevance of the sequences found with PRECISE. In our case, the reference set was the upstream sequences of all the Arabidopsis genes.
Creation of a Reference Empirical Motif Frequency Table
The process of creating an empirical motif frequency table was streamlined by the creation of two computer algorithms. The first algorithm takes as input a genome sequence and feature description. The feature description is used to parse the sequence and identify the coding sequences. The sequence regions in between coding sequences (on either strand), excluding introns, are retained.
A second script is used to parse all retained sequences and determine the frequencies of all unique motifs with a length of 4 to 11 bases (or any other length range chosen by the user). In the process, every unique motif is awarded a unique identifier to allow fast, indexed lookup of the frequencies at a later stage.
Formula 1 (Vincens et al. 1998) is used to "translate" every unique motif s into an identifier Y:
![]() |
Scanning Promoter Sequences for Motifs That Appear With a High Frequency.
The PRECISE software package considers all possible sequence motifs of a certain length and performs an analysis of the frequency of these motifs, both with the target set of promoter sequences and in a reference background set (usually an entire genome sequence), and filters out the most promising motifs. Filtering is based on a "minimum number of occurrences" rule, but the resulting set of putative valuable words can still become quite large, which complicates devising an easy statistical test for significance of the putative valuable words.
If a motif occurs at random within the set of promoter sequences, the relative number of occurrences is expected to be very similar to the relative number of occurrences of the motif within the unrelated reference set. This reasoning forms the basis of the algorithm implemented in PRECISE.
The PRECISE package was developed as a user-friendly Windows-compatible computer program that performs motif frequency analysis, including calculation of frequency statistics and visualization and export of analysis results. PRECISE was written in Borland Delphi (Object Pascal); the software source contains roughly 1200 lines of code (including comments) and runs on all 32-bit Windows platforms. The executable file is around 700 kb and requires approximately 5 Mb of system memory (with no data loaded). Depending on the size of the dataset, additional memory may be required; for example, after loading the Arabidopsis reference frequency dataset, PRECISE uses 64 Mb of memory. PRECISE is accompanied by helper programs that assist in extracting noncoding sequences from annotated whole genome sequences, and in the construction of a reference frequency dataset. These programs contain up to 300400 lines of code each.
| The PRECISE Algorithm, Step by Step |
|---|
|
|
|---|
The first step concerns the empirical frequency table, which has to be loaded into memory. The software then loads the set of (promoter) sequences to be scanned. These can be presented in one or more Fasta-formatted text files. The size range of the motifs to be scanned can be set by the user and the scanning algorithm is started.
The main algorithm creates a reverse copy of each promoter sequence and adds these to the set. Next, it breaks up the sequences into fragments and creates a list of unique motifs present in the selected set (within the preset size range). One by one the unique motifs are compared with the selected set of sequences and their frequency of occurrence in this set is determined. The motifs with a frequency exceeding a predefined (by the user) threshold are reported. This report summarizes the findings and also contains relevant statistics. In most cases, the ratio of the relative observed motif frequency among the selected set of sequences versus the relative motif frequency in the reference table will be the most informative statistic. The report is formatted as a list of motifs and their statistics, and each parameter can be easily sorted. Clicking on a motif reveals relevant details, such as the location in which it was found, in a textual and a graphic form.
| Determining Motif Significance |
|---|
|
|
|---|
The PRECISE algorithm essentially performs a brute force exhaustive search through all motifs present in the data. Discriminating true putative regulatory elements from artifacts is of vital importance, especially considering the large number of potential motifs. The motif significance test in PRECISE is based on common statistical sampling theory (e.g., Moore and McCabe 1999).
For any random motif M of length L, the relative frequency of occurrence of M within the subset of all motifs of length L has equal expectation in both the selected set of scanned sequences as well as in the reference set of sequence frequencies. These can be considered two independent samples out of the pool of noncoding sequences.
For each motif, the relative frequency in the selected set and the relative frequency in the reference set is considered, and the ratio of these two parameters is calculated. In formal mathematical language:
![]() |
The ratio of the two relative frequencies gives a useful criterion to order the results. A formal test on the equality of FTr(M) and FRr(M) can be constructed and, for sufficiently large sample size of the motif sets considered, the distribution of the tester
can be approximated by a normal distribution (Moore and McCabe 1999).
The PRECISE software calculates and prints the probability under Ho: FTr(M) = FRr(M). The significance threshold is corrected for multiple testing using the Bonferroni adjustment. Motifs with a probability under Ho that lie below the adjusted significance threshold can be considered putative interesting motifs and are marked as such.
Other statistics reported by PRECISE are based on knowledge of the G + C/A + T ratio within the genome considered. This allows a probability to be calculated for every motif, since we consider the motif to be an assembly of individual bases that each have a certain probability of occurrence.
| Case Study 1 |
|---|
|
|
|---|
In order to evaluate PRECISE, the promoters of 64 genes involved in starch metabolism were screened for common regulatory elements. The genes involved in the starch metabolic pathway were chosen to test PRECISE because both the genes involved in this pathway and its regulation have been extensively studied. A total of 10,706 oligomers were common to at least 10 of the 64 promoters.
The relative frequency ratio (FTr/FRr) of each of these oligomers was calculated and a histogram with these values is shown in Figure 1. Oligomers with a ratio greater than 7 are rare in the set of 64 promoters. The 20 potential motifs with the highest relative frequency ratio (Table 1) were compared to previously described plant cis-acting elements present in the PLACE database (Higo et al. 1999).
|
|
The sequences of at least eight of these oligomers were identical to previously described plant cis-acting elements (Table 1). Some covered the whole identified cis element, and most were homologous to a large part of the described motif. Seven other oligomers were nearly identical to the characterized plant cis-acting element and differed only with regard to the first or the last nucleotide or both. Thus only 5 oligomers out of our selected set of 20 showed little similarity to PLACE cis-acting elements.
In order to illustrate the relationship between the selected set of 64 promoters and the reference set containing all the upstream regions of the Arabidopsis genome, the relative frequency of the 1,048,576 motifs with a length of 10 nucleotides were compared between these two sets (Figure 2).
|
The probability for the observed difference in relative frequency for each of the 20 oligomers in the reference frequency table and in the selected set of promoters was much smaller than 1% (Table 1), thus all of these differences in frequency are statistically significant. In addition, for the motifs G/TCCACGTG, the relative frequency in the set of promoters was nearly 14-fold higher than in the Arabidopsis noncoding regions.
Concerning the factors affecting the 15 PLACE cis-acting elements (Table 1), they confer tissue-specific expression mainly in sink organs such as leaves, seeds, and roots, and they are regulated by auxin, abscisic acid, and light. Some general transcription factor binding sites, such as for MADS domains and the nonamer motif, have also been identified among the oligomers with higher relative frequency in the set of 64 promoters when compared to the reference frequency table.
Several independent reports showed that both auxin and abscisic acid play an important role in the regulation of starch biosynthesis (Miyazawa et al. 1999; Rinne et al. 1994; Santos et al. 2004; Yang et al. 2002). Light has also been described as a factor that influences starch biosyntheses (Schlüter et al. 2003), and two light-responsive elements were identified among the 20 oligomers with higher FTr/FRr ratios. Finally, five cis-acting elements confer tissue-specific expression in sink organs such as leaves, seeds, endosperm, and roots. It is exactly in these organs where high concentrations of starch can be found in the plant, and they are also the organs where starch biosynthesis occurs.
As a large number of (specific) cis elements are not yet characterized, the remaining three motifs that do not show high similarity to any described cis element are potential candidates for new transcription factor binding sites.
| Case Study 2 |
|---|
|
|
|---|
In the second case study, the upstream regions of 16 genes known to be induced by gibberellins (GA) were screened for common regulatory elements. PRECISE found a total of 19,751 oligomers present in at least 4 of the 16 promoters.
In the PLACE database, there are in total 17 known GA motifs, and the sequences of these motifs were compared to the sequences of the oligomers present in at least four promoters. Six of the 17 motifs had a relative frequency ratio (FTr/FRr) greater than 10 and were 100% homologous to the motifs in the PLACE database (Table 2).
|
| Conclusion |
|---|
|
|
|---|
While this manuscript was under review, an article was published by Tompa et al. (2005) where the performance of publicly available computational tools to find cis-acting elements was compared. Most of these tools use weight matrixes in order to find cis elements, and only a few (e.g., Weeder) use a principle comparable to PRECISE. PRECISE was compared to a number of these tools. The results indicated that several motifs found with PRECISE were also found with several other tools, while others were only found with PRECISE, but as concluded by Tompa et al. (2005), further experimental data are necessary in order to confirm which of the tools perform best.
Finally, the applicability of PRECISE is not restricted to find cis-acting elements in the promoter regions, but when complemented with a cluster analysis, it can be used to map, find, or cluster coregulated genes whose functions have not been previously characterized.
| Availability |
|---|
|
|
|---|
The PRECISE software package is available for academic users from the Laboratory of Plant Breeding, Wageningen University, The Netherlands, at http://www.dpw.wau.nl/pv/pub/precise/. A users' manual is also available at this location.
| Footnotes |
|---|
Corresponding Editor: Reid Palmer
Received November 4, 2004
Accepted May 25, 2005
| References |
|---|
|
|
|---|
-
Brazma A, Jonassen I, Vilo J, and Ukkonen E, 1998. Predicting gene regulatory elements in silico on a genomic scale. Genome Res 8:12021215.
Bulyk ML, Huang X, Choo Y, and Church GM, 2001. Exploring the DNA-binding specificities of zinc fingers with DNA microarrays. Proc Natl Acad Sci USA 98:71587163.
Bussemaker HJ, Li H, and Siggia ED, 2001. Regulatory element detection using correlation with expression. Nat Genet 27:167171.[CrossRef][Web of Science][Medline]
Daraselia ND, Tarchvskaya S, and Narita JO, 1996. The promoter for tomato 3-hydroxy-3-methylglutaryl coenzyme A reductase gene 2 has unusual regulatory elements that direct high-level expression. Plant Physiol 112:727733[Abstract]
Fickett JW and Wasserman WW, 2000. Discovery and modeling of transcription regulatory regions. Curr Opin Biotechnol 11:1924.[CrossRef][Web of Science][Medline]
Higo K, Ugawa Y, Iwamoto M, and Korenaga T, 1999. Plant cis-acting regulatory DNA elements (PLACE) database: 1999. Nucleic Acids Res 27:297300.
Hughes JD, Estep PW, Tavazoie S, and Church GM, 2000. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Biol 296:12051214.[CrossRef][Web of Science][Medline]
Jensen LJ and Knudsen S, 2000. Automatic discovery of regulatory patterns in promoter regions based on whole cell expression data and functional annotation. Bioinformatics 16:326333.
Liu XJ, Prat S, Willmitzer L, and Frommer WB, 1990. Cis-regulatory elements directing tuber-specific and sucrose-inducible expression of a chimeric class I patatin promoter/GUS-gene fusion. Mol Gen Genet 223:401406.[Web of Science][Medline]
Miyazawa Y, Sakai A, Miyagishima S, Takano H, Kawano S, and Kuroiwa T, 1999. Auxin and cytokinin have opposite effects on amyloplast development and the expression of starch synthesis gene in cultured yellow-2 tobacco cells. Plant Physiol 121:461469.
Moore DS and McCabe GP, 1999. Introduction to the practice of statistics, 3rd ed. New York: Freeman; 601609.
Pilpel Y, Sudarsanam P, and Church GM, 2001. Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet 29:153159.[CrossRef][Web of Science][Medline]
Rinne P, Tuominen H, and Junttila O, 1994. Seasonal changes in bud dormancy in relation to bud morphology, water and starch content, and abscisic acid concentration in adult trees of Betula pubescens. Tree Physiol 14:549561.[Abstract]
Roth FP, Hughes JD, Estep PW, and Church GM, 1998. Finding DNA regulatory motifs within unaligned non-coding sequences clustered by whole-genome mRNA quantification. Nat Biotechnol 16:939945.[CrossRef][Web of Science][Medline]
Santos HP, Purgatto E, Mercier H, and Buckeridge MS, 2004. The control of storage xyloglucan mobilization in cotyledons of Hymenaea courbaril. Plant Physiol 135:287299.
Schlüter U, Muschak M, Berger D, and Altmann T, 2003. Photosynthetic performance of an Arabidopsis mutant with elevated stomatal density (sdd1-1) under different light regimes. J Exp Bot 54:867874.
Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, and Zhu Z, 2005. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23:137144.[CrossRef][Web of Science][Medline]
Yang J, Zhang J, Wang Z, Zhu Q, and Liu L, 2002. Abscisic acid and cytokinins in the root exudates and leaves and their relationship to senescence and remobilization of carbon reserves in rice subjected to water stress during grain filling. Planta 215:645652.[CrossRef][Web of Science][Medline]
van Helden J, Andre B, and Collado-Vides J, 1998. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 281:827842.[CrossRef][Web of Science][Medline]
Vilo J, Brazma A, Jonassen I, Robinson A, and Ukkonen E, 2000. Mining for putative elements in the yeast genome using gene expression data. Proc Int Conf Intell Syst Mol Biol 8:384394.[Medline]
Vilo J and Kivinen K, 2001. Regulatory sequence analysis: application to the interpretation of gene expression. Eur Neuropsychopharmacol 11:399411.[CrossRef][Web of Science][Medline]
Vincens P, Buffat L, Andre C, Chevrolat JP, Boisvieux JF, and Hazout S, 1998. A strategy for finding regions of similarity in complete genome sequences. Bioinformatics 14:715725.
Zhu J and Zhang MQ, 2000. Cluster, function and promoter: analysis of yeast expression array. Pac Symp Biocomput 2000:47990.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



