Characterization of Structure , Divergence and Regulation Patterns of Plant Promoters

Plant promoters have attracted increasing attention because of their irreplaceable role in modulating the spatio-temporal expression of genes interacting with transcription factors (TFs). Despite their importance, the basic characteristics of plant promoters are not well understood. In order to determine sequence diversity within promoter regions, evolutionary divergence of promoters between plant species, and the general structural characteristics of promoter sequences, we downloaded and analyzed 3922 plant promoter sequences from a wide range of plant species. The average plant promoter GC content was lower in dicotyledons than in monocotyledons, which might suggest different evolutionary pressures for promoter sequences between the two clades. Approximately 3.3% of plant promoters harbored minisatellite sequences, and 15.4% of plant promoters harbored microsatellite sequences (also called simple sequence repeats). Very few transposable elements were detected within the plant promoters. The most common transcription factor binding site (TFBS) motif was AGAGAGAGA, followed by TTAGGGTTT and then GCCGCC. Transcribed gene regions with promoters containing the corresponding TFBSs were predicted to be most commonly involved in metabolic processes, biological regulation, and stimulus response in plants. These results reveal some basic structural characteristics of plant promoters and clarify the evolutionary forces shaping plant promoters. This data might facilitate cloning of plant promoter sequences and aid in our understanding of gene spatio-temporal expression patterns in plants.


Introduction
Promoters are sections of DNA sequence that lie upstream of the transcribed sequences and regulate their expression (Hernandez-Garcia et al., 2010).Promoters contain binding sites for transcription factors (TFs), and interact with these TFs to modulate gene expression.RNA polymerase initiates transcription at promoter sequences, and hence binding of RNA polymerase by TFs within promoter sequences regulates spatio-temporal expression of the downstream transcribed sequence (Camp et al., 2003;Halfon & Zhu, 2009;Freeman et al., 2011).Therefore, promoters are critical for priming or halting gene expression (Wolf et al., 2010;Mastroeni et al., 2011), especially in stress signaling and transcriptional activation during pathogen infection (Hwang et al., 2009;Pandey & Somssich, 2009).To date, numerous promoters have been identified in animals (Romania et al., 2011), plants (Wang et al., 2011), viruses (Smith et al., 2011b), and microorganisms (Cooper et al., 2011).transcription in a narrow genomic region, while broad promoters switch on transcription in a wide genomic region (Nozaki et al., 2011).Cap-analysis gene expression data can be used to indentify those two types of promoters (Carninci et al., 2006).These peak promoters generally contain TATA-boxes (except in mammals) and regulate tissue-specific transcripts in eukaryotes (Hoskins et al., 2011).For most promoters, gene transcription starts from broad regions that are usually associated with CpG islands.These broad promoters have a wide distribution of TSSs, usually over a 100-bp region, and start sites that are preferentially comprised of pyrimidine/purine dinucleotides (Carninci et al., 2006).
Promoters can be divided into prokaryotic-and eukaryotic-type promoters, which differ mainly in promoter motifs.A typical promoter sequence is thought to comprise certain motifs positioned at specific sites upstream of TSS.Two hexameric motifs centered at or near the -10 and -35 positions relative to the TSS are observed in a prokaryotic promoter, whilst a TATA box, a CCAAT box, and a GC box are usually observed in eukaryotic promoters (Bansal & Kanhere, 2005).These three types of boxes play a major role in precise initiation of transcription (Molina & Grotewold, 2005).Nevertheless, not every eukaryotic gene promoter has all three motifs (Anish et al., 2009).In addition, some novel motifs in promoter sequences, e.g.AGTTAGG (Abdullah et al., 2010), G-quadruplex (Chowdhury et al., 2010), and TATGAAAAGAATATGAGAA motifs (Wu & Huang, 2004), have been identified.Other promoter motifs, such as GATA (Obara et al., 2005) and AAAAT (Van Oers et al., 2007), are not conserved but are essential for some promoter functionality.Overall, eukaryotic promoters display more complex structures and regulation patterns than prokaryotic promoters (Bansal & Kanhere, 2005).
Promoters undergo mutations such as nucleotide substitutions, small insertions and deletions in a similar fashion to transcribed sequences (Seliverstov et al., 2009).The evolution and conservation of promoters has been scrutinized through comparative genomics studies in mammals.Previous studies include comparisons between humans and chimpanzees (Deyneko et al., 2010), and between rats, mice, rhesus monkeys, and humans for promoters of hepatic lipase genes (Van Deursen et al., 2007).GC-rich monotone gradients have been observed in eukaryotes while AT-rich monotone gradients have been observed in bacteria, along with strand biases (Calistri et al., 2011).
Each gene can have several promoters that control its spatio-temporal expression.Although promoters are important in investigating patterns of gene expression and for transgenic work, promoters are cloned far less often than transcribed gene sequences.A total of 3922 plant promoters in the Plant Promoter Database (PlantProm DB; http://linux1.softberry.com/berry.phtml)have been collected to date.Knowledge of the basic structural and evolutionary characteristics of plant promoters remain unknown, making plant promoter sequences hard to identify.To facilitate better characterization of plant promoter sequences, the 3922 available plant promoter sequences were downloaded and analyzed.Basic promoter characteristics were dissected, presence of special motifs, minisatellite sequences, microsatellite sequences, and transposable elements (TEs).We present the results of this analysis, and propose mechanisms for promoter divergence and evolution.

Acquisition of Plant Promoter Sequences
All plant promoter sequences from monocotyledons and dicotyledons (the latter mainly from Arabidopsis thaliana) were downloaded from the PlantProm DB (Release 2009.02;http://linux1.softberry.com);an annotated, non-redundant collection of proximal promoter sequences (Shahmuradov et al., 2003).These promoters could potentially be recognized by RNA polymerase II and contained experimentally determined TSSs from diverse plant species (Solovyev et al., 2003).The PlantProm DB contains both the predicted TSSs and the experimentally verified promoter TSSs, identified using approaches such as full-length cDNA/5'ESTs mapping, cap-analysis gene expression, and serial analysis of gene expression.

Detection of Minisatellite Sequences
Minisatellites, a type of tandem repeat sequence, consist of a short series of 11-100 bp repeat units.Tandem Repeats Finder 4.04 (http://tandem.bu.edu/trf/trf.download.html)developed by Gary Benson of the Bioinformatics Program at Boston University, was used to detect minisatellite sequences (Martin, 2006).Default parameters were used: Alignment parameters were match = 2, mismatch = 7, indel = 7, the minimum alignment score to report a repeat was 50 and the maximum period size was 100 bp.

Detection of Transposable Elements
There are two classes of transposable elements (TEs): DNA transposons and retrotransposons (Zhang et al., 2004).The Long Terminal Repeat (LTR)-Finder 1.05 (http://tlife.fudan.edu.cn/ltr_finder/) was used to detect full-length LTR retrotranspsons in genome sequences.The parameters of minimal LTR length, minimal distance between LTRs, and the output threshold score were set to 50, 100, and 3.0, respectively (Gao et al., 2012).The RepeatMasker 3.0SE-AB program (www.repeatmasker.org) was used to detect all types of transposons using the abblast (formerly known as WUBlast) search engine with A. thaliana set as the reference species.Since LTR-type retrotransposons detected by the LTR-Finder tool with default parameters exhibit intact retrotransposon sequence characteristics, LTR-Finder predictions were used instead of LTR retrotransposon predictions from RepeatMasker.

Prediction of Transcription Factor Binding Sites (TFBSs)
The online software NSITE-PL (http://linux1.softberry.com) with default parameters was used to predict transcription factor binding sites by recognition of regulatory motifs of plant promoters.

Functional Annotation by Blast2Go
The sequences of the transcribed gene regions with promoters containing TFBSs were in downloaded in a batch from NCBI (http://www.ncbi.nlm.nih.gov/sites/batchentrez). Blast2Go V2.6.0 (Conesa et al., 2005) (http://www.blast2go.org),a functional annotation prediction tool for unknown sequences, was used with default parameters to predict the putative functions of the transcribed gene regions with promoters containing TFBSs.Functional annotations of these genes were carried out for cellular component, biological process and molecular function.

Alignment of Plant Promoter Sequences
All plant promoters underwent all-by-all BlastN analysis using the basic local alignment search tool (BLAST) (http://www.ncbi.nlm.nih.gov/blast)(Cameron & Williams, 2007) with an E value of less than e -10 .The alignment results were imported into Cytoscape V2.7.0 (an open source platform for complex network analysis and visualization) (http://www.cytoscape.org) to classify different groups using the 'import network from table' function.

Phylogenic Dendrogram of Plant Promoter Sequences
The Molecular Evolutionary Genetics Analysis (MEGA; http://www.megasoftware.net)4.0 software was used to draw the phylogenetic dendrogram of different plant promoter sequence groups using the maximum composite likelihood (MCL) model with the bootstrap value set as 1000 (Kumar et al., 2007).

Plant Promoter Sequence Sets
A total of 3922 plant promoter sequences were downloaded from the PlantProm DB: 98 from monocotyledons and 3824 from dicotyledons.Monocotyledon sequences comprised 36 plant promoters from Zea, 32 from Hordeum, and 19 from Triticum.Dicotyledon sequences comprised 3537 plant promoters from Arabidopsis, 49 from Nicotiana, 46 from Solanum, 31 from Glycine and 31 from Pisum.Another 130 plant promoters were acquired from other genera, including Phaseolus (13), Brassica (9), and Avena (4).

Distribution of GC Content of Plant Promoters
GC content was calculated for each plant promoter sequence.The GC content of plant promoters ranged from 13.1% to 72.6%, with an average of 34.6%.The GC content of dicotyledon promoters ranged from 13.1% to 58.6% with an average of 34.1%, whilst the GC content of monocotyledon promoters ranged from 33.0% to 72.6% with an average of 50.5%.The centre of the GC content distribution for most dicotyledon promoters was from 30% to 40%, median 34.26%, whereas the centre of the GC content distribution of most dicotyledon promoters ranged from 50% to 60%, median 51% (Figure 1).Approximately 15% of the analyzed plant promoters (605 out of 3922) contained one or more microsatellites.Of these, 93% (563 out of 605) contained a single microsatellite, 6.5% (39 out of 605) contained two microsatellites, and 0.5% (3 out of 605) contained three microsatellites.Microsatellites with monomer motifs were by far the most common microsatellite type in the promoters (74.92%).Dimeric and trimeric microsatellite motifs were the next most common and accounted for, respectively, 15.39% and 6.14% of promoter-containing microsatellites (Table 1).
Microsatellites with monomer motifs were almost all A/T types (486 out of 487; 99.79%), with a single C monomer motif.A-motifs comprised the majority of the microsatellites with monomer repeats (345 out of 487; 70.84%) and T-motifs the minority (141 out of 487, 28.95%).AG/CT and GA/TC microsatellites comprised 71% of microsatellites with dimer motifs (Table 1).

Detection of Minisatellite Sequences
Approximately 2.24% of promoters (88 out of 3922) contained minisatellite sequences.No minisatellite sequences were found in monocotyledons.The length of the repeat unit ranged from 11 to 116 bp with an average of 24 bp, and the average number minisatellite repeats was 2.3, ranging from 1.9 to 3.8.

Analysis of TFBSs
We used the online software NSITE-PL to predict 31259 TFBS motifs from 3922 plant promoter sequences.On average, one promoter contained eight TFBS motifs.Motif lengths ranged from 4 to 51 bp (predominantly ≤ 30 bp; 99.9%) with an average length of 11 bp.
TFBS with 10-bp motifs comprised the highest proportion of TFBS in the promoters (25.5%), followed by TFBS with 12-bp motifs (14.2%), then TFBS with 9-bp motifs (14.18%) (Figure 2).TFBS motif length mostly ranged from 6 to 17 bp (97.8% of all promoters).Up to 50% of the motifs were classified into 545 motif types, demonstrating that some key TFBS motifs are widely distributed in promoters.Most TFBS motifs possessed the characteristics of simple repeat sequences.The TFBS motif with the highest frequency was AGAGAGAGA (1.6%; 495 out of 31259), which has previously been suggested to be a regulatory element for light responsive photo-transduction regulation in plants (Parida et al., 2009).The second most common TFBS motif was TTAGGGTTT (1.3%; 392 out of 31259); this motif has been shown to interact directly with MYB2-box-like elements in the promoters of osmotic, drought, and ABA-induced genes (Yun et al., 2010).The next most common TFBS motif was GCCGCC (1.1%; 336 out of 31259), involved in the cell cycle, jasmonic acid (JA) responsiveness and sugar signaling (Hu et al., 2011) (Figure 3).The three most common motifs comprised 4.0% of the total motif types, with the remaining motifs present at lower frequencies.The G+C content of TFBS motifs varied from 0.0% to 100.0%, with an average of 43.35%.Motifs with G+C content ranging from 0.0% to 50.0% accounted for 74.7% of all motifs, suggesting that critical promoter motifs exist in AT-rich regions.
Figure 3. Mean number distribution of transcription factor binding site (TFBS) motifs in each plant promoter group.A total of 31259 motifs could be classified into 16 groups with identical motifs whole length.These groups are arranged by number of members for each motif, from greatest to least Aside from the conserved motifs in the TFBS mentioned above, different cis-regulatory elements were also found in promoter sequences: 29201 cis-regulatory elements were identified in total.Although the frequency of most regulatory element types was low, some regulatory elements (G-box, GA-box, and ABRE motifs) were found at considerably higher frequencies (Figure 4).Among those three, G-box regulatory elements were the most common, accounting for 7.07% (2065 out of 29201) of the total regulatory elements.GA-box regulatory elements were the second most common at 5.00% (1460 out of 29201), and ABRE regulatory elements were the third most common at 4.81% (1405 out of 29201).Our results show that a small number of motifs with high affinities for binding proteins are widely distributed in promoter sequences.) was used to recognize TFBSs and provide information for the transcribed gene regions with promoters containing the corresponding TFBSs.Blast2Go was used to predict the functional annotation of the transcribed gene regions with promoters containing the corresponding TFBSs for biological process, molecular function and cellular component.For biological process annotation, the most common involvement was in metabolic processes (27.81%), followed by biological regulation (27.54%), and response to stimulus (17.77%) (Figure 5a).With respect to molecular functionality, the transcribed gene regions with the promoters containing the corresponding TFBSs mainly played a role in binding function (45.57%), followed by catalytic activity (23.86%) and other unknown molecular functions (17.57%) (Figure 5b).The transcribed gene regions with promoters containing the corresponding TFBSs most commonly functioned in the organelles (42.75%), followed by the intracellular (22.46%), and cellular components (20.53%) (Figure 5c).Hence, for biological process annotation, the transcribed gene regions with promoters containing the corresponding TFBSs were mainly involved in metabolic processes; with respect to molecular functionality, the most common function was binding; and with regard to cellular component annotation, transcribed gene regions most commonly functioned in the organelles.

Analysis of Alignment and Phylogenetic Dendrogram of Plant Promoter Sequences
All-by-all BlastN analysis of the plant promoters did not allow clear classification into different subclasses (Figure 6), indicating that the homology of these plant promoter sequences was relatively low.Nevertheless, according to the structure of the phylogenetic dendrogram, the ancestral lineages produced in MEGA 4 (Figure 7) and the species taxonomy, the plant promoter sequences could be classified into 8 groups containing 1172, 791, 60, 24, 136, 59, 287, and 1393 sequences, respectively (Figure 8).The genetic distance between the 8 groups was 0.19 on average, indicating greater divergence within the plant promoter sequence groups than between groups.We divided the whole plant promoter sequences into eight groups according to the standard of uniform ancestral lineage in their phylogenetic dendrogram, and then the sequence number per group was counted and labeled in Y-axis.

GC Content and Mutability of Plant Promoter Sequences
In the current study, the GC content of plant promoters was between 30% and 40% in most dicotyledon species, but was between 50% and 60% in most monocotyledon species, indicating that the GC content of plant promoters in monocotyledons is generally higher than that in dicotyledons.AT-rich regions are prone to mutate to generate diversity more often than GC-rich regions, and are inserted by exogenous gene fragments such as transposons (Gupta et al., 2005).Hence, more complex gene regulation may be required in dicotyledons compared to monocotyledons.AT-rich microsatellite sequences were also very common in the plant promoter sequences, suggesting that the mutability of plant promoters may have an important evolutionary adaptive role in diversification of gene expression.Nevertheless, some transcribed gene regions with GC-rich promoters are expressed more efficiently (Singh et al., 2012) suggesting that balancing selective pressure may exist for retention of GC-rich promoter sequences for genome stability.

Frequency and Possible Functionality of Key Promoter Motifs
According to the results, the length of most plant promoter TFBSs ranged from 6 to 17 bp, with AGAGAGAGA (1.6%; 495 out of 31259), TTAGGGTTT (1.3%; 392 out of 31259), and GCCGCC (1.1%; 336 out of 31259), being the most common.These high-frequency motifs may represent cis-regulatory elements which enhance the expression of sets of related genes.These common motifs detected may exist in the promoters of genes which have been highly conserved in species evolution, such as genes that play basic roles in plant growth and development.For example, the motif AGAGAGAGA is a known regulatory element participating in light-responsive regulation of phototransduction in plants (Parida et al., 2009).This motif is also present in the promoter of the WRKY gene which encodes the WRKY protein, one of the largest families of TFs, regulating processes such as response to biotic and abiotic stresses in plants (Zhang & Wang, 2005;Rushton et al., 2010).
In rice, the WRKY gene family contains over 100 members (Pandey & Somssich, 2009).Likewise, the second most common TFBS motif (TTAGGGTTT) can directly interact with MYB2-box-like elements in the promoters of osmotic, drought, and ABA-induced genes (Yun et al., 2010).In contrast, different organisms may also have organism-specific but genome-wide TFBS motifs.For example, in Actinobacteria, the most significant TFBS motif is TCGAACA (Janky & van Helden, 2008).Similarly, the octamer AAAATTGA motif exists in the predicted core promoters of almost half the Mimivirus genes (Suhre et al., 2005).Therefore, high-frequency TFBS motifs may play multiple and comprehensive roles in many processes occurring in different organisms.
In addition, plant promoter motifs play important roles in accurate initiation of transcription.TFs can combine with DNA to orchestrate transcription of specific cis-regulatory elements (Rombauts et al., 2003).Only small numbers of TFs also combine with special promoter motifs to regulate expression of large numbers of genes (Smith et al., 2011a).Identification of such broad promoters may be useful for transgenic breeding, because the combination between these critical motifs and just a few TFs may allow for more effectively controlled expression of a batch of downstream transcribed gene regions.Critical promoter motifs with important roles can also be used to construct regulatory sequences which contribute to the spatio-temporal expression of transgenic plants.Thus, recombined regulatory sequences could not only accelerate the speed of breeding but also help in obtaining special gene products.

Functional Annotation of the Transcribed Gene Regions With Promoters Containing TFBSs
In this study, 31259 motifs of TFBS were detected from 3922 plant promoter sequences.On average, one promoter contained eight TFBS motifs.What are functions of these transcribed gene regions with promoters containing TFBSs? Blast2GO annotation revealed that the transcribed gene regions with TFBS-containing promoters commonly controlled metabolic processes during plant development, mainly had molecular binding functionality, and were operative in the organelles.We may characterize and mine critical TFBS and promoters from these transcribed gene regions to serve breeding purposes.Promoter cloning and subsequent manipulation of spatio-temporal gene expression offers significant promise as a developing research field in transgenic breeding.Promoter-based transgenic technologies have already been applied to great effect in wheat, where a heat-inducible promoter in transgenic wheat effectively controlled the spatio-temporal expression of a transgene (Freeman et al., 2011).

Some Microsatellites are Universally Distributed in Plant Promoters
Different species share common, prevalent motifs in promoters.The current study observed that (A) n , (T) n , (AG) n , (GA) n , (CT) n , and (TC) n were the predominant mononucleotide and dinucleotide microsatellite motifs, respectively.This result suggests that microsatellites with specific motifs survived during natural selection due to positive selective advantages.The monomer microsatellites (almost all A and T motifs) accounted for the highest proportion of the microsatellite-containing promoter sequences.As the A/T-motif microsatellites are easily mutated (Gao et al., 2011), this may indicate a positive selection pressure due to the advantage provided by the extra diversity of gene expression in adapting to the environment and evolving into more complex higher organisms.
In summary, the GC content of plant promoters in monocotyledons appeared to be higher than that in dicotyledons.Most microsatellites and TEs were quite rare in promoter sequences, whereas microsatellites with A and T monomers were very commonly observed and may provide adaptive mutability potential in plant promoter sequences.Motifs of particular lengths occurred mainly on the TFBSs, and regulatory elements occurring with high frequency were mostly G-box, GA-box, and ABRE motifs.For biological process annotation, the transcribed gene regions with promoters containing the corresponding TFBSs were mainly involved in metabolic processes; with respect to molecular functionality, the most common function was binding; and with regards to cellular component annotation, the most common functional location was the organelles The characteristics of higher A/T content, more microsatellites and a small quantity of TEs in plant promoters may play a role in evolution of plant promoters.The different TFBS motifs in plant promoters are a critical element of spatio-temporal expression of genes.These results are beneficial not only for elucidating the mechanisms of spatio-temporal gene expression and for cloning key plant promoters (or their main motifs), but also for investigating the basic structure of plant promoters and clarifying the evolutionary forces at work in plant promoter diversification.

Figure 1 .
Figure 1.Proportion of plant promoter sequences with different ranges of GC content in different classes: monocotyledon and dicotyledon

a
the left hand side motif b the right hand side motif (reverse complement of a ).The percentage of Type 1 and Type 2 motifs was derived by the number of Type 1 or Type 2 motifs divided by the subtotal.

Figure 2 .
Figure 2. Distribution of motif lengths of transcription factor binding site (TFBS) in plant promoters.Promoter TFBS motifs were predicted using the software NSITE-PL to process plant promoter sequences

Figure 4 .
Figure 4. Number distribution of plant promoter regulatory elements detected in promoters.The values in the brackets represent the number range of all kinds of regulatory elements detected, and the percentages denote the total percentage of promoters in each regulatory element group

Figure 5 .
Figure 5. Functional annotation of the transcribed gene regions with promoters containing the corresponding transcription factor binding sites (TFBSs) (a) Biological process; (b) Molecular function; (c) Cellular components.

Figure 6 .
Figure 6.Classification of different plant promoter sequences All-by-all BlastN analysis was used to classify different plant promoter sequences into different subclasses.The circles represent different plant promoter sequences, and the lines between the circles denote the homology between the two plant promoter sequences distributed in the two circles.

Figure 7 .Figure 8 .
Figure 7. Phylogenetic dendrogram of the plant promoter sequences of 288 species All plant promoter sequences were classified into 8 classes.The yellow line represents the demarcation of different classes.The numbers on the right represent the plant promoter sequence groups, with groups marked out using two pink lines.

Table 1 .
Type and distribution of microsatellites in the collected plant promoter sequences

Table 2 .
Predictions of presence of different types of transposable elements (TEs) in plant promoter sequences