TOBFAC: The database of tobacco transcription factors | |||
|
|
||
Transcription factors are proteins that bind to DNA and play a role in the regulation of gene expression by activating or repressing transcription. Regulation of gene expression is fundamental to all aspects of biology. Working with Michael Timko, a team of biologists and computer scientists headed up by Paul Rushton has hand-curated 65 families of transcription factors in tobacco to create the TOBFAC database and web site. The starting materials are the gene space sequences (GSS) from the NCSU Tobacco Genome Initiative (TGI) . The TOBFAC team consists of: Paul J. Rushton, Marta T. Bokowiec, Xianfeng Jeff Chen, Thomas W. Laudeman, Jennifer F. Brannock, and Michael P. Timko. Regulation of gene expression at the level of transcription is a major control point in many biological processes and plant genomes devote approximately 7% of their coding sequence to transcription factors. Global analysis of transcription factors has only been performed for three seed plants - arabidopsis (Arabidopsis thaliana) , poplar (Populus trichocarpa) , and rice (Oryza sativa) . TOBFAC: The database of tobacco transcription factors, contains a detailed analysis of over 2,507 tobacco (Nicotiana tabacum) transcription factors using a dataset of 1,159,022 gene space sequences (GSSs) obtained by methylation filtering from the Tobacco Genome Initiative (TGI). These GSSs are estimated to represent at least 90% of tobacco open reading frames. TOBFAC contains all of the transcription factor sequences from the tobacco gene space sequence, together with EST data. These sequences can be queried by BLAST searches and downloaded for further analysis. TOBFAC also contains phylogenetic trees for some of the largest families of transcription factors and these are also downloadable. We aim to regularly update the information so that TOBFAC will continue to represent one of the most wide-ranging databases of transcription factors in any plant species and be a major resource for the study of gene expression in tobacco and the Solanaceae. Tobacco (Nicotiana tabacum) has been one of the most studied plant species, partly because of its economic importance and partly because it is a convenient plant system for research. Tobacco is a model plant for the Solanaceae and is an amphiploid species (2n=48) with a relatively large genome size of approximately 4.5 Mbp and this large genome size makes the goal of sequencing the tobacco genome difficult. However, to alleviate some of the difficulties created by the presence of large amounts of repetitive DNA in large genomes, a number of techniques have been developed to isolate the low-copy or hypomethylated regions of the genome for sequencing. One of these techniques is methylation filtration (MF), which preferentially clones the hypomethylated fraction of the genome, effectively reducing the size of the genome to be sequenced. The Tobacco Genome Initiative (TGI) has been established to sequence and annotate more than 90% of the open reading frames in the genome of cultivated tobacco using methylation filtration technology. We used a dataset of 1,159,022 gene space sequences (GSSs) obtained by methylation filtering from the Tobacco Genome Initiative (TGI) to obtain sequence information from at least 90% of tobacco transcription factors. A consensus amino acid sequence (normally the DNA-binding domain) from each of 64 currently known transcription factor families was used to isolate sequences that belong to each class of transcription factor. These were assembled into contigs and individually analysed by BLAST searches to verify the identity of the gene sequence. Tobacco contains a minimum of 2,507 transcription factors, a total that is higher than both Arabidopsis and rice. Arabidopsis, poplar and tobacco all contain this core set of 64 transcription factor families and that rice also shares 63 of these. This suggests that the evolution of higher plants was not associated with the wholesale gain or loss of transcription factor families but rather with the lineage specific expansion of transcription factor subfamilies. Highlights of our work include the discovery of a novel subfamily of NAC transcription factors that we have called TNACS. The TNAC genes make up about 25% of all NAC genes in tobacco but are completely absent from all currently sequenced plant genomes. TNACs are, however, present in tomato, pepper and potato and this novel subfamily therefore appears to be restricted to the Solanaceae. In addition, we have subjected the tobacco ERF, WRKY, NAC, homeodomain, bZIP, bHLH, R2R3MYB and MADS box genes to detailed phylogenetic analysis that facilitates predictions of function based on phylogenetic position. TOBFAC includes these transcription factor families: ABI Alfin AP2 ARF ARID AS2 AUX-IAA BBR-BPC BES bHLH bZIP C2C2-GATA C2H2 C3H CAMTA CCAAT-Dr1 CCAAT-HAP2 CCAAT-HAP3 CCAAT-HAP5 CONSTANS CPP Dof E2F EIL ERF FHA GARP-ARR-B GARP-G2 GeBP GIF GRAS GRF HMG Homeodomain HRT HSF JUMONJI LFY LIM LUG MADS MBF MYB-related NAC Nin NZZ PcG PHD PLATZ R2R3-MYB S1Fa SAP SBP SRS TAZ TCP Trihelix TULP ULT VOZ Whirly WRKY YABBY ZF-HD ZIM |
Loading
|