Gene Identification

The program predicts whole genes, so the predicted exons always splice correctly. It can predict several whole or partial genes in one sequence, so it can be used on whole cosmids or even longer sequences. HMMgene can also be used to predict splice sites and start/stop codons. If some features of a sequence are known, such as hits to ESTs, proteins, or repeat elements, these regions can be locked as coding or non-coding and then the program will find the best gene structure under these constraints.

Gene Discovery Page
The purpose of this page is to serve as a"desktop" area, primarily for the bench scientist with little biocomputing background. It organizes existing search engines in a coherent, stepwise fashion providing one of the many strategies that may lead to gene discovery. Questions that this page helps to answer are of the type:"Does a particular sequence of DNA code for proteins and what may their function be?" or"Is there a protein in organism A homologous to protein X of organism B?"

FramePlot - protein-coding region prediction tool for high GC-content bacteria
FramePlot is a web-based protein-coding region prediction tool for high GC-content bacteria.

Clickable Map:

You can get nucleotide and amino acid sequence (FASTA format) of the ORF you interested by clicking the graph.

BLAST Gateway:

And you can submit the sequence immediately to the NCBI BLAST server.

Step Size:

"Window" is moved along the sequence by the set"step size". Larger step size makes FramePlot run much faster. But it will produce low resolution plot.

Auto Image Width:

"Image width" is adjusted automatically.

tRNAscan-SE Search for transfer RNA genes in genomic sequence
tRNAscan-SE identifies transfer RNA genes in genomic DNA or RNA sequences. It combines the specificity of the Cove probabilistic RNA prediction package (Eddy & Durbin, 1994) with the speed and sensitivity of tRNAscan 1.3 (Fichant & Burks, 1991) plus an implementation of an algorithm described by Pavesi and colleagues (1994) which searches for eukaryotic pol III tRNA promoters (our implementation referred to as EufindtRNA). tRNAscan and EufindtRNA are used as first-pass prefilters to identify ``candidate'' tRNA regions of the sequence. These subsequences are then passed to Cove for further analysis, and output if Cove confirms the initial tRNA prediction. In this way, tRNAscan-SE attains the best of both worlds:

a false positive rate of less than one per 15 billion nucleotides of random sequence
the combined sensitivities of tRNAscan and EufindtRNA (detection of 99% of true tRNAs)
search speed 1,000 to 3,000 times faster than Cove analysis and 30 to 90 times faster than the original tRNAscan 1.3 (tRNAscan-SE uses both a code-optimized version of tRNAscan 1.3 which gives a 650-fold increase in speed, and a fast C implementation of the Pavesi et al. algorithm).

NETGENE - Predict splice sites in human genes
NETGENE is a neural network for the prediction of splice site locations in human pre-mRNA. A joint prediction scheme where prediction of transition regions between introns and exons regulates a cutoff level for splice site assignment is able to predict splice site locations with confidence levels far better than previously reported in the literature. The problem of predicting donor and acceptor sites in human genes is hampered by the presence of numerous amounts of false positives. When the presented method detects 95% of the true donor and acceptor sites it makes less than 0.1% false donor site assignments and less than 0.4% false acceptor site assignments.

ORF Finder
The ORF Finder (Open Reading Frame Finder) is a graphical analysis tool which finds all open reading frames of a selectable minimum size in a user's sequence or in a sequence already in the database. This tool identifies all open reading frames using the standard or alternative genetic codes. The deduced amino acid sequence can be saved in various formats and searched against the sequence database using the WWW BLAST server. The ORF Finder should be helpful in preparing complete and accurate sequence submissions.

BCM Gene Finder
This provides a rich set of programs for finding various features in different organisms.

FGENEH / construction of gene model by exon assembling
FEXH / search for potential 5'-, internal and 3'-coding exons
HEXON / search for potential internal exons
HSPL / search for potential splice sites
RNASPL / search for exon-exon junction positions in cDNA
CDSB / search for Protein coding region in E.coli
HBR / recognition of Human and E.coli sequences
POLYAH / Recognition of 3'-end cleavage and polyadenilation region
TSSG / Recognition of human PolII promoter region and start of transcription
TSSW / Recognition of human PolII promoter region and start of transcription
FGENED / construction of gene model by Drosophila exon assembling
FEXD / search for Drosophila potential 5'-, internal and 3'-coding exons
DSPL / search for potential splice sites in Drosophila sequences
FGENEN / construction of gene model by Nematode exon assembling
FEXN / search for Nematode potential 5'-, internal and 3'-coding exons
NSPL / search for potential splice sites in Nematode sequences
FGENEA / construction of gene model by Plant exon assembling
FEXA / search for Plant potential 5'-, internal and 3'-coding exons
ASPL / search for potential splice sites in Plant sequences
FEXY / search for Yeast potential 5'-, internal and 3'-coding exons
YSPL / search for potential splice sites in Yeast sequences

Grail
GRAIL is a suite of tools designed to provide analysis and putative annotation of DNA sequences both interactively and through the use of automated computation.

The coding recognition portion of the system uses a neural network which combines a series of coding prediction algorithms. There are three basic versions of this neural network, GRAIL 1, GRAIL 1a and GRAIL 2.

GRAIL 1 has been in place for about three years. It uses a neural network described in PNAS 88, 11261-11265, which recognizes coding potential within a fixed size (100 base) window. It evaluates coding potential without looking for additional features (information such as splice junctions, etc).

GRAIL 1a is an updated version of GRAIL 1. It uses a fixed-length window to locate the potential coding regions and then evaluates a number of discrete candidates of different lengths around each potential coding region, using information from the two 60-base regions adjacent to that coding region, to find the"best" boundaries for that coding region.

GRAIL 2 uses variable-length windows tailored to each potential exon candidate, defined as an open reading frame bounded by a pair of start/donor, acceptor/donor or acceptor/stop sites. This scheme facilitates the use of more genomic context information (splice junctions, translation starts, non-coding scores of 60-base regions on either side of a putative exon) in the exon recognition process. GRAIL 2 is therefore not appropriate for sequences without genomic context (when the regions adjacent to an exon are not present).

These changes have improved the overall performance compared to GRAIL 1, particularly for short exons.

All three systems have been trained to recognize coding regions in human DNA sequences, although they also work well on a number of other organisms, particularly other mammals.

Genemark
This provides identification of protein coding regions in DNA sequences from prokaryotic and eukaryotic species. The Genemark server accepts a DNA sequence. You may specify a species name and parameters that control the analysis.

The current genemark version is accurate for finding prokaryotic genes and for analyzing cDNA sequences to discriminate coding regions from non-coding ones like 5' or 3' UTRs.

This program is able to identify rather long exons (more than 100bp) in eukaryotic genomic DNA. This information can be used for designing RT-PCR or PCR primers.

Genie: A Gene Finder Based on Generalized Hidden Markov Models
Genie uses a statistical model of genes in DNA. A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence. Probabilities are assigned to transitions between states in the GHMM and to the generation of each nucleotide base given a particular state. Machine learning techniques are applied to optimize these probabilities using a standardized gene data set.

Please note that Genie is currently a prototype that is only set up to find multi-exon genes, and it tries to find exactly one gene on each strand (forward and reverse) of each sequence submitted. If you give Genie a sequence with more than one gene per strand, it will not give satisfactory results. You may submit multiple sequences (in multiple-FASTA format); Genie will predict a gene separately for each sequence.

Genie was trained on human genes, and seems to give good results on other vertebrate sequences as well. It should also give good results for drosophila and other invertebrate organisms if the"invertebrate" option is selected.

GENSCAN - predict complete gene structures
GENSCAN is a program designed to predict complete gene structures, including exons, introns, promoter and poly-adenylation signals, in genomic sequences. It differs from the majority of existing gene finding algorithms in that it allows for partial genes as well as complete genes and for the occurrence of multiple genes in a single sequence, on either or both DNA strands. The program is based on a probabilistic model of gene structure/compositional properties and does not make use of protein sequence homology information. The text output of the program is a list of one or more (or possibly zero) predicted genes together with the corresponding peptide sequences.

Splice Site Prediction by Neural Network
The output of the neural networks is a list of the 15-base (41-base) regions that the network judges most likely to be Donor and Acceptor sites, respectively.

Splice sites are the key signal sequences that determine the boundaries of exons. A method for splice site detection should ideally be based on a thorough understanding of the complex eukaryotic splicing process. We trained a backpropagation feedforward neural network with one layer of hidden units to recognize donor and acceptor sites, using a representative data set. We only consider genes that have constraint consensus splice sites, i.e., `GT' for the donor and `AG' for the acceptor site. The output of the network is a score between 0 and 1 for a potential splice site.

Procrustes
PROCRUSTES is based on the spliced alignment algorithm which explores all possible exon assemblies and finds the multi-exon structurewith the best fit to a related protein.

The distinctive features of Procrustes are:

high reliability of predictions (up to 99%) in the presence of related proteins as compared to 70-80% for most gene recognition programs;
the possibility to analyze extremely long (up to 150,000 bp) sequences with many (up to 20-30) exons;
insensitivity to statistical inhomogeneity of genomes;
the possibility to recognize very short exons;
insensitivity to the presence of several genes in one sequence.

GenePrimer
This software implements an algorithm for experimental gene identification by multiple PCR amplifications. Since current algorithms for gene recognition make mistakes, biologists have to perform experimental gene identification to eliminate errors in predictions. Conventional approaches amount to `guessing' PCR primers on top of unreliable gene predictions and frequently lead to wasting experimental efforts. An algorithm which eliminates the need of gene verification in some cases is the Las Vegas algorithm for gene recognition. This bypasses the unreliable gene prediction step and proposes an approach geared towards experimental gene identification. The algorithm locates a set of PCR primers which relatively uniformly cover the exons and can be used for RT-PCR and further sequencing of (unknown) mRNA.

GenLang
GenLang is a syntactic pattern recognition system, which uses the tools and techniques of computational linguistics to find genes and other higher-order features in biological sequence data. Patterns are specified by means of rule sets called grammars, and a general purpose parser, implemented in the logic programming language Prolog, then performs the search.

MZEF Gene Finder
This page contains software tools designed to predict putative internal protein coding exons in genomic DNA sequences. Human, mouse and Arabidopsis exons are predicted by a program called MZEF (developed by Dr. Michael Zhang). fission yeast exons are predicted by a program called Pombe (developed by Dr. Tim Chen and Dr. Michael Zhang).

MZEF (Michael Zhang's Exon Finder) is an internal coding exon predictiton program. It starts with a potential exon (AG+ORF+GT, currently minimum orf size =18 bp (9 bp for Arabadopsis) and maximum orf size = 999 bp or 2000 for Arabidopsis), measures 9 (10 for Arabidopsis) discriminant variables and then calculates its posterior exon probabilty. If the probability P %gt; 1/2, it will be output as a predicted exon. The output contains the following 7 fields:

Coordinates -- exon bounaries (in bp)
P -- Posterior probability (between .5 to 1.)
Fri -- Frame preference score for the ith frame of the genomic sequence
Orf -- ORF indinator,"011" (or"211") means 2nd and 3rd frames are open
3ss -- Acceptor score
Cds -- Coding preference score
5ss -- Donoor score

Webgene - Tools for prediction and analysis of protein-coding gene structure
A wide variety of Gene Identification tools:

Genebuilder - An integrated computing system for protein-coding gene prediction
ORFGene - Gene structure prediction using information on homologous protein sequence
ESTmap - EST mapping
Syncod - CDS prediction based on Silent/Replacement ratio
Repeat - Repeated elements mapping
CpG - CpG islands prediction
SpliceView - Splicing signals prediction
HCpolya - Hamming Clustering Method for Poly-A prediction in Eukaryotic Genes
HCtata - Hamming Clustering Method for TATA signal prediction in Eukaryotic Genes
GenView - A computing system for protein-coding gene prediction

MAR-Finder - Nuclear matrix attachment region prediction
MAR-Finder uses statistical inference to deduce the presence of matrix association regions, or MARs, in DNA sequences. MARs constitute a significant functional block within sequences and facilitate the processes of differential gene expression and DNA replication.

Glimmer bacterial/archael gene finder
Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria and archaea. Glimmer (Gene Locator and Interpolated Markov Modeler) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DNA. The IMM approach, described in our Nucleic Acids Research paper, uses a combination of Markov models from first through eighth order, weighting each model according to its predictive power. Glimmer's IMM is a 3-periodic nonhomogenous Markov model. Glimmer has been tested on many genomes, including H. influenzae, E. coli, and H. pylori. It was used to find the genes in the recently sequenced genomes of B. burgdorferi (Fraser et al., Nature, Dec. 1997) and T. pallidum (Fraser et al., Science, July 1998), among others.

PipMaker - computes alignments of similar regions in two DNA sequences
PipMaker computes alignments of similar regions in two DNA sequences. The resulting alignments are summarized with a ``percent identity plot'', or ``pip'' for short.

PipMaker generates graphical output as a PDF document by default, or optionally as a PostScript document.

Exofish - comparison to fish genome
Exofish is based on the notion that sequences coding for proteins evolve more slowly than non-coding sequences. By comparing DNA of distant vertebrates, it should therefore be possible to identify genes by looking for sequences that have remained similar during evolution. This principle has been successfully exploited in numerous studies involving prokaryotic organisms, as well as eukaryotic organism such as yeast, fly, worm or mouse in comparison with each other and with the human genome.

When comparing genomic DNA, as opposed to spliced mRNA sequences, results are often difficult to interpret because some non-coding regions may still be conserved, especially between closely related species. In addition, comparing large fractions of eukaryotic genomes at the protein level involves long calculation times, sometimes more than several weeks.

We have developed Exofish to address these problems. Exofish detects exons with a very low background (<2%) in human genomic DNA by comparison with DNA from Tetraodon nigroviridis, a pufferfish distant by approximately 400 million years from human. Comparisons at the protein level are performed using the BLAST algorithm with parameters that enable a high specificity and speed without loss of sensitivity. BLAST is implemented in the Large Scale Sequence Analysis Package (LASSAP) developed by Gene-IT.

TRANSFAC - Transcription Factor database
TRANSFAC is a database on eukaryotic cis-acting regulatory DNA elements and trans-acting factors. It covers the whole range from yeast to human.

The first criterion for a site to be included in TRANSFAC is protein binding, the second is function. Assigned to each site is an unambiguous accession number and an identifier. The latter is composed of a hint onto the species (e. g., HS for human), a code for the gene description and a consecutive number for each entry referring to a particular gene. Thus, HS$BAC_2 refers to the 2nd entry for the human gene for beta-actin.

MIRAGE (Molecular Informatics Resource for the Analysis of Gene Expression)
MIRAGE is a web site dedicated to methodologies, tools, and technologies relating to information in the study of gene expression. MIRAGE is an experimental web resource of the Institute for Transcriptional Informatics (IFTI), Pittsburgh

TransTerm - A Translational Signal Database
A database of sequence contexts about the stop and start codons of many species found in GenBank. Also includes codon usage tables and parameters about the coding sequences such as Nc, GC3 and CAI where applicable

PLACE - a database of plant cis-acting regulatory DNA elements
PLACE is a database of motifs found in plant cis-acting regulatory DNA elements, all from previously published reports. It covers vascular plants only. In addition to the motifs originally reported, their variations in other genes or in other plant species reported later are also compiled. The PLACE database also contains a brief description of each motif and relevant literature with PubMed ID numbers.

NNPP: Promoter Prediction by Neural Network
NNPP is a method that finds eukaryotic and prokaryotic promoters in a DNA sequence. The function of the promoter as a initiator for transcription is one of the most complex processes in molecular biology. It has been shown that multiple functional sites in the primary DNA are involved in the polymerase binding process. These elements, such as the TATA-box and the transcription start site ("Initiator") for eukaryotes, are known to function as binding sites for Polymerase II, transcription factors, and other proteins that are involved in the transcription initiation process. These promoter elements are present in various combinations separated by various distances in the sequence.

FastM/ModelInspector
FastM and ModelInspector are tools for detection of correlated transcription factor binding sites in DNA sequences (including databases).

FastM builds a so-called model from the information of

which two binding sites are involved
their strand orientation (with respect to each other)
and a distance range between the two binding sites.

ModelInspector then uses this FastM-model to scan either a given DNA sequence or a selected GenBank section for matches to the model. This allows fast verification whether a pair of binding sites is really characteristic for the sequence you are analyzing.

TRES - Comparative Promoter Analysis
Using TRES you can simultaneously search up to 20 promoter sequences for known transcription factor binding sites, cis-acting elements, palindromic motifs or conserved k-tuples (phylogenetic footprints). This is useful for comparative promoter sequence analysis to elucidate common themes (modules) in functionally or phylogenetically related promoters.

TFSEARCH
Search for transcription factor binding sites.

MatInd and MatInspector
MatInd is a simple but powerful method to derive a matrix description of a consensus from a number of short sequences on which the definition of an IUPAC code would be based.

A large library (>200 entries) of predefined matrix descriptions for protein binding sites exists and has been tested for accuracy and suitability. Information about the transcription factors connected to these matrices can be retrieved from the TRANSFAC database

MatInspector is a second software tool that utilizes this library of matrix descriptions to locate matches in sequences of unlimited length. MatInspector is almost as fast as an IUPAC search but has been shown to produce superior results. It assigns a quality rating to matches and thus allows quality-based filtering and selection of matches. MatInspector is able to compare one, several, or all sequences in a sequence file against all or selected subsets of matrices from the library in a single program run. It scans both strands of the sequence simultaneously.

Transcription Element Search Software (TESS)
TESS (Transcription Element Search Software) is a set of software for locating and displaying transcription factor binding sites in DNA sequence.

TESS uses the Transfac database as its store of transcription factors and their binding sites.

CorePromoter (Core-Promoter Prediction Program)
CorePromoter is a Transcriptional Start Site (TSS) prediction program based on a Quadratic Discrimination Analysis of human core-promoters. The input genomic DNA sequence (larger than 240 bp and less than 2,001 bp) is assumed to contain a functional core-promoter (-60,+40) with respect to the TSS.

Gene Express - analysis of genomic regulatory sequences

the database of transcription regulatory regions of eukaryotic genes TRRD that contains the description of regulatory regions of 427 eukaryotic genes, including 2133 transcription factor binding sites, 78 composite regulatory elements, 593 enhancers, and other types of transcription-regulating elements;
the database Activity that comprises the available data on the activity of the sites involved in the regulation of gene expression as well as the physico-chemical, conformational, and statistical DNA/RNA properties significant for the activity of these sites;
the database for gene networks GeneNet that contains the information on groups of the genes functioning coordinately to provide expression of genetic information;
a set of programs for detecting functional sites and predicting their activity;
programs for visualization of the information contained in the listed databases and the results of analysis;

Signal Scan
Find and list homologies of published signal sequences with the input DNA sequence.

PromoterInspector
PromoterInspector is a highly specific detector of promoter regions in large mammalian genomic sequences.

Promoter Scan II
PROMOTER SCAN II is a program developed to recognize and predict pol II promoters in genomic DNA sequences. Presently it is limited to mammalian promoter sequences, and is set to find approximately 60-70% of promoter sequences never before seen by the program, with an expected false positive rate of about 1 in 30,000 single-stranded bases.

Pol3scan
Pol3scan recognizes the eukaryotic internal control regions A box and B box that are typical of tRNA genes and tDNA-derived elements.

The algorithm is based on the statistical analysis of a database of 231 tRNA promoter regions and makes use of weight matrices and weight vectors for scoring. The program discriminates between tRNA genes and related class III elements (e.g. tDNA-derived SINEs) on the basis of the presence of a transcriptional terminator signal and of the base-pairing within the aminoacyl stem.

The accuracy of the prediction was estimated by scanning the eukaryotic nuclear sequences present in the rel. 33 of the EMBL database (65180 entries). The program correctly identified 932 of 940 known tRNA genes (0.85% of false negatives) with a false positive rate of 0.0018%.

TargetFinder - finds DNA-binding proteins.
Performs database searches for candidate target genes of DNA-binding proteins. The use of this program allows to search a database of annotated sequences for binding sites located in context with other important transcription regulatory signals and regions, like the TATA element, the transcription start site, the promoter and so on, thereby greatly reducing the background usually associated with this kind of searches.

Orange - Organisational Elements in Gene Regulation
Aim of the project is to supply computer web services for the analysis of gene regulatory sequences. The identification of virtual gene switches - the binding sites for transcription factors - is a key point in computer based analysis of sequence data. Transcription factors docking to the binding sites control the expression of a gene for a large part.

AliBaba 2.1 - context specific identification of binding sites
DESIRE 4EPD - the first promoter search engine

UTR - 5' and 3' Untranslated Regions of Eukaryotic mRNAs
Understanding the basic mechanisms of cell growth, differentiation and response to environmental stimuli, i.e. the program controlling the temporal and spatial order of molecular events, is becoming a real challenge in molecular biology. Indeed, although most of the regulatory elements are thought to be embedded in the non-coding part of the genomes, nucleotide databases are biased by the presence of expressed sequences mostly corresponding to the protein coding portion of the genes.

Among non-coding regions, the 5' and 3' untranslated regions (5'-UTR and 3'-UTR) of eukaryotic mRNAs have often been experimentally demonstrated to contain sequence elements crucial for many aspects of gene regulation and expression .

The UTR Page
These pages compile information from various sources about untranslated elements in messenger RNA

Principles of Functional Genomics
The biological information in genomic DNA is not intelligible on primary sequence level. Therefore, functional interpretation of the raw sequence data is a prerequisite for any kind of sequence-based functional genomics.

Introduction to Computer Analysis of Sequences for MAR Prediction
The potentiation and subsequent initiation of DNA transcription are complex biological phenomena. The region of attachment of the chromatin fiber to the nuclear matrix, known as the matrix attachment region (MAR) or scaffold attachment region (SAR), is considered necessary for transcriptional regulation of the eukaryotic genome. Numerous expressed sequences are expected to be contained between the regions bounded by MARs. Consequently, it is important to know whether these bounding regions can be identified from primary sequence data alone and used as experimental markers for transcribed domains.

Gene2EST BLAST Server
This is a BLAST server specialised for querying EST databases with eukaryotic gene-sized queries, which may often be in, say, the 50-100 KB range. This server maps ESTs onto the query - it should not be confused with gene prediction algorithms. Gene2EST is primarily targeted to help the researcher who wishes to examine a few genes in high detail and is not suitable for high throughput analyses.

Gene2EST uses RepeatMasker (http://ftp.genome.washington.edu/RM/RepeatMasker.html) and Repbase to filter out dispersed repetitive sequences (Alus etc.) from the query. (Since many of these are expressed and are therefore abundant in EST DBs, it is well known that they often confound gene-based queries.)

As well as allowing large queries, useful Gene2EST features include two novel output files parsed from the BLAST output:

1. A multiple alignment of all EST HSPs on the query sequence.

2. An EMBL format file in which the ESTs are represented as mRNA features. This output is for display in a graphical display tool such as Artemis4 (freely available for all platforms from http://www.sanger.ac.uk/Software/Artemis/v4/). In favourable cases (i.e. lots of ESTs), this display can reveal the entire gene structure, including alternative splicing.

Any Comments, Questions? Support@hgmp.mrc.ac.uk

Welcome to the GenomeWeb Gene Identification

Promoter Region, Transscription Factor and Signals

Documentation and Background theory

Detailed information on the above options

Welcome to the GenomeWeb
Gene Identification