mGene
Category Genomics>Genetic Data Analysis/Tools
Abstract mGene is an advanced computational tool for the genome- wide prediction of protein coding genes from eukaryotic DNA sequences.
It is based on recent advances in machine learning and uses discriminative training techniques, such as Support Vector Machines (SVMs) and Hidden Semi-Markov Support Vector Machines (HSMSVMs).
Its excellent performance was proved in an objective competition based on the genome of the nematode Caenorhabditis elegans.
The evaluated developmental version of mGene exhibited the best prediction performance (in terms of the average between sensitivity and specificity) for the multiple-genome prediction tasks on all four evaluation levels (considering, nucleotides, exons, transcripts and genes).
The manufacturer tackles the 'gene prediction' problem taking a two (2)- layered approach.
1) In a first step, state-of-the-art kernel machines are employed to detect signal sequences in genomic DNA (like splice sites or transcription start sites) and to discriminate the content of different DNA sequences (like coding exons, introns, etc.).
2) In a second step their outputs are combined to predict whole gene structures. In this step, the manufacturer uses a discriminative training approach based on HSMSVMs.
The manufacture offers mGene via two (2) options - 1) Standalone Tools for Training and Prediction with mGene and 2) mGene as a web service (mGene.web).
mGene.web --
mGene.web is a ‘web service’ for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences.
It offers pre-trained models for the recognition of gene structures, including untranslated regions in an increasing number of organisms.
mGene.web additionally allows you to train the system for other organisms on the push of a button, a functionality that greatly accelerates the annotation of newly sequenced genomes.
The system is built in a highly modular way, such that individual components of the framework, like the promoter prediction tool or the splice site predictor, can be used autonomously.
mGene.web is free of charge, and can be used for eukaryotic genomes of small to moderate size (several hundred Mbp).
mGene.web main features/capabilities include:
1) Simple one-step procedure to train an ab initio gene predictor for a new organism based on a FASTA and a GFF3 (or GTF) file.
2) Gene prediction for a growing list of organisms from a given FASTA file using pretrained mGene instances.
3) Easy access to the signal predictions, e.g. for splice sites, transcription start sites, etc.
4) Integration of externally provided signal or content predictions/tracks into the mGene gene finder.
5) High accuracy of mGene's gene and signal predictions. mGene.web modules --
The web service (mGene.web) currently provides fourteen (14) core modules. They can be grouped into four (4) groups: Data preparation; Signal training and prediction; Content training and prediction; and Gene structure training and prediction.
Each tool requires a set of inputs and provides at least one output. They are managed by the Galaxy system according to their data types.
Data preparation --
GenomeTool needs a file in FASTA format containing genomic sequences as input that allows it to create a genome object, stored in a Genome Information Object (GIO), to be used by other mGene modules.
Additionally, one may create a GIO from an internal database of more than 50 genomes.
Signal training and prediction --
Anno2SignalLabel - uses an Annotation Gene Structure (AGS) to collect labeled genomic positions for the selected genomic signal. Possible signals include transcription start and stop sites, translation initiation and termination sites, as well as donor and acceptor splice sites.
It uses the regions covered by annotated features to generate negative examples at all consensus positions unless they were annotated as true sites. The output is a file in signal prediction format (SPF) providing chromosome/contig name, position, strand, and the label of the example.
SignalTrain - trains a signal predictor using SVMs with pre-selected kernels for each signal. Input is a genome information object (GIO) and an SPF file with labeled genomic positions. The output is a trained signal predictor (TSP) that can be used with SignalPredict to perform predictions on genomic sequences.
SignalPredict - uses a GIO and TSP to predict signals on the given DNA sequences. The output is given in signal prediction format (SPF).
SignalEval - takes a label file and a prediction file (both SPF files) as input and computes several accuracy measures for the predictions, including the areas under the Receiver-Operator-Curve (ROC) and the Precision-Recall-Curve (PRC). This tool is useful for prediction quality monitoring.
Content training and prediction --
Anno2ContentLabel - collects labeled genomic segments for the selected content types, analogous to Anno2SignalLabel. Possible content types include 5' UTR, exonic, intronic, 3' UTR, and intergenic. Any segment included that is Not of the specified type is used as a negative example.
The output is a file in content prediction format (CPF) providing chromosome/contig name, start position, end position, strand, and the label of the example.
ContentTrain - is analogous to SignalTrain, with a GIO and an SPF file as inputs and a Trained Content Predictor object (TCP) as output.
ContentPredict - is analogous to SignalPredict, with a GIO and a trained content predictor (TCP) as input and an SPF file as output.
ContentEval - analogous to SignalEval, takes a CPF and SPF file as input and performs an evaluation.
Gene structure training and prediction --
GeneTrain - trains the 'second layer' of mGene.web. Based on the GIO, genome-wide predictions for all relevant signals and content types, and a set of annotated genes, GeneTrain learns to predict gene structures from genomic DNA.
The output is an internal data structure containing the Trained Gene Predictor (TGP) that can be used with mGenePredict to predict genes.
GenePredict - uses the TGP (either from the current history or from a list of pre-trained predictors) as well as genome-wide signal and content predictions to predict genes from the provided DNA sequences. The output is provided as a GFF3 (genome annotation and gene prediction) file.
GeneEval - takes two GFF3 files, one containing an annotation, the other the genome-wide gene predictions, and evaluates the prediction performance by comparing the two annotations.
Note that the ‘annotated genes’ should be distinct from the annotated genes used for training; otherwise a training error will be reported.
Evaluation criteria include sensitivity and specificity on nucleotide, exon, and gene levels.
ComposeMGenePredictor - bundles all necessary trained signal, content, and gene predictor objects into a trained mGene predictor that can be used with mGenePredict to predict genes.
DecomposeMGenePredictor - decomposes a trained mGene predictor into its components, i.e. the individual predictors.
System Requirements
Web-based and contact manufacturer.
Manufacturer
- Rätsch Lab: Machine Learning in Biology
- Friedrich Miescher Laboratory of the Max Planck Society
- Spemannstraße 39
- 72076 Tübingen, Germany
Manufacturer Web Site mGene
Price Contact manufacturer.
G6G Abstract Number 20470
G6G Manufacturer Number 104095