TopHat
Category Cross-Omics>Next Generation Sequence Analysis/Tools and Genomics>Gene Expression Analysis/Profiling/Tools
Abstract TopHat is a fast splice junction mapper for RNA-Seq reads.
It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
TopHat identifies splice sites ab initio by large-scale mapping of RNA-Seq reads. TopHat maps reads to splice sites in a mammalian genome at a rate of ~2.2 million reads per CPU hour.
Rather than filtering out possible splice sites with a scoring scheme, TopHat aligns all sites, relying on an efficient 2-bit-per-base encoding and a data layout that effectively uses the cache on modern processors.
This strategy works well in practice because TopHat first maps non-junction reads (those contained within exons) using Bowtie (as stated above...), an ultra-fast short-read mapping program.
Bowtie indexes the reference genome using a technique borrowed from data-compression, the Burrows-Wheeler transform.
This memory-efficient data structure allows Bowtie to scan reads against a mammalian genome using around 2 GB of memory (within what is commonly available on a standard desktop computer).
Bowtie - Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour.
Bowtie indexes the genome with a Burrows-Wheeler (as stated above...) index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).
The TopHat pipeline --
RNA-Seq reads are mapped against the whole reference genome, and those reads that do Not map are set aside. An initial consensus of mapped regions is computed by Maq (see below…).
Sequences flanking potential donor/acceptor splice sites within neighboring regions are joined to form potential splice junctions. The IUM reads are indexed and aligned to these splice junction sequences.
TopHat Methods --
TopHat finds junctions by mapping reads to the reference in two (2) phases.
In the first phase, the pipeline maps all reads to the reference genome using Bowtie. All reads that do Not map to the genome are set aside as ‘initially unmapped reads’, or IUM reads.
Bowtie reports, for each read, one or more alignment containing No more than a few mismatches (two, by default) in the 5'-most s bases of the read.
The remaining portion of the read on the 3' end may have additional mismatches, provided that the Phred-quality-weighted Hamming distance is less than a specified threshold (70 by default).
This policy is based on the empirical observation that the 5' end of a read contains fewer sequencing errors than the 3' end.
TopHat allows Bowtie to report more than one alignment for a read (default=10), and suppresses all alignments for reads that have more than this number.
This policy allows so called ‘multireads’ from genes with multiple copies to be reported, but excludes alignments to low-complexity sequence, to which failed reads often align.
Low complexity reads are Not included in the set of IUM reads; they are simply discarded.
TopHat then assembles the mapped reads using the assembly module in Maq (see below...).
TopHat extracts the sequences for the resulting islands of contiguous sequence from the sparse consensus, inferring them to be putative exons.
To generate the island sequences, Tophat invokes the Maq assemble subcommand (with the -s flag) which produces a compact consensus file containing called bases and the corresponding reference bases.
Because the consensus may include incorrect base calls due to sequencing errors in low-coverage regions, such islands may be a ‘pseudoconsensus’: for any low-coverage or low-quality positions, TopHat uses the reference genome to call the base.
Because most reads covering the ends of exons will also span splice junctions, the ends of exons in the pseudoconsensus will initially be covered by few reads, and as a result, an exon’s pseudoconsensus will likely be missing a small amount of sequence on each end.
In order to capture this sequence along with donor and acceptor sites from flanking introns, TopHat includes a small amount of flanking sequence from the reference on both sides of each island (default=45 bp).
Because genes transcribed at low levels will be sequenced at low coverage, the exons in these genes may have gaps. TopHat has a parameter that controls when two (2) distinct but nearby exons should be merged into a single exon.
This parameter defines the length of the longest allowable coverage gap in a single island. Because introns shorter than 70 bp are rare in mammalian genomes such as mouse, any value less than 70 bp for this parameter is reasonable. To be conservative, the TopHat default is 6 bp.
To map reads to splice junctions, TopHat first enumerates all canonical donor and acceptor sites within the island sequences (as well as their reverse complements).
Next, it considers all pairings of these sites that could form canonical (GT-AG) introns between neighboring (but Not necessarily adjacent) islands.
Each possible intron is checked against the IUM reads for reads that span the splice junction.
By default, TopHat only examines potential introns longer than 70 bp and shorter than 20,000 bp, but these default minimum and maximum intron lengths can be adjusted by the user.
These values describe the vast majority of known eukaryotic introns. For example, more than 93% of mouse introns in the UCSC known gene set fall within this range.
However, users willing to make a small sacrifice in sensitivity will see substantially lower running time by reducing the maximum intron length.
To improve running times and avoid reporting false positives, the program excludes donor-acceptor pairs that fall entirely within a single island, unless the island is very deeply sequenced.
TopHat Implementation and Documentation --
TopHat is implemented in C++ and Python and runs on Linux and Mac OS X.
It makes substantial use of additional tools, including Bowtie (see above...), Maq and the SeqAn library.
Maq - Maq stands for Mapping and Assembly with Quality. It builds an assembly by mapping short reads to reference sequences. Maq is a project hosted by SourceForge.net
.SeqAn - SeqAn is a library of efficient data types and algorithms for sequence analysis in computational biology.
SeqAn comprises implementations of existing, practical state-of-the-art algorithmic components to provide a sound basis for algorithm testing and development.
TopHat provides an HTML based Getting Started and User manual.
System Requirements
Contact manufacturer.
Manufacturer
- TopHat is a collaborative effort between the
- Institute of Genetic Medicine at Johns Hopkins University, the
- Departments of Mathematics and Molecular and Cell Biology at the University of California, Berkeley and the
- Department of Stem Cell and Regenerative Biology at Harvard University.
- Questions about TopHat should be sent to:
- tophat.cufflinks at gmail.com.
Manufacturer Web Site TopHat
Price Contact manufacturer.
G6G Abstract Number 20811
G6G Manufacturer Number 104297