Genome Analysis Toolkit (GATK)

Category Cross-Omics>Next Generation Sequence Analysis/Tools

Abstract The Genome Analysis Toolkit (GATK) is a structured programming framework designed to enable rapid development of efficient and robust analysis tools for next-generation DNA sequencers.

The GATK solves the data management challenge by separating data access patterns from analysis algorithms, using the functional programming philosophy of Map/Reduce.

Consequently, the GATK is structured into data traversals and data walkers that interact through a programming contract in which the traversal provides a series of units of data to the walker, and the walker consumes each datum to generate an output for each datum.

Because many tools to analyze next-generation sequencing (NGS) data access the data in a very similar way, the GATK can provide a small but nearly comprehensive set of traversal types that satisfying the data access needs of the majority of analysis tools.

For example, traversals “by each sequencer read” and “by every read covering each locus in a genome” are common throughout many tools such as counting reads, building base quality histograms, reporting average coverage of the genome, and calling SNPs.

The small number of these traversals, shared among many tools enables the core GATK development team to optimize such traversals for correctness, stability, CPU performance, memory footprint, and in many cases to even automatically parallelize calculations.

Moreover, since the traversal engine encapsulates the complexity of efficiently accessing the next-generation sequencing data, researchers and developers are free to focus on their specific analysis algorithms.

This Not only vastly improves productivity of the developers, who can quickly write new analyses, but also results in tools that are efficient and robust and can benefit from improvement to a common data management engine.

GATK Capabilities --

The Genome Analysis Toolkit (GATK) development environment is currently provided as a platform-independent Java programming language library.

The core system works with the nascent standard Sequence Alignment/Map (SAM) format to represent reads using a production-quality SAM library developed at the Broad Institute.

The system can access a variety of metadata files such as dbSNP, HapMap, RefSeq as well as work with genotype and SNP files in GLF, Geli, and other common formats.

The core system handles read data from Illumina/Solexa, SOLiD, and Roche/454.

According to the manufacturer the current GATK engine can process all of the 1000 genomes data representing ~5 Terabytes (Tb) of data from these three (3) next-gen technologies produced from multiple sequencing centers and aligned to the human reference genome with multiple aligners.

The GATK currently provides traversals by each read (ByRead traversal), by all reads covering each locus in the genome (ByLoci traversal), and by all reads within pre-specified intervals on the genome (ByWindow traversal).

Why use the GATK?

The GATK aims to eliminate the constant writing and rewriting of error-prone boilerplate code to manage the reading, presentation, and output of sequencing data.

Before the GATK, users were forced to cobble together solutions for their analyses out of components of varying quality from disparate sources. To implement a genotyper, a biologist had to:

1) Find or create a SAM file reading utility.

2) Write a layer of bridge code to work with his or her SAM reader.

3) Examine data provided by the SAM reader, ensuring that the reader is accurately presenting all available data.

4) Repeat steps 1-3 for a FASTA file reader, a HapMap and/or dbSNP reader, and readers for any additional required data.

5) Write a data collector to group and view reads by locus.

6) Finally, implement the analysis. In many cases, this step requires the least amount of programming effort.

In any of the above steps, the biologist or implementer of the many data access layers can introduce insidious data integrity and performance bugs that can go undetected by the biologist.

The GATK aims to simplify the development process by providing a fast, reliable mechanism to present data to the user. To write a similar analysis with the GATK, the biologist can:

1) Download the GATK.

2) Convert his or her data to one of the many standard formats supported by the GATK.

3) Write an analysis in Java, compile it, and copy it to the GATK’s plug-in directory.

The GATK can present data to the user in several different formats, including one read at a time, a locus and its context, or a window of loci.

Supported GATK Tools include:

1) Variant Detection;

2) Quality Control and Simple Analysis Tools;

3) BAM Processing and Analysis Tools;

4) Variant Discovery Tools;

5) Cancer-specific Variant Discovery Tools;

6) Variant Evaluation and Manipulation Tools;

7) Sequenom Utilities;

8) Companion Utilities;

9) Miscellaneous Experimental (and Potentially Unstable) Tools; and

10) Queue and the GATK-Pipeline - At the Broad Institute the GSA team run a production-scale Next-Generation Sequencing (NGS) data processing pipeline using Queue.

And Queue is the GATK companion pipeline execution engine.

System Requirements

Contact manufacturer.

Manufacturer

The Genome Sequencing and Analysis Group (GSA)
in Medical and Population Genetics at the Broad Institute, USA.

Manufacturer Web Site Genome Analysis Toolkit (GATK)

Price Contact manufacturer.

G6G Abstract Number 20777

G6G Manufacturer Number 104354

The G6G Directory of Omics and Intelligent Software

Genome Analysis Toolkit (GATK)