Abstract NextGENe is a software suite designed to allow biologists to analyze the vast amount of data generated by “Next Generation” Sequencing systems. NextGENe provides cross instrument platform compatibility.

All applications are tailored to accept data from:

1) Illumina® Genome Analyzer;

2) Roche Genome Sequencer FLX™;

3) Applied BioSystems SOLiD™ System;

4) Helicos Biosciences (in development).

NextGENe provides an intuitive Windows® user interface and a host of applications including:

1) de novo and target assembly;

2) SNP/INDEL Detection;

3) Digital Gene Expression Studies;

4) Transcriptome/ChIPSeq Analysis.

1) de novo Assembly --

de novo sequence assembly with the short reads from the genome analyzers presents many challenges. With many of the current techniques, it is difficult to assemble the short reads into a large contig [a contig (from contiguous) is a set of overlapping DNA segments derived from a single genetic source] of 1 to 5 kb.

These techniques often create many false alignments due to two (2) major issues; short reads with high base calling errors and ambiguity within the genome.

The short reads with Single Nucleotide Polymorphisms (SNPs) and Insertions and Deletions of base pairs (Indels) are often discarded, which is problematic in the determination of 'copy number variations' in applications such as chromatin immunoprecipitation (ChIP), gene expression and transcriptome studies.

NextGENe sequence assembler was developed to solve the current problems. The software is able to assemble the short reads into contigs of 0.5 kb to 5 kb, where contigs end with repeat sequences. It uniquely aligns these contigs to a reference genome.

The short reads used in the assembly of a contig are recorded to show the copy number and Indel positions. NextGENe is capable of detecting Indels of 1-30 bps.

de novo Assembly Methodology -- NextGENe statistically polishes high coverage (20-100x) datasets to remove random sequencing errors and roughly double the read lengths with the use of the 'Condensation Assembly Tool' (Patent Pending).

Repeating the Condensation removes systematic errors and further lengthens the sequence reads. The polished and elongated reads can then be assembled into large contigs while removing redundant reads.

2) SNP/INDEL Detection --

SNP’s and Micro Indels, up to 30 bp, can be detected in targeted sequencing data from both longer sequence reads and short reads from the Solexa sequencing technology (Illumina).

Use of the Condensation Tool elongates short reads increasing their uniqueness probability, while polishing the data to remove chemistry and instrumental errors.

Features/capabilities include:

3) Digital Gene Expression Studies --

Features/capabilities include:

Gene expression studies are often currently analyzed using the technologies of microarray and DNA sequencing such as Serial Analysis of Gene Expression, or SAGE.

SAGE technology measures the counts of the 'sequence tags' relative to the genes of interest.

The next generation DNA sequence technologies generate millions to hundreds of millions of the short sequence reads.

Illumina® Genome Analyzer utilizing the Solexa sequencing technology uses PCR on a surface and the Applied Biosystem SOLiD™ System uses emulsion PCR and sequencing by ligation.

Both of these systems can produce the short reads ideal for analyzing gene expression.

NextGENe software package takes full advantage of the short sequencing reads and has tools for analyzing the SAGE tags. SAGE Libraries are available that contain lists of sequence tags associated with particular genes.

NextGENe can load these libraries as a reference and align the sequence reads to the appropriate sequence tags. Digital gene expression reports are created to show the sequence of each tag, the coverage, gene names, and the location in the genome.

New gene tags that are Not in the library are also reported.

4) Transcriptome/Chromatin Immunoprecipitation (ChIPSeq) Analysis --

Features/capabilities include:

