Signature Evaluation Tool (SET)

Category Genomics>Gene Expression Analysis/Profiling/Tools

Abstract Signature Evaluation Tool (SET) is a standalone Java tool that can be used to evaluate and visualize the sample discrimination abilities of gene expression signatures.

It adopts the Golub's weighted voting algorithm as well as incorporating the visual presentation of prediction strength for each array sample. SET provides a flexible and easy-to-follow platform to evaluate the discrimination power of a gene signature.

The manufacturers have demonstrated the application of SET for several purposes:

(1) for signatures consisting of a large number of genes, SET offers the ability to rapidly narrow down the number of genes;

(2) for a given signature (from third party analyses or user-defined), SET can re-evaluate and re-adjust its discrimination power by selecting/de-selecting genes repeatedly;

(3) for multiple microarray datasets, SET can evaluate the classification capability of a signature among datasets; and

(4) by providing a module to visualize the prediction strength for each sample, SET allows users to re-evaluate the discrimination power on mis-grouped or less-certain samples.

Information obtained from the above applications could be useful in prognostic analyses or clinical management decisions.

SET Implementation -- SET deploys the Java Web Start technology, providing a flexible platform for researchers to evaluate gene signatures based on expression datasets. It enables users to analyze unpublished profiles locally with the most up-to-date version of the program.

Results are visualized by JFreeChart, an open-sourced Java chart library, which displays the line chart of error rate distribution and the scatter plot of prediction strength analysis.

SET exhibits several unique presentations and user-friendly elements by following four (4) simple steps:

Step1 - Grouping arrays by supervised knowledge -- First, the user prepares and uploads two tab-delimited text files, one containing a gene expression matrix that has been normalized, filtered, or transformed; and another file containing a list of genes that are potential classification markers.

In both files, individual genes (or probe IDs) are represented in rows while array samples or user-defined attributes are displayed in columns.

To increase flexibility SET implements parsers to recognize a variety of popular data formats including normalized outputs from the Expression Console™ (see G6G Abstract Number 20001), BioConductor or dChip; and accepts published analytical results as gene list input or it can be user-defined.

Upon uploading the files, array samples are assigned into two groups ("Supervised" groups) under the "Sample Grouping" panel. Samples of unknown identity can be assigned to the "Testing" group and their identities can be predicted in the latter step of prediction strength analysis.

Samples to be excluded in latter analyses can be assigned to the "Ignore" group.

Step 2 - Error rate distribution -- By default, the uploaded genes are ranked according to the absolute values of corresponding signal-to- noise scores in descending order, but can be user-defined to be ranked by other attributes such as p-values.

Genes are included into a signature one at a time based on the order of ranking. The error rate for each new signature is estimated by the weighted voting algorithm and Leave-One-Out Cross-Validation (LOOCV) and can be monitored via an error rate distribution plot. Subsequently, based on the error rate information, the user can select an appropriate composition of discriminating genes, for instance, a composition with the lowest error rate.

Step 3 - Signature evaluation -- Genes within the chosen composition are ranked and displayed by their signal-to-noise scores and the user can manually select or de-select genes as appropriate.

Gene titles and gene symbols can be incorporated in this step if the annotations of an array platform are supported by the manufacturers ArrayFusion database (see G6G Abstract Number 20319), which currently supports annotations for the majority of Affymetrix arrays and several Agilent arrays.

The potential of selected genes to distinguish between two supervised groups can be evaluated by cross-validating error rate information, where a lower error rate reflects a superior distinguishing potential. The 'significance of error rate' is estimated by 1,000 times of the group permutations to ensure that the error rate is Not a result of random chance.

The expression signature can be arbitrarily modified during the analysis and the corresponding error rate can be recalculated repeatedly.

Step 4 - Prediction strength -- The result of prediction strength (PS) analysis for each sample is shown once a signature is defined. The PS values range from -1 to +1, where the higher absolute values reflect stronger predictions.

An overview of the results for samples in both the "Supervised" and "Testing" groups is illustrated by the PS plot for the selected signature, and the results can be used to evaluate and predict the certainty of group identity for an individual sample.

To increase the flexibility of evaluation, samples can be re-grouped (for instance, re-allocated from the "Testing" group to the "Supervised" group) and signature genes can be re-selected repeatedly. Results of the analysis provide the user candidate genes for further experimental validation.

Serial signature evaluation -- SET is a tool for 'signature evaluation' rather than a machine learning tool for building an optimized prediction rule; in other words, the estimated error rate is only applied to the defined signature rather than to the signature building procedure that includes the feature selection process.

Application on multi-class datasets -- For datasets containing multiple phenotypes, one-versus-all comparisons can be performed to filter associated markers.

This strategy has been proven successful in several high-quality microarray experiments, and the incorporation of algorithms designed for multivariate issues into the next version of SET is currently in progress.

SET and biological relevance analysis -- Although it is logical to assume biological correlation of signature genes between one another (for instance, the involvement in common pathways or genetic networks) the identification of the biological relevance of input or output genes, however, is Not the primary function of SET.

This tool is principally aimed at providing a gene filtration threshold for gene identification.

Upon identification of a gene set of interest, the candidate genes can be applied to other biologically/clinically relevant analyses [such as Gene Ontology (GO) or Gene Set Enrichment Analysis (see G6G Abstract Number 20266)] to determine the biological significance of those genes.

System Requirements

Java SE 6 is required for every operating system platform. However, the manufacturer also provides a separate version (auto-detected, no table sorting) for Mac users to run via Java 5.

Manufacturer

Manufacturer Web Site Signature Evaluation Tool (SET)

Price Contact manufacturer.

G6G Abstract Number 20322

G6G Manufacturer Number 102873