Prospectr (PRiOrization by Sequence & Phylogenetic Extent of CandidaTe Regions)

Category Genomics>Genetic Data Analysis/Tools

Abstract Prospectr (PRiOrization by Sequence & Phylogenetic Extent of CandidaTe Regions) is an alternating decision tree which has been trained to differentiate between genes likely to be involved in disease and genes unlikely to be involved in disease.

By using sequence-based features like gene length, protein length and the percent identity of homologs in other species as input a classification can be obtained for a gene of interest.

The alternating decision trees outputs a classification ("likely to be involved in disease" or "unlikely to be involved in disease"), a score (which is a measure of confidence in the classification) and a breakdown of which factors contributed most to that score.

Given this score the product can also roughly estimate how much more or less likely it is that a particular gene is involved in human hereditary disease.

What Prospectr can it be used for --

Prospectr can be used to enrich lists of genes found at a 'suspected disease' locus. Given a list of genes, Prospectr will return a 'ranked list' ordered by the likelihood of involvement in disease.

Tests on an independent data set of genes taken from the Human Gene Mutation Database (HGMD) suggest that Prospectr will, on average, enrich a list of ~ 200 genes two-fold 74% of the time, five-fold 33% of the time and twenty-fold 8% of the time.

95% of the time the list was enriched one and a half fold - that is to say that the target gene was in the top three-quarters of the ranked list. PROSPECTR is a simple and effective way to identify genes involved in Mendelian and oligogenic disorders.

It performs markedly better than the single existing sequence-based classifier, on novel data.

PROSPECTR could save investigators looking at large regions of interest time and effort by prioritizing positional candidate genes for mutation detection and case-control association studies.

Defining features and building the training set --

A set of features was chosen based on a comparative study of ~ 18,000 known genes from Ensembl which are Not known to be involved in human disease and the 1,084 Ensembl genes also listed in Online Mendelian Inheritance in Man (OMIM).

The feature set reflects the structure, content and phylogenetic extent (the extent to which a gene is conserved back through evolution based on homologs in other species) of each gene examined.

The manufacturer included signal peptide and trans-membrane domain predictions; though these are strictly speaking functional attributes they can be calculated with a high degree of accuracy directly from sequence.

Automatic classifiers are created by being trained on a set of genes that has already been classified manually.

The manufacturer's training set of genes was made up of the 1,084 genes found in both OMIM and Ensembl (the "disease genes") and 1,084 Ensembl genes Not known to be involved in disease (the "control genes") which were selected at random from the larger set of ~ 18,000 as a representative sample.

Choosing an algorithm --

The manufacturer used Weka as the platform for their ‘machine learning’ experiments.

A variety of different machine learning methods were examined but the 'alternating decision tree' algorithm was chosen as the basis of their ‘classification scheme’ as it couples high accuracy with a relatively small set of rules.

The advantage of 'decision tree based' schemes over other popular algorithms such as k-Nearest Neighbor (kNN), Support Vector Machines (SVM) and Bayesian Networks is that the rules that are produced for classifying instances can be interpreted more easily by non-expert users.

This is particularly true for the alternating decision tree algorithm, which typically produces trees that are just as predictive as those created by more traditional decision tree algorithms but that are far more concise and thus easier to understand.

Alternating decision trees also allowed the manufacturer to measure the contribution of each feature to the final classification of a gene, which might provide insight into the essential differences between those genes more and less likely to be involved in disease.

Alternating decision trees are created by adding rules to the tree in an iterative fashion in the order of their predictive power, with the more effective rules being added first.

These rules are automatically derived from the differences between the disease and control genes in the training set provided.

PROSPECTR implementation --

The manufacturer implemented their classifier as a standalone script in Perl and designed an associated web interface to aid in the interpretation of the results produced.

PROSPECTR is freely accessible together with training and test sets of genes.

The web interface allows researchers to quickly obtain scores for regions of the genome or individual genes of interest.

System Requirements

Web-based.

Manufacturer

Manufacturer Web Site Prospectr

Price Contact manufacturer.

G6G Abstract Number 20428

G6G Manufacturer Number 104056