TreeNet®
Category Intelligent Software>Data Mining Systems/Tools
Abstract TreeNet is a data mining tool designed for very high-accuracy predictive modeling. Because TreeNet attempts to achieve this goal even if very complex models are required, models may be relatively difficult to understand in detail. However, the graphs produced by TreeNet software display the impact of any relevant predictor or pair of predictors on the target, thus revealing the underlying data structure.
Product can be used for credit risk scoring, targeted marketing, fraud detection, document classification, response modeling, and bioinformatics.
Salford Systems sees TreeNet as a tool to be used after the data has been explored with tools such as CART (see G6G Abstract Number 20053B60) and MARS (an additional product from this manufacturer).
CART and MARS produce output that can clearly reveal data errors and inconsistencies, quickly leading to a detailed understanding of the data and potential problems.
Once data quality has been assured and the basic understanding of the key drivers in the data has been achieved, reanalyzing the data with TreeNet is worthwhile. In most cases, TreeNet will confirm the primary findings reported by CART or MARS while substantially increasing the predictive accuracy of the models.
How TreeNet works and what a TreeNet model looks like -- A TreeNet model normally consists of from several dozen to several hundred small trees, each typically No larger than two (2) to eight (8) terminal nodes.
The model is similar in spirit to a long series expansion (such as a Fourier or Taylor's series) - a sum of factors that becomes progressively more accurate as the expansion continues.
Advantages of TreeNet --
1) Automatic selection from thousands of candidate predictors - No prior variable selection or data reduction is required.
2) Ability to handle data without preprocessing - Data does Not need to be rescaled, transformed, or modified in any way.
3) Resistance to outliers in predictors or the target variable;
4) Automatic handling of missing values;
5) General robustness to dirty and partially inaccurate data;
6) High Speed;
7) Trees are grown quickly and small trees are grown extraordinarily quickly.
8) TreeNet is able to focus on the data that is Not easily predictable as the model evolves - Thus, as additional trees are grown fewer and fewer data needs to be processed, and in many cases, TreeNet is able to train effectively on 20% of the data.
9) Resistance to Over Training - When working with large data bases, even models with 2,000 trees show little evidence of overtraining and most models show maximum accuracy well before 1,000 trees are grown.
TreeNet's robustness extends to data contaminated with erroneous target labels. For example, in medicine there is some risk that patients labeled as healthy are in fact ill and vice versa.
This type of data error can be very challenging for conventional data mining methods and will be catastrophic for conventional boosting.
In contrast, TreeNet is generally immune to such errors as it dynamically rejects training data points too much at variance with the existing model.
In addition, TreeNet adds the advantage of a degree of accuracy usually Not attainable by a single model or by ensembles such as bagging or conventional boosting.
Independent real world tests in text mining, fraud detection, and credit worthiness have shown TreeNet to be dramatically more accurate on test data than other competing methods.
Of course No one method can be best for all problems in all contexts. Typically, if TreeNet is Not well suited for a problem it will yield accuracies on par with that achievable with a single CART tree.
What TreeNet output looks like --
The TreeNet model is a complex structure Not easily understood by studying its individual components. However, TreeNet produces a number of clear reports and graphs that reveal the core message and predictive content of the model. These include:
1) Variable importance ranking.
2) Graphs of the typical relationship between the target and any one predictor - All other variable effects are taken into account to arrive at a typical relationship and technically, Salford Systems graphs E(Y/Xi) for a single predictor Xi, integrating out all other relevant predictors.
3) 3-D graphs of the target against any pair or predictors.
4) The first few trees of the model may also be displayed as a set of text rules.
How prediction and scoring are handled by TreeNet --
Optionally, TreeNet 1.0 will score any database and output predictions in the file format or data base required. The data management system allows access to any of 85 file formats.
Input and output file formats can be different. Alternatively, the TreeNet model can be exported as a SAS® language subroutine.
TreeNet’s underlying technology and how it differs from boosting --
TreeNet uses 'gradient boosting' to achieve the benefit of boosting (accuracy) without the drawback of a tendency to be misled by bad data.
In boosting, each tree grown would normally be a fully articulated stand alone model, with each boosted tree combined with its mates via a weighted voting scheme.
In contrast, each TreeNet component is a small tree, often No larger than two (2) terminal nodes and trees are summed together with very small weights on each component.
System Requirements
TreeNet 1.0 requires that both training and test data reside in RAM. Thus, if large databases are being analyzed, TreeNet will be most effective when running on large-capacity servers. We recommend a minimum of 512 MB RAM and on Windows machines, Windows 2000 or XP or later versions of the OS are preferred platforms for performance. TreeNet® is available for Windows 98/NT/2000 and UNIX (IBM AIX, Compaq Alpha, SGI, HP, and Sun) platforms and will run with as little as 64 MB RAM. A Linux version is planned.
Manufacturer
- Salford Systems
- 4740 Murphy Canyon Rd. Ste 200
- San Diego, Calif. 92123
- Tel: 619.543.8880
- Fax: 619.543.8888
- info@salford-systems.com
- support@salford-systems.com
Manufacturer Web Site TreeNet
Price Contact manufacturer.
G6G Abstract Number 20220
G6G Manufacturer Number 102305