CART 6.0 New Enhancements and Features

Category Intelligent Software>Data Mining Systems/Tools

Abstract CART data mining software is a decision tree tool that automatically sifts large, complex databases, searching for and isolating significant patterns and relationships.

This discovered knowledge is then used to generate reliable, predictive models for applications such as credit risk scoring (probability of default, loss given default); fraud detection; targeted marketing (new customer acquisition, cross-sell, up-sell); churn modeling [and related customer relationship management (CRM)]; document classification; microarray data analysis; genomics, proteomics; manufacturing and production line quality control.

CART 6.0 incorporates many new user-requested enhancements and features:

Enhanced Descriptive Statistics -- The manufacturers complete set of statistics, including standard summary statistics, quantiles, and detailed tabulations continue to be available for data exploration in a single easy-to-access display. The manufacturer now also offers an abbreviated version in the traditional one row per predictor format.

Also new in CART 6.0 are sub-group statistics based on any segmentation or stratification variable, as well as charts and histograms for visualizing your data.

Improved User Interface --

New setup activity window - The activity window offers a quick way to access summary statistics, summary graphs, the model setup dialog, a view of the data, and scoring.

Ability to control default settings - The user can customize and save default settings.

Model Building --

Set weighted values for minimum parent/terminal node size - Previous versions of CART have always allowed you to control the size of the smallest terminal node produced.

In version 6.0 the manufacturer now also allows you to control the minimum allowable weighted record counts in any terminal node. A similar control applies to the minimum weighted size; a node must have to become a parent node.

Additional fraction for auto validation - Traditionally, CART trees are grown on learn (or training) data and evaluated on test data. Because the test data are used to help select the optimal tree, some practitioners prefer to conduct a further model check by evaluating a performance on a “holdout” portion of the data. The manufacturer refers to these holdout data as the validation data.

Select/reject predictors directly from variable importance list - Once a model is built, you can easily refine it by managing the variable importance list. Simply highlight the variables you want to keep for the next model and click the “Build New Model” button.

CART Pro and Pro EX provide a higher degree of automation for predictor list refinement (feature extraction) and offer an automated pre-modeling predictor discovery stage.

This can be very effective when you are faced with a large number of candidate predictors. In extensive experiments the manufacturers have established that automatic predictor discovery frequently improves CART model performance on independent holdout data.

Splits --

Forced splits - The user can dictate the splitting variable to be used in the root node or in either of the two (2) child nodes of the root, allowing the user to impose some modest structure on a tree. More specific controls allow the user to specify the split values for both continuous and categorical variables.

Linear combination lists - In CART 6.0 you may specify lists of variables (LC lists) from which any linear combination can be constructed. For example, in a credit risk model you might list credit report variables on one list, core demographics on another list, and current income-related variables on a third list.

Time series analysts might create a separate LC list for a variable and all its lagged values. Such LC lists force combinations of variables used in an LC splitter to be of a specific type.

Constraints and structured trees (patent pending) - CART 6.0 offers an advanced mechanism for generating structured trees by allowing you to specify where a variable or group of variables is permitted to appear in the tree. For example, in a marketing model, you might limit consumer-related variables to the top of the tree and product-related variables to the bottom.

The resulting model would first split the data into different consumer types and then analyze the product preferences of each group. Constraints could also be used to reflect the natural or causal order of variables within the model structure, or to construct a tree using broad and general predictors at the top and more specific and detailed predictors toward the bottom.

CART allows you to structure your trees in a number of ways. You can specify where a variable can appear in the tree based either on its location in the tree or on the size of the sample arriving at a node. You can also specify as many different regions in the tree as you wish.

Cross Validation --

User-controlled cross validation bins - You can create your own partition of the data for the purpose of cross-validation by instructing CART to use a variable you have created for this purpose.

This is most useful when there are repeated observations on a behavioral unit such as person or a firm, and it is important to keep all records pertaining to such a unit together (either all records are in the training sample or all in the test sample). User constructed CV bins are also useful in the analysis of time series or geographically correlated data.

Missing Value Analysis --

Automatically add missing value indicators - CART has always offered sophisticated high performance missing value handling. In CART 6.0 the manufacturer introduces a new set of missing value analysis tools for automatic exploration of the optimal handling of your incomplete data. On request, CART 6.0 will automatically add missing value indicator variables (MVIs) to your list of predictors and conduct a variety of analyses using them.

MVIs allow formal testing of the core predictive value of knowing that a field is missing. One of the models CART 6.0 will generate for you automatically is a model using only missing value indicators as predictors.

In some circumstances such a simple model can be very accurate and it is important to be aware of this predictive power. Other analyses explore the benefits of imposing penalties on variables that are frequently missing.

Allow “missing” as a legal discrete level - For categorical variables, an MVI can be handled either by adding a separate MVI variable or by treating missing as a valid “level”. You can experiment to see which works best for your data.

Model Evaluation --

“Profit” display: track non-model variables across all nodes - “Profit” variables are any variables the modeler is interested in tracking in the terminal nodes. The “profit” tab on the summary window includes tabular and graphical displays of these variables, showing absolute and average node results and cumulative results based on the ordering of the nodes determined by the original target variable.

Train/test consistency: how well do train and test match up across all nodes - Classic CART trees are evaluated on the basis of the overall tree performance.

However, many users of CART are more interested in the performance of specific nodes and the degree to which terminal nodes exhibit strongly consistent results across the train and test samples. The TTC report provides new graphical and tabular reports to summarize train/test agreement.

Hot spot detection: search many trees to find nodes of ultra-high response - In many modeling situations, an analyst is looking for “hot spots,” regions of modeling space richest in the event of interest.

For example, in a fraud detection problem, you might be interested in identifying a set of rules that lead to a high ratio of fraudulent transactions, allowing you to flag future records that fit those rules as almost certainly fraudulent. CART’s hot spot detection process is fully automated and is especially effective in processing batteries of models.

Additional Summary Reports --

ROC curves and variance of ROC measure - ROC curves have become a preferred way of summarizing the performance of a model, and these are now available for all CART models and ensembles on both train and test data. An estimate of the area under the ROC curve is also produced when cross validation is used to assess model performance.

Display learn, test, or pooled results - Results can be viewed for either the training (learn) data, the test data, or the aggregate created by pooling “the learn” and “test samples”.

Gains chart: show perfect model curve - In a gains curve, the performance of a perfect model depends on the balance between the “response” and “non-response” sample sizes. The “perfect model” reference line helps to put the observed gains curve into proper perspective.

Scoring and Translation --

Multi-tree selection control for scoring and model translation - In CART 6.0, any tree in the pruning sequence can be used for scoring and model translation.

New model translation formats: Java and PMML - The manufacturers have added Java and PMML to their existing group of model translation languages. The Predictive Modeling Markup Language (PMML) is a form of XML specifically designed to express the predictive formulas or mechanisms of a data mining model. In CART 6.0 the manufacturer conforms to PMML release 3.0.

Unsupervised Learning --

Breiman’s column scrambler - The manufacturers believe that Leo Breiman invented this trick (although they are Not entirely sure). The manufacturers start with the original data and then make a copy. The copy has each of its columns randomly shuffled to destroy its original correlation structure.

CART is then used to try to recognize whether a record belongs to the original data or to the shuffled copy. The stronger the correlation structure in the original data, the better CART will do, and the terminal nodes may identify interesting data segments.

Automated Model search: BATTERY --

Most modelers conduct a variety of experiments, trying different model control parameters in an effort to find the best settings. This is done for any method that has a number of control settings that can materially affect performance outcomes.

In CART 6.0, the manufacturers have made the process easier yet by packaging their recommended “batteries of models” into batches that the modeler can request with a mouse click.

Note 1: See CART 5.0 (G6G Abstract Number 20053) for additional features.

Note 2: See CART 5.0 New and Enhanced Features (G6G Abstract Number 20053A1) for additional features.

System Requirements

CART System Requirements

Manufacturer

Manufacturer Web Site CART 6.0 New Enhancements and Features

Price Contact manufacturer.

G6G Abstract Number 20053B60

G6G Manufacturer Number 102305