CART 5.0 New and Enhanced Features

Category Intelligent Software>Data Mining Systems/Tools

Abstract CART data mining software is a decision tree tool that automatically sifts large, complex databases, searching for and isolating significant patterns and relationships. This discovered knowledge is then used to generate reliable, predictive models for applications such as credit risk scoring (probability of default, loss given default); fraud detection; targeted marketing (new customer acquisition, cross-sell, up-sell); churn modeling [and related customer relationship management (CRM)]; document classification; microarray data analysis; genomics, proteomics; manufacturing and production line quality control.

CART 5.0 incorporates many new user-requested enhancements and features:

Discrete Variables -- Discrete variables (a.k.a. categorical variables) are those that take on a finite set of distinct values. Predictor variables can be discrete, as can the target variable (in which case the model is a classification tree).

CART 5 handles discrete variables in the following more flexible and easier-to-use ways:

1) Ability to Automatically Detect Distinct Classes -- It is No longer necessary to identify how many distinct classes a discrete variable has, even if the variable is the target. Thus, the user has only to identify which variables are to be treated as discrete; CART will figure out the rest.

2) Fractional Values Can Be Specified as Categorical -- Numeric discrete variables No longer need to take on whole-number values, nor do the values need to be contiguous. For instance, the following series of distinct values is supported in CART 5: 0.01, 0.1, 1.0, 1.001, 200, -500.

3) Character Data -- Character variables are now fully supported, as predictors, as the target or as auxiliary variables (described below). This is an important new feature because modern data, especially those arising from web logs and Internet transactions, are often character in nature.

Native Support for Text Data -- CART 5 includes native support for text datasets, the most flexible and natural format for many users to maintain data. A single delimiter is used throughout the dataset, usually a comma, but semicolon, space, and tab are also supported as delimiters.

Data Information -- The 'Data Info' window is a new display in CART 5 that offers summary information about variables in your dataset, included are continuous statistics (N, mean, sum, min, max, variance, standard deviation, skewness and kurtosis, conditional mean, N equal and unequal to 0.0 that may be weighted by a case weight variable.) Also available is a fully-weighted tabulation of distinct values, along with quantiles, quartiles and interquartile range, and N-most and -least frequent values.

Auxiliary Variables -- CART 5 introduces the "auxiliary" variable. Any variable (discrete/continuous, character/numeric) can be summarized with descriptive statistics or a frequency distribution at any node level. Such variables are termed "auxiliary" variables. It is Not necessary for auxiliary variables to be predictors in the model, although they can be. For example, profit and revenue measures in the dataset can be summarized for each node without affecting the growth of the tree (i.e., they are Not predictors in the model), allowing the most profitable partitions to be identified.

Groves -- CART 5 introduces "grove" files, which replace the pre-CART 5 .TR1 file. A grove file is a binary file that stores all the information about the tree sequence needed to apply any tree from the sequence to new data, or to translate (export) the tree into a different presentation language. Grove files contain a variety of information, including node information, the optimal tree indicator, and predicted probabilities. Grove files are Not limited to storing only one tree sequence, but may contain entire collections of trees obtained as a result of bagging, arcing, or cross validation. The file format is flexible enough to accommodate further extensions and exotic tree-related objects created in other Salford Systems' applications.

Note: Once a grove file is created, it can be translated into SAS- compatible, C, and Predictive Model Markup Language (PMML) languages.

Exporting CART Model Information -- CART 5 includes the ability to export the model information contained within the binary grove file, including primary and surrogate splitting rules for various programming language codes. The files containing the exported code can be used outside CART for scoring data. Export language formats currently supported are SAS-compatible, C, and PMML.

Missing Value Summary Report -- This report identifies the proportion of records missing for the target and each predictor variable, and for each sample (learn/test), sorted from most- to least-missing.

Entropy Splitting Rule -- This well-known splitting rule is related to the likelihood function. With multilevel targets it tends to look for splits where some or as many levels as possible are divided perfectly or near perfectly. As a result Entropy puts more emphasis on getting rare levels right relative to common levels than either Gini or Twoing. In different circumstances, its properties may be similar to Gini or Twoing or somewhere between them.

32-Character Variable Names -- CART 5 supports variable names up to 32 characters.

Path Length Extended to Windows Maximum -- CART 5 supports a Windows maximum path length of 256 characters (including the file name).

Improved Navigator Window --

1) The tree topology navigator now allows you to display either the learn sample or the test sample.

2) Toggle the secondary navigator window panel to display terminal node counts or the relative cost curve with an emphasis on all trees within one standard deviation.

3) The new action button in the navigator allows the user to save the navigator and the grove file, score new data, or translate the tree into one of the available languages.

4) Compare, learn and test samples at any level of the tree.

5) View the tree topology display with the focus on any specified auxiliary variable.

6) View auxiliary variable descriptive statistics or frequency tables at any level of the tree.

New and Improved Summary Reports --

1) An improved terminal node report now enables you to evaluate the purity or homogeneity of the terminal nodes, an indication of how well CART has partitioned the classes.

2) A prediction success report allows you to specify a focus target class, enabling quicker analysis of the most important class.

3) A learn/test sample breakdown is available in a majority of the post- processing result windows and dialogs.

4) Result windows allow the user to toggle displays between the percent of data or the number of cases.

5) Result windows allow the user to choose among various graph forms.

Additional Tree Details for Viewing and Printing -- An increased level of tree detail provides more information, allowing greater control when displaying and printing your trees.

Improved Model Setup -- Quickly specify the target, predictor, categorical, weighting, and auxiliary variables in a single setup tab.

Easy Data Access - Manufacturer has continued a direct link to DBMS/CopyT, with access to over 90 different file formats, including more than ten (10) new formats. For example, you can import and export statistical analysis packages (e.g., SAS, SPSS), and spreadsheets (e.g., Excel, Lotus). We have also added native support for text datasets.

Printing -- Automatic page fitting allows the user to print trees on two (2) pages when possible. Upgraded support for large format printing and plot printing. It allows the user to produce presentation-quality printing of large trees on a single piece of paper.

Note: See CART 5.0 (G6G Abstract Number 20053) for additional features.

System Requirements

CART requirements.

Manufacturer

  • Salford Systems
  • 9685 Via Excelencia
  • Suite 208
  • San Diego, CA 92126
  • USA
  • Telephone: (619) 543-8880
  • Fax: (619) 543-8888

Manufacturer Web Site Salford Systems CART

Price Contact manufacturer.

G6G Abstract Number 20053A1

G6G Manufacturer Number 102305