Cubist

Category Intelligent Software>Data Mining Systems/Tools

Abstract Cubist is a data mining tool that produces rule-based models for numerical prediction.

Each rule specifies the conditions under which an associated multivariate linear sub-model should be used.

Cubist models often yield more accurate predictions than simple linear models without giving up the advantages of interpretability.

Cubist builds rule-based predictive models that output values, complementing 'See5 and C5.0' (see G6G Abstract Number 20344) that 'predicts categories'.

For instance, See5/C5.0 might classify the percentage yield from some process as "high", "medium", or "low", whereas Cubist would output a number such as "7.3".

Cubist is an advanced tool for generating rule-based models that balance the need for accurate prediction against the requirements of intelligibility.

Cubist models generally give better results than those produced by simple techniques such as multivariate linear regression, while also being easier to understand than neural networks.

Cubist features/ capabilities are:

1) Cubist has been designed to analyze substantial databases containing hundreds of thousands of records and tens to thousands of numeric or nominal fields.

If you have used neural networks or similar modeling tools, you'll be surprised by Cubist's speed! (Cubist also takes advantage of processors with quad cores, up to four CPUs, or Intel Hyper-Threading to speed up model-building.)

2) To maximize interpretability, Cubist models are expressed as collections of rules, where each rule has an associated multivariate linear model.

Whenever a situation matches a rule's conditions the associated model is used to calculate the predicted value.

3) Cubist is available for Windows 2000/XP/Vista and Linux.

4) Cubist is easy to use and does Not presume advanced knowledge of Statistics or Machine Learning (although these don't hurt, either!)

5) RuleQuest provides C source code so that models constructed by Cubist can be embedded in your organization's own systems.

Like See5/C5.0, Cubist pays particular attention to the issue of 'comprehensibility'.

RuleQuest believes that a data mining system should find patterns that Not only facilitate accurate predictions, but also provide insight.

Cubist additional features/capabilities are:

1) Cubist incorporates a novel method for generating 'composite instance-based' (nearest neighbor) and rule-based models.

These often improve predictive accuracy, although at the cost of being more difficult to understand than rule-based models alone.

Composite models can be selected by an option, or you can let Cubist decide whether they are appropriate for your application.

2) Cubist can also construct 'committees of models'. The first model is found as usual, the second model attempts to compensate for the errors of the first model, the third tries to compensate for the second, and so on.

A committee prediction is obtained by averaging the predictions made by each model in the committee.

Committee models are usually both more accurate than single models and faster to evaluate than composite models (since finding nearest neighbors for large training sets is slow).

3) By default, all cases from which a model is constructed have equal importance.

This is Not appropriate in some applications -- for example, cases describing high-value loans might be more critical that those of lower value.

Cubist allows an optional 'case weight' attribute to indicate the relative importance of each case.

4) Most models represent a trade-off between simplicity and accuracy -- simpler models are easier to understand but may under-fit the data. The balance can be shifted towards simplicity by setting a 'ceiling on the number of rules' that may appear in a model.

5) Cubist has built-in support for both 'cross-validation' and 'sampling' from large datasets.

New in Release 2.05 of Cubist --

1) More accurate models -

Cubist models should now have somewhat lower average absolute error on unseen cases.

2) Improved multi-threading -

Another bottleneck in Cubist's model-building algorithm has been parallelized and will now run on multiple CPUs or cores.

Assignment of some tasks to processors has also been adjusted to balance loads better and so reduce the time taken to process larger applications.

3) Faster composite models -

When Cubist constructs a composite model for applications with hundreds of thousands of training cases, a significant proportion of the total run time is taken up by calculating the accuracy of the model on these same cases.

Instead, Release 2.05 uses a large sample of the training cases to estimate this accuracy.

System Requirements

Cubist is available for Windows 2000/XP/Vista and Linux.

Manufacturer

Manufacturer Web Site Cubist

Price Contact manufacturer.

G6G Abstract Number 20345

G6G Manufacturer Number 102311