I made a change in the blogger configuration to ease the later work when blogging. It is possible that older entries are not correctly formatted.

Friday, 20 June 2008

Combining Data Mining Models

Here is a little summary of the possible way of combining multiple models for Data Mining (I use as first resource the "Data Mining" Book from Ian H. Witten und Eibe Frank):
  • Bagging
  • Bagging with costs
  • Randomization
  • Boosting
  • Additive regression
  • Additive logistic regression
  • Option trees
  • Logistic model trees
  • Stacking
  • Error correcting output codes

Bagging

The principle of Bagging is to let create a number of models for a training set, and use the class returned the most frequently for a specific instance for each of these models. In other word, it is important that the different model return the same set of possible class output.

Bagging with costs

This extension of the Bagging approach uses a cost model. It is particularly useful when the predictions made by the models used in the bagging provide probabilities telling how likely is to be exact.

Randomization

The idea here is to introduce some kind of randomization in the model creation in order to create different models. Depending on the stability of the process, a certain class can then be chosen as prediction.

Boosting

Similarly to the bagging approach, boosting tries to create models as a kind of cascade, each model is built with the purpose of classifying better the instances which have not been suitably classified by previous models. This is a type of forward stagewise additive modelling

Additive regression

The additive regression is alse a kind of forward stagewise additive modelling which is suitable for numeric prediction with regression. Here again the principe is to use a serie of regressions which try to classify better the elements which were incorrectly classified.

Additive logistic regression

This type of regression is an adapation of the previous combination approach but for logistic regression.

Option trees

I still have to describe this but the concept is quite simple

Logistic model trees

I still have to describe this but the concept is quite simple

Stacking

The purpose of stacking is to combine different types of models which might not have the same labels. In order to achieve this, a meta learner is introduced. A number of models are built for the data. A meta learner, i.e a model which decides from the learning output of other learners, created in order to classify and adapt to all the models from the first phase.

Error correcting output codes

I still have to describe this but the concept is quite simple

Of course, all these mechanisms have been implemented in Weka.