I made a change in the blogger configuration to ease the later work when blogging. It is possible that older entries are not correctly formatted.

Sunday 29 June 2008

Git-bisect instructions

After reading a little bit of a long thread on testing Bugs for the linux kernel, there was a small HOWTO for running bissects of the linux kernel.

I write it again here in order to make sure, it is easier to find:

# install git

# clone Linus' tree:
git clone \
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

# start bisecting:
cd linux-2.6
git bisect start
git bisect bad v2.6.21
git bisect good v2.6.20
cp /path/to/.config .
# start a round
make oldconfig
make
# install kernel, check whether it's good or bad, then:
git bisect [bad|good]
# start next round
After at about 10-15 reboots you'll have found the guilty commit
("... is first bad commit").
More information on git bisecting:
man git-bisect

But this is not all... The bisection as means of automatically checking whether a given tests works or not. Suppose you have a script to run a test called: test.sh. Then you could call for the bisection:

$ git bisect start v1.3 v1.1 -- # v1.3 is bad, v1.1 is good
$ git bisect run ~/test.sh

For more information on this a good explanation is found at this page.

Moreover, there is also the possibility to restrict the search for versions which are good or bad which had a change in a given part of the repository. For example, you may know that the bug is found in a certain subdirectories. Then you can specify these directories:

$ git bisect start -- mydirectory1 mydirectory2/directory3

Friday 27 June 2008

Ketchup - Linux Kernel Automatic Patcher

I discovered an interesting script: ketchup which can be used in order to automatically perform patching of the kernel from one version to the next.

Thursday 26 June 2008

Learning Method

How do I learn about a topic? Well there are different strategy. One of the things I often do is to look for excellent books on the topic and go through the content and look for things I do not already know. Take for example: The Database System Implementation book by Hector Garcia-Molina, Jeffrey D. Ullman and Jennifer Widom,Prentice-Hall International London Limited ,2000 ISBN 0-13-0400264-8. I look at the section in the table of content. For example:
  • 2.4 Improving the access of secondary storage
    • 2.4.1 I/O model of computation
    • 2.4.2 Sorting data on secondary storage
    • 2.4.3 Merge Sort
    • 2.4.4 Two-phase, multiway merge sort
    • 2.4.5 extension of multiway merging for larger relations
I know already something about merge sort. However, I don't know much about the other topics. So I take a look at each of them. If something seems interesting and important then I try to sum up the information learnt somewhere, for example, in this blog.

Friday 20 June 2008

Combining Data Mining Models

Here is a little summary of the possible way of combining multiple models for Data Mining (I use as first resource the "Data Mining" Book from Ian H. Witten und Eibe Frank):
  • Bagging
  • Bagging with costs
  • Randomization
  • Boosting
  • Additive regression
  • Additive logistic regression
  • Option trees
  • Logistic model trees
  • Stacking
  • Error correcting output codes

Bagging

The principle of Bagging is to let create a number of models for a training set, and use the class returned the most frequently for a specific instance for each of these models. In other word, it is important that the different model return the same set of possible class output.

Bagging with costs

This extension of the Bagging approach uses a cost model. It is particularly useful when the predictions made by the models used in the bagging provide probabilities telling how likely is to be exact.

Randomization

The idea here is to introduce some kind of randomization in the model creation in order to create different models. Depending on the stability of the process, a certain class can then be chosen as prediction.

Boosting

Similarly to the bagging approach, boosting tries to create models as a kind of cascade, each model is built with the purpose of classifying better the instances which have not been suitably classified by previous models. This is a type of forward stagewise additive modelling

Additive regression

The additive regression is alse a kind of forward stagewise additive modelling which is suitable for numeric prediction with regression. Here again the principe is to use a serie of regressions which try to classify better the elements which were incorrectly classified.

Additive logistic regression

This type of regression is an adapation of the previous combination approach but for logistic regression.

Option trees

I still have to describe this but the concept is quite simple

Logistic model trees

I still have to describe this but the concept is quite simple

Stacking

The purpose of stacking is to combine different types of models which might not have the same labels. In order to achieve this, a meta learner is introduced. A number of models are built for the data. A meta learner, i.e a model which decides from the learning output of other learners, created in order to classify and adapt to all the models from the first phase.

Error correcting output codes

I still have to describe this but the concept is quite simple

Of course, all these mechanisms have been implemented in Weka.

Thursday 19 June 2008

FishEye - Visualisation of Subversion

There seems to be a nice tool for the visualisation of subversion repository: FishEye (see this page). Unfortunately, the tool is not open source. Though an open source project may ask for an open source community license. From the web site, it is not clear to me, which SCM (source code management) system is supported. At least, Subversion seems to support CVS, SVN and Perforce (?).

Tuesday 17 June 2008

Transaction Models

As a reminder I will list here the possible transaction models. The aim is not to be exhaustive, but merely to have this entry to check the possible transactions. First, let's list the different models:
  • Flat transactions (supported by J2EE)
  • Nested Transactions (supported by J2EE)
  • Chained Transactions
  • Sagas

Flat transactions

Flat transactions consists of a series of operations which should be performed as one unit of work, and return true or false depending on whether the operation failed or not. The reasons causing a transaction to abort may be for instance:
  • some invalid parameters have been given to the components
  • some system state which should not have changed, was violated
  • Hardware or software failure

Nested transactions

Contrary to flat transactions, nested transactions allow to have units of work consisting of other atomic units of works. Some of the embedded unit of work may roll back without forcing the entire transaction to roll back. In that way, failed transactions can be performed again. Note that subtransactions of transactions may also be nested transactions.

Chained transactions

In the chained transaction model, new transactions are started automatically for a program when the previous one has been commited or aborted. This is not implemented by j2ee.

Long Running Transactions - Sagas

Sagas, or Long-Running transactions are transactions which may take a long time to finish and may last for days either because they wait for an external event, or they wait for the decision of one of the actors involved in the transaction. This type of transaction usually occurs in a web service context (see here). This is not implemented by j2ee.

Wednesday 11 June 2008

Kernel KnowHow: Per CPU Variables

I just read the section on per CPU variables from the Linux Device Drivers book from O'Reilly (TODO link).

The use of per CPU variable helps improving performance by limiting the cache sharing between CPUs. The idea is that each CPU has his own private instance of a given variable.

A variable can be defined using the following macro: DEFINE_PER_CPU(type,name);

It is important to note that a kind of locking mechanism is still needed for the CPU variables in case:

  • the processor would move the process to another cpu
  • the processor would be preemted in the middle of a modification of the variable

Therefore, each instance should be updated using a simple mechanism to lock the variable during the changes. This can be done by using the get_cpu_var() and put_cpu_var() functions, for example:

get_cpu_var(variabletoupdate)++;
put_cpu_var(variabletoupdate);

It is also possible to access the variable values from other processors using: per_cpu(variable, int cpu_id); but a locking mechanism must be implemented for these cases.

Memory allocation for dynamically allocated per-CPU variables is performed using:

void *alloc_percpu(type);
void *__alloc_percpu(size_t size , size_t align);

Deallocation is achieved using: free_percpu().

Accessing a dynamically allocated per-CPU variable is performed by using

per_cpu_ptr(void * per_cpu_var, int cpu_id).
It returns a pointer to the variable content. For example,
int cpu;
cpu= get_cpu();
ptr = per_cpu_ptr(per_cpu_var, cpu);
put_cpu();

Note: the use of the method to free the cpu from the lock.

Tuesday 3 June 2008

Kernel Index page from LWN.net

As I was looking for some supplementary information on the kernel, I found this page which returns a categorized list of entries of articles in the LWN page.