MrsCake Tutorial

Installing

To install mrscake, clone the github archive and then do the usual configure / make / make install incantation:
    git clone git://github.com/matthiaskramm/mrscake.git
    cd mrscake
    ./configure
    make 
    make install

Creating Datasets

Suppose you've already aquired some training data. For example, you want to try to predict whether a given git commit breaks a build (this is a real example from our company- we found we can predict 70% of possible build breakages just by looking at git commits)

This is a typical git commit:

commit f1568a8318a64e68944b805f34cc6e08f1ef1c1b
Author: Karl Weizenfeld <karl@acme.com>
Date: Fri Apr 13 16:23:20 2012 -0400
So much noise in funnel.yml
config/funnel.yml | 56 ++++++++++----------
1 files changed, 28 insertions(+), 28 deletions(-)

It has a number of easily extracted features:

Suppose we also have a labelled training set from past software releases that specifies, for every commit, whether that commit broke the build or not.

To feed all this data into mrscake, we create a dataset:

(This example is in Ruby- the Python API looks very similar)

require 'mrscake'
d = MrsCake::DataSet.new()

d.add({:author=>:stefan,:added_lines=>5, :removed_lines=>5,  
       :message=>"misprint oopsie",         :hour=>18}, :broken)
d.add({:author=>:karl,  :added_lines=>80,:removed_lines=>14, 
       :message=>"fixed log output",        :hour=>11}, :not_broken)    
d.add({:author=>:peter, :added_lines=>2, :removed_lines=>1,  
       :message=>"more strace goodness",    :hour=>19}, :not_broken)    
d.add({:author=>:karl,  :added_lines=>2, :removed_lines=>2,  
       :message=>"Fix for bug 3718",        :hour=>14}, :broken)    
d.add({:author=>:bran,  :added_lines=>36,:removed_lines=>0,  
       :message=>"migration for user split",:hour=>10}, :not_broken)    
d.add({:author=>:peter, :added_lines=>3, :removed_lines=>7,  
       :message=>"simplify group handling", :hour=>17}, :not_broken)    

As you can see, DataSet::add takes two parameters: A hash (or array) or features, and the training label (i.e., the desired output for these features.)

Notice that the desired output will always be categories, not numbers- mrscake can only categorize, not make numerical prediction (it's a classification engine, not a regression suite.)

Training Models

Once you've fed your data into mrscake in this way, you can train a prediction model. The easiest way to do this is by just using DataSet::train and have mrscake pick the right model for you:

model = d.train()

train will run a model selection over a large number of models, including neuronal networks, decision trees, random forests, support vector machines and many others. It will then pick the model that explains your data best (and in the most condensed way.)

The resulting model can now make predictions:

model.predict({:author=>:karl, :added_lines=>508, :removed_lines=>529, :message=>"Refactor default layout", :hour=>12})
=> :not_broken

Saving models

As training takes some time, you'll probably want to save the resulting model to disk:

model.save("broken_build_predict.dat")

Another way of making your model persistent is to have it generate a function and then plugging that function into your codebase:

puts model.generate_code("ruby")

In this example, the output of this will be something like this:

def predict(message, author, hour, added_lines, removed_lines)
    if !![:stefan,:karl].index((author)) then
        return :broken
    else
        return :not_broken
    end
end

This is a decision tree, which is usually the best model if you have only a small amount of training data.

You can generate code in C, Python, Ruby and Javascript.

Training specific models

If you already have a hunch what kind of model might suit your data best, you can also have mrscake train a specific model for you:

data.train("neuronal network (gaussian) with 2 layers")

Use

p MrsCake::model_names

to see all the supported models.

Parallel processing

It's possible to parallelize model selection. To do this, run

mrscake-job-server

on a couple of machines and then add them to mrscake before you run train():

MrsCake::add_server("129.187.2.1")
MrsCake::add_server("129.187.2.2")
MrsCake::add_server("129.187.2.3")
MrsCake::add_server("129.187.2.4")