Installing
To install mrscake, clone the github archive and then do the usual configure / make / make install incantation:git clone git://github.com/matthiaskramm/mrscake.git cd mrscake ./configure make make install
Creating Datasets
Suppose you've already aquired some training data. For example, you want to try to predict whether a given git commit breaks a build (this is a real example from our company- we found we can predict 70% of possible build breakages just by looking at git commits)
This is a typical git commit:
commit f1568a8318a64e68944b805f34cc6e08f1ef1c1b Author: Karl Weizenfeld <karl@acme.com> Date: Fri Apr 13 16:23:20 2012 -0400 So much noise in funnel.yml config/funnel.yml | 56 ++++++++++---------- 1 files changed, 28 insertions(+), 28 deletions(-)
It has a number of easily extracted features:
- Time of day (hour, 0-24)
- Author name
- Number of added/removed lines
- Commit message
Suppose we also have a labelled training set from past software releases that specifies, for every commit, whether that commit broke the build or not.
To feed all this data into mrscake, we create a dataset:
(This example is in Ruby- the Python API looks very similar)
require 'mrscake' d = MrsCake::DataSet.new() d.add({:author=>:stefan,:added_lines=>5, :removed_lines=>5, :message=>"misprint oopsie", :hour=>18}, :broken) d.add({:author=>:karl, :added_lines=>80,:removed_lines=>14, :message=>"fixed log output", :hour=>11}, :not_broken) d.add({:author=>:peter, :added_lines=>2, :removed_lines=>1, :message=>"more strace goodness", :hour=>19}, :not_broken) d.add({:author=>:karl, :added_lines=>2, :removed_lines=>2, :message=>"Fix for bug 3718", :hour=>14}, :broken) d.add({:author=>:bran, :added_lines=>36,:removed_lines=>0, :message=>"migration for user split",:hour=>10}, :not_broken) d.add({:author=>:peter, :added_lines=>3, :removed_lines=>7, :message=>"simplify group handling", :hour=>17}, :not_broken)
As you can see, DataSet::add
takes two parameters: A hash (or array) or features, and the
training label (i.e., the desired output for these features.)
Notice that the desired output will always be categories, not numbers- mrscake can only categorize, not make numerical prediction (it's a classification engine, not a regression suite.)
Training Models
Once you've fed your data into mrscake in this way, you can train a prediction model. The easiest way to
do this is by just using DataSet::train
and have mrscake pick the right model for you:
model = d.train()
train
will run a model selection over a large number of models, including neuronal networks,
decision trees, random forests, support vector machines and many others. It will then pick the model
that explains your data best (and in the most condensed way.)
The resulting model can now make predictions:
model.predict({:author=>:karl, :added_lines=>508, :removed_lines=>529, :message=>"Refactor default layout", :hour=>12}) => :not_broken
Saving models
As training takes some time, you'll probably want to save the resulting model to disk:
model.save("broken_build_predict.dat")
Another way of making your model persistent is to have it generate a function and then plugging that function into your codebase:
puts model.generate_code("ruby")
In this example, the output of this will be something like this:
def predict(message, author, hour, added_lines, removed_lines) if !![:stefan,:karl].index((author)) then return :broken else return :not_broken end end
This is a decision tree, which is usually the best model if you have only a small amount of training data.
You can generate code in C, Python, Ruby and Javascript.
Training specific models
If you already have a hunch what kind of model might suit your data best, you can also have mrscake train a specific model for you:
data.train("neuronal network (gaussian) with 2 layers")
Use
p MrsCake::model_names
to see all the supported models.
Parallel processing
It's possible to parallelize model selection. To do this, run
mrscake-job-server
on a couple of machines and then add them to mrscake before you run train()
:
MrsCake::add_server("129.187.2.1") MrsCake::add_server("129.187.2.2") MrsCake::add_server("129.187.2.3") MrsCake::add_server("129.187.2.4")