MLBox : a short regression tutorial

I have recently discovered MLBox, an automated machine learning python library.

Its main author, Axel Aronio de Romblay, promises :

  • Fast reading and distributed data preprocessing/cleaning/formatting.
  • Highly robust feature selection and leak detection.
  • Accurate hyper-parameter optimization in high-dimensional space.
  • State-of-the art predictive models for classification and regression (Deep Learning, Stacking, LightGBM,…).
  • Prediction with models interpretation.

Quite promising, no ? What about real life ? Let’s dive into it.

If you’re interested in classification, have a look at this great tutorial on analytics Vidhya.

I have decided to test the library on Kaggle’s competition “House prices”, to compare with my last solution.

Downloading and installing MLBox

Good news, MLBox is available for Linux, but not only since I could install it on my Windows 10 system, using Anaconda Navigator. However, there is no conda package, so we’re going to experience some difficulties (I hope I’ll save you a few hours).

  • Create a MLBox environment in Anaconda, python 3.6 is OK. Simply click on ‘Create’ (bottom left), select python 3.6 and click ‘Create’ again; then you’ll have to wait for Anaconda to download packages and install everything.

Creating a new environment in Anaconda

  • Before trying to install MLBox, you’ll have to install XGBoost.

XGBoost can not be installed via pip at the moment, so if you try to install MLBox, the installation process will crash.

You can chose Git or MinGW if you feel comfortable with it, but I used a simpler method. Go to this page and download the .whl file corresponding to your system. Next open a command prompt from Anaconda in your newly created environment

Open terminal from here

And cd to the folder you’ve downloaded the .whl file, then from here simply use pip install xgboost-0.6-cp36-cp36m-win_amd64.whl (or whatever file name you downloaded).

Now you should be able to proceed by simply typing

pip install mlbox

If the installation stumbles on a blocked dependency, maybe you’ll have to download the corresponding .whl file and pip install it before retrying to install mlbox.

OK, we are now ready to work !

Using MLBox for regression

1. MLBox “Blackbox” edition

MLBox can be used as a complete blackbox : you feed  it train/test sets, define the target and you’re done.

Let’s try the basic blackbox approach.

#Import MLBox and other packages
import mlbox as mlb #I don't really like * imports

#Read files using preprocessing.reader
#Usage: train_test_split(path to training data,path to test data,target)
#Target is "SalePrice", i.e. the price the houses are sold
data=mlb.preprocessing.Reader(sep=",").train_test_split(["data.csv","data_test.csv"],'SalePrice')

#Preprocess data
#1/ Remove Ids
#2/ Delete drifting data between train and test sets
data=mlb.preprocessing.Drift_thresholder().fit_transform(data)

#Train and evaluate with default parameters
#best is a set of parameters that were estimated as being the best
best = mlb.optimisation.Optimiser().evaluate(None, data)

#Predict on the test data using best parameters
mlb.prediction.Predictor().fit_predict(best, data)

That was easy. We now have a subfolder names “save” in which we can find a .csv file with the predictions, as well as features importance and drift coefficients for all variables.

Let’s submit the prediction to kaggle … wait for data uploading and processing … and the score is ….

0.26013

Hmm, that’s not too bad considering I’ve simply ran a library on data I have barely examined. However, with such a score I’d rank 1590 out of 1765.

There’s hopefully something we can do about it.

2. Optimising the pipeline

So, we’ve seen that MLBox can deliver some useable predictions without any work, but we can do much better by optimising the model parameters. Similarly to GridSearchCV in sciki-learn, we’ll feed the model a dictionnary containing key,value pairs of parameters.

space_xgb={
'ne__numerical_strategy'    :{"search":"choice",
                              "space":[0,'mean','median','most_frequent']},
'ne__categorical_strategy'  :{"search":"choice",
                              "space":[np.NaN,"None"]},
'ce__strategy'              :{"search":"choice",
                              "space":['label_encoding','entity_embedding','dummification']},
'fs__strategy'              :{"search":"choice",
                              "space":['l1','variance','rf_feature_importance']},
'fs__threshold'             :{"search":"uniform",
                              "space":[0.01,0.6]},
'est__strategy'             :{"search":"choice",
                              "space":["XGBoost"]},
'est__max_depth'            :{"search":"choice",
                              "space":[3,4,5,6,7]},
'est__learning_rate'        :{"search":"uniform",
                              "space":[0.01,0.1]},
'est__subsample'            :{"search":"uniform",
                              "space":[0.4,0.9]},
'est__reg_alpha'            :{"search":"uniform",
                              "space":[0,10]},
'est__reg_lambda'           :{"search":"uniform",
                              "space":[0,10]},
'est__n_estimators'         :{"search":"choice",
                              "space":[1000,1250,1500]}
}

Using the documentation, let’s try to decypher the code I used above.

“space_xgb” is a hyper-parameters space (for my XGBoost estimator), where :

  • Keys must respect the syntax “enc__param” (note the double underscore)

enc” can be : ‘ne’ for numerical encoder, ‘ce’ for categorical encoder, ‘fs’ for feature selection, ‘est’ for estimator.

param” : can be (almost) any parameter accepted by the corresponding encoder. I suggest you refer to XGBoost documentation for the class xgboost.XGBRegressor to understand all the parameters I used.

  • Values must respect the following syntax : {“search” : strategy, “space” : list}, where ‘strategy’ may be “choice” for sparse values given as a list or “uniform” given as [start_value,end_value]

I’ll explain into more details some lines:

'ne__numerical_strategy' :{"search":"choice", "space":[0,'mean','median','most_frequent']} means: try and replace missing numeric values by 0, their mean or median or their most frequent value.

'est__strategy' :{"search":"choice", "space":["XGBoost"]} means : Use XGBoost for regression.

'est__learning_rate' :{"search":"uniform", "space":[0.01,0.1]}, means : for my estimator (previously defined as XGBRegressor), use values ranging from 0.01 to 0.1 for the parameter “learning_rate”.

'est__n_estimators' :{"search":"choice", "space":[1000,1250,1500] means : XGBoost parameter ‘n_estimators’ use only the values 1000,1250 and 1500.

Let’s optimise and fit_predict: MLBox will try all combinations in space_xgb to find the best fit.

#Optimisation
best_xgb=mlb.optimisation.Optimiser(scoring="r2",n_folds=5).optimise(space_xgb,data,120)

#Prediction
mlb.prediction.Predictor().fit_predict(best_xgb,data)

Note that for the scoring I had to select “r2” instead of “mean_squared_error”, because of an annoying deprecationg warning since “mean_squared_error” has been replaced by “neg_mean_squared_error” in sklearn 0.18 (I’ll try to  report this ‘bug’).

I have set max_evals=120 instead of the default value (40), it seems to yield better results but I suggest you keep at 40 for your tests, since it increases the optimisation time a lot (2600 seconds for the code above).

Now, what are the best model parameters ?

best_xgb

Out[58]: 
{'ce__strategy': 'entity_embedding',
 'est__learning_rate': 0.05171805330989606,
 'est__max_depth': 4,
 'est__n_estimators': 1500,
 'est__reg_alpha': 1.8150695175445302,
 'est__reg_lambda': 3.7116897460071567,
 'est__strategy': 'XGBoost',
 'est__subsample': 0.6404841935660599,
 'fs__strategy': 'variance',
 'fs__threshold': 0.161493332373689,
 'ne__categorical_strategy': 'None',
 'ne__numerical_strategy': 0}

Missing numerical values are set to 0, while categorical are set to the string ‘None’, about 16% of the features are removed and selected by variance. Numeric parameters for the estimator are also shown.

In the subfolder ‘save’ we also have a bar graph of features importance :

Features importance as determined by MLBox

From a business point of view, this graph makes sense since the most important features include : the area of the house, its overall quality, surface of the garage, area of the garden, year the house was built, the quality of the neighborhood, etc.

Let’s submit this new prediction to kaggle: the new score is now…

0.12612

Wow ! That’s a huge increase ! Now, I’d rank about 668/1765 on the leaderboard.

Hopefully, as shown in my previous post, my best score is 0.12594 which ranked me 661/1765. I’m still 7 places above the machine, but probably not for long as I’ll improve the optimisation step.

3. Being smarter than MLBox ?

Considering I spent several days building a model manually, I was equally frustrated and excited to see that a few lines could provide a similar result. I decided to check whether I could be more intelligent by preprocessing the data manually, as I did in my previous post.

I went back to the data set and decided to remove some features that I considered irrelevant. I tried several combinations with [“GarageYrBlt”, “MoSold”,”MasVnrArea”, “GarageCars”, “GarageArea”]. In particular, when looking at the (reduced) correlation map :

Correlation map for selected features

There seems to be a strong collinearity between “GarageCars” (number of cars the parking can hold) and “GarageArea” meaning we have a redundancy there. Intuitively I would try to delete one of these features (or create a new one by combining them).

Well, I confess any combination I used led to still good but worse scores : 0.12672, 0.12827, 0.12852…

Whatever I try, the algorithm always performs better by itself: pretty impressive! I suppose, MLBox is smart enough to deal with multicollinearity. I feel like it’s tellig me : “Don’t even try, I don’t need you”.

Conclusion

First I hope this tutorial will help you start with MLBox, which I think is a wonderful tool for machine learning.

I hope I’ll have enough time to try and improve the model for this House price problem. I already have some ideas I’d like to implement, so stay tuned for a next blog post.

Concerning the library itself it is :

  • Easy to use
  • Fast
  • Able to yield interesting results with very little knowledge of the underlying models
  • Able to yield much better results with some finetuning, so do not expect it to solve your problems too easily

Do not forget to keep an eye on the subfolder /save/joblib that can grow rapidly : mine went up 6 GB in just 1 day (not really good for my online drive and limited bandwitdth)

As a side note, I’d like to point out I was not able to get any good prediction with LightGBM (nothing better than 0.26), whatever parameters I tried. If you have any idea why, please comment below.

3 thoughts on “MLBox : a short regression tutorial”

  1. Hi, I am also trying to use MLbox for employee attrition problem and the data is stored in SQL table. In the post you are reading train and test cvs data. How to use ML Box in the situation where data is stored in SQL server and how It can be used in production.

    thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.