Porto Seguro’s Safe Driver Prediction (Kaggle)

This competition was held on Kaggle from august to november 2017. Porto Seguro is a large brasilian insurance company that whishes to build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year.

The data

The training data is a anonymized 113Mo .csv file with 59 features and ~ 595k samples and the data for submitting predictions has around 892k samples.

The feature engineering will be limited here, because we have almost no information about the features, except being numerical, categorical or binary. I could try to randomly combine or transform features, but doing so involves luck and I do not want to waste time here.

I have written a class that:

  1. Analyses the data to automatically determine the type of the features
  2. Encodes missing values with different strategies depending on the type : average or mean for numeric features, most frequent for categorical or binaries.
  3. Scales numeric features
  4. Binarizes categorical features by one-hot encoding.

And that’s all for features engineering.

import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelBinarizer, StandardScaler

class encoder:
    """
    Applies sklearn.preprocessing.LabelBinarizer and 
    sklearn.preprocessing.StandardScaler
    """

    def __init__(self):
        #binary, categorical or numerical features
        self.binary = list()
        self.categorical = list()
        self.numeric = list()
        self.feat_remove = list()
        self.binarizer = False

    def __repr__(self):
        return "<Binarizer and Scaler> object"
    
    def get_params(self):
        
        """ Returns parameters 
        
        Parameters
        ----------
        None
        
        Return
        ------
        Dictionnary : Keywords
        """
        
        return {"Binarizer": self.binarizer}

    def auto_cat(self,df):        
        ''' 
        Reads DataFrame to automatically determine if a feature is binary, 
        categorical, numerical or other.
        Not taking into account texte, dates, other...
        
        Parameters
        ----------
        df : Pandas DataFrame
             DataFrame whose columns are to be categorized
             
        Returns
        ------
        None
        
        '''
        colnames = df.columns
        for col in colnames:
            if (col=='target') | (col=='id'):
                pass
            # If only 2 different values then binary 
            elif len(pd.unique(df[col]))==2:
                self.binary.append(col)
                print("{0}\t-\tBinary".format(str(col)))           
            # If integer then categorical
            elif np.dtype(df[col])==np.int64:
                ''' if integer and shorter than total length then categorical
                 could be index 
                '''
                self.categorical.append(col)
                print("{0}\t-\tCategorical".format(str(col)))
            # Everything else is considered as a numeric feature
            else:
                print("{0}\t-\tNumeric".format(str(col)))
                self.numeric.append(col)
    
    def na_encode(self,df, NA=np.NaN):
        ''' 
        
        Encodes null values using different strategies depending on the 
        category of the columns
        
        Parameters
        ----------
        df : Pandas DataFrame
            DataFrame containing the data
        NA : The representation of missing values in the DataFrame
        
        Returns
        -------
        df : Pandas DataFrame
            The DataFrame after its missing values were replaced
        '''
        
        #Implement 'most frequent' strategy
        print("Processing categorical features...")
#        _df = df.copy()
#        self.auto_cat(df)
        for c in self.categorical:
            #Null values are represented by "-1", switching to NaN
            print("Number of missing values for feature: {0}".format(c))
            df.loc[(df[c] == NA, c)] = np.NaN
            na_count = df[c].isnull().sum()
            df[c].fillna(value=df[c].mode()[0], inplace=True)
            print(na_count) #Checking number of null values            
        #Implement 'mean' or 'median" strategy
        print("Processing numerical features...")
        for c in self.numeric:
            #NaN are represented as "-1" in the original data
            df.loc[(df[c] == NA, c)] = np.NaN
            na_count = df[c].isnull().sum()
            print("Number of missing values for feature: {0}\n{1}".format(c,
                  na_count))
            #Replace NaN values
            df[c].fillna(value=df[c].mean(), inplace=True)
        for c in self.binary:
            print("Number of missing values for feature: {0}".format(c))
            df.loc[(df[c] == NA, c)] = np.NaN
            na_count = df[c].isnull().sum()
            df[c].fillna(value=df[c].mode()[0], inplace=True)
            print(na_count) #Checking number of null values            
        return df

    def binarize_scale(self, df, NA=np.NaN, binarizer=False):
        
        """
        Get column names and determine if numeric/cat/bin
        
        parameters
        ----------
        
        df: A pandas DataFrame
            DataFrame containing the data that will be processed
        NA : The representation of missing values in the DataFrame
        binarizer : Boolean
            Select if the categorical features will be binarized or not
        
        Returns
        -------
        df : Pandas DataFrame
            The DataFrame after processing (binarized and scaled)
        """
        
        self.auto_cat(df)
        print("Encoding null values")
        df = self.na_encode(df,NA)
        self.binarizer = binarizer
        if self.binarizer == True:
            print("Binarizing...")
            #       Let's first create the LabelBinarized features                      
            label = LabelBinarizer()
            for c in self.categorical:   
                print(c)
                _ = label.fit_transform(df[c])
                for i in range(np.shape(_)[1]):              
                    df[str(c)+"_"+str(i)] = _[:,i]
        print("Scaling...")
#       Scale numeric features
        scaler = StandardScaler()
        for c in self.numeric:
            _ = scaler.fit_transform(np.float64(df[c]).reshape(-1,1))
        return df

Note that the data is imbalanced since only about 3.6% of the target are “1”. I’ll deal with it by using built-in parameters such as “class_weight” or by manually oversampling the data.

Features selection

First, I have deleted features having too many null values : data.drop([“ps_car_03_cat”,”ps_car_05_cat”], inplace=True, axis=1)data.drop(["ps_car_03_cat","ps_car_05_cat"], inplace=True, axis=1)

Following several discussions on Kaggle’s forum, all features labeled like “*_calc_*” are deleted. We suppose these were preprocessed by Porto Seguro’s data scientists and therefore useless for an anonymized data set. Most models found on Kaggle have got rid of these and showed an increase in the final score.

Since we do not have access to features names, I have used scikit-learn’s features selection class, in particular tree based techniques.

features_init = data.columns
features_init = features_init.drop(['id', 'target'])
target = "target"
    
clf = ExtraTreesClassifier(n_jobs=-1)
clf = clf.fit(data[features_init],data[target])
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
#Use threshold to tune the number of selected features
model = SelectFromModel(clf,threshold=0.005)
model.fit(data[features_init], data[target])
#Features selection X_train -> X_train_new
data_new = model.transform(data[features_init])
n_kept_features = data_new.shape[1]
print("The model has kept {0} features".format(n_kept_features))
 
# Plot the feature importances of the forest
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
             axis=0)
plt.figure()
plt.title("Feature importances")
plt.bar(range(data_new.shape[1]), importances[indices][:n_kept_features],
       color="r", yerr=std[indices][:n_kept_features], align="center")
plt.xticks(range(data_new.shape[1]), indices)
plt.xlim([-1, data_new.shape[1]])
plt.show()
print("LIST OF FEATURES BY IMPORTANCE")
for i in range(len(features_init)):
    print("feature {0}--\t{1}\t{2}".format(i+1,features_init[indices[i]],
           importances[indices[i]])) 
 
kept_features = features_init[indices][:n_kept_features]
    
data_new = pd.DataFrame(data[kept_features].copy())
#Do not forget to keep the target column
data_new['target'] = data['target']
data_valid_new = pd.DataFrame(data_valid[kept_features].copy())

Here is the output from the above code :

LIST OF FEATURES BY IMPORTANCE
feature 1--     ps_car_13       0.08395602322296314
feature 2--     ps_reg_03       0.07654451444144916
feature 3--     ps_ind_15       0.07156292447280062
feature 4--     ps_ind_03       0.07016063421350557
feature 5--     ps_reg_02       0.06672868042596583
feature 6--     ps_car_14       0.06386550357905733
feature 7--     ps_car_15       0.059201449534025216
feature 8--     ps_ind_01       0.05868455931428655
feature 9--     ps_reg_01       0.05413478290904592
feature 10--    ps_car_11_cat   0.052169950169067114
feature 11--    ps_car_01_cat   0.05197268130395137
feature 12--    ps_car_06_cat   0.0456994765624346
feature 13--    ps_car_12       0.04410778610553676
feature 14--    ps_car_09_cat   0.03390578438345717
feature 15--    ps_ind_02_cat   0.029693212195217516
feature 16--    ps_car_11       0.026579370022879727
feature 17--    ps_ind_04_cat   0.02039494171735725
feature 18--    ps_ind_05_cat   0.01855091183309448
feature 19--    ps_car_04_cat   0.01214762037764512
feature 20--    ps_car_08_cat   0.008835714035118429
feature 21--    ps_car_07_cat   0.008477157474454863
feature 22--    ps_car_02_cat   0.00759847442162129
feature 23--    ps_ind_18_bin   0.006347465882805775
feature 24--    ps_ind_16_bin   0.006235044573153435
feature 25--    ps_ind_08_bin   0.0038822273859338515
feature 26--    ps_ind_07_bin   0.0036710319094359614
feature 27--    ps_ind_09_bin   0.0030633329019027476
feature 28--    ps_ind_17_bin   0.0026276329367167623
feature 29--    ps_car_10_cat   0.0024131270893488926
feature 30--    ps_ind_06_bin   0.002386273793299049
feature 31--    ps_ind_14       0.0017387505487839207
feature 32--    ps_ind_12_bin   0.0016463162659332339
feature 33--    ps_ind_11_bin   0.0004192578744535972
feature 34--    ps_ind_13_bin   0.0003628728513604819
feature 35--    ps_ind_10_bin   0.0002345132719372518

So you can tune the number of features by changing the threshold on line 10.

Logistic regression

This linear model for classification is one of the simplest model we can use for binary classification. I have used it in the first steps of the competition, to test for example the sample selection threshold and other parameters.

One limitation  is that binarization of the data is necessary, leading to more than 230 features. To speed up calculation, I have used PCA to reduce the number of features. I have written the function below that pipelines PCA then logistic regression.

def pipePCALogit(logistic, pca_n_components):
    """
    Creates a pipeline that will chain a PCA and a logit without GridSearch
    Should be used when best hyperparameters were found
    """
    print("#################\nPCA decomposition\n#################")
    pca = decomposition.PCA(n_components=pca_n_components)    
    pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)],
                           memory='./test')
    print("Fitting PCA with {0} components".format(pca_n_components))
    pca.fit(X_train[features])      
    print("Fitting pipeline")
    pipe.fit(X_train[features],y_train)
    print("Predicting on test set")
    pred_pipe_train = pipe.predict(X_train[features])
    pred_pipe_train_proba = pipe.predict_proba(X_train[features])[:,1]
    pred_pipe_test = pipe.predict(X_test[features])
    pred_pipe_test_proba = pipe.predict_proba(X_test[features])[:,1]
    n_positives_test = pred_pipe_test.sum()
    n_positives_train = pred_pipe_train.sum()
    print("There are {0} positive predictions on the test set"\
          .format(n_positives_test))
    print("There are {0} positive predictions on the train set"\
          .format(n_positives_train))
    print("Predicting on the validation set")
    print("There are {:.2f}% of positive predictions on the train test"
          .format(100*pred_pipe_train.sum()/len(train)))    
    print("There are {:.2f}% of positive predictions on the test test"
          .format(100*pred_pipe_test.sum()/len(test)))
    print("Train set scores:\n",\
          evalscores(y_train,pred_pipe_train,pred_pipe_train_proba))
    print("Test set scores:\n",\
          evalscores(y_test,pred_pipe_test,pred_pipe_test_proba))
    
    print("Predicting on validation data ...")
    pred_pipe_valid = pipe.predict(data_valid_new[features])
    print("This model predicts {0:.2f}% of positive results on the validation data"
              .format(100*pred_pipe_valid.sum()/len(data_valid_new)))
    pred_pipe_valid_proba = pipe.predict_proba(data_valid_new[features])[:,1]
    
    return pipe, pred_pipe_test, pred_pipe_test_proba, pred_pipe_valid_proba

This function can be used like this:

logistic = LogisticRegression(n_jobs=-1,
                                 max_iter=650,
                                 C=1,
                                 class_weight={0:0.1,1:1.3},
                                 solver='saga')

components = [120]
for c in components:
    t_init=time.time()
    log_model, pred_logit_test, pred_logit_test_proba, pred_logit_valid = \
        pipePCALogit(logistic, c)
    t_final=time.time()
    print("Execution time: {0}\n".format(t_final-t_init))

Parameter “class_weight” can be tuned to take into account the data imbalance. I have adjusted the parameters using a randomized search rather than a grid search. To achieve the highest Gini score, the optimal number of components for the PCA lies in the 140-180 range (depending on the parameters, and features selected), and the class weights should be adjusted so that the % of positive predictions is slightly over 4%.

The graphs below show the train and test score as a function of the number of components, as well as the fitting time. It can be clearly seen that the test score reaches its optimum around 140 components. The importance of the PCA is also seen on the fitting time that exponentially increases above ~ 180 components.

Gradient boosting classifier

For gradient boosting classification I have used XGBoost’s powerful scikit-learn API. The data is beforehand binarized and features selected using tree classifier as described in the previous section. This classifier has many parameters, so a random search is necessary for fine-tuning the model.

clf = xgb.XGBClassifier(nthread=16, learning_rate=0.01)

# Utility function to report best scores
def reporting(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

# specify parameters and distributions to sample from
param_dist = {"max_depth": [1, 2, 3, 4, 5, 10],
              "min_child_weight":[1,2,3],
              "n_estimators": [200, 300, 400,500],
              "reg_alpha": [.005,.004,0.001, 0.01,.1, .2, .3, .4, .5, .6,1,],
              "reg_lambda": [.1,.2,.3,.01,.05,.001,.005,.0001,.0005],
              "subsample": [.9, 1],
              "gamma": [0, 1, 2, 3, 4, 5, 6, 8],
              "scale_pos_weight": [15,15.2,15.5,15.6,15.7,16,16.2]
              }

from sklearn.metrics import make_scorer
scorer = make_scorer(matthews_corrcoef, needs_proba=True)

# run randomized search
n_iter_search = 30
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search,
                                   scoring=scorer)

start = time.time()
random_search.fit(X_train, y_train)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time.time() - start), n_iter_search))
reporting(random_search.cv_results_)

One the hyperparameters are tuned, I would use the model as shown below where data can be binarized or not.

t_init = time.time()
xgb_model = xgb.XGBClassifier(n_estimators=500, 
                              max_depth=3, 
                              min_child_weight=2,
                              learning_rate=0.001,
                              nthread=16,       
                              subsample=0.9,
                              reg_alpha=.006,
                              reg_lambda=.005,
                              gamma=5,
                              scale_pos_weight=15
                              )
print("Fitting on train set with XGBoost")
xgb_model.fit(X_train,y_train)
print("Total fitting time: {}".format(time.time()-t_init))
print("Predicting on test data")
xgb_model.predict(X_test)
t_final = time.time()
pred_xgb_train = xgb_model.predict(X_train)
pred_xgb_train_proba = xgb_model.predict_proba(X_train)[:,1]
pred_xgb_test = xgb_model.predict(X_test)
pred_xgb_test_proba = xgb_model.predict_proba(X_test)[:,1]
#Plot ROC AUC curve
fpr, tpr, _ = roc_curve(y_test, pred_xgb_test_proba, pos_label=1)
df = pd.DataFrame(dict(fpr=fpr, tpr=tpr))
plt.plot(fpr,tpr)
plt.plot([0,1],[0,1],'r--')
plt.ylabel('True positive rate')
plt.xlabel('False positive rate')
print("There are {:.2f}% of positive predictions on the test set"
      .format(100*pred_xgb_test.sum()/len(test)))
print("There are {:.2f}% of positive predictions on the train set"
      .format(100*pred_xgb_train.sum()/len(train)))
print("Predicting on validation data ...")
if binarizer == True:
    pred_xgb_valid = xgb_model.predict(data_valid_new[features])
    print("This model predicts {0:.2f}% of positive results on the validation data"
          .format(100*pred_xgb_valid.sum()/len(data_valid_new)))
    pred_xgb_proba_valid = xgb_model.predict_proba(data_valid_new[features])[:,1]
elif binarizer == False:
    pred_xgb_valid = xgb_model.predict(data_valid[features])
    print("This model predicts {0:.2f}% of positive results on the validation data"
          .format(100*pred_xgb_valid.sum()/len(data_valid)))
    pred_xgb_proba_valid = xgb_model.predict_proba(data_valid[features])[:,1]
print("Train set scores:\n",evalscores(y_train,pred_xgb_train,pred_xgb_train_proba))
print("Test set scores:\n",evalscores(y_test,pred_xgb_test,pred_xgb_test_proba))

Neural network model

The third model I have built for this competition is a simple 3 to 4 layers neural network using Tensorflow 1.4 (Python API).

Since the DNNClassifier does not have a class_weight balance parameter I have written a simple function “classbalance” that oversamples the less represented class to the desired level.

The number of layers and nodes is defined on line 79. Then from line 97 a little work is needed to get the predictions since Tensorflow outputs them as a list of dictionnaries. So I’ll enumerate the predictions to get only the predicted classes and the associated probabilities.

def classbalance(dat, thresh):
    print("There are {:.2f}% of positive predictions in the original data set" 
          .format(100*dat['target'].sum()/len(dat)))
    
    #Fetch only positive entries
    df_positives_train_data = dat.loc[dat['target']==1]
    proportion = 0
    while proportion < thresh:
        #Percentage of positive target
        proportion = 100*dat['target'].sum()/len(dat)
        #Duplicate to balance data
        dat = dat.append(df_positives_train_data)
        print("There are {:.2f}% of positive predictions in the train set"
              .format(proportion))
        
    return dat

data_new_balanced = classbalance(data_new.copy(),22)

train, test = train_test_split(data_new_balanced, train_size=0.85,\
                               test_size=0.15)

X_train, y_train, X_test, y_test = \
    train[features], train[target], test[features], test[target]
print("There are {:.2f}% of positive predictions in the train set"
      .format(100*y_train.sum()/len(y_train)))
print("There are {:.2f}% of positive predictions in the test set"
      .format(100*y_test.sum()/len(y_test)))


def testinput(X_test, y_test):
    """
    Creates the test inputs
    """
    
    test_input_fn = tf.estimator.inputs.numpy_input_fn(
            x={"x": np.array(X_test)},
            y=np.array(y_test),
            num_epochs=1,
            shuffle=False)
    return test_input_fn

def dnn_model(hu, model_dir, features, beta1=0.8, beta2=0.4):
    # Specify the shape of the features columns
    feature_columns = [tf.feature_column.numeric_column("x", 
                                                        shape=[len(features),1]
                                                        )]
    # Build n layer DNN with hu units (hu is an array)
    # The default optimizer is "AdaGrad" but we can specify another model
    classifier = tf.estimator.DNNClassifier\
                    (feature_columns=feature_columns,
                     hidden_units=hu,
                     n_classes=2,                                          
                     model_dir=model_dir,
                     optimizer=tf.train.ProximalAdagradOptimizer\
                                 (learning_rate=1e-3,
                                  l1_regularization_strength=0.0,
                                  l2_regularization_strength=1e-4 #ridge
                                 )
                    )
    '''
    OTHER OPTIMIZER
    optimizer=tf.train.AdamOptimizer(
                                                learning_rate=0.00001,
                                                beta1=0.1,
                                                beta2=0.99),
    '''
    
# Define the training inputs
    train_input_fn = tf.estimator.inputs.numpy_input_fn(
        x={"x": np.array(X_train)},
        y=np.array(y_train.values.reshape((len(y_train),1))),
        num_epochs=None,
        shuffle=True)
    return classifier, train_input_fn

t_init=time.time()
# 3-layers
classifier, train_input_fn = dnn_model([228,600,600,600], "./tmp/DNN1", features)
#Let's train
classifier.train(input_fn=train_input_fn, max_steps=20000)
#classifier.evaluate(input_fn=train_input_fn, steps=10000)
test_input_fn = testinput(X_test,y_test)  

validation_input_fn = tf.estimator.inputs.numpy_input_fn(
        x={"x": np.array(data_valid_new[features])}, 
        y=None,
        num_epochs=1,
        shuffle=False)

training_input_fn = tf.estimator.inputs.numpy_input_fn(
        x={"x": np.array(X_train[features])}, 
        y=None,
        num_epochs=1,
        shuffle=False)

print("Predicting on train set...")
pred_tf_train_data = classifier.predict(input_fn=training_input_fn)
print("Predicting on test set...")    
pred_tf_test_data_temp = classifier.predict(input_fn=test_input_fn)
print("Predicting on validation set...")
pred_valid_temp = classifier.predict(input_fn=validation_input_fn)

print("Working on train predictions...")
predictions_tf_train = list(pred_tf_train_data)
pred_tf_train = list()
pred_tf_train_proba = list()
for i,p in enumerate(predictions_tf_train):
    pred_tf_train.append(p['class_ids'][0])
    pred_tf_train_proba.append(p['probabilities'][1])

print("Working on validation predictions...")    
predictions = list(pred_valid_temp)
pred_tf_validation = list()
pred_tf_validation_proba = list()
for i,p in enumerate(predictions):
    pred_tf_validation.append(p['class_ids'][0])
    pred_tf_validation_proba.append(p['probabilities'][1])

print("Working on test predictions...")
pred_tf_test_temp = list(pred_tf_test_data_temp)
pred_tf_test_proba = list()
pred_tf_test = list()
for i,p in enumerate(pred_tf_test_temp):
    pred_tf_test_proba.append(p['probabilities'][1])
    pred_tf_test.append(p['class_ids'][0])
print("Total processing time: {0:.3f}".format(time.time()-t_init))
print("Computing roc auc curve...")   
fpr, tpr, _ = roc_curve(y_test, pred_tf_test_proba,pos_label=1)
df = pd.DataFrame(dict(fpr=fpr, tpr=tpr))
plt.plot(fpr,tpr)
plt.plot([0,1],[0,1],'r--')
plt.ylabel('True positive rate')
plt.xlabel('False positive rate')

print("There are {:.2f}% of positive predictions on the train set"
      .format(100*pred_tf_train.count(1)/len(y_train)))
print("There are {:.2f}% of positive predictions on the test set"
      .format(100*pred_tf_test.count(1)/len(y_test)))
print("Number of positives in test set: ",y_test.sum())
print("There are {:.2f}% of positive predictions on the validation set"
      .format(100*pred_tf_validation.count(1)/len(data_valid)))

pred_tf_test = np.array(pred_tf_test)
pred_tf_train = np.array(pred_tf_train)

print("Train set scores:\n",evalscores(y_train,pred_tf_train,pred_tf_train_proba))
print("Test set scores:\n",evalscores(y_test,pred_tf_test,pred_tf_test_proba))

Final predictions and score

The best score I could achieve was by averaging the predictions using Tensorflow and XGBoost.

My best score was a Gini of 0.28054 (~top 50%).

I suppose I could try different balancing of these models to improve the score and of course, re-train all the models using crossvalidation but my goal here was to try and test different models, not necessarily to get the highest possible rank.

All comments are welcome !

2 thoughts on “Porto Seguro’s Safe Driver Prediction (Kaggle)”

  1. Hi, I’m a learner who wants to implement my own techniques and use yours as comparision model. I could not find the data set and its not on kaggle’s site anymore. Can you help me by sharing the data set

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.