Tensorflow classification example : Titanic competition

After having completed the first three lectures in Andrew Ng’s excellent deep learning lecture on coursera, I decided to practice my new skills using kaggle competitions.

For a first example, I’ll use the Titanic dataset again.

The data has already been analysed and processed (log, binning, etc.) in a previous article, so I’ll skip this part.

import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

#Helper function for median age
def median_age(data,Pclass):
 med_age = round(data["Age"][data["Pclass"]==Pclass].median())
 return med_age
data = pd.read_csv("train.csv")
data.loc[(data.Age.isnull()) & (data["Fare"] == 0), "Age"] = 0
for i in [1,2,3]:
    data.loc[(data.Age.isnull()) & (data["Pclass"] == i), "Age"] = median_age(data,i)
data.loc[(data.Age < 1), "Age"] = 1
data.loc[(data.Sex == "male", "Sexe")]=1
data.loc[(data.Sex == "female", "Sexe")]=0
data.loc[(data.Cabin.isnull()==False), "Cabin"] = 1
data.loc[(data.Cabin.isnull()), "Cabin"] = 0
data["Agebin"] = pd.cut(data["Age"],bins=[0,4,12,25,60,85], labels=[1,2,3,4,5])
data["Famille"] = 1+data["SibSp"]+data["Parch"]
data["LogFare"] = data["Fare"].apply(lambda x: np.log(x) if x > 0 else x)

data_validation = pd.read_csv("test.csv")
data_validation.loc[(data_validation.Age.isnull()) & (data_validation["Fare"] == 0), "Age"] = 0
for i in [1,2,3]:
    data_validation.loc[(data_validation.Age.isnull()) & (data_validation["Pclass"] == i), "Age"] = median_age(data,i)
data_validation.loc[(data_validation.Age <1), "Age"] = 1
data_validation.loc[(data_validation.Sex == "male", "Sexe")]=1
data_validation.loc[(data_validation.Sex == "female", "Sexe")]=0
data_validation.loc[(data_validation.Cabin.isnull()==False), "Cabin"] = 1
data_validation.loc[(data_validation.Cabin.isnull()), "Cabin"] = 0
data_validation["Agebin"] = pd.cut(data_validation["Age"],bins=[0,4,12,25,60,85], labels=[1,2,3,4,5])
data_validation["Famille"] = 1+data_validation["SibSp"]+data_validation["Parch"]
data_validation.loc[(data_validation.Fare.isnull()), "Fare"] = 9
data_validation["LogFare"] = data_validation["Fare"].apply(lambda x: np.log(x) if x>0 else x)

My training data contains 891 samples and 16 features, from which I’ll be using only 5 as in the previous article. We can split the data into train/test sets, here I’ll use all of the data for training.

target = "Survived"
features = ["Pclass", "Sexe","Famille","Fare", "Age"]

train, test = train_test_split(data, test_size=0.001)

X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

print("Dimensions of the training set : {0}".format(np.shape(X_train)))
print("Dimensions of the training set (target) : {0}".format(np.shape(y_train.values.reshape(len(y_train),1))))

Let’s move to building a neural network (NN). Tensorflow has several high level functions that we can use : for instance let’s build a n-hidden units NN and use a ProximalGradientDescentOptimizer instead of the default AdaGradOptimizer (this is actually not the best option, but take this as an illustration).

To build the training input, we use tf.estimator.inputs.numpy_input_fn that returns input function that would feed dict of numpy arrays into the model. Pay attention to the shape of “y”, the target array.

def model(hu, model_dir, features):
    # Specify the shape of the features columns, so [5,1] here
    feature_columns = [tf.feature_column.numeric_column("x", shape=[len(features),1])]

    # Build n layer DNN with hu units (hu is an array)
    # The default optimizer is "AdaGrad" but we can specify another model
    classifier = tf.estimator.DNNClassifier(feature_columns=feature_columns,
                                        hidden_units=hu,
                                        n_classes=2,
                                        optimizer=tf.train.ProximalGradientDescentOptimizer(
                                            learning_rate=0.01,
                                            l1_regularization_strength=0.1,
                                            l2_regularization_strength=0.1),                                      
                                        model_dir=model_dir)

    # Define the training inputs
    train_input_fn = tf.estimator.inputs.numpy_input_fn(
        x={"x": np.array(X_train)},
        y=np.array(y_train.values.reshape((len(y_train),1))),
        num_epochs=None,
        shuffle=True)
    return classifier, train_input_fn

To train a 3 hidden units NN, the function above can be called like this  :

# 3-layers
classifier, train_input_fn = model([32,64,32], "./tmp/DNN", features)

#Let's train
classifier.train(input_fn=train_input_fn, steps=1000)

If we want to evaluate the model we can do a similar procedure with the test set :

# Define the test inputs
def testinput(X_test, y_test):
    test_input_fn = tf.estimator.inputs.numpy_input_fn(
        x={"x": np.array(X_test)},
        y=np.array(y_test),
        num_epochs=1,
        shuffle=False)

    return test_input_fn
    
# Evaluate accuracy.
accuracy_score = classifier.evaluate(input_fn=test_input_fn)["accuracy"]

print("\nTest Accuracy: {0:f}\n".format(accuracy_score))

Here, the accuracy is 30% since I have only 3 or 4 samples in my test set, but you get the idea.

Now the model has been trained, let’s proceed to the prediction step. Again, we’ll define an input function using tf.estimator.inputs.numpy_input_fn, where y=None since we have no target column.

my_input_fn = tf.estimator.inputs.numpy_input_fn(
        x={"x": np.array(data_validation[features])},
        y=None,
        num_epochs=1,
        shuffle=False)

pred = classifier.predict(input_fn=my_input_fn)

Notice that the predictions (“pred”) is a generator :

pred
<generator object Estimator.predict at 0x0000022B625A26D0>

So we’ll switch to a list : each element is a dict with the predicted class and probabilities :

predictions = list(pred)
predictions[0]

{'class_ids': array([0], dtype=int64),
 'classes': array([b'0'], dtype=object),
 'logistic': array([ 0.12466089], dtype=float32),
 'logits': array([-1.94901419], dtype=float32),
 'probabilities': array([ 0.87533915,  0.12466089], dtype=float32)}

To get only the predicted classes, I’ll loop through the dictionnary :

final_pred = np.array([])
for p in predictions:
    final_pred = np.append(final_pred,p['class_ids'][0])

#cast from string to integer
final_pred = final_pred.astype(int)

And we’re done !

Let’s write the prediction file for submission.

result = pd.DataFrame(columns=["PassengerId", "Survived"])
result["PassengerId"] = data_validation['PassengerId']
result["Survived"] = final_pred
result.to_csv("Submission-tf1.csv", index=False) #Do not forget to remove the index

The final score is 0.78 which is very close to what was obtained using logistic regression or random forest with scikit-learn.

Bonus : using K-fold

The dataset is rather small, so using K-fold cross validation might be useful to build a better model:

from sklearn.model_selection import KFold, cross_val_score
k_fold = KFold(n_splits=3, shuffle=True)

features = ["Pclass", "Sexe", "Famille", "Age", "LogFare"]
target = "Survived"
accuracy = np.array([])
for train_index, test_index in k_fold.split(data):
    X_train = data.loc[train_index,features]
    y_train = data.loc[train_index,target]
    X_test = data.loc[test_index,features]
    y_test = data.loc[test_index,target]
    classifier, train_input_fn = model(10, 20, 10, "./tmp/DNN20", features)
    classifier.train(input_fn=train_input_fn, steps=3000)
    test_input_fn = testinput(X_test,y_test)
    accuracy_score = classifier.evaluate(input_fn=test_input_fn)["accuracy"]
    accuracy = np.append(accuracy,accuracy_score)
    print("\nTest Accuracy: {0:f}\n".format(accuracy_score))

np.mean(accuracy) #typically between 0.78-0.82

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.