After having completed the first three lectures in Andrew Ng’s excellent deep learning lecture on coursera, I decided to practice my new skills using kaggle competitions.
For a first example, I’ll use the Titanic dataset again.
The data has already been analysed and processed (log, binning, etc.) in a previous article, so I’ll skip this part.
import tensorflow as tf import pandas as pd from sklearn.model_selection import train_test_split import numpy as np #Helper function for median age def median_age(data,Pclass): med_age = round(data["Age"][data["Pclass"]==Pclass].median()) return med_age data = pd.read_csv("train.csv") data.loc[(data.Age.isnull()) & (data["Fare"] == 0), "Age"] = 0 for i in [1,2,3]: data.loc[(data.Age.isnull()) & (data["Pclass"] == i), "Age"] = median_age(data,i) data.loc[(data.Age < 1), "Age"] = 1 data.loc[(data.Sex == "male", "Sexe")]=1 data.loc[(data.Sex == "female", "Sexe")]=0 data.loc[(data.Cabin.isnull()==False), "Cabin"] = 1 data.loc[(data.Cabin.isnull()), "Cabin"] = 0 data["Agebin"] = pd.cut(data["Age"],bins=[0,4,12,25,60,85], labels=[1,2,3,4,5]) data["Famille"] = 1+data["SibSp"]+data["Parch"] data["LogFare"] = data["Fare"].apply(lambda x: np.log(x) if x > 0 else x) data_validation = pd.read_csv("test.csv") data_validation.loc[(data_validation.Age.isnull()) & (data_validation["Fare"] == 0), "Age"] = 0 for i in [1,2,3]: data_validation.loc[(data_validation.Age.isnull()) & (data_validation["Pclass"] == i), "Age"] = median_age(data,i) data_validation.loc[(data_validation.Age <1), "Age"] = 1 data_validation.loc[(data_validation.Sex == "male", "Sexe")]=1 data_validation.loc[(data_validation.Sex == "female", "Sexe")]=0 data_validation.loc[(data_validation.Cabin.isnull()==False), "Cabin"] = 1 data_validation.loc[(data_validation.Cabin.isnull()), "Cabin"] = 0 data_validation["Agebin"] = pd.cut(data_validation["Age"],bins=[0,4,12,25,60,85], labels=[1,2,3,4,5]) data_validation["Famille"] = 1+data_validation["SibSp"]+data_validation["Parch"] data_validation.loc[(data_validation.Fare.isnull()), "Fare"] = 9 data_validation["LogFare"] = data_validation["Fare"].apply(lambda x: np.log(x) if x>0 else x)
My training data contains 891 samples and 16 features, from which I’ll be using only 5 as in the previous article. We can split the data into train/test sets, here I’ll use all of the data for training.
target = "Survived" features = ["Pclass", "Sexe","Famille","Fare", "Age"] train, test = train_test_split(data, test_size=0.001) X_train = train[features] y_train = train[target] X_test = test[features] y_test = test[target] print("Dimensions of the training set : {0}".format(np.shape(X_train))) print("Dimensions of the training set (target) : {0}".format(np.shape(y_train.values.reshape(len(y_train),1))))
Let’s move to building a neural network (NN). Tensorflow has several high level functions that we can use : for instance let’s build a n-hidden units NN and use a ProximalGradientDescentOptimizer instead of the default AdaGradOptimizer (this is actually not the best option, but take this as an illustration).
To build the training input, we use tf.estimator.inputs.numpy_input_fn that returns input function that would feed dict of numpy arrays into the model. Pay attention to the shape of “y”, the target array.
def model(hu, model_dir, features): # Specify the shape of the features columns, so [5,1] here feature_columns = [tf.feature_column.numeric_column("x", shape=[len(features),1])] # Build n layer DNN with hu units (hu is an array) # The default optimizer is "AdaGrad" but we can specify another model classifier = tf.estimator.DNNClassifier(feature_columns=feature_columns, hidden_units=hu, n_classes=2, optimizer=tf.train.ProximalGradientDescentOptimizer( learning_rate=0.01, l1_regularization_strength=0.1, l2_regularization_strength=0.1), model_dir=model_dir) # Define the training inputs train_input_fn = tf.estimator.inputs.numpy_input_fn( x={"x": np.array(X_train)}, y=np.array(y_train.values.reshape((len(y_train),1))), num_epochs=None, shuffle=True) return classifier, train_input_fn
To train a 3 hidden units NN, the function above can be called like this :
# 3-layers classifier, train_input_fn = model([32,64,32], "./tmp/DNN", features) #Let's train classifier.train(input_fn=train_input_fn, steps=1000)
If we want to evaluate the model we can do a similar procedure with the test set :
# Define the test inputs def testinput(X_test, y_test): test_input_fn = tf.estimator.inputs.numpy_input_fn( x={"x": np.array(X_test)}, y=np.array(y_test), num_epochs=1, shuffle=False) return test_input_fn # Evaluate accuracy. accuracy_score = classifier.evaluate(input_fn=test_input_fn)["accuracy"] print("\nTest Accuracy: {0:f}\n".format(accuracy_score))
Here, the accuracy is 30% since I have only 3 or 4 samples in my test set, but you get the idea.
Now the model has been trained, let’s proceed to the prediction step. Again, we’ll define an input function using tf.estimator.inputs.numpy_input_fn, where y=None since we have no target column.
my_input_fn = tf.estimator.inputs.numpy_input_fn( x={"x": np.array(data_validation[features])}, y=None, num_epochs=1, shuffle=False) pred = classifier.predict(input_fn=my_input_fn)
Notice that the predictions (“pred”) is a generator :
pred <generator object Estimator.predict at 0x0000022B625A26D0>
So we’ll switch to a list : each element is a dict with the predicted class and probabilities :
predictions = list(pred) predictions[0]
{'class_ids': array([0], dtype=int64), 'classes': array([b'0'], dtype=object), 'logistic': array([ 0.12466089], dtype=float32), 'logits': array([-1.94901419], dtype=float32), 'probabilities': array([ 0.87533915, 0.12466089], dtype=float32)}
To get only the predicted classes, I’ll loop through the dictionnary :
final_pred = np.array([]) for p in predictions: final_pred = np.append(final_pred,p['class_ids'][0]) #cast from string to integer final_pred = final_pred.astype(int)
And we’re done !
Let’s write the prediction file for submission.
result = pd.DataFrame(columns=["PassengerId", "Survived"]) result["PassengerId"] = data_validation['PassengerId'] result["Survived"] = final_pred result.to_csv("Submission-tf1.csv", index=False) #Do not forget to remove the index
The final score is 0.78 which is very close to what was obtained using logistic regression or random forest with scikit-learn.
Bonus : using K-fold
The dataset is rather small, so using K-fold cross validation might be useful to build a better model:
from sklearn.model_selection import KFold, cross_val_score k_fold = KFold(n_splits=3, shuffle=True) features = ["Pclass", "Sexe", "Famille", "Age", "LogFare"] target = "Survived" accuracy = np.array([]) for train_index, test_index in k_fold.split(data): X_train = data.loc[train_index,features] y_train = data.loc[train_index,target] X_test = data.loc[test_index,features] y_test = data.loc[test_index,target] classifier, train_input_fn = model(10, 20, 10, "./tmp/DNN20", features) classifier.train(input_fn=train_input_fn, steps=3000) test_input_fn = testinput(X_test,y_test) accuracy_score = classifier.evaluate(input_fn=test_input_fn)["accuracy"] accuracy = np.append(accuracy,accuracy_score) print("\nTest Accuracy: {0:f}\n".format(accuracy_score)) np.mean(accuracy) #typically between 0.78-0.82