# Tensorflow classification example : Titanic competition

After having completed the first three lectures in Andrew Ng’s excellent deep learning lecture on coursera, I decided to practice my new skills using kaggle competitions.

For a first example, I’ll use the Titanic dataset again.

The data has already been analysed and processed (log, binning, etc.) in a previous article, so I’ll skip this part.

```import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

#Helper function for median age
def median_age(data,Pclass):
med_age = round(data["Age"][data["Pclass"]==Pclass].median())
return med_age
data.loc[(data.Age.isnull()) & (data["Fare"] == 0), "Age"] = 0
for i in [1,2,3]:
data.loc[(data.Age.isnull()) & (data["Pclass"] == i), "Age"] = median_age(data,i)
data.loc[(data.Age < 1), "Age"] = 1
data.loc[(data.Sex == "male", "Sexe")]=1
data.loc[(data.Sex == "female", "Sexe")]=0
data.loc[(data.Cabin.isnull()==False), "Cabin"] = 1
data.loc[(data.Cabin.isnull()), "Cabin"] = 0
data["Agebin"] = pd.cut(data["Age"],bins=[0,4,12,25,60,85], labels=[1,2,3,4,5])
data["Famille"] = 1+data["SibSp"]+data["Parch"]
data["LogFare"] = data["Fare"].apply(lambda x: np.log(x) if x > 0 else x)

data_validation.loc[(data_validation.Age.isnull()) & (data_validation["Fare"] == 0), "Age"] = 0
for i in [1,2,3]:
data_validation.loc[(data_validation.Age.isnull()) & (data_validation["Pclass"] == i), "Age"] = median_age(data,i)
data_validation.loc[(data_validation.Age <1), "Age"] = 1
data_validation.loc[(data_validation.Sex == "male", "Sexe")]=1
data_validation.loc[(data_validation.Sex == "female", "Sexe")]=0
data_validation.loc[(data_validation.Cabin.isnull()==False), "Cabin"] = 1
data_validation.loc[(data_validation.Cabin.isnull()), "Cabin"] = 0
data_validation["Agebin"] = pd.cut(data_validation["Age"],bins=[0,4,12,25,60,85], labels=[1,2,3,4,5])
data_validation["Famille"] = 1+data_validation["SibSp"]+data_validation["Parch"]
data_validation.loc[(data_validation.Fare.isnull()), "Fare"] = 9
data_validation["LogFare"] = data_validation["Fare"].apply(lambda x: np.log(x) if x>0 else x)```

My training data contains 891 samples and 16 features, from which I’ll be using only 5 as in the previous article. We can split the data into train/test sets, here I’ll use all of the data for training.

```target = "Survived"
features = ["Pclass", "Sexe","Famille","Fare", "Age"]

train, test = train_test_split(data, test_size=0.001)

X_train = train[features]
y_train = train[target]
X_test = test[features]
y_test = test[target]

print("Dimensions of the training set : {0}".format(np.shape(X_train)))
print("Dimensions of the training set (target) : {0}".format(np.shape(y_train.values.reshape(len(y_train),1))))```

Let’s move to building a neural network (NN). Tensorflow has several high level functions that we can use : for instance let’s build a n-hidden units NN and use a ProximalGradientDescentOptimizer instead of the default AdaGradOptimizer (this is actually not the best option, but take this as an illustration).

To build the training input, we use tf.estimator.inputs.numpy_input_fn that returns input function that would feed dict of numpy arrays into the model. Pay attention to the shape of “y”, the target array.

```def model(hu, model_dir, features):
# Specify the shape of the features columns, so [5,1] here
feature_columns = [tf.feature_column.numeric_column("x", shape=[len(features),1])]

# Build n layer DNN with hu units (hu is an array)
# The default optimizer is "AdaGrad" but we can specify another model
classifier = tf.estimator.DNNClassifier(feature_columns=feature_columns,
hidden_units=hu,
n_classes=2,
learning_rate=0.01,
l1_regularization_strength=0.1,
l2_regularization_strength=0.1),
model_dir=model_dir)

# Define the training inputs
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array(X_train)},
y=np.array(y_train.values.reshape((len(y_train),1))),
num_epochs=None,
shuffle=True)
return classifier, train_input_fn```

To train a 3 hidden units NN, the function above can be called like this  :

```# 3-layers
classifier, train_input_fn = model([32,64,32], "./tmp/DNN", features)

#Let's train
classifier.train(input_fn=train_input_fn, steps=1000)```

If we want to evaluate the model we can do a similar procedure with the test set :

```# Define the test inputs
def testinput(X_test, y_test):
test_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array(X_test)},
y=np.array(y_test),
num_epochs=1,
shuffle=False)

return test_input_fn

# Evaluate accuracy.
accuracy_score = classifier.evaluate(input_fn=test_input_fn)["accuracy"]

print("\nTest Accuracy: {0:f}\n".format(accuracy_score))
```

Here, the accuracy is 30% since I have only 3 or 4 samples in my test set, but you get the idea.

Now the model has been trained, let’s proceed to the prediction step. Again, we’ll define an input function using tf.estimator.inputs.numpy_input_fn, where y=None since we have no target column.

```my_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array(data_validation[features])},
y=None,
num_epochs=1,
shuffle=False)

pred = classifier.predict(input_fn=my_input_fn)
```

Notice that the predictions (“pred”) is a generator :

```pred
<generator object Estimator.predict at 0x0000022B625A26D0>```

So we’ll switch to a list : each element is a dict with the predicted class and probabilities :

```predictions = list(pred)
predictions

```
```{'class_ids': array(, dtype=int64),
'classes': array([b'0'], dtype=object),
'logistic': array([ 0.12466089], dtype=float32),
'logits': array([-1.94901419], dtype=float32),
'probabilities': array([ 0.87533915,  0.12466089], dtype=float32)}

```

To get only the predicted classes, I’ll loop through the dictionnary :

```final_pred = np.array([])
for p in predictions:
final_pred = np.append(final_pred,p['class_ids'])

#cast from string to integer
final_pred = final_pred.astype(int)```

And we’re done !

Let’s write the prediction file for submission.

```result = pd.DataFrame(columns=["PassengerId", "Survived"])
result["PassengerId"] = data_validation['PassengerId']
result["Survived"] = final_pred
result.to_csv("Submission-tf1.csv", index=False) #Do not forget to remove the index```

The final score is 0.78 which is very close to what was obtained using logistic regression or random forest with scikit-learn.

### Bonus : using K-fold

The dataset is rather small, so using K-fold cross validation might be useful to build a better model:

```from sklearn.model_selection import KFold, cross_val_score
k_fold = KFold(n_splits=3, shuffle=True)

features = ["Pclass", "Sexe", "Famille", "Age", "LogFare"]
target = "Survived"
accuracy = np.array([])
for train_index, test_index in k_fold.split(data):
X_train = data.loc[train_index,features]
y_train = data.loc[train_index,target]
X_test = data.loc[test_index,features]
y_test = data.loc[test_index,target]
classifier, train_input_fn = model(10, 20, 10, "./tmp/DNN20", features)
classifier.train(input_fn=train_input_fn, steps=3000)
test_input_fn = testinput(X_test,y_test)
accuracy_score = classifier.evaluate(input_fn=test_input_fn)["accuracy"]
accuracy = np.append(accuracy,accuracy_score)
print("\nTest Accuracy: {0:f}\n".format(accuracy_score))

np.mean(accuracy) #typically between 0.78-0.82```

This site uses Akismet to reduce spam. Learn how your comment data is processed.