Titanic Data Set – How I increased my score from 79% to 82%

The Titanic data set is said to be the starter for every aspiring data scientist. So it was that I sat down two years ago, after having taken an econometrics course in university which introduced me to R, thinking to give the competition a shot. My goal was to achieve an accuracy of 80% or higher.

However, my first try in this competition ended up with me producing some summary statistics and trying to solve the challenge with a linear regression model 🙂 . In short, I misinterpreted what the data science community meant by: “a competition for beginners”.

Now, two years later, I again tried my luck. Knowing about terms like training and testing data sets, overfitting, cross-validation, bias-variance trade-off, regular expressions, and about different classification models makes me hopefully better prepared this time. After almost having completed a Statistics degree, countless hours on Coursera, Data Camp and Stackoverflow, and after having a data science internship under my belt I finally declared myself a “beginner” in the data science community and ready for the Titanic Kaggle Competition.

Long story short, don’t be discouraged if you hear the words “beginner competition”, but are not yet able to understand everything or produce the results other people showcase in their kernels.

This blog post is an attempt to make the titanic data set easier to understand. Throughout it, I will be explaining my code.

What we will be covering in this blog post:

  • An Exploratory Data Analysis with ggplot2 and dplyr.
  • Feature Engineering for some Variables.
  • Dealing with Missing Values.
  • Model Building and Model Evaluation:
    1. Logistic Regression
    2. Random Forest
    3. Linear Discriminant Analysis
    4. K-Nearest Neighbors

In the end, I am ending up with a score of around 79%. Pretty disappointing for me and I didn’t achieve my initial goal of 80%. This is why in the second part of this tutorial, I evaluated the so-called gender model with which I achieved a score of 81.82%.

Let’s get started!

First, we are loading the libraries we need. The tidyverse consists of various packages (dplyr, ggplot, etc.) and is perfect for data manipulations. The ggbubr package is nice for visualizations and gives us some extra flexibility. The arsenal package is for easily creating some nice looking tables. The other packages are for building predictive models.

After having loaded the packages, we are loading the data sets. In order to rbind() them, the train and test data sets have to have equivalent columns. That is why I am creating the Survived column in the test data set. After that, I am transforming some variables with mutate() to characters and factors.

Exploratory Data Analysis of the Titanic Data Set

Investigating Gender for the Titanic Data Set

For all my plots, I am using ggplot2. If you are unfamiliar with the syntax, the R for Data Science book, Data Camp, and the ggplot cheat sheet are great resources which you can refer to.

titanic data set

It looks like most of the female titanic passengers survived and most of the male passengers died. This becomes especially visible when looking at the percentages on the right plot. 75% of all female passengers survived whereas less than 25% of male passengers survived.

This is a very crucial finding and is key for the 81.82% success of the gender model I am discussing in part 2.

Investigating Gender and Class for the Titanic Data Set

Almost all female passengers in classes 1 and 2 survived whereas for male, the passenger class is not a great predictor for survival. This is because regardless of class, male passengers do not really seem to benefit much from being in higher classes.

Investigating Age, Fare, and Embarked for the Titanic Data Set

 

titanic data set fare
  • For the Age variable, we can see that younger children are more likely to survive. From around 0 – 10, the survival chances are pretty good.
  • There are a lot of fares that cost around $10. People who paid this amount had really bad survival chances. It seems like the more expensive the fare, the better the survival chances are.
  • The third plot shows where people embarked. This plot does not say a lot about survival. We can see however, that we have some missing values there and most people came on board in “S”, which stands for Southampton.

Investigating The Titles of Passengers for the Titanic Data Set

Next, we do some feature engineering. This means, we are deriving new variables which have more explanatory power in predicting who survived and died from already existing variables in the data set. Such a variable is Name for example. In order to do that, it is advantageous to have some basic understanding of regular expressions. When I first looked at other people’s kernels, I had no idea how they got the titles out of the name variable. The code looks pretty complicated and it takes some time to get used to regular expressions.

When we look at the output of the head function, then we see that every passenger name starts with their last name, followed by a comma, followed by their title with a dot, and then their first names. We are going to extract the title with the gsub() function.

Here is a little working example. In order to extract the title of my name from the string below we are saying that everything before and including the comma should be removed and then everything after and including the dot should be removed as well.

We do that with the code below. The .*, means that we remove everything before the comma. The comma (,) is a special character and therefore, we need these two backward slashes (\) in front of the comma. Then we have the “or” (|) bar. After that we need the two backward slashes again because the literal dot after the titles is also a special character. Then we say remove everything after the dot again. So what we are doing is replacing “Schmidt,” with nothing and then ” Pascal David Fabian” with nothing as well. And voila, we are left with the title only.

We do that for the entire Name vector, saving the titles in the titles object, and then displaying how many titles there are with the table() function.

titanic data set titles

After that, we are going to do some visualizations. Of course in ggplot2 again!

titanic data set titles

We do some data manipulation with dplyr in order to see the percentage of survived versus died passengers for each unique title. Then we are plotting. Aouch, a lot of misters died! Wow that’s actually quite interesting. Mrs, and Miss does pretty well. So let’s group some of the titles together. We do that with the code below.

titanic table titles

Investigating Cabin Numbers for the Titanic Data Set

A lot of cabin numbers are missing. This is really unfortunate because I think based on the visualization, our final model could have improved from the correct cabin numbers of every single passenger.

titanic data set cabin

The cabin numbers are representing on which deck each passenger had their cabin. Survival is pretty good for decks B, D, and E. A lot of passengers from unknown cabin numbers died. Unfortunately, there are too many missing values in the Cabin variable.

Next, we are investigating the family sizes of passengers. SibSp is the number of siblings or spouses on board of the titanic. Parch is the number of parents or children on board of the titanic. So we are feature engineering a variable called family_size which will consist of Parch, SibSp and the passenger themself.

Investigating Survival Chances of Families for the Titanic Data Set

titanic data set family size

It seems like the survival is highest for couples and parents with 1-3 children. For larger sized families and people travelling alone, survival chances do not seem to be great. Because it seems like larger family sizes do not do well, we are grouping them.

Dealing With Missing Values

Age

First, we identified which rows are missing for the titanic data set. Afterwards, we constructed a for loop which imputes every missing age value with the median age of their corresponding title.

Embarked

titanic data set embarked

The mode of the Embarked variable is Southampton. Therefore, we are substituting the empty string of the two passengers with an S.

Fare

titanic data set missing fare

In order to impute the one missing value for Fare, we are looking at the median of class 3 fare. Then we are imputing the value.

Model Building

Logistic Regression

As we can see from the output above, there are some highly significant variables. Moreover, some of the titles are statistically significant too. We did a good job in creating the titles variable. Let’s see what predictions accuracy we get for this model.

titanic data set submission

One thing that is worrisome about our model are the high standard errors for the Sex and titles variables. This means that the precision for the coefficient estimates are not precise. How can we correct for that? With the variance inflation factor (VIF). A VIF value of 5 or below is usually regarded as acceptable.

When we check for multicollinearity among the predictors we can see that the VIF value for the Sex variable is very high. This means that there is multicollinearity in our model which is responsible for inflated standard errors. So, how can we deal with that?

One common solution is to standardize the variables with a high variance inflation factor. If this does not yield success, then throwing them out of the model is another solution. For our model, we are deciding to throw away the Sex variable.

After having done that, you will realize that there is still multicollinearity in our model. The VIF values for Parch, SibSp, and family_size are still high. This is not really a surprise. Remember, we derived the family_size variable from the SibSp and Parch variable. We are deciding to throw away our derived variable family_size and then have another look at the VIF values.

Now, that looks much better! No single variable is above 5 anymore. In addition to that, our model output looks much better now.

As we can see, the standard errors of the coefficients are not high anymore and became more precise. Age became more statistically significant as well as Parch and SibSp. Let’s make another submission!

titanic data set submission

Wow! Our model has improved by only doing some variable selection.

Random Forest

titanic data set random forest

Let’s send in our predictions!

titanic data set random forest prediction

The random forest model has the same prediction accuracy as the logistic regresion model. Why is that? 

A random forest model is great in eliminate muticollinearity. In the code above, we specified that we want to build 1000 trees. Each of those trees is only taking a subset of predictors. In the randomForest() function, the model only chooses \sqrt{p} predictors at each split. In short, the random forest model only considers a subset of predictors at each split which is great to eliminate collinearity. This will lead to a reduction in variance as well. In the book “An Introduction to Statistical Learning“, it says:

“We can think of this process (only taking a subset of predictors) as decorrelating the trees, thereby making the average of the resulting trees less variable and hence more reliable.”

When knowing how a random forest works, it is not surprising that it achieves good accuracy even with a highly correlated data set such as the titanic one.

Linear Discriminant Analysis

I already explained the theory behind LDA and also provided a practical example. So, in this blog post, we are only looking at the prediction accuracy. After, we will see if the model will outpeform the previous ones.

titanic data set submission lda

Wow our best prediction so far!

K-Nearest Neighbors

Our last algorithm that we are going to evaluate is k-nearest neighbors. The critical part of this algorithm is to select the flexibility of the algorithm. The less neighbors, the more flexible the algorithm is and the more neighbors, the more inflexible  the algorithm the greater amount of neighbors.

We determine the perfect amount of neighbors for the titanic data set with the caret package and cross validation.

knn Cross Validation

As we can see in the plot above, the best accuracy is achieved by 6 neighbors. Therefore, we will choose k = 6 in our knn algorithm from the class package.

titanic data set knn submission

All of our models give pretty much the same accuracy. However, we have two clear winners for the titanic data set. Our LDA model and our knn model give the best accuracy.

Unfortunately, we have not yet received an accuracy of 80% or higher. In my next blog post we will though. After some research I came along the gender model which will boost our accuracy to 82%. 

I hope you have enjoyed this blog post about the titanic data set. If you have any questions let me know in the comments below. Thank you.