The Titanic Data Set And The Woman-Child Model – 82% Test Set Accuracy
September 25, 2018 By Pascal Schmidt Personal Project R
In this tutorial, I will be showing you how to achieve 82% accuracy with the titanic data set and the woman-child model.
There are two rules to the titanic data set and the woman-child model:
- We predict that all males die except for boys in families where most females and boys live.
- We predict that all females live except for females in families where most females and boys die.
The last tutorial, we achieved an accuracy of around 79%. We used a random forest algorithm, a logistic regression, k-nearest neighbors, and linear discriminant analysis. All of these models were decent but are not able to reach the women-child model.
The titanic data set and the women-child model can be created by only looking at the last names of passengers and also their corresponding ticket numbers. Let’s jump into our analysis.
The Titanic Data Set And The Woman-Child Model
First, we have to load in the data.
library(tidyverse) library(here) train <- read.csv(here::here("docs", "train.csv")) test <- read.csv(here::here("docs", "test.csv")) test$Survived <- NA titanic <- rbind(train, test) # combining data sets titanic %>% dplyr::mutate( Name = as.character(Name), Ticket = as.character(Ticket), Cabin = as.character(Cabin), Survived = as.factor(Survived), Pclass = as.factor(Pclass) ) -> titanic
Again, we are engineering a new variable called titles
like last time.
titanic$titles <- gsub("(.*\\,|\\..*)", "", titanic$Name) %>% gsub("[[:space:]]", "", .) titanic$last_name <- gsub("\\,.*", "", titanic$Name)
Sometimes, ticket numbers differ from each other in their last digits. For example, in the table below we can be sure that the first three passengers traveled together. This is because they embarked in Southampton and all are class 3. Their ticket number only differs in the last digits.
`{r results="asis"}
titanic %>%
dplyr::group_by(Ticket, Pclass, Embarked) %>%
dplyr::summarise() %>%
tail() %>%
pander::pandoc.table()
So what we will be doing is to substitute the last 2 digits or letters of a ticket with “XX”. Below is a little working example of how this works. We are grabbing the last 2 characters of the string with str_sub
from the stringr
package and then we are substituting it with “XX”. This will be done with the gsub
function.
x <- "hello" gsub(stringr::str_sub(x, -1), "X", x) ## [1] "hellX"
Now, we are applying this method to the entire Ticket
column in the titanic data set.
for(i in 1:length(titanic$Ticket)) { titanic$ticket_number[[i]] <- gsub(stringr::str_sub(titanic$Ticket[[i]], -2), "XX", titanic$Ticket[[i]]) }
Before we are continuing with our analysis, we will be throwing out all rows with passengers who are male, except for children.
titanic %>% dplyr::filter(Sex == "female" | titles == "Master") %>% dplyr::group_by(last_name, titles, ticket_number, Sex, Survived, PassengerId) %>% dplyr::summarise() %>% dplyr::select(last_name, titles, ticket_number, Sex, Survived, PassengerId) -> gender_model head(gender_model) %>% pander::pandoc.table()
After we have ordered the data, we have to identify families now. We do that by comparing the last_name
and ticket_number
columns. When either the preceding or following ticket_number
and last_name
are identical, then the familyID will be family
. Otherwise, no family
. Afterwards, we have to throw out passengers who traveled alone.
gender_model$familyID <- NA gender_model$familyID[[1]] <- "family" for(i in 2:(length(gender_model$last_name) - 1)) { if ((gender_model$last_name[[i]] == gender_model$last_name[[i + 1]] | gender_model$last_name[[i]] == gender_model$last_name[[i - 1]]) & (gender_model$ticket_number[[i]] == gender_model$ticket_number[[i + 1]] | gender_model$ticket_number[[i]] == gender_model$ticket_number[[i - 1]])) { gender_model$familyID[[i]] <- "family" } else { gender_model$familyID[[i]] <- "no family" } } gender_model %>% dplyr::filter(familyID == "family") -> gender_model length(unique(gender_model$last_name)) ## [1] 81
We are ending up with 81 unique families.
Visualizing The Titanic Data Set And The Woman-Child Model With ggplot
ggplot(gender_model, aes(x = last_name, fill = Survived)) + geom_bar() + ylab("Family Name") + ggtitle("All 81 Women-Child Groups") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
The plot makes the power of the women-child model visible. Either all family members die or survive. Only family Alison and family Asplund have family members who died and survived. For other families, either all members survived or died.
What we will be doing next is labeling who survived and died. All NA values for families who all survived will be predicted to have survived as well. All NA values for families who all died will be predicted to have died as well. For the family Alison and Asplund we are assigning the survived and died labels to the NA values based on the majority voting. If the majority of the family survived, then the NA values will be substituted with survived (1). If the majority died, then the NA value will be substituted with died (0).
For the families Gibson, Klasen, and Peacock, all values are missing.
gender_model$Survived <- as.numeric(as.character(gender_model$Survived)) gender_mod <- split(gender_model, gender_model$last_name) for (i in 1:length(gender_mod)) { if (is.nan(mean(gender_mod[[i]]$Survived, na.rm = TRUE))) { gender_mod[[i]]$Survived[is.na(gender_mod[[i]]$Survived)] <- NA } else if (mean(gender_mod[[i]]$Survived, na.rm = TRUE) == 0) { gender_mod[[i]]$Survived[is.na(gender_mod[[i]]$Survived)] <- 0 } else if (mean(gender_mod[[i]]$Survived, na.rm = TRUE) == 1) { gender_mod[[i]]$Survived[is.na(gender_mod[[i]]$Survived)] <- 1 } else if (mean(gender_mod[[i]]$Survived, na.rm = TRUE) >= 0.5) { gender_mod[[i]]$Survived[is.na(gender_mod[[i]]$Survived)] <- 1 } else { gender_mod[[i]]$Survived[is.na(gender_mod[[i]]$Survived)] <- 0 } } gender_mod <- do.call(rbind, gender_mod)
plot_pclass <- ggplot(titanic[1:891, ], aes(x = Pclass, fill = Survived)) + geom_bar() plot_pclass
titanic %>% dplyr::filter(last_name %in% c("Klasen", "Peacock", "Gibson")) %>% dplyr::select(last_name, Pclass) %>% pander::pandoc.table()
gender_mod$Survived[gender_mod$last_name %in% c("Klasen", "Peacock")] <- 0 gender_mod$Survived[gender_mod$last_name == "Gibson"] <- 1
Because the Klasen family and the Peacock family are class 3 passengers, we are predicting that they died. Because the Gibson family traveled in class 1, we are predicting that they survived.
Now, we have to merge our results back into our original titanic data set. We can do that with a double for loop. However, this takes some time.
system.time( for(i in 1:nrow(gender_mod)) { for(j in 1:nrow(titanic)) { if (gender_mod[["PassengerId"]][[i]] == titanic[["PassengerId"]][[j]]) { titanic[["Survived"]][[j]] <- gender_mod[["Survived"]][[i]] } } } ) ## user system elapsed ## 3.12 0.00 3.14
So, if we want to save time we can also vectorize our operation like in the code below.
gender_mod_survived <- gender_mod %>% filter(Survived == 1) gender_mod_died <- gender_mod %>% filter(Survived == 0) titanic$Survived[titanic$PassengerId %in% gender_mod_survived$PassengerId] <- gender_mod_survived$Survived titanic$Survived[titanic$PassengerId %in% gender_mod_died$PassengerId] <- gender_mod_died$Survived
Now, that our gender_model
results are back in our original data frame, we will be predicting that every male passenger whose survival is still unknown will die and every female who’s survival is still unknown will live.
The Titanic Data Set And The Woman-Child Model – Submission
After that, we will be sending in our predictions.
titanic$Survived[is.na(titanic$Survived) & titanic$Sex == "male"] <- 0 titanic$Survived[is.na(titanic$Survived) & titanic$Sex == "female"] <- 1 sub <- data.frame(PassengerId = 892:1309, Survived = titanic[892:1309, ]$Survived) write.csv(sub, here::here("docs", "gender_model.csv"), row.names = FALSE)
Heya! Almost 82%. That is fantastic. The women-child model is way less time consuming than all the model building in part 1. It is easy, straight forward but still very powerful. So powerful that it gives us around 3% better accuracy on the test set.
Other interesting Kernels on Kaggle that you might like can be explored with the links below:
I hope you have enjoyed the second part of the tutorial and if you have any questions, you can write them in the comments below.
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications