September 25, 2018 By Pascal Schmidt Personal Project R

In this tutorial, I will be showing you how to achieve 82% accuracy with the titanic data set and the woman-child model.

There are two rules to the titanic data set and the woman-child model:

We predict that all males die except for boys in families where most females and boys live.
We predict that all females live except for females in families where most females and boys die.

The last tutorial, we achieved an accuracy of around 79%. We used a random forest algorithm, a logistic regression, k-nearest neighbors, and linear discriminant analysis. All of these models were decent but are not able to reach the women-child model.

The titanic data set and the women-child model can be created by only looking at the last names of passengers and also their corresponding ticket numbers. Let’s jump into our analysis.

The Titanic Data Set And The Woman-Child Model

First, we have to load in the data.

library(tidyverse)
library(here)

train <- read.csv(here::here("docs", "train.csv"))
test <- read.csv(here::here("docs", "test.csv"))
test$Survived <- NA
titanic <- rbind(train, test) # combining data sets

titanic %>%
  dplyr::mutate(
    Name = as.character(Name),
    Ticket = as.character(Ticket),
    Cabin = as.character(Cabin),
    Survived = as.factor(Survived),
    Pclass = as.factor(Pclass)
  ) -> titanic

Again, we are engineering a new variable called titles like last time.

titanic$titles <- gsub("(.*\\,|\\..*)", "", titanic$Name) %>%
  gsub("[[:space:]]", "", .)

titanic$last_name <- gsub("\\,.*", "", titanic$Name)

Sometimes, ticket numbers differ from each other in their last digits. For example, in the table below we can be sure that the first three passengers traveled together. This is because they embarked in Southampton and all are class 3. Their ticket number only differs in the last digits.

`{r results="asis"}
titanic %>%
  dplyr::group_by(Ticket, Pclass, Embarked) %>%
  dplyr::summarise() %>%
  tail() %>%
  pander::pandoc.table()

So what we will be doing is to substitute the last 2 digits or letters of a ticket with “XX”. Below is a little working example of how this works. We are grabbing the last 2 characters of the string with str_sub from the stringr package and then we are substituting it with “XX”. This will be done with the gsub function.

x <- "hello"
gsub(stringr::str_sub(x, -1), "X", x)


## [1] "hellX"

Now, we are applying this method to the entire Ticket column in the titanic data set.

for(i in 1:length(titanic$Ticket)) {
  titanic$ticket_number[[i]] <- gsub(stringr::str_sub(titanic$Ticket[[i]], -2), 
                                     "XX", 
                                     titanic$Ticket[[i]])
}

Before we are continuing with our analysis, we will be throwing out all rows with passengers who are male, except for children.

titanic %>%
  dplyr::filter(Sex == "female" | titles == "Master") %>%
  dplyr::group_by(last_name, titles, ticket_number, Sex, Survived, PassengerId) %>%
  dplyr::summarise() %>%
  dplyr::select(last_name, titles, ticket_number, Sex, Survived, PassengerId) -> gender_model

head(gender_model) %>%
  pander::pandoc.table()

After we have ordered the data, we have to identify families now. We do that by comparing the last_name and ticket_number columns. When either the preceding or following ticket_number and last_name are identical, then the familyID will be family. Otherwise, no family. Afterwards, we have to throw out passengers who traveled alone.

gender_model$familyID <- NA
gender_model$familyID[[1]] <- "family"
for(i in 2:(length(gender_model$last_name) - 1)) {
  if ((gender_model$last_name[[i]] == gender_model$last_name[[i + 1]] |
      gender_model$last_name[[i]] == gender_model$last_name[[i - 1]]) &
      (gender_model$ticket_number[[i]] == gender_model$ticket_number[[i + 1]] |
      gender_model$ticket_number[[i]] == gender_model$ticket_number[[i - 1]])) {
    gender_model$familyID[[i]] <- "family"
  } else {
        gender_model$familyID[[i]] <- "no family"
      }
}

gender_model %>%
  dplyr::filter(familyID == "family") -> gender_model
length(unique(gender_model$last_name))

## [1] 81

We are ending up with 81 unique families.

Visualizing The Titanic Data Set And The Woman-Child Model With ggplot

ggplot(gender_model, aes(x = last_name, fill = Survived)) + 
  geom_bar() +
  ylab("Family Name") +
  ggtitle("All 81 Women-Child Groups") +
  theme(plot.title = element_text(hjust = 0.5)) +
  coord_flip()

The plot makes the power of the women-child model visible. Either all family members die or survive. Only family Alison and family Asplund have family members who died and survived. For other families, either all members survived or died.

What we will be doing next is labeling who survived and died. All NA values for families who all survived will be predicted to have survived as well. All NA values for families who all died will be predicted to have died as well. For the family Alison and Asplund we are assigning the survived and died labels to the NA values based on the majority voting. If the majority of the family survived, then the NA values will be substituted with survived (1). If the majority died, then the NA value will be substituted with died (0).

For the families Gibson, Klasen, and Peacock, all values are missing.

gender_model$Survived <- as.numeric(as.character(gender_model$Survived))
gender_mod <- split(gender_model, gender_model$last_name)


for (i in 1:length(gender_mod)) {
  if (is.nan(mean(gender_mod[[i]]$Survived, na.rm = TRUE))) {
    gender_mod[[i]]$Survived[is.na(gender_mod[[i]]$Survived)] <- NA
  } else if (mean(gender_mod[[i]]$Survived, na.rm = TRUE) == 0) {
    gender_mod[[i]]$Survived[is.na(gender_mod[[i]]$Survived)] <- 0
  } else if (mean(gender_mod[[i]]$Survived, na.rm = TRUE) == 1) {
    gender_mod[[i]]$Survived[is.na(gender_mod[[i]]$Survived)] <- 1
  } else if (mean(gender_mod[[i]]$Survived, na.rm = TRUE) >= 0.5) {
    gender_mod[[i]]$Survived[is.na(gender_mod[[i]]$Survived)] <- 1 
  } else {
    gender_mod[[i]]$Survived[is.na(gender_mod[[i]]$Survived)] <- 0 
  }
}

gender_mod <- do.call(rbind, gender_mod)

plot_pclass <- ggplot(titanic[1:891, ], aes(x = Pclass, fill = Survived)) + 
  geom_bar()
plot_pclass

titanic %>%
  dplyr::filter(last_name %in% c("Klasen", "Peacock", "Gibson")) %>%
  dplyr::select(last_name, Pclass) %>%
  pander::pandoc.table()

gender_mod$Survived[gender_mod$last_name %in% c("Klasen", "Peacock")] <- 0
gender_mod$Survived[gender_mod$last_name == "Gibson"] <- 1

Because the Klasen family and the Peacock family are class 3 passengers, we are predicting that they died. Because the Gibson family traveled in class 1, we are predicting that they survived.

Now, we have to merge our results back into our original titanic data set. We can do that with a double for loop. However, this takes some time.

system.time(
for(i in 1:nrow(gender_mod)) {
  for(j in 1:nrow(titanic)) {
    if (gender_mod[["PassengerId"]][[i]] == titanic[["PassengerId"]][[j]]) {
      titanic[["Survived"]][[j]] <- gender_mod[["Survived"]][[i]]
    }
  }
}
)


##    user  system elapsed 
##    3.12    0.00    3.14

So, if we want to save time we can also vectorize our operation like in the code below.

gender_mod_survived <- gender_mod %>%
  filter(Survived == 1)
gender_mod_died <- gender_mod %>%
  filter(Survived == 0)

titanic$Survived[titanic$PassengerId %in% gender_mod_survived$PassengerId] <- gender_mod_survived$Survived
titanic$Survived[titanic$PassengerId %in% gender_mod_died$PassengerId] <- gender_mod_died$Survived

Now, that our gender_model results are back in our original data frame, we will be predicting that every male passenger whose survival is still unknown will die and every female who’s survival is still unknown will live.

The Titanic Data Set And The Woman-Child Model – Submission

After that, we will be sending in our predictions.

titanic$Survived[is.na(titanic$Survived) & titanic$Sex == "male"] <- 0
titanic$Survived[is.na(titanic$Survived) & titanic$Sex == "female"] <- 1


sub <- data.frame(PassengerId = 892:1309, Survived = titanic[892:1309, ]$Survived)
write.csv(sub, here::here("docs", "gender_model.csv"), row.names = FALSE)

titanic data set gender model submission

Heya! Almost 82%. That is fantastic. The women-child model is way less time consuming than all the model building in part 1. It is easy, straight forward but still very powerful. So powerful that it gives us around 3% better accuracy on the test set.

Other interesting Kernels on Kaggle that you might like can be explored with the links below:

I hope you have enjoyed the second part of the tutorial and if you have any questions, you can write them in the comments below.

Tags: data analysis Kaggle machine learning R Titanic