More dplyr – How to Get Your Data Into the Right Shape

Last time we looked at the basic verbs of the tidyverse and this time we will be looking at some more verbs that make data munging/shaping a lot easier. This time we will be covering:

  • select_if()/all()
  • filter_if()/at()
  • mutate_if()/at()
  • between()
  • na_if()
  • case_when()
  • pull()

Again, we will be looking at the Pokemon data set.

library(tidyverse)
poke <- read.csv("Pokemon.csv") %>%
  dplyr::select(-X.)

dplyr’s select_if()/all() Function

Personally, I rarely use the select + suffix functions, however, they can sometimes be very handy. The select_if() function selects certain variables of a data set based on a condition. I personally use select_if() in combination with a correlation matrix. As we know, a correlation matrix can only be produced if the variables are all numeric. Hence, we are going to only select all the numeric variables in our data set.

Instead of subsetting the data frame and looking for numeric variables (a real pain), we do:

poke %>%
  dplyr::select_if(is.numeric) %>%
  stats::cor() %>%
  head(2)

          Total        HP    Attack   Defense   Sp..Atk   Sp..Def     Speed Generation
Total 1.0000000 0.6187484 0.7362107 0.6127874 0.7472499 0.7176095 0.5759427 0.04838402
HP    0.6187484 1.0000000 0.4223860 0.2396223 0.3623799 0.3787181 0.1759521 0.05868251

Very easy and efficient.

Next, we could use select_all() to transform all the variables according to the tidyverse style guide. Meaning that we want all variables to be lower case and convert the dot into snake case.

These are the column names before:

colnames(poke)

 [1] "Name"       "Type.1"     "Type.2"     "Total"      "HP"         "Attack"     "Defense"   
 [8] "Sp..Atk"    "Sp..Def"    "Speed"      "Generation" "Legendary"

and after:

tidy_col <- function(x) gsub("(\\.|\\.\\.)", "_", x)

poke %>%
  dplyr::select_all(tolower) %>%
  dplyr::select_all(tidy_col) %>%
  colnames()

[1] "name"       "type_1"     "type_2"     "total"      "hp"         "attack"     "defense"   
[8] "sp_atk"     "sp_def"     "speed"      "generation" "legendary"

The tidyverse is all about ease and efficiency. It cannot get any easier than that.

dplyr’s filter_if()/at() Function

The filter_at() function is useful when you want to filter for the same condition, for different variables. In the vars() function we will be specifying the variables we are filtering on. In the next argument we will be using all_vars() or any_vars. all_vars() is the intersection (&) and any_vars() is the union (|). Let’s have a look at the code.

# intersection
poke %>%
  dplyr::filter_at(vars(Attack, Defense), all_vars(. >= 100)) %>%
  head()

# equivalent to 
poke %>%
  dplyr::filter(Attack >= 100 & Defense >= 100) %>%
  head()

##                        Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1     VenusaurMega Venusaur  Grass Poison   625 80    100     123     122
## 2 CharizardMega Charizard X   Fire Dragon   634 78    130     111     130
## 3   BlastoiseMega Blastoise  Water          630 79    103     120     135
## 4                 Sandslash Ground          450 75    100     110      45
## 5                     Golem   Rock Ground   495 80    120     130      55
## 6                   Kingler  Water          475 55    130     115      50


# union
poke %>%
  dplyr::filter_at(vars(Attack, Defense), any_vars(. >= 100)) %>%
  head()

# equivalent to 
poke %>%
  dplyr::filter(Attack >= 100 | Defense >= 100) %>%
  head()

##                        Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
## 1     VenusaurMega Venusaur  Grass Poison   625 80    100     123     122
## 2 CharizardMega Charizard X   Fire Dragon   634 78    130     111     130
## 3 CharizardMega Charizard Y   Fire Flying   634 78    104      78     159
## 4                 Blastoise  Water          530 79     83     100      85
## 5   BlastoiseMega Blastoise  Water          630 79    103     120     135
## 6     BeedrillMega Beedrill    Bug Poison   495 65    150      40      15
poke %>%
  dplyr::filter_if(is.numeric, any_vars(dplyr::between(., 150, 400)))


##                          Name   Type.1   Type.2 Total  HP Attack Defense
## 1                   Bulbasaur    Grass   Poison   318  45     49      49
## 2                  Charmander     Fire            309  39     52      43
## 3   CharizardMega Charizard Y     Fire   Flying   634  78    104      78
## 4                    Squirtle    Water            314  44     48      65
## 5                    Caterpie      Bug            195  45     30      35
## 6                     Metapod      Bug            205  50     20      55

filter_if() works as expected. It filters any variables that are between 50 and 125 in the specified columns. Here, you just got introduced to the between() function in dplyr which can reduce the amount of code you are writing.

The na_if() function is a great way to turn rows into missing values. In our case, there are a lot of empty strings in the Type.2 column. These values are missing and will be turned into NA values with the na_if() function.

poke %>%
  dplyr::mutate(Type.2 = dplyr::na_if(Type.2, ""))

##                          Name   Type.1   Type.2 Total  HP Attack Defense
## 1                   Bulbasaur    Grass   Poison   318  45     49      49
## 2                     Ivysaur    Grass   Poison   405  60     62      63
## 3                    Venusaur    Grass   Poison   525  80     82      83
## 4       VenusaurMega Venusaur    Grass   Poison   625  80    100     123
## 5                  Charmander     Fire     <NA>   309  39     52      43
## 6                  Charmeleon     Fire     <NA>   405  58     64      58

In the code below, we are identifying outlier Pokemons with the near() function. Whenever a numeric variable is larger or smaller than 3 standard deviations from the mean, then the specific value is filled out as an outlier. Afterwards, we are filtering all Pokemons that have at least one outlier in their row.

poke %>%
  dplyr::mutate_at(vars(Total, HP, Attack, Defense, Sp..Atk, Sp..Def, Speed), 
                   ~ ifelse(dplyr::near(., mean(.), sd(.) * 3), NA, "Outlier")) %>%
  dplyr::filter_if(is.character, any_vars(. == "Outlier"))

##                       Name  Type.1   Type.2 Total      HP  Attack Defense
## 1    AlakazamMega Alakazam Psychic             NA    <NA>    <NA>    <NA>
## 2      SlowbroMega Slowbro   Water  Psychic    NA    <NA>    <NA> Outlier
## 3                 Cloyster   Water      Ice    NA    <NA>    <NA> Outlier
## 4                  Chansey  Normal             NA Outlier    <NA>    <NA>
## 5                  Snorlax  Normal             NA Outlier    <NA>    <NA>
## 6      MewtwoMega Mewtwo X Psychic Fighting    NA    <NA> Outlier    <NA>

dplyr’s case_when() Function

Another great function in combination with the mutate() function is case_when(). With this function, we can group variables in certain categories. For example, let’s specify if a value is low, normal, or high based on the mean and standard deviation of the particular column.

If you are not sure about the dot placeholder, then check out this blog post about magrittr’s dot placeholder.

poke %>%
  dplyr::mutate_if(is.numeric, ~ dplyr::case_when(
    . > mean(.) + sd(.) ~ "high",
    . < mean(.) - sd(.) ~ "low",
    . >= mean(.) - sd(.) | . <= mean(.) + sd(.) ~ "normal"
  ))

poke <- read.csv("C:/Users/Pascal Schmidt/Desktop/Pokemon.csv") %>%
  dplyr::select(-X.)

# equivalent to dplyr version only way more complicated 🙂
for(i in 1:ncol(poke)) {
  if(is.integer(poke[, i])) {
    mean_col <- mean(poke[, i], na.rm = TRUE)
    sd_col <- sd(poke[, i], na.rm = TRUE)
    poke[, i][poke[, i] > mean_col + sd_col] <- "high"
    poke[, i][poke[, i] < mean_col - sd_col] <- "low"
    poke[, i][poke[, i] != "high" & poke[, i] != "low"] <- "normal"
  }
}

##                          Name   Type.1   Type.2  Total     HP Attack
## 1                   Bulbasaur    Grass   Poison normal normal normal
## 2                     Ivysaur    Grass   Poison normal normal normal
## 3                    Venusaur    Grass   Poison normal normal normal
## 4       VenusaurMega Venusaur    Grass   Poison   high normal normal
## 5                  Charmander     Fire             low    low normal
## 6                  Charmeleon     Fire          normal normal normal

On the other hand, we could have constructed this for loop instead of using some of dplyr’s verbs. In my opinion, dplyr is more readable and less complicated to use.

dplyr’s pull() Function

The pull function is great when you want to get a single column vector out of the data frame. So when you use select(), the output will be a data frame and when you use pull(), the output will be a vector.

poke <- read.csv("C:/Users/Pascal Schmidt/Desktop/Pokemon.csv") %>%
  dplyr::select(-X.)

Below you can see an example of using pull(). We pull out the Type.1 variable and then do some manipulation to the vector with paste(). Afterward, we are suppressing magrittr’s behavior of placing the vector as the first argument in the mutate() function. As the first argument in the mutate() function, we put the data frame and then placing the dot placeholder (the manipulated vector) as the second argument.

Check out my magrittr blog post. Since I have been through the documentation more thoroughly, I am better at writing code with the pipe.

poke %>%
  dplyr::pull(Type.1) %>%
  base::paste(" Pokemon") %>%
  {dplyr::mutate(poke, Type.1 = .)}

##                          Name            Type.1   Type.2 Total  HP Attack
## 1                   Bulbasaur    Grass  Pokemon   Poison   318  45     49
## 2                     Ivysaur    Grass  Pokemon   Poison   405  60     62
## 3                    Venusaur    Grass  Pokemon   Poison   525  80     82
## 4       VenusaurMega Venusaur    Grass  Pokemon   Poison   625  80    100
## 5                  Charmander     Fire  Pokemon            309  39     52
## 6                  Charmeleon     Fire  Pokemon            405  58     64

Additional Resources:

Post your comment