More dplyr – How to Get Your Data Into the Right Shape
April 8, 2019 By Pascal Schmidt R Tidyverse Tutorial
Last time we looked at the basic verbs of the tidyverse and this time we will be looking at some more verbs that make data munging/shaping a lot easier. This time we will be covering:
select_if()/all()
filter_if()/at()
mutate_if()/at()
between()
na_if()
case_when()
pull()
Again, we will be looking at the Pokemon data set.
library(tidyverse) poke <- read.csv("Pokemon.csv") %>% dplyr::select(-X.)
dplyr
’s select_if()/all()
Function
Personally, I rarely use the select
+ suffix
functions, however, they can sometimes be very handy. The select_if()
function selects certain variables of a data set based on a condition. I personally use select_if()
in combination with a correlation matrix. As we know, a correlation matrix can only be produced if the variables are all numeric. Hence, we are going to only select all the numeric variables in our data set.
Instead of subsetting the data frame and looking for numeric variables (a real pain), we do:
poke %>% dplyr::select_if(is.numeric) %>% stats::cor() %>% head(2) Total HP Attack Defense Sp..Atk Sp..Def Speed Generation Total 1.0000000 0.6187484 0.7362107 0.6127874 0.7472499 0.7176095 0.5759427 0.04838402 HP 0.6187484 1.0000000 0.4223860 0.2396223 0.3623799 0.3787181 0.1759521 0.05868251
Very easy and efficient.
Next, we could use select_all()
to transform all the variables according to the tidyverse
style guide. Meaning that we want all variables to be lower case and convert the dot into snake case.
These are the column names before:
colnames(poke) [1] "Name" "Type.1" "Type.2" "Total" "HP" "Attack" "Defense" [8] "Sp..Atk" "Sp..Def" "Speed" "Generation" "Legendary"
and after:
tidy_col <- function(x) gsub("(\\.|\\.\\.)", "_", x) poke %>% dplyr::select_all(tolower) %>% dplyr::select_all(tidy_col) %>% colnames() [1] "name" "type_1" "type_2" "total" "hp" "attack" "defense" [8] "sp_atk" "sp_def" "speed" "generation" "legendary"
The tidyverse
is all about ease and efficiency. It cannot get any easier than that.
dplyr
’s filter_if()/at()
Function
The filter_at()
function is useful when you want to filter for the same condition, for different variables. In the vars()
function we will be specifying the variables we are filtering on. In the next argument we will be using all_vars()
or any_vars
. all_vars()
is the intersection (&) and any_vars()
is the union (|). Let’s have a look at the code.
# intersection poke %>% dplyr::filter_at(vars(Attack, Defense), all_vars(. >= 100)) %>% head() # equivalent to poke %>% dplyr::filter(Attack >= 100 & Defense >= 100) %>% head() ## Name Type.1 Type.2 Total HP Attack Defense Sp..Atk ## 1 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 ## 2 CharizardMega Charizard X Fire Dragon 634 78 130 111 130 ## 3 BlastoiseMega Blastoise Water 630 79 103 120 135 ## 4 Sandslash Ground 450 75 100 110 45 ## 5 Golem Rock Ground 495 80 120 130 55 ## 6 Kingler Water 475 55 130 115 50 # union poke %>% dplyr::filter_at(vars(Attack, Defense), any_vars(. >= 100)) %>% head() # equivalent to poke %>% dplyr::filter(Attack >= 100 | Defense >= 100) %>% head() ## Name Type.1 Type.2 Total HP Attack Defense Sp..Atk ## 1 VenusaurMega Venusaur Grass Poison 625 80 100 123 122 ## 2 CharizardMega Charizard X Fire Dragon 634 78 130 111 130 ## 3 CharizardMega Charizard Y Fire Flying 634 78 104 78 159 ## 4 Blastoise Water 530 79 83 100 85 ## 5 BlastoiseMega Blastoise Water 630 79 103 120 135 ## 6 BeedrillMega Beedrill Bug Poison 495 65 150 40 15
poke %>% dplyr::filter_if(is.numeric, any_vars(dplyr::between(., 150, 400))) ## Name Type.1 Type.2 Total HP Attack Defense ## 1 Bulbasaur Grass Poison 318 45 49 49 ## 2 Charmander Fire 309 39 52 43 ## 3 CharizardMega Charizard Y Fire Flying 634 78 104 78 ## 4 Squirtle Water 314 44 48 65 ## 5 Caterpie Bug 195 45 30 35 ## 6 Metapod Bug 205 50 20 55
filter_if()
works as expected. It filters any variables that are between 50 and 125 in the specified columns. Here, you just got introduced to the between()
function in dplyr
which can reduce the amount of code you are writing.
The na_if()
function is a great way to turn rows into missing values. In our case, there are a lot of empty strings in the Type.2 column. These values are missing and will be turned into NA values with the na_if()
function.
poke %>% dplyr::mutate(Type.2 = dplyr::na_if(Type.2, "")) ## Name Type.1 Type.2 Total HP Attack Defense ## 1 Bulbasaur Grass Poison 318 45 49 49 ## 2 Ivysaur Grass Poison 405 60 62 63 ## 3 Venusaur Grass Poison 525 80 82 83 ## 4 VenusaurMega Venusaur Grass Poison 625 80 100 123 ## 5 Charmander Fire <NA> 309 39 52 43 ## 6 Charmeleon Fire <NA> 405 58 64 58
In the code below, we are identifying outlier Pokemons with the near()
function. Whenever a numeric variable is larger or smaller than 3 standard deviations from the mean, then the specific value is filled out as an outlier. Afterwards, we are filtering all Pokemons that have at least one outlier in their row.
poke %>% dplyr::mutate_at(vars(Total, HP, Attack, Defense, Sp..Atk, Sp..Def, Speed), ~ ifelse(dplyr::near(., mean(.), sd(.) * 3), NA, "Outlier")) %>% dplyr::filter_if(is.character, any_vars(. == "Outlier")) ## Name Type.1 Type.2 Total HP Attack Defense ## 1 AlakazamMega Alakazam Psychic NA <NA> <NA> <NA> ## 2 SlowbroMega Slowbro Water Psychic NA <NA> <NA> Outlier ## 3 Cloyster Water Ice NA <NA> <NA> Outlier ## 4 Chansey Normal NA Outlier <NA> <NA> ## 5 Snorlax Normal NA Outlier <NA> <NA> ## 6 MewtwoMega Mewtwo X Psychic Fighting NA <NA> Outlier <NA>
dplyr
’s case_when()
Function
Another great function in combination with the mutate()
function is case_when()
. With this function, we can group variables in certain categories. For example, let’s specify if a value is low, normal, or high based on the mean and standard deviation of the particular column.
If you are not sure about the dot placeholder, then check out this blog post about magrittr’s dot placeholder.
poke %>% dplyr::mutate_if(is.numeric, ~ dplyr::case_when( . > mean(.) + sd(.) ~ "high", . < mean(.) - sd(.) ~ "low", . >= mean(.) - sd(.) | . <= mean(.) + sd(.) ~ "normal" )) poke <- read.csv("C:/Users/Pascal Schmidt/Desktop/Pokemon.csv") %>% dplyr::select(-X.) # equivalent to dplyr version only way more complicated 🙂 for(i in 1:ncol(poke)) { if(is.integer(poke[, i])) { mean_col <- mean(poke[, i], na.rm = TRUE) sd_col <- sd(poke[, i], na.rm = TRUE) poke[, i][poke[, i] > mean_col + sd_col] <- "high" poke[, i][poke[, i] < mean_col - sd_col] <- "low" poke[, i][poke[, i] != "high" & poke[, i] != "low"] <- "normal" } } ## Name Type.1 Type.2 Total HP Attack ## 1 Bulbasaur Grass Poison normal normal normal ## 2 Ivysaur Grass Poison normal normal normal ## 3 Venusaur Grass Poison normal normal normal ## 4 VenusaurMega Venusaur Grass Poison high normal normal ## 5 Charmander Fire low low normal ## 6 Charmeleon Fire normal normal normal
On the other hand, we could have constructed this for loop instead of using some of dplyr
’s verbs. In my opinion, dplyr
is more readable and less complicated to use.
dplyr
’s pull()
Function
The pull function is great when you want to get a single column vector out of the data frame. So when you use select()
, the output will be a data frame and when you use pull()
, the output will be a vector.
poke <- read.csv("C:/Users/Pascal Schmidt/Desktop/Pokemon.csv") %>% dplyr::select(-X.)
Below you can see an example of using pull()
. We pull out the Type.1
variable and then do some manipulation to the vector with paste()
. Afterward, we are suppressing magrittr’s
behavior of placing the vector as the first argument in the mutate()
function. As the first argument in the mutate()
function, we put the data frame and then placing the dot placeholder (the manipulated vector) as the second argument.
Check out my magrittr blog post. Since I have been through the documentation more thoroughly, I am better at writing code with the pipe.
poke %>% dplyr::pull(Type.1) %>% base::paste(" Pokemon") %>% {dplyr::mutate(poke, Type.1 = .)} ## Name Type.1 Type.2 Total HP Attack ## 1 Bulbasaur Grass Pokemon Poison 318 45 49 ## 2 Ivysaur Grass Pokemon Poison 405 60 62 ## 3 Venusaur Grass Pokemon Poison 525 80 82 ## 4 VenusaurMega Venusaur Grass Pokemon Poison 625 80 100 ## 5 Charmander Fire Pokemon 309 39 52 ## 6 Charmeleon Fire Pokemon 405 58 64
Additional Resources:
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications