Learning the Tidyverse: Basic dplyr Verbs for Data Manipulation
March 26, 2019 By Pascal Schmidt R Tidyverse Tutorial
Data comes in various shapes and forms. Hence, basic data manipulation is a must-have skill as a data scientist. The tidyverse
packages are the ideal solution for doing data shaping. These packages have been developed by the core R development team and I would even consider them as part of base R functions. The dplyr
package is the most useful one for data manipulation.
In this blog post, we will be going over the basic verbs from the dplyr
package. One of the great things about R and the tidyverse
is that there is no coding logic involved. Hence, without any experience, you can pick up these packages and perform powerful manipulations based on your needs. By learning the tidyverse
, this is how I started.
The best thing about the tidyverse is that it is very intuitive and it is the quickest way to get you started with data manipulation. Once you mastered the tidyverse and can do everything you want to do, you can slowely progress with learning base R functions and syntax for data manipulation. However, the most important part about learning is:
- making progress
- seeing results
- staying motivated
What we will be covering in this post are verbs such as:
- select()
- filter()
- mutate()
- group_by()
- summarise()
- arrange()
- rename()
As you can see, without any knowledge about the functions, you already know what kind of data manipulations they will be doing. The verbs are pretty self explanatory. Now, we only have to pick up the syntax and we are good to go.
dplyr
’s select()
Function
With select()
, we are able to select a subset of a data frame’s columns or we can select rows that we want to delete. The first argument in the select function is always the data frame itself. With the pipe operator %>%
, we can make the code more readable. Check out my blog post about magrittr
’s pipe when you want to understand it. In short, the pipe inserts the previous code into the first argument of the function that follows the pipe. Have a look.
dplyr::select(poke, Name, Speed, Legendary) poke %>% dplyr::select(Name, Speed, Legendary) poke %>% dplyr::select("Name", "Speed", "Legendary") poke %>% dplyr::select(-X., -c(Type.1:Speed)) ## Name Speed Legendary ## 1 Bulbasaur 45 False ## 2 Ivysaur 60 False ## 3 Venusaur 80 False ## 4 VenusaurMega Venusaur 80 False ## 5 Charmander 65 False ## 6 Charmeleon 80 False
All the four different lines of code above are producing the same output. A data frame with the columns Name
, Speed
, and Legendary
. We can write the column names with or without quotation marks. I am always writing them without quotation marks because of less typing.
The last line of code says that we want to delete the column X.
and all columns from Type.1
up to Speed
.
When we want to order some columns in our data frame then we can do that with the following code below.
poke %>% dplyr::select(Name, Speed, Legendary, everything()) %>% head() ## Name Speed Legendary X. Type.1 Type.2 Total HP Attack ## 1 Bulbasaur 45 False 1 Grass Poison 318 45 49 ## 2 Ivysaur 60 False 2 Grass Poison 405 60 62 ## 3 Venusaur 80 False 3 Grass Poison 525 80 82 ## 4 VenusaurMega Venusaur 80 False 3 Grass Poison 625 80 100 ## 5 Charmander 65 False 4 Fire 309 39 52 ## 6 Charmeleon 80 False 5 Fire 405 58 64
We out Name
first, then Speed
, then Legendary
, and then everything else.
That is basically all that is to select()
. Pretty easy and pretty powerful. No base R subsetting necessary.
Let’s continue with filter()
.
dplyr
’s filter()
Function
dplyr
’s filter()
function selects certain rows instead of columns. It filters for whatever you want it to filter. Let’s see how that works with the Pokemon data set.
poke %>% dplyr::filter(Speed > 65) %>% head() poke %>% dplyr::filter(Speed > 65 & HP < 40 & Total >= 200) %>% head()
The code above filters to Pokemon who have Speed
above 65. The second line filters for Pokemon having Speed
greater than 65, HP
less than 40, and Total
greater or equal to 200. So if you want to filter for more than one variable, then you can simply add &
or |
, which stand for or, and filter for how many variables and conditions that you want.
One important thing to notice about the filter function
is that the inside has to evaluate to TRUE
or FALSE
. In the code above, we used relational operations that evaluate to a boolean (true or false). The code below shows one example where we look for certain pokemon with a specific type.
poke %>% dplyr::filter(Type.2 == "Bug" | Type.2 == "Flying") %>% head()
So far so good. Sometimes, we also want to filter Pokemons that contain a certain string. For example, let’s find all pokemon that start with the letter Pi.
For this kind of filtering, we would use the stringr
package from the tidyverse. If you want to find out more about it read up here.
poke %>% dplyr::filter(stringr::str_detect(Name, "^Pi")) %>% head() ## Name Type.1 Type.2 Total HP Attack Defense Sp..Atk ## Pidgey Normal Flying 251 40 45 40 35 ## Pidgeotto Normal Flying 349 63 60 55 50 ## Pidgeot Normal Flying 479 83 80 75 70 ## PidgeotMega Pidgeot Normal Flying 579 83 80 80 135 ## Pikachu Electric 320 35 55 40 50 ## Pinsir Bug 500 65 125 100 55
In the code above, are loking for Pokemons who start with “Pi”. The ^
symbole states that we only want to filter for Pokemons who with the starting letters “Pi” and not somewhere in the middle or end. Always remember that the inside of the filter()
function has to evaluate to a boolean.
This function was also very intuitive and easy to understand. As soon as you get the logic down, it will be easy as pi to select certain rows in your data set.
dplyr
’s mutate()
Function
The next function is able to create new columns based on calculations of our choice or overwrite columns. Let’s see how that works out with the Pokemon data set.
poke %>% dplyr::mutate(Attack_1000 = Attack * 1000, Attack = Attack_1000) %>% head() ## Name Type.1 Type.2 Total HP Attack Defense Sp..Atk ## Bulbasaur Grass Poison 318 45 49000 49 65 ## Ivysaur Grass Poison 405 60 62000 63 80 ## Venusaur Grass Poison 525 80 82000 83 100 ## VenusaurMega Venusaur Grass Poison 625 80 100000 123 122 ## Charmander Fire 309 39 52000 43 60 ## Charmeleon Fire 405 58 64000 58 80
The code above does not make sense at all. However, we can see what is possible. We created a new column called Attack_1000
, where we multiplied the values of Attack
by 1000. Afterward, we were able to use our newly created column right away and overwrote all the values in Attack
with the values in Attack_1000
.
Another great way to use the mutate()
function is in combination with the base::ifelse()
function. Let’s see how that works. The first argument of the ifelse()
function is the condition that evaluates as true or false. Everything that evaluates to true will take on the second argument and everything that evaluates to false will take on the third argument. Let’s see how that works out.
poke %>% dplyr::mutate(Legendary = base::ifelse(Legendary == "False", "Not Legendary", "Legendary")) %>% head() ## Sp..Def Speed Generation Legendary ## 1 65 45 1 Not Legendary ## 2 80 60 1 Not Legendary ## 3 100 80 1 Not Legendary ## 4 120 80 1 Not Legendary ## 5 50 65 1 Not Legendary ## 6 65 80 1 Not Legendary
When a Pokemon is legendary, then we want to the row in that column to say Legendary
and if not, it should say Not Legendary
.
Another useful way to use the base::ifelse()
function is with a placeholder. Check out my blog post about magrittr
’s placeholders if you want to know more.
poke %>% dplyr::mutate(Type.1 = ifelse(Type.1 == "Grass", "Grass Pokemon", as.character(.$Type.1))) %>% head() ## Name Type.1 Type.2 Total HP Attack Defense ## Bulbasaur Grass Pokemon Poison 318 45 49 49 ## Ivysaur Grass Pokemon Poison 405 60 62 63 ## Venusaur Grass Pokemon Poison 525 80 82 83 ## VenusaurMega Venusaur Grass Pokemon Poison 625 80 100 123 ## Charmander Fire 309 39 52 43 ## Charmeleon Fire 405 58 64 58
Other resources about data manipulation/visualization you might find useful:
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications