Learning the Tidyverse: Basic dplyr Verbs for Data Manipulation

March 26, 2019 By Pascal Schmidt R Tidyverse Tutorial

Data comes in various shapes and forms. Hence, basic data manipulation is a must-have skill as a data scientist. The tidyverse packages are the ideal solution for doing data shaping. These packages have been developed by the core R development team and I would even consider them as part of base R functions. The dplyr package is the most useful one for data manipulation.

In this blog post, we will be going over the basic verbs from the dplyr package. One of the great things about R and the tidyverse is that there is no coding logic involved. Hence, without any experience, you can pick up these packages and perform powerful manipulations based on your needs. By learning the tidyverse, this is how I started.

The best thing about the tidyverse is that it is very intuitive and it is the quickest way to get you started with data manipulation. Once you mastered the tidyverse and can do everything you want to do, you can slowely progress with learning base R functions and syntax for data manipulation. However, the most important part about learning is:

  • making progress
  • seeing results
  • staying motivated

What we will be covering in this post are verbs such as:

  • select()
  • filter()
  • mutate()
  • group_by()
  • summarise()
  • arrange()
  • rename()

As you can see, without any knowledge about the functions, you already know what kind of data manipulations they will be doing. The verbs are pretty self explanatory. Now, we only have to pick up the syntax and we are good to go.

dplyr

dplyr’s select() Function

With select(), we are able to select a subset of a data frame’s columns or we can select rows that we want to delete. The first argument in the select function is always the data frame itself. With the pipe operator %>%, we can make the code more readable. Check out my blog post about magrittr’s pipe when you want to understand it. In short, the pipe inserts the previous code into the first argument of the function that follows the pipe. Have a look.

dplyr::select(poke, Name, Speed, Legendary)

poke %>%
  dplyr::select(Name, Speed, Legendary)

poke %>%
  dplyr::select("Name", "Speed", "Legendary")

poke %>%
  dplyr::select(-X., -c(Type.1:Speed))

##                    Name Speed Legendary
## 1             Bulbasaur    45     False
## 2               Ivysaur    60     False
## 3              Venusaur    80     False
## 4 VenusaurMega Venusaur    80     False
## 5            Charmander    65     False
## 6            Charmeleon    80     False

All the four different lines of code above are producing the same output. A data frame with the columns Name, Speed, and Legendary. We can write the column names with or without quotation marks. I am always writing them without quotation marks because of less typing.

The last line of code says that we want to delete the column X. and all columns from Type.1 up to Speed.

When we want to order some columns in our data frame then we can do that with the following code below.

poke %>%
  dplyr::select(Name, Speed, Legendary, everything()) %>%
  head()

##                    Name Speed Legendary X. Type.1 Type.2 Total HP Attack
## 1             Bulbasaur    45     False  1  Grass Poison   318 45     49
## 2               Ivysaur    60     False  2  Grass Poison   405 60     62
## 3              Venusaur    80     False  3  Grass Poison   525 80     82
## 4 VenusaurMega Venusaur    80     False  3  Grass Poison   625 80    100
## 5            Charmander    65     False  4   Fire          309 39     52
## 6            Charmeleon    80     False  5   Fire          405 58     64

We out Name first, then Speed, then Legendary, and then everything else.

That is basically all that is to select(). Pretty easy and pretty powerful. No base R subsetting necessary.

Let’s continue with filter().

dplyr’s filter() Function

dplyr’s filter() function selects certain rows instead of columns. It filters for whatever you want it to filter. Let’s see how that works with the Pokemon data set.

poke %>%
  dplyr::filter(Speed > 65) %>%
  head()

poke %>%
  dplyr::filter(Speed > 65 & HP < 40 & Total >= 200) %>%
  head()

The code above filters to Pokemon who have Speed above 65. The second line filters for Pokemon having Speed greater than 65, HP less than 40, and Total greater or equal to 200. So if you want to filter for more than one variable, then you can simply add & or |, which stand for or, and filter for how many variables and conditions that you want.

One important thing to notice about the filter function is that the inside has to evaluate to TRUE or FALSE. In the code above, we used relational operations that evaluate to a boolean (true or false). The code below shows one example where we look for certain pokemon with a specific type.

poke %>% 
  dplyr::filter(Type.2 == "Bug" | Type.2 == "Flying") %>%
  head()

So far so good. Sometimes, we also want to filter Pokemons that contain a certain string. For example, let’s find all pokemon that start with the letter Pi.

For this kind of filtering, we would use the stringr package from the tidyverse. If you want to find out more about it read up here.

poke %>%
  dplyr::filter(stringr::str_detect(Name, "^Pi")) %>%
  head()

##                Name   Type.1 Type.2 Total HP Attack Defense Sp..Atk
##              Pidgey   Normal Flying   251 40     45      40      35
##           Pidgeotto   Normal Flying   349 63     60      55      50
##             Pidgeot   Normal Flying   479 83     80      75      70
## PidgeotMega Pidgeot   Normal Flying   579 83     80      80     135
##             Pikachu Electric          320 35     55      40      50
##              Pinsir      Bug          500 65    125     100      55

In the code above, are loking for Pokemons who start with “Pi”. The ^ symbole states that we only want to filter for Pokemons who with the starting letters “Pi” and not somewhere in the middle or end. Always remember that the inside of the filter() function has to evaluate to a boolean.

This function was also very intuitive and easy to understand. As soon as you get the logic down, it will be easy as pi to select certain rows in your data set.

dplyr’s mutate() Function

The next function is able to create new columns based on calculations of our choice or overwrite columns. Let’s see how that works out with the Pokemon data set.

poke %>%
  dplyr::mutate(Attack_1000 = Attack * 1000,
                Attack = Attack_1000) %>%
  head()

##                  Name Type.1 Type.2 Total HP Attack Defense Sp..Atk
##             Bulbasaur  Grass Poison   318 45  49000      49      65
##               Ivysaur  Grass Poison   405 60  62000      63      80
##              Venusaur  Grass Poison   525 80  82000      83     100
## VenusaurMega Venusaur  Grass Poison   625 80 100000     123     122
##            Charmander   Fire          309 39  52000      43      60
##            Charmeleon   Fire          405 58  64000      58      80

The code above does not make sense at all. However, we can see what is possible. We created a new column called Attack_1000, where we multiplied the values of Attack by 1000. Afterward, we were able to use our newly created column right away and overwrote all the values in Attack with the values in Attack_1000.

Another great way to use the mutate() function is in combination with the base::ifelse() function. Let’s see how that works. The first argument of the ifelse() function is the condition that evaluates as true or false. Everything that evaluates to true will take on the second argument and everything that evaluates to false will take on the third argument. Let’s see how that works out.

poke %>%
  dplyr::mutate(Legendary = base::ifelse(Legendary == "False", 
                                         "Not Legendary", 
                                         "Legendary")) %>%
  head()

##   Sp..Def Speed Generation     Legendary
## 1      65    45          1 Not Legendary
## 2      80    60          1 Not Legendary
## 3     100    80          1 Not Legendary
## 4     120    80          1 Not Legendary
## 5      50    65          1 Not Legendary
## 6      65    80          1 Not Legendary

When a Pokemon is legendary, then we want to the row in that column to say Legendary and if not, it should say Not Legendary.

Another useful way to use the base::ifelse() function is with a placeholder. Check out my blog post about magrittr’s placeholders if you want to know more.

poke %>%
  dplyr::mutate(Type.1 = ifelse(Type.1 == "Grass", 
                                "Grass Pokemon", 
                                as.character(.$Type.1))) %>%
  head()

##                  Name        Type.1 Type.2 Total HP Attack Defense
##             Bulbasaur Grass Pokemon Poison   318 45     49      49
##               Ivysaur Grass Pokemon Poison   405 60     62      63
##              Venusaur Grass Pokemon Poison   525 80     82      83
## VenusaurMega Venusaur Grass Pokemon Poison   625 80    100     123
##            Charmander          Fire          309 39     52      43
##            Charmeleon          Fire          405 58     64      58

Other resources about data manipulation/visualization you might find useful:

 

Post your comment