From Tidyverse to Pandas and Back – An Introduction to Data Wrangling with Pyhton and R
September 27, 2020 By Pascal Schmidt #TidyTuesday R Tidyverse Tutorial
In this tutorial, we are going to have a look at a tidytuesday data set. We are exploring college tuition with R and Python at the same time! We are going from pandas to tidyverse and from tidyverse to pandas. Both languages are really cool for data science and a combination of both languages gives you more flexibility.
Let’s see how the tidyverse package compares to pandas and vice versa!
Python and R Setup
First, we have to load the R packages and then specify the virtual environment for python.
library(tidyverse)
library(ggrepel)
library(reticulate)
use_condaenv("anaconda", required = TRUE)
Then, we’ll load pandas, numpy, and plotnine.
import pandas as pd import numpy as np from plotnine import * from adjustText import adjust_text from mizani.formatters import dollar_format
For R, I just downloaded the data frame from the GitHub URL with read_csv() and then saved them as a data frame in a folder called data.
tuition_cost <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/tuition_cost.csv') %>%
dplyr::select(name, type, degree_length, in_state_tuition)
tuition_income <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/tuition_income.csv')
readr::write_csv(tuition_cost, here::here("/college-tuition/data/tuition_cost.csv"))
readr::write_csv(tuition_income, here::here("college-tuition/data/tuition_income.csv"))
From the data folder, I am reading in the data with pd.read_csv().
# read in data
cost = pd.read_csv("data/tuition_cost.csv", encoding = "ISO-8859-1")
income = pd.read_csv("data/tuition_income.csv")
Comparison:
– readr::read_csv() versus pd.read_csv()
Very similar and easy to remember.
Now, for our data sets, we are interested in which college has consistently increased tuition prices over the past years. This is the research question that we are trying to answer.
In order to do that, we are dropping net_cost and income_lvl, group by name and year, and then calculating the mean for each university in each year.
In order to accomplish the task, we are using select() to drop columns and group_by() + summarize() to calculate the mean. The summarize() function will automatically drop rows and displays the groups with the corresponding value.
Data Wrangling: Python vs. R, Pandas vs. Tidyverse
tuition_income %>%
dplyr::select(-c(net_cost, income_lvl)) %>%
dplyr::group_by(name, year) %>%
dplyr::summarise(avg_per_year = mean(total_price, na.rm = TRUE)) %>%
dplyr::ungroup() %>%
dplyr::arrange(name, year) -> temp_r
For python, we are using filter() to drop columns and group_by() + agg() for calculating the mean. In order to avoid multilevel indexing, we have to reset the index and then also rename the total_price column to avg_per_year.
temp_py = (income.filter(["name", "total_price", "year"])
# group by year and name to aggregate multiple rows with same year and name
.groupby(["name", "year"]) # or as_index = False
.agg({"total_price": "mean"})
.reset_index()
.sort_values(["name", "year"])
.rename(columns = {"total_price": "avg_per_year"}))
Another way to accomplish this would be to use the assign() method in combination with groupby() + transform(). This way, we are adding a new column, avg_per_year. One thing to note is that the transform() method works similarly to the mutate() function in R and the number of rows is equal to the original data frame. Hence, we are dropping duplicate rows and then also dropping the total_price column afterward in the code below.
temp_py = (income.filter(["name", "total_price", "year"])
# group by year and name to aggregate multiple rows with same year and name
.assign(avg_per_year = lambda x: x.groupby(["name", "year"])["total_price"]
.transform(lambda x: x.mean()))
.drop("total_price", axis = 1)
.drop_duplicates()
.sort_values(["name", "year"]))
Comparison:
– dplyr::select() versus pandas filter() method
– dplyr::group_by() + dplyr::summarize() versus pandas groupby() + agg() + reset_index() methods
Now as a sanity check that we are truly a master of pandas and tidyverse, we are comparing data frames and see if we get the same results for both of them.
dplyr::all_equal(py$temp_py, temp_r)
The result equals TRUE and we can move to the next steps.
We are the masters!!
Calculating the % Change in Tuition with pandas and tidyverse
In order to see which university has the most increasing tuition consistently, we will compute the difference for each consecutive year.
temp_r %>%
dplyr::group_by(name) %>%
# get difference from year to year for every university
dplyr::mutate(count = dplyr::n(),
difference = c(NA, diff(avg_per_year))) %>%
# filter out universities where tuition was not increasing over time
dplyr::filter(difference >= 0 | is.na(difference)) %>%
# after filtering out universities count again
dplyr::mutate(count_2 = dplyr::n()) %>%
# compare counts to first ones and filter out universities
# that have not had increasing tuition for every single year
dplyr::filter(count == count_2) -> temp_r
Comparison:
– pandas groupby + transform(“count”) versus tidyverse group_by() function and dplyr n() function
– For Python, we can use the the transform method with np.diff to calculate the difference and in R we are using the base function diff.
The rest is pretty straight forward and self-explanatory.
# calculate the difference for each year in college tuition
temp_py = (temp_py.assign(count = lambda x: x.groupby("name")["name"].transform("count"),
difference = lambda x: x.groupby("name")["avg_per_year"].transform(lambda x: np.append(np.nan, np.diff(x)))) \
# filter out tuition where there has not been an increase in tuition each year
.query("(difference >= 0) or difference != difference") \
.assign(count_2 = lambda x: x.groupby("name")["name"].transform("count")) \
# throw out rows where there has not been a consistent increase intuition over the years
.query("(count_2 == count)"))
When we look if both data frames are equal, we can see that only the type for the data frames looks different.
dplyr::all_equal(py$temp_py, temp_r) [1] - Different types for columncount: double vs integer - Different types for columncount_2: double vs integer
Next, we are going to filter for only 4-year degrees, join the data with the tuition_cost data set, calculate the percentages, split the data frame by college type, and then pick the 3 universities from each type, that have had the most increase in college tuition.
temp_r %>%
dplyr::filter(count_2 >= 4) %>%
dplyr::ungroup() %>%
dplyr::inner_join(tuition_cost %>%
dplyr::select(type, name, degree_length), by = "name") %>%
# only consider universities with 4 years of length
dplyr::filter(degree_length == "4 Year") %>%
dplyr::group_by(name) %>%
dplyr::mutate(change_in_price = (dplyr::last(avg_per_year) - avg_per_year[1]) / avg_per_year[1]) %>%
dplyr::ungroup() %>%
dplyr::group_split(type) %>%
purrr::map(~ dplyr::arrange(., desc(change_in_price))) %>%
purrr::map(~ dplyr::filter(., stringr::str_detect(name, { unique(.$name)[1:3] } %>%
paste0(collapse = "|")))) %>%
base::do.call(rbind, .) %>%
dplyr::ungroup() %>%
dplyr::mutate(name = forcats::fct_reorder(name, avg_per_year),
date = lubridate::ymd(year, truncated = 2L)) -> line_chart
line_chart %>%
dplyr::group_by(name) %>%
dplyr::summarise_all(~ max(.)) -> df_text
Here is the equivalent python code. First, we merge the data frame and then calculate the percentage change from the very first year and the last year.
# inner join to get degree length and type of university
tuition_income = pd.merge(temp_py,
cost,
on = "name")
temp_py_2 = (tuition_income.query('degree_length == "4 Year"') \
.assign(change_in_price = lambda x: x.groupby('name')['avg_per_year'] \
.transform(lambda x: (x.iloc[-1] - x.iloc[0]) / x.iloc[0])))
The percentage change is calculated with iloc and then the corresponding index of the row.
In the code below, we are going to split the data frame with the groupby method and then we are going to get the top universities for each type with the get_group method and indexing.
diff_types = np.unique(tuition_income["type"])
# initializing empty data frame
df = pd.DataFrame(columns = tuition_income.columns)
for i in range(len(diff_types)):
tu_in = (temp_py_2.groupby("type") \
.get_group(diff_types[i]) \
.sort_values("change_in_price", ascending = False))
top_3 = tu_in.drop_duplicates("name")["name"][0:3].reset_index()
tu_in = (tu_in[tu_in["name"].isin(top_3['name'])] \
.sort_values(["name", "year"]))
df = df.append(tu_in)
Python vs. R: plotnine vs. ggplot
Below, we are using the plotnine library which is the ggplot implementation in python. It is not as good as ggplot in R but still a decent plotting library for python.
df_text = df.sort_values(["name", "type", "year"], ascending = False) \
.drop_duplicates("name") \
.assign(change_in_price = lambda x: round(x["change_in_price"] * 100, 2))
(ggplot(df, aes(x = "factor(year)", y = "avg_per_year", color = "factor(name)", group = "name")) +
geom_point() +
geom_line() +
geom_label(data = df_text, mapping = aes(x = "factor(year)", y = "avg_per_year",
label = "change_in_price"), color = "black",
format_string='{}%') +
theme_light() +
theme(legend_position = (.5, -.1),
legend_direction = 'horizontal',
plot_title = element_text(hjust = 0.5, size = 20),
legend_title = element_blank(),
axis_title = element_text(size = 15),
axis_text = element_text(size = 12),
legend_text = element_text(size = 12)) +
scale_y_continuous(labels = dollar_format()) +
ylab("Tuition") +
xlab("Year") +
ggtitle("Top 9 Universities With the Most Tuition Increase Over Time"))
And here, we have the ggplot equivalent.
ggplot(line_chart) +
geom_point(aes(x = date, y = avg_per_year, col = name)) +
geom_line(aes(x = date, y = avg_per_year, col = name)) +
theme_minimal() +
theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, size = 20),
legend.title = element_blank(),
axis.title = element_text(size = 15),
axis.text = element_text(size = 12),
legend.text = element_text(size = 12)) +
expand_limits(x = lubridate::ymd(2018.5, truncated = 2L)) +
geom_label_repel(data = df_text, show.legend = FALSE,
aes(x = date, y = avg_per_year, size = 0.5,
label = paste0(round(change_in_price, 2) * 100, "% ", type))) +
scale_y_continuous(labels = scales::dollar_format()) +
ylab("Tuition") +
xlab("Year") +
ggtitle("Top 3 Universities From Each Type (Public, Private, For Profit) \n With the Most Tuition Increase Over Time") +
guides(col = guide_legend(nrow = 3, byrow = TRUE)) +
scale_color_brewer(palette = "Set1")
ggsave("line_chart_ggplot.png", width = 10, height = 6)
I hope you have learned a few tricks and can have a smooth transition from pandas to the tidyverse or from the tidyverse to pandas.
Let me know in the comments below which language you are preferring the most for data cleaning and data wrangling task.
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
