From Tidyverse to Pandas and Back – An Introduction to Data Wrangling with Pyhton and R
September 27, 2020 By Pascal Schmidt #TidyTuesday R Tidyverse Tutorial
In this tutorial, we are going to have a look at a tidytuesday data set. We are exploring college tuition with R and Python at the same time! We are going from pandas
to tidyverse
and from tidyverse
to pandas
. Both languages are really cool for data science and a combination of both languages gives you more flexibility.
Let’s see how the tidyverse
package compares to pandas
and vice versa!
Python and R Setup
First, we have to load the R packages and then specify the virtual environment for python.
library(tidyverse) library(ggrepel) library(reticulate) use_condaenv("anaconda", required = TRUE)
Then, we’ll load pandas
, numpy
, and plotnine
.
import pandas as pd import numpy as np from plotnine import * from adjustText import adjust_text from mizani.formatters import dollar_format
For R, I just downloaded the data frame from the GitHub URL with read_csv()
and then saved them as a data frame in a folder called data
.
tuition_cost <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/tuition_cost.csv') %>% dplyr::select(name, type, degree_length, in_state_tuition) tuition_income <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/tuition_income.csv') readr::write_csv(tuition_cost, here::here("/college-tuition/data/tuition_cost.csv")) readr::write_csv(tuition_income, here::here("college-tuition/data/tuition_income.csv"))
From the data folder, I am reading in the data with pd.read_csv()
.
# read in data cost = pd.read_csv("data/tuition_cost.csv", encoding = "ISO-8859-1") income = pd.read_csv("data/tuition_income.csv")
Comparison:
– readr::read_csv()
versus pd.read_csv()
Very similar and easy to remember.
Now, for our data sets, we are interested in which college has consistently increased tuition prices over the past years. This is the research question that we are trying to answer.
In order to do that, we are dropping net_cost
and income_lvl
, group by name
and year
, and then calculating the mean for each university in each year.
In order to accomplish the task, we are using select()
to drop columns and group_by()
+ summarize()
to calculate the mean. The summarize()
function will automatically drop rows and displays the groups with the corresponding value.
Data Wrangling: Python vs. R, Pandas vs. Tidyverse
tuition_income %>% dplyr::select(-c(net_cost, income_lvl)) %>% dplyr::group_by(name, year) %>% dplyr::summarise(avg_per_year = mean(total_price, na.rm = TRUE)) %>% dplyr::ungroup() %>% dplyr::arrange(name, year) -> temp_r
For python, we are using filter()
to drop columns and group_by()
+ agg()
for calculating the mean. In order to avoid multilevel indexing, we have to reset the index and then also rename the total_price
column to avg_per_year
.
temp_py = (income.filter(["name", "total_price", "year"]) # group by year and name to aggregate multiple rows with same year and name .groupby(["name", "year"]) # or as_index = False .agg({"total_price": "mean"}) .reset_index() .sort_values(["name", "year"]) .rename(columns = {"total_price": "avg_per_year"}))
Another way to accomplish this would be to use the assign()
method in combination with groupby()
+ transform()
. This way, we are adding a new column, avg_per_year
. One thing to note is that the transform()
method works similarly to the mutate()
function in R and the number of rows is equal to the original data frame. Hence, we are dropping duplicate rows and then also dropping the total_price
column afterward in the code below.
temp_py = (income.filter(["name", "total_price", "year"]) # group by year and name to aggregate multiple rows with same year and name .assign(avg_per_year = lambda x: x.groupby(["name", "year"])["total_price"] .transform(lambda x: x.mean())) .drop("total_price", axis = 1) .drop_duplicates() .sort_values(["name", "year"]))
Comparison:
– dplyr::select()
versus pandas
filter()
method
– dplyr::group_by()
+ dplyr::summarize()
versus pandas
groupby()
+ agg()
+ reset_index()
methods
Now as a sanity check that we are truly a master of pandas
and tidyverse
, we are comparing data frames and see if we get the same results for both of them.
dplyr::all_equal(py$temp_py, temp_r)
The result equals TRUE
and we can move to the next steps.
We are the masters!!
Calculating the % Change in Tuition with pandas and tidyverse
In order to see which university has the most increasing tuition consistently, we will compute the difference for each consecutive year.
temp_r %>% dplyr::group_by(name) %>% # get difference from year to year for every university dplyr::mutate(count = dplyr::n(), difference = c(NA, diff(avg_per_year))) %>% # filter out universities where tuition was not increasing over time dplyr::filter(difference >= 0 | is.na(difference)) %>% # after filtering out universities count again dplyr::mutate(count_2 = dplyr::n()) %>% # compare counts to first ones and filter out universities # that have not had increasing tuition for every single year dplyr::filter(count == count_2) -> temp_r
Comparison:
– pandas
groupby
+ transform(“count”)
versus tidyverse
group_by()
function and dplyr
n()
function
– For Python, we can use the the transform
method with np.diff
to calculate the difference and in R we are using the base function diff
.
The rest is pretty straight forward and self-explanatory.
# calculate the difference for each year in college tuition temp_py = (temp_py.assign(count = lambda x: x.groupby("name")["name"].transform("count"), difference = lambda x: x.groupby("name")["avg_per_year"].transform(lambda x: np.append(np.nan, np.diff(x)))) \ # filter out tuition where there has not been an increase in tuition each year .query("(difference >= 0) or difference != difference") \ .assign(count_2 = lambda x: x.groupby("name")["name"].transform("count")) \ # throw out rows where there has not been a consistent increase intuition over the years .query("(count_2 == count)"))
When we look if both data frames are equal, we can see that only the type for the data frames looks different.
dplyr::all_equal(py$temp_py, temp_r) [1] - Different types for columncount
: double vs integer - Different types for columncount_2
: double vs integer
Next, we are going to filter for only 4-year degrees, join the data with the tuition_cost
data set, calculate the percentages, split the data frame by college type, and then pick the 3 universities from each type, that have had the most increase in college tuition.
temp_r %>% dplyr::filter(count_2 >= 4) %>% dplyr::ungroup() %>% dplyr::inner_join(tuition_cost %>% dplyr::select(type, name, degree_length), by = "name") %>% # only consider universities with 4 years of length dplyr::filter(degree_length == "4 Year") %>% dplyr::group_by(name) %>% dplyr::mutate(change_in_price = (dplyr::last(avg_per_year) - avg_per_year[1]) / avg_per_year[1]) %>% dplyr::ungroup() %>% dplyr::group_split(type) %>% purrr::map(~ dplyr::arrange(., desc(change_in_price))) %>% purrr::map(~ dplyr::filter(., stringr::str_detect(name, { unique(.$name)[1:3] } %>% paste0(collapse = "|")))) %>% base::do.call(rbind, .) %>% dplyr::ungroup() %>% dplyr::mutate(name = forcats::fct_reorder(name, avg_per_year), date = lubridate::ymd(year, truncated = 2L)) -> line_chart line_chart %>% dplyr::group_by(name) %>% dplyr::summarise_all(~ max(.)) -> df_text
Here is the equivalent python code. First, we merge the data frame and then calculate the percentage change from the very first year and the last year.
# inner join to get degree length and type of university tuition_income = pd.merge(temp_py, cost, on = "name") temp_py_2 = (tuition_income.query('degree_length == "4 Year"') \ .assign(change_in_price = lambda x: x.groupby('name')['avg_per_year'] \ .transform(lambda x: (x.iloc[-1] - x.iloc[0]) / x.iloc[0])))
The percentage change is calculated with iloc
and then the corresponding index of the row.
In the code below, we are going to split the data frame with the groupby
method and then we are going to get the top universities for each type with the get_group
method and indexing.
diff_types = np.unique(tuition_income["type"]) # initializing empty data frame df = pd.DataFrame(columns = tuition_income.columns) for i in range(len(diff_types)): tu_in = (temp_py_2.groupby("type") \ .get_group(diff_types[i]) \ .sort_values("change_in_price", ascending = False)) top_3 = tu_in.drop_duplicates("name")["name"][0:3].reset_index() tu_in = (tu_in[tu_in["name"].isin(top_3['name'])] \ .sort_values(["name", "year"])) df = df.append(tu_in)
Python vs. R: plotnine vs. ggplot
Below, we are using the plotnine
library which is the ggplot
implementation in python. It is not as good as ggplot
in R but still a decent plotting library for python.
df_text = df.sort_values(["name", "type", "year"], ascending = False) \ .drop_duplicates("name") \ .assign(change_in_price = lambda x: round(x["change_in_price"] * 100, 2)) (ggplot(df, aes(x = "factor(year)", y = "avg_per_year", color = "factor(name)", group = "name")) + geom_point() + geom_line() + geom_label(data = df_text, mapping = aes(x = "factor(year)", y = "avg_per_year", label = "change_in_price"), color = "black", format_string='{}%') + theme_light() + theme(legend_position = (.5, -.1), legend_direction = 'horizontal', plot_title = element_text(hjust = 0.5, size = 20), legend_title = element_blank(), axis_title = element_text(size = 15), axis_text = element_text(size = 12), legend_text = element_text(size = 12)) + scale_y_continuous(labels = dollar_format()) + ylab("Tuition") + xlab("Year") + ggtitle("Top 9 Universities With the Most Tuition Increase Over Time"))
And here, we have the ggplot
equivalent.
ggplot(line_chart) + geom_point(aes(x = date, y = avg_per_year, col = name)) + geom_line(aes(x = date, y = avg_per_year, col = name)) + theme_minimal() + theme(legend.position = "bottom", plot.title = element_text(hjust = 0.5, size = 20), legend.title = element_blank(), axis.title = element_text(size = 15), axis.text = element_text(size = 12), legend.text = element_text(size = 12)) + expand_limits(x = lubridate::ymd(2018.5, truncated = 2L)) + geom_label_repel(data = df_text, show.legend = FALSE, aes(x = date, y = avg_per_year, size = 0.5, label = paste0(round(change_in_price, 2) * 100, "% ", type))) + scale_y_continuous(labels = scales::dollar_format()) + ylab("Tuition") + xlab("Year") + ggtitle("Top 3 Universities From Each Type (Public, Private, For Profit) \n With the Most Tuition Increase Over Time") + guides(col = guide_legend(nrow = 3, byrow = TRUE)) + scale_color_brewer(palette = "Set1")
ggsave("line_chart_ggplot.png", width = 10, height = 6)
I hope you have learned a few tricks and can have a smooth transition from pandas to the tidyverse
or from the tidyverse
to pandas
.
Let me know in the comments below which language you are preferring the most for data cleaning and data wrangling task.
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications