From Tidyverse to Pandas and Back – An Introduction to Data Wrangling with Pyhton and R

In this tutorial, we are going to have a look at a tidytuesday data set. We are exploring college tuition with R and Python at the same time! We are going from pandas to tidyverse and from tidyverse to pandas. Both languages are really cool for data science and a combination of both languages gives you more flexibility.

Let’s see how the tidyverse package compares to pandas and vice versa!

Python and R Setup

First, we have to load the R packages and then specify the virtual environment for python.

library(tidyverse) 
library(ggrepel) 
library(reticulate)

use_condaenv("anaconda", required = TRUE)

Then, we’ll load pandasnumpy, and plotnine.

import pandas as pd
import numpy as np
from plotnine import *
from adjustText import adjust_text
from mizani.formatters import dollar_format

For R, I just downloaded the data frame from the GitHub URL with read_csv() and then saved them as a data frame in a folder called data.

tuition_cost <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/tuition_cost.csv') %>%
    dplyr::select(name, type, degree_length, in_state_tuition)
tuition_income <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/tuition_income.csv')

readr::write_csv(tuition_cost, here::here("/college-tuition/data/tuition_cost.csv"))
readr::write_csv(tuition_income, here::here("college-tuition/data/tuition_income.csv"))

From the data folder, I am reading in the data with pd.read_csv().

# read in data 
cost = pd.read_csv("data/tuition_cost.csv", encoding = "ISO-8859-1") 
income = pd.read_csv("data/tuition_income.csv")

Comparison:

readr::read_csv() versus pd.read_csv()

Very similar and easy to remember.

Now, for our data sets, we are interested in which college has consistently increased tuition prices over the past years. This is the research question that we are trying to answer.

In order to do that, we are dropping net_cost and income_lvl, group by name and year, and then calculating the mean for each university in each year.

In order to accomplish the task, we are using select() to drop columns and group_by() + summarize() to calculate the mean. The summarize() function will automatically drop rows and displays the groups with the corresponding value.

Data Wrangling: Python vs. R, Pandas vs. Tidyverse

tuition_income %>%
    dplyr::select(-c(net_cost, income_lvl)) %>%
    dplyr::group_by(name, year) %>%
    dplyr::summarise(avg_per_year = mean(total_price, na.rm = TRUE)) %>%
    dplyr::ungroup() %>%
    dplyr::arrange(name, year) -> temp_r

For python, we are using filter() to drop columns and group_by() + agg() for calculating the mean. In order to avoid multilevel indexing, we have to reset the index and then also rename the total_price column to avg_per_year.

temp_py = (income.filter(["name", "total_price", "year"])

    # group by year and name to aggregate multiple rows with same year and name
    .groupby(["name", "year"]) # or as_index = False
    .agg({"total_price": "mean"})
    .reset_index()
    .sort_values(["name", "year"])
    .rename(columns = {"total_price": "avg_per_year"}))

Another way to accomplish this would be to use the assign() method in combination with groupby() + transform(). This way, we are adding a new column, avg_per_year. One thing to note is that the transform() method works similarly to the mutate() function in R and the number of rows is equal to the original data frame. Hence, we are dropping duplicate rows and then also dropping the total_price column afterward in the code below.

temp_py = (income.filter(["name", "total_price", "year"])

    # group by year and name to aggregate multiple rows with same year and name
    .assign(avg_per_year = lambda x: x.groupby(["name", "year"])["total_price"] 
    .transform(lambda x: x.mean())) 
    .drop("total_price", axis = 1)
    .drop_duplicates()
    .sort_values(["name", "year"]))

Comparison:

dplyr::select() versus pandas filter() method
dplyr::group_by() + dplyr::summarize() versus pandas groupby() + agg() + reset_index() methods

Now as a sanity check that we are truly a master of pandas and tidyverse, we are comparing data frames and see if we get the same results for both of them.

dplyr::all_equal(py$temp_py, temp_r)

The result equals TRUE and we can move to the next steps.

We are the masters!!

Calculating the % Change in Tuition with pandas and tidyverse

In order to see which university has the most increasing tuition consistently, we will compute the difference for each consecutive year.

temp_r %>%
    dplyr::group_by(name) %>%

    # get difference from year to year for every university
    dplyr::mutate(count = dplyr::n(),
    difference = c(NA, diff(avg_per_year))) %>%

    # filter out universities where tuition was not increasing over time
    dplyr::filter(difference >= 0 | is.na(difference)) %>%

    # after filtering out universities count again
    dplyr::mutate(count_2 = dplyr::n()) %>%

    # compare counts to first ones and filter out universities
    # that have not had increasing tuition for every single year
    dplyr::filter(count == count_2) -> temp_r

Comparison:

pandas  groupby + transform(“count”) versus tidyverse group_by() function and dplyr n() function
– For Python, we can use the the transform method with np.diff to calculate the difference and in R we are using the base function diff.

The rest is pretty straight forward and self-explanatory.

# calculate the difference for each year in college tuition
temp_py = (temp_py.assign(count = lambda x: x.groupby("name")["name"].transform("count"),
    difference = lambda x: x.groupby("name")["avg_per_year"].transform(lambda x: np.append(np.nan, np.diff(x)))) \

    # filter out tuition where there has not been an increase in tuition each year
    .query("(difference >= 0) or difference != difference") \
    .assign(count_2 = lambda x: x.groupby("name")["name"].transform("count")) \

    # throw out rows where there has not been a consistent increase intuition over the years
    .query("(count_2 == count)"))

When we look if both data frames are equal, we can see that only the type for the data frames looks different.

dplyr::all_equal(py$temp_py, temp_r)

[1] - Different types for column count: double vs integer
    - Different types for column count_2: double vs integer

Next, we are going to filter for only 4-year degrees, join the data with the tuition_cost data set, calculate the percentages, split the data frame by college type, and then pick the 3 universities from each type, that have had the most increase in college tuition.

temp_r %>%
    dplyr::filter(count_2 >= 4) %>%
    dplyr::ungroup() %>%
    dplyr::inner_join(tuition_cost %>%
    dplyr::select(type, name, degree_length), by = "name") %>%

    # only consider universities with 4 years of length
    dplyr::filter(degree_length == "4 Year") %>%
    dplyr::group_by(name) %>%
    dplyr::mutate(change_in_price = (dplyr::last(avg_per_year) - avg_per_year[1]) / avg_per_year[1]) %>%
    dplyr::ungroup() %>%
    dplyr::group_split(type) %>%
    purrr::map(~ dplyr::arrange(., desc(change_in_price))) %>%
    purrr::map(~ dplyr::filter(., stringr::str_detect(name, { unique(.$name)[1:3] } %>%
                                                                 paste0(collapse = "|")))) %>%
    base::do.call(rbind, .) %>%
    dplyr::ungroup() %>%
    dplyr::mutate(name = forcats::fct_reorder(name, avg_per_year),
                  date = lubridate::ymd(year, truncated = 2L)) -> line_chart

line_chart %>%
    dplyr::group_by(name) %>%
    dplyr::summarise_all(~ max(.)) -> df_text

Here is the equivalent python code. First, we merge the data frame and then calculate the percentage change from the very first year and the last year.

# inner join to get degree length and type of university
tuition_income = pd.merge(temp_py,
                          cost,
                          on = "name")

temp_py_2 = (tuition_income.query('degree_length == "4 Year"') \
    .assign(change_in_price = lambda x: x.groupby('name')['avg_per_year'] \
    .transform(lambda x: (x.iloc[-1] - x.iloc[0]) / x.iloc[0])))

The percentage change is calculated with iloc and then the corresponding index of the row.

In the code below, we are going to split the data frame with the groupby method and then we are going to get the top universities for each type with the get_group method and indexing.

diff_types = np.unique(tuition_income["type"])
# initializing empty data frame
df = pd.DataFrame(columns = tuition_income.columns)

for i in range(len(diff_types)):

    tu_in = (temp_py_2.groupby("type") \
        .get_group(diff_types[i]) \
        .sort_values("change_in_price", ascending = False))

    top_3 = tu_in.drop_duplicates("name")["name"][0:3].reset_index()
    tu_in = (tu_in[tu_in["name"].isin(top_3['name'])] \
        .sort_values(["name", "year"]))

    df = df.append(tu_in)

Python vs. R: plotnine vs. ggplot

Below, we are using the plotnine library which is the ggplot implementation in python. It is not as good as ggplot in R but still a decent plotting library for python.

df_text = df.sort_values(["name", "type", "year"], ascending = False) \
    .drop_duplicates("name") \
    .assign(change_in_price = lambda x: round(x["change_in_price"] * 100, 2))

(ggplot(df, aes(x = "factor(year)", y = "avg_per_year", color = "factor(name)", group = "name")) +
    geom_point() +
    geom_line() +
    geom_label(data = df_text, mapping = aes(x = "factor(year)", y = "avg_per_year",
    label = "change_in_price"), color = "black",
    format_string='{}%') +
    theme_light() +
    theme(legend_position = (.5, -.1),
          legend_direction = 'horizontal',
          plot_title = element_text(hjust = 0.5, size = 20),
          legend_title = element_blank(),
          axis_title = element_text(size = 15),
          axis_text = element_text(size = 12),
          legend_text = element_text(size = 12)) +
          scale_y_continuous(labels = dollar_format()) +
     ylab("Tuition") +
     xlab("Year") +
     ggtitle("Top 9 Universities With the Most Tuition Increase Over Time"))

And here, we have the ggplot equivalent.

ggplot(line_chart) + 
  geom_point(aes(x = date, y = avg_per_year, col = name)) +
  geom_line(aes(x = date, y = avg_per_year, col = name)) +
  theme_minimal() +
  theme(legend.position = "bottom",
        plot.title = element_text(hjust = 0.5, size = 20),
        legend.title = element_blank(),
        axis.title = element_text(size = 15),
        axis.text = element_text(size = 12),
        legend.text = element_text(size = 12)) +
  expand_limits(x = lubridate::ymd(2018.5, truncated = 2L)) +
  geom_label_repel(data = df_text, show.legend = FALSE,
                   aes(x = date, y = avg_per_year, size = 0.5,
                       label = paste0(round(change_in_price, 2) * 100, "% ", type))) +
  scale_y_continuous(labels = scales::dollar_format()) +
  ylab("Tuition") +
  xlab("Year") +
  ggtitle("Top 3 Universities From Each Type (Public, Private, For Profit) \n With the Most Tuition Increase Over Time") +
  guides(col = guide_legend(nrow = 3, byrow = TRUE)) +
  scale_color_brewer(palette = "Set1")

 

ggsave("line_chart_ggplot.png", width = 10, height = 6)

pandas vs.tidyverse

I hope you have learned a few tricks and can have a smooth transition from pandas to the tidyverse or from the tidyverse to pandas.

Let me know in the comments below which language you are preferring the most for data cleaning and data wrangling task.

Post your comment