What Were the Most Hyped Broadway Musicals of All Time? #TidyTuesday
May 2, 2020 By Pascal Schmidt #TidyTuesday R
During this week’s #TidyTuesday challenge, I had the opportunity to explore how to place images in ggplot
. To release you from your suspense about my plot, here it is:
We can clearly see how far ahead Hamilton is. In fact, since its preview, there has not been any performance that has not been sold out completely! Wow! In second place comes Harry Potter and the cursed child. The Harry Potter musical was consecutively sold out since its start in early 2018 until the end of February where there were still some seats left in the theaters. Still, the musical was consecutively sold out for almost 2 years.
Maybe it is a surprise that we do not see The Lion King among the top ten since it is the most successful musical in terms of the overall money being spent to watch the musical.
Data Investigation
After loading the packages and data…
library(tidyverse) library(tidytuesdayR) library(ggimage) tuesdata <- tidytuesdayR::tt_load("2020-04-28") gross <- tuesdata$grosses
… we’ll have a look at it.
head(gross) ## # A tibble: 6 x 14 ## week_ending week_number weekly_gross_ov~ show theatre weekly_gross ## <date> <dbl> <dbl> <chr> <chr> <dbl> ## 1 1985-06-09 1 3915937 42nd~ St. Ja~ 282368 ## 2 1985-06-09 1 3915937 A Ch~ Sam S.~ 222584 ## 3 1985-06-09 1 3915937 Aren~ Brooks~ 249272 ## 4 1985-06-09 1 3915937 Arms~ Circle~ 95688 ## 5 1985-06-09 1 3915937 As Is Lyceum~ 61059 ## 6 1985-06-09 1 3915937 Big ~ Eugene~ 255386 ## # ... with 8 more variables: potential_gross <lgl>, avg_ticket_price <dbl>, ## # top_ticket_price <lgl>, seats_sold <dbl>, seats_in_theatre <dbl>, ## # pct_capacity <dbl>, performances <dbl>, previews <dbl>;
As you can see, the data is aggregated by week. We are interested in the pct_capacity
column. The pct_capacity
column contains the percentage of tickets sold / seats in theater
. Sometimes, the number can be more than one when standing tickets sell.
Data Manipulation
Now, to see for how many consecutive months each show was sold out, I am splitting the show
column, which consists of all the musical names, into a list that contains data frames for each show. Then I will arrange()
all these data frames by week_ending
. Next, we want to see if, for each week, the pct_capacity
value will be 1 or more. We could do that with a for loop and see if df[i, “pct_capacity”] >= 1
and then have a counter to see for how many weeks that is the case.
However, thanks to purrr
and the head_while()
function, we get back a data frame until a certain condition is not being met. Then, we count the number of rows in the data frame to see how many consecutive weeks a show has been sold out for.
head(gross) ## # A tibble: 6 x 14 ## week_ending week_number weekly_gross_ov~ show theatre weekly_gross ## <date> <dbl> <dbl> <chr> <chr> <dbl> ## 1 1985-06-09 1 3915937 42nd~ St. Ja~ 282368 ## 2 1985-06-09 1 3915937 A Ch~ Sam S.~ 222584 ## 3 1985-06-09 1 3915937 Aren~ Brooks~ 249272 ## 4 1985-06-09 1 3915937 Arms~ Circle~ 95688 ## 5 1985-06-09 1 3915937 As Is Lyceum~ 61059 ## 6 1985-06-09 1 3915937 Big ~ Eugene~ 255386 ## # ... with 8 more variables: potential_gross <lgl>, avg_ticket_price <dbl>, ## # top_ticket_price <lgl>, seats_sold <dbl>, seats_in_theatre <dbl>, ## # pct_capacity <dbl>, performances <dbl>, previews <dbl>
We can see, that for the first six shows in our list, the first show was not sold out. However, we do not know exactly what shows belong to what number. Hence…
gross %>% dplyr::group_split(show) %>% purrr::map2(., most_hyped, ~ .x[1:.y, ]) %>% do.call(rbind, .) -> most_hyped head(most_hyped) ## # A tibble: 6 x 14 ## week_ending week_number weekly_gross_ov~ show theatre weekly_gross ## <date> <dbl> <dbl> <chr> <chr> <dbl> ## 1 2004-10-24 21 12484083 "'ni~ Royale~ 125452 ## 2 2003-05-11 50 13957866 ""M~ Royale~ 148824 ## 3 2008-07-06 6 18880521 "[ti~ Lyceum~ 69299 ## 4 2007-04-15 46 22462312 "110~ Studio~ 197785 ## 5 2008-09-21 17 15646164 "13" Bernar~ 203326 ## 6 1997-07-20 7 9449380 "177~ Gershw~ 150045 ## # ... with 8 more variables: potential_gross <lgl>, avg_ticket_price <dbl>, ## # top_ticket_price <lgl>, seats_sold <dbl>, seats_in_theatre <dbl>, ## # pct_capacity <dbl>, performances <dbl>, previews <dbl>
… we split our initial data frame again by show and then use map2()
to loop over each data frame in the list. In the case above, .x is each data frame and .y is the list with the integers we calculated in the previous code chunk. So .x[1:.y, ]
returns all the rows for which shows were consecutively sold out.
Now, let’s have a look at the top ten and let’s transform the weeks into months.
most_hyped %>% dplyr::count(show, sort = TRUE) %>% dplyr::arrange(desc(n)) %>% .[1:10, ] %>% dplyr::mutate(months = round(n / 52 * 12, 2)) -> top_ten top_ten ## # A tibble: 10 x 3 ## show n months ## <chr> <int> <dbl> ## 1 Hamilton 241 55.6 ## 2 Harry Potter and the Cursed Child, Parts One and Two 100 23.1 ## 3 To Kill A Mockingbird 66 15.2 ## 4 Moulin Rouge! The Musical! 36 8.31 ## 5 Cats 34 7.85 ## 6 Springsteen On Broadway 27 6.23 ## 7 Mean Girls 25 5.77 ## 8 Motown The Musical 20 4.62 ## 9 Fish in the Dark 18 4.15 ## 10 Hughie
So, n is the weeks (the number of rows) and then we transform them into months. There we go, the Hamilton musical seems to be the most hyped musical of all times with shows that have been consecutively sold old since day 1. For almost 56 months.
Plotting the Most Hyped Musicals of All Times With ggplot
For the ggplot
you have already seen on top, I downloaded the musical images from Google and used the ggimage
package to place them into the plot.
First, however, I saved the images and then put them into the top ten
data frame for easier plotting.
images <- list.files()[stringr::str_detect(list.files(), pattern = ".*png")] path <- vector(mode = "character", length = nrow(top_ten)) for (i in 1:nrow(top_ten)) { index <- which(grepl( pattern = stringr::str_remove(images[i], "\.png"), top_ten$show, ignore.case = TRUE )) path[index] <- images[i] } path <- paste0(here::here(), "/", path) top_ten %>% dplyr::bind_cols(dplyr::as_tibble(path)) %>% dplyr::mutate(show = forcats::fct_reorder(show, n, .desc = TRUE)) -> top_ten top_ten ## # A tibble: 10 x 4 ## show n months value ## <fct> <int> <dbl> <chr> ## 1 Hamilton 241 55.6 C:/Users/Pascal Schmidt/Desktop/ti~ ## 2 Harry Potter and the Cursed~ 100 23.1 C:/Users/Pascal Schmidt/Desktop/ti~ ## 3 To Kill A Mockingbird 66 15.2 C:/Users/Pascal Schmidt/Desktop/ti~ ## 4 Moulin Rouge! The Musical! 36 8.31 C:/Users/Pascal Schmidt/Desktop/ti~ ## 5 Cats 34 7.85 C:/Users/Pascal Schmidt/Desktop/ti~ ## 6 Springsteen On Broadway 27 6.23 C:/Users/Pascal Schmidt/Desktop/ti~ ## 7 Mean Girls 25 5.77 C:/Users/Pascal Schmidt/Desktop/ti~ ## 8 Motown The Musical 20 4.62 C:/Users/Pascal Schmidt/Desktop/ti~ ## 9 Fish in the Dark 18 4.15 C:/Users/Pascal Schmidt/Desktop/ti~ ## 10 Hughie 14 3.23 C:/Users/Pascal Schmidt/Desktop/ti~
Before we start plotting, you can see in my plot, that the gold bars become more transparent. This is because I also added an alpha
column to the data frame.
top_ten %>% dplyr::mutate(alpha = c(1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)) -> top_ten
Now, we can start plotting!
ggplot(top_ten, aes(x = show, y = months)) + geom_col(fill = "#DAA520", aes(alpha = alpha)) + geom_image(aes(image = value, y = months + 4.5), size = 0.085) + theme( panel.grid.major = element_blank(), panel.grid.minor = element_blank(), plot.background = element_rect(fill = "black"), panel.background = element_rect(fill = "black"), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks = element_blank(), axis.line = element_blank(), axis.text.y = element_text(color = "#e5e4e2"), axis.title.y = element_text(color = "#e5e4e2"), plot.subtitle = element_text(color = "#e5e4e2", hjust = 0.5, size = 12), plot.title = element_text(color = "#e5e4e2", hjust = 0.5, size = 16), legend.position = "none" ) + ylab("Months") + labs( title = "Top Ten of the Most Initially Hyped Musicals", subtitle = "For How Many Consecutive Months After Premiering Were Musicals Sold Out?" ) + annotate("curve", x = 4, y = 50, xend = 1.5, yend = 60, curvature = 0.3, arrow = arrow(length = unit(3, "mm")), color = "#DAA520") + annotate("text", x = 5, y = 45, label = "Hamilton has been sold out since its first preview non July 19th 2015 for every single performance. nWill there ever be a performance that is not sold out???", color = "#DAA520" ) ggsave("final/broadway.jpeg", width = 10)
Again, the created plot:
I hope you have enjoyed this week’s #TidyTuesday challenge. The code can be found on my Github.
For more #TidyTuesday, check out the links below:
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications