What Were the Most Hyped Broadway Musicals of All Time? #TidyTuesday

May 2, 2020 By Pascal Schmidt #TidyTuesday R

During this week’s #TidyTuesday challenge, I had the opportunity to explore how to place images in ggplot. To release you from your suspense about my plot, here it is:

We can clearly see how far ahead Hamilton is. In fact, since its preview, there has not been any performance that has not been sold out completely! Wow! In second place comes Harry Potter and the cursed child. The Harry Potter musical was consecutively sold out since its start in early 2018 until the end of February where there were still some seats left in the theaters. Still, the musical was consecutively sold out for almost 2 years.
Maybe it is a surprise that we do not see The Lion King among the top ten since it is the most successful musical in terms of the overall money being spent to watch the musical.

Data Investigation

After loading the packages and data…

library(tidyverse)
library(tidytuesdayR)
library(ggimage)
tuesdata <- tidytuesdayR::tt_load("2020-04-28")
gross <- tuesdata$grosses

… we’ll have a look at it.

head(gross)
## # A tibble: 6 x 14
##   week_ending week_number weekly_gross_ov~ show  theatre weekly_gross
##   <date>            <dbl>           <dbl> <chr> <chr>          <dbl>
## 1 1985-06-09            1          3915937 42nd~ St. Ja~       282368
## 2 1985-06-09            1          3915937 A Ch~ Sam S.~       222584
## 3 1985-06-09            1          3915937 Aren~ Brooks~       249272
## 4 1985-06-09            1          3915937 Arms~ Circle~        95688
## 5 1985-06-09            1          3915937 As Is Lyceum~        61059
## 6 1985-06-09            1          3915937 Big ~ Eugene~       255386
## # ... with 8 more variables: potential_gross <lgl>, avg_ticket_price <dbl>,
## #   top_ticket_price <lgl>, seats_sold <dbl>, seats_in_theatre <dbl>,
## #   pct_capacity <dbl>, performances <dbl>, previews <dbl>;

As you can see, the data is aggregated by week. We are interested in the pct_capacity column. The pct_capacity column contains the percentage of tickets sold / seats in theater. Sometimes, the number can be more than one when standing tickets sell.

Data Manipulation

Now, to see for how many consecutive months each show was sold out, I am splitting the show column, which consists of all the musical names, into a list that contains data frames for each show. Then I will arrange() all these data frames by week_ending. Next, we want to see if, for each week, the pct_capacity value will be 1 or more. We could do that with a for loop and see if df[i, “pct_capacity”] >= 1 and then have a counter to see for how many weeks that is the case.
However, thanks to purrr and the head_while() function, we get back a data frame until a certain condition is not being met. Then, we count the number of rows in the data frame to see how many consecutive weeks a show has been sold out for.

head(gross)
## # A tibble: 6 x 14
##   week_ending week_number weekly_gross_ov~ show  theatre weekly_gross
##   &lt;date&gt;            &lt;dbl&gt;            &lt;dbl&gt; &lt;chr&gt; &lt;chr&gt;          &lt;dbl&gt;
## 1 1985-06-09            1          3915937 42nd~ St. Ja~       282368
## 2 1985-06-09            1          3915937 A Ch~ Sam S.~       222584
## 3 1985-06-09            1          3915937 Aren~ Brooks~       249272
## 4 1985-06-09            1          3915937 Arms~ Circle~        95688
## 5 1985-06-09            1          3915937 As Is Lyceum~        61059
## 6 1985-06-09            1          3915937 Big ~ Eugene~       255386
## # ... with 8 more variables: potential_gross &lt;lgl&gt;, avg_ticket_price &lt;dbl&gt;,
## #   top_ticket_price &lt;lgl&gt;, seats_sold &lt;dbl&gt;, seats_in_theatre &lt;dbl&gt;,
## #   pct_capacity &lt;dbl&gt;, performances &lt;dbl&gt;, previews &lt;dbl&gt;

We can see, that for the first six shows in our list, the first show was not sold out. However, we do not know exactly what shows belong to what number. Hence…

gross %>%
  dplyr::group_split(show) %>%
  purrr::map2(., most_hyped, ~ .x[1:.y, ]) %>%
  do.call(rbind, .) -> most_hyped
head(most_hyped)
## # A tibble: 6 x 14
##   week_ending week_number weekly_gross_ov~ show  theatre weekly_gross
##   <date>            <dbl>            <dbl> <chr> <chr>          <dbl>
## 1 2004-10-24           21         12484083 "'ni~ Royale~       125452
## 2 2003-05-11           50         13957866 ""M~ Royale~       148824
## 3 2008-07-06            6         18880521 "[ti~ Lyceum~        69299
## 4 2007-04-15           46         22462312 "110~ Studio~       197785
## 5 2008-09-21           17         15646164 "13"  Bernar~       203326
## 6 1997-07-20            7          9449380 "177~ Gershw~       150045
## # ... with 8 more variables: potential_gross <lgl>, avg_ticket_price <dbl>,
## #   top_ticket_price <lgl>, seats_sold <dbl>, seats_in_theatre <dbl>,
## #   pct_capacity <dbl>, performances <dbl>, previews <dbl>

… we split our initial data frame again by show and then use map2() to loop over each data frame in the list. In the case above, .x is each data frame and .y is the list with the integers we calculated in the previous code chunk. So .x[1:.y, ] returns all the rows for which shows were consecutively sold out.
Now, let’s have a look at the top ten and let’s transform the weeks into months.

most_hyped %>%
  dplyr::count(show, sort = TRUE) %>%
  dplyr::arrange(desc(n)) %>%
  .[1:10, ] %>%
  dplyr::mutate(months = round(n / 52 * 12, 2)) -> top_ten
top_ten
## # A tibble: 10 x 3
##    show                                                     n months
##    <chr>                                                <int>  <dbl>
##  1 Hamilton                                               241  55.6 
##  2 Harry Potter and the Cursed Child, Parts One and Two   100  23.1 
##  3 To Kill A Mockingbird                                   66  15.2 
##  4 Moulin Rouge! The Musical!                              36   8.31
##  5 Cats                                                    34   7.85
##  6 Springsteen On Broadway                                 27   6.23
##  7 Mean Girls                                              25   5.77
##  8 Motown The Musical                                      20   4.62
##  9 Fish in the Dark                                        18   4.15
## 10 Hughie

So, n is the weeks (the number of rows) and then we transform them into months. There we go, the Hamilton musical seems to be the most hyped musical of all times with shows that have been consecutively sold old since day 1. For almost 56 months.

Plotting the Most Hyped Musicals of All Times With ggplot

For the ggplot you have already seen on top, I downloaded the musical images from Google and used the ggimage package to place them into the plot.
First, however, I saved the images and then put them into the top ten data frame for easier plotting.

images <- list.files()[stringr::str_detect(list.files(), pattern = ".*png")]
path <- vector(mode = "character", length = nrow(top_ten))
for (i in 1:nrow(top_ten)) {
  index <- which(grepl(
    pattern = stringr::str_remove(images[i], "\.png"),
    top_ten$show, ignore.case = TRUE
  ))
  path[index] <- images[i]
}
path <- paste0(here::here(), "/", path)
top_ten %>%
  dplyr::bind_cols(dplyr::as_tibble(path)) %>%
  dplyr::mutate(show = forcats::fct_reorder(show, n, .desc = TRUE)) -> top_ten
top_ten
## # A tibble: 10 x 4
##    show                             n months value                              
##    <fct>                        <int>  <dbl> <chr>                              
##  1 Hamilton                       241  55.6  C:/Users/Pascal Schmidt/Desktop/ti~
##  2 Harry Potter and the Cursed~   100  23.1  C:/Users/Pascal Schmidt/Desktop/ti~
##  3 To Kill A Mockingbird           66  15.2  C:/Users/Pascal Schmidt/Desktop/ti~
##  4 Moulin Rouge! The Musical!      36   8.31 C:/Users/Pascal Schmidt/Desktop/ti~
##  5 Cats                            34   7.85 C:/Users/Pascal Schmidt/Desktop/ti~
##  6 Springsteen On Broadway         27   6.23 C:/Users/Pascal Schmidt/Desktop/ti~
##  7 Mean Girls                      25   5.77 C:/Users/Pascal Schmidt/Desktop/ti~
##  8 Motown The Musical              20   4.62 C:/Users/Pascal Schmidt/Desktop/ti~
##  9 Fish in the Dark                18   4.15 C:/Users/Pascal Schmidt/Desktop/ti~
## 10 Hughie                          14   3.23 C:/Users/Pascal Schmidt/Desktop/ti~

Before we start plotting, you can see in my plot, that the gold bars become more transparent. This is because I also added an alpha column to the data frame.

top_ten %>%
  dplyr::mutate(alpha = c(1, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1)) -> top_ten

Now, we can start plotting!

ggplot(top_ten, aes(x = show, y = months)) +
  geom_col(fill = "#DAA520", aes(alpha = alpha)) +
  geom_image(aes(image = value, y = months + 4.5), size = 0.085) +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    plot.background = element_rect(fill = "black"),
    panel.background = element_rect(fill = "black"),
    axis.title.x = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks = element_blank(),
    axis.line = element_blank(),
    axis.text.y = element_text(color = "#e5e4e2"),
    axis.title.y = element_text(color = "#e5e4e2"),
    plot.subtitle = element_text(color = "#e5e4e2", hjust = 0.5, size = 12),
    plot.title = element_text(color = "#e5e4e2", hjust = 0.5, size = 16),
    legend.position = "none"
  ) +
  ylab("Months") +
  labs(
    title = "Top Ten of the Most Initially Hyped Musicals",
    subtitle = "For How Many Consecutive Months After Premiering Were Musicals Sold Out?"
  ) +
  annotate("curve", x = 4, y = 50, xend = 1.5, yend = 60, curvature = 0.3, arrow = arrow(length = unit(3, "mm")), color = "#DAA520") +
  annotate("text",
    x = 5,
    y = 45,
    label = "Hamilton has been sold out since its first preview non July 19th 2015 for every single performance. nWill there ever be a performance that is not sold out???",
    color = "#DAA520"
  )
ggsave("final/broadway.jpeg", width = 10)

Again, the created plot:

I hope you have enjoyed this week’s #TidyTuesday challenge. The code can be found on my Github.

For more #TidyTuesday, check out the links below:

Tags: data manipulation data visualization ggplot2 R