Analyzing My Website With Google Analytics, R and googleAnalyticsR

February 5, 2020 By Pascal Schmidt Personal Project R

Today, we will be going over some basic analytics of my data science blog (thatdatatho.com). We will be looking at my page views, what articles generate the most traffic and a few other things.

I have been blogging now for a bit over two years and I started the blog to become a better data scientist and to share my learnings and insights with others.

For my analysis, I used the googleAnalyticsR package to pull data from my Google Analytics account.

Let’s get started

Google Analytics Set-up

library(tidyverse)
library(tidytext)
library(igraph)
library(ggraph)
library(wordcloud)

# Ran last on February 4th, 2020
# library(googleAnalyticsR)
# googleAnalyticsR::ga_auth()
# 
# my_accounts <- ga_account_list()
# my_ID <- my_accounts %>%
#   dplyr::pull(viewId) %>%
#   as.integer()
# 
# web_data <- google_analytics(my_ID, 
#                              date_range = c("2018-01-15", "today"),
#                              metrics = c("sessions","pageviews", 
#                                          "entrances","bounces", "bounceRate", "sessionDuration"),
#                              dimensions = c("date","deviceCategory", "hour", "dayOfWeekName",
#                                             "channelGrouping", "source", "keyword", "pagePath"),
#                              anti_sample = TRUE)
# 
# write.csv(web_data, "data/web_data.csv", row.names = F)

web_data <- readr::read_csv("data/web_data.csv")

If you want to know more about the googleAnalyticsR package, then check out the documentation. Also, when you are interested in other metrics and dimensions, then you can choose them from this link.

If you want to reproduce my analysis or want to look at other metrics or dimensions, then go to my GitHub where I provided the code and the data set for today’s analysis.

Blog Development Over Time With googleAnalyticsR

web_data %>%
  dplyr::group_by(date) %>%
  dplyr::summarise(total_views = sum(pageviews)) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(month = lubridate::month(date, label = TRUE),
                year = lubridate::year(date)) %>%
  tidyr::unite("month_year", month, year, sep = "-", remove = TRUE) %>%
  dplyr::group_by(month_year) %>%
  dplyr::mutate(max_view_month = max(total_views),
                max_view_month = base::ifelse(max_view_month == total_views, 
                                              max_view_month, 
                                              NA)) -> total_views

ggplot(total_views, aes(x = date, y = total_views)) +
  geom_line() +
  geom_text(aes(label = max_view_month), check_overlap = TRUE, vjust = -0.5) +
  theme_minimal() +
  ggtitle("Line Chart of Page Views With Maximum Page Views Per Month") +
  theme(plot.title = element_text(hjust = 0.5, size = 15)) +
  xlab("Date") +
  ylab("Page Views")

I started my blog on January 15th, 2018. Since the middle of April 2018, I am keeping track of my analytics with Google Analytics. We can see that my blog is continually growing over time. Currently, I am averaging around 5000 page views per month and around 4000 unique visitors. The spikes that you see each month, accompanied by numbers, are the maximum number of page views I have reached during a day of a month. In December 2019, I reached my most page views so far throughout the day.

What Blog Posts Generate the Most Traffic?

web_data %>%
  dplyr::group_by(pagePath) %>%
  dplyr::summarise(n = sum(pageviews)) %>%
  dplyr::arrange(desc(n)) %>%
  dplyr::mutate(pagePath = stringr::str_remove_all(pagePath, "([[0-9]]|\\/)")) %>%
  dplyr::filter(pagePath != "") %>%
  .[1:10, ] %>%
  dplyr::arrange(n) %>%
  dplyr::mutate(pagePath = factor(pagePath, levels = pagePath)) -> top_10_articles


ggplot(top_10_articles, aes(x = pagePath, y = n)) +
  geom_bar(stat = "identity") +
  geom_text(data = top_10_articles %>%
              .[5:10, ], aes(label = n), hjust = 1, color = "white") +
  geom_text(data = top_10_articles %>%
              .[1:4, ], aes(label = n), hjust = -0.1, color = "black") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 15)) +
  coord_flip() +
  ylab("Articles") +
  xlab("Count") +
  ggtitle("Top 10 of My Most Popular Blog Posts")

Interestingly, almost half of my traffic is being generated by one single article where I am presenting different summary statistics tables. I published this post in fall 2018 and it was a result of my research for summary tables at the BC Cancer Agency. Check out my most successful article here and also my internship experience at the BC Cancer Agency.

What Days of the Week are Most Popular?

web_data %>%
  dplyr::group_by(dayOfWeekName) %>%
  summarise(n = sum(pageviews)) %>%
  dplyr::arrange(desc(n)) %>%
  dplyr::mutate(dayOfWeekName = factor(dayOfWeekName, levels = dayOfWeekName)) -> week_day

ggplot(week_day, aes(x = dayOfWeekName, y = n)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  xlab("") +
  ylab("") +
  theme(plot.title = element_text(hjust = 0.5, size = 15)) +
  ggtitle("Most Popular Days")

Tuesdays and Wednesdays are the most popular. As expected, no one wants to become a better data scientist on the weekend 🙁

Device Category and Traffic Source Analytics

web_data %>%
  dplyr::group_by(date, deviceCategory) %>%
  dplyr::summarise(total_views = sum(pageviews)) -> device_cat

ggplot(device_cat, aes(x = date, y = total_views, col = deviceCategory)) +
  geom_smooth(se = F, span = 0.2) +
  theme_minimal()

Most people look at my posts on desktop, followed by mobile and tablet.

web_data %>%
  dplyr::group_by(channelGrouping) %>%
  dplyr::summarise(total_views = sum(pageviews)) %>%
  dplyr::mutate(prop = round(total_views / sum(total_views), 4) * 100) %>%
  dplyr::arrange(desc(total_views)) %>%
  dplyr::mutate(channelGrouping = factor(channelGrouping, levels = channelGrouping)) -> source

ggplot(source, aes(x = channelGrouping, y = total_views, fill = channelGrouping)) +
  geom_bar(stat = "identity") +
  geom_text(data = source %>%
              .[1, ], aes(label = paste(channelGrouping, "\n", total_views, "\n", prop, "%")),
            vjust = 1) +
  geom_text(data = source %>%
              .[2:4, ], aes(label = paste(channelGrouping, "\n", total_views, "\n", prop, "%")),
            vjust = -0.1) +
  theme_minimal() +
  theme(axis.text = element_blank(),
        axis.title = element_blank(),
        plot.title = element_text(hjust = 0.5),
        legend.position = "none") +
  ggtitle("Traffic Sources")

More than 80% of my traffic comes from organic searches. A good idea for increasing direct traffic would be to create an email list.

Analyzing Keyword Searches with Google Analytics and R

web_data %>%
  dplyr::select(keyword) %>%
  dplyr::filter(!(keyword %in% c("(not set)", "(not provided)"))) %>%
  tidytext::unnest_tokens(word, keyword) -> key_words
  
key_words %>%
  dplyr::count(word, sort = TRUE) %>%
  .[1:10, ] %>%
  dplyr::mutate(word = stats::reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  theme_minimal()

In the above graph, I visualized the keywords people typed into Google to get to my articles. The data is very sparse, unfortunately. Therefore, I did not remove any stop words. Again, we can see that my summary statistics table tutorial and my RSelenium tutorial are very popular.

In the below graph I visualized the words with a word cloud.

key_words %>%
  dplyr::count(word) %>%
  with(wordcloud::wordcloud(word, n))

Visualizing a Bigram With Google Analytics and R

In the code below, we have used the unnest_tokens() function to tokenize the keyword search of readers into sequences of words. We do this to see how often the word X is followed by the word Y. Through this kind of analysis, we can model a relationship between words.

Below, we arranged the words into a network, or “graph”. We can see words and combinations of words that are connected by nodes.

web_data %>%
  dplyr::select(keyword) %>%
  dplyr::filter(!(keyword %in% c("(not set)", "(not provided)"))) %>%
  tidytext::unnest_tokens(bigram, keyword, token = "ngrams", n = 2) -> bigram

bigram %>%
  tidyr::separate(bigram, c("word_1", "word_2")) %>%
  dplyr::count(word_1, word_2, sort = TRUE) -> bigram_counts

bigram_counts %>%
  dplyr::filter(n > 10) %>%
  igraph::graph_from_data_frame() -> bigram_graph

a <- grid::arrow(type = "closed", length = unit(0.15, "inches"))
ggraph::ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = F,
                 arrow = a, end_cap = circle(0.07, "inches")) +
  geom_node_point(color = "lightblue", size = 3) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

From the above graph, we can see common connections between words in data science. For example, people searched for “gradient descent”, “variance bias”, “line search”, “machine learning”, “web scraping”, “cross-validation”, or “trade off”.

I hope you have enjoyed this blog about Google Analytics and R. If you have any questions, let me know in the comments below.

Tags: data analysis data visualization googleAnalyticsR project R