Analyzing My Website With Google Analytics, R and googleAnalyticsR
February 5, 2020 By Pascal Schmidt Personal Project R
Today, we will be going over some basic analytics of my data science blog (thatdatatho.com). We will be looking at my page views, what articles generate the most traffic and a few other things.
I have been blogging now for a bit over two years and I started the blog to become a better data scientist and to share my learnings and insights with others.
For my analysis, I used the googleAnalyticsR
package to pull data from my Google Analytics account.
Let’s get started
Google Analytics Set-up
library(tidyverse) library(tidytext) library(igraph) library(ggraph) library(wordcloud) # Ran last on February 4th, 2020 # library(googleAnalyticsR) # googleAnalyticsR::ga_auth() # # my_accounts <- ga_account_list() # my_ID <- my_accounts %>% # dplyr::pull(viewId) %>% # as.integer() # # web_data <- google_analytics(my_ID, # date_range = c("2018-01-15", "today"), # metrics = c("sessions","pageviews", # "entrances","bounces", "bounceRate", "sessionDuration"), # dimensions = c("date","deviceCategory", "hour", "dayOfWeekName", # "channelGrouping", "source", "keyword", "pagePath"), # anti_sample = TRUE) # # write.csv(web_data, "data/web_data.csv", row.names = F) web_data <- readr::read_csv("data/web_data.csv")
If you want to know more about the googleAnalyticsR
package, then check out the documentation. Also, when you are interested in other metrics and dimensions, then you can choose them from this link.
If you want to reproduce my analysis or want to look at other metrics or dimensions, then go to my GitHub where I provided the code and the data set for today’s analysis.
Blog Development Over Time With googleAnalyticsR
web_data %>% dplyr::group_by(date) %>% dplyr::summarise(total_views = sum(pageviews)) %>% dplyr::ungroup() %>% dplyr::mutate(month = lubridate::month(date, label = TRUE), year = lubridate::year(date)) %>% tidyr::unite("month_year", month, year, sep = "-", remove = TRUE) %>% dplyr::group_by(month_year) %>% dplyr::mutate(max_view_month = max(total_views), max_view_month = base::ifelse(max_view_month == total_views, max_view_month, NA)) -> total_views ggplot(total_views, aes(x = date, y = total_views)) + geom_line() + geom_text(aes(label = max_view_month), check_overlap = TRUE, vjust = -0.5) + theme_minimal() + ggtitle("Line Chart of Page Views With Maximum Page Views Per Month") + theme(plot.title = element_text(hjust = 0.5, size = 15)) + xlab("Date") + ylab("Page Views")
I started my blog on January 15th, 2018. Since the middle of April 2018, I am keeping track of my analytics with Google Analytics. We can see that my blog is continually growing over time. Currently, I am averaging around 5000 page views per month and around 4000 unique visitors. The spikes that you see each month, accompanied by numbers, are the maximum number of page views I have reached during a day of a month. In December 2019, I reached my most page views so far throughout the day.
What Blog Posts Generate the Most Traffic?
web_data %>% dplyr::group_by(pagePath) %>% dplyr::summarise(n = sum(pageviews)) %>% dplyr::arrange(desc(n)) %>% dplyr::mutate(pagePath = stringr::str_remove_all(pagePath, "([[0-9]]|\\/)")) %>% dplyr::filter(pagePath != "") %>% .[1:10, ] %>% dplyr::arrange(n) %>% dplyr::mutate(pagePath = factor(pagePath, levels = pagePath)) -> top_10_articles ggplot(top_10_articles, aes(x = pagePath, y = n)) + geom_bar(stat = "identity") + geom_text(data = top_10_articles %>% .[5:10, ], aes(label = n), hjust = 1, color = "white") + geom_text(data = top_10_articles %>% .[1:4, ], aes(label = n), hjust = -0.1, color = "black") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5, size = 15)) + coord_flip() + ylab("Articles") + xlab("Count") + ggtitle("Top 10 of My Most Popular Blog Posts")
Interestingly, almost half of my traffic is being generated by one single article where I am presenting different summary statistics tables. I published this post in fall 2018 and it was a result of my research for summary tables at the BC Cancer Agency. Check out my most successful article here and also my internship experience at the BC Cancer Agency.
What Days of the Week are Most Popular?
web_data %>% dplyr::group_by(dayOfWeekName) %>% summarise(n = sum(pageviews)) %>% dplyr::arrange(desc(n)) %>% dplyr::mutate(dayOfWeekName = factor(dayOfWeekName, levels = dayOfWeekName)) -> week_day ggplot(week_day, aes(x = dayOfWeekName, y = n)) + geom_bar(stat = "identity") + theme_minimal() + xlab("") + ylab("") + theme(plot.title = element_text(hjust = 0.5, size = 15)) + ggtitle("Most Popular Days")
Tuesdays and Wednesdays are the most popular. As expected, no one wants to become a better data scientist on the weekend 🙁
Device Category and Traffic Source Analytics
web_data %>% dplyr::group_by(date, deviceCategory) %>% dplyr::summarise(total_views = sum(pageviews)) -> device_cat ggplot(device_cat, aes(x = date, y = total_views, col = deviceCategory)) + geom_smooth(se = F, span = 0.2) + theme_minimal()
Most people look at my posts on desktop, followed by mobile and tablet.
web_data %>% dplyr::group_by(channelGrouping) %>% dplyr::summarise(total_views = sum(pageviews)) %>% dplyr::mutate(prop = round(total_views / sum(total_views), 4) * 100) %>% dplyr::arrange(desc(total_views)) %>% dplyr::mutate(channelGrouping = factor(channelGrouping, levels = channelGrouping)) -> source ggplot(source, aes(x = channelGrouping, y = total_views, fill = channelGrouping)) + geom_bar(stat = "identity") + geom_text(data = source %>% .[1, ], aes(label = paste(channelGrouping, "\n", total_views, "\n", prop, "%")), vjust = 1) + geom_text(data = source %>% .[2:4, ], aes(label = paste(channelGrouping, "\n", total_views, "\n", prop, "%")), vjust = -0.1) + theme_minimal() + theme(axis.text = element_blank(), axis.title = element_blank(), plot.title = element_text(hjust = 0.5), legend.position = "none") + ggtitle("Traffic Sources")
More than 80% of my traffic comes from organic searches. A good idea for increasing direct traffic would be to create an email list.
Analyzing Keyword Searches with Google Analytics and R
web_data %>% dplyr::select(keyword) %>% dplyr::filter(!(keyword %in% c("(not set)", "(not provided)"))) %>% tidytext::unnest_tokens(word, keyword) -> key_words key_words %>% dplyr::count(word, sort = TRUE) %>% .[1:10, ] %>% dplyr::mutate(word = stats::reorder(word, n)) %>% ggplot(aes(x = word, y = n)) + geom_col() + xlab(NULL) + coord_flip() + theme_minimal()
In the above graph, I visualized the keywords people typed into Google to get to my articles. The data is very sparse, unfortunately. Therefore, I did not remove any stop words. Again, we can see that my summary statistics table tutorial and my RSelenium tutorial are very popular.
In the below graph I visualized the words with a word cloud.
key_words %>% dplyr::count(word) %>% with(wordcloud::wordcloud(word, n))
Visualizing a Bigram With Google Analytics and R
In the code below, we have used the unnest_tokens()
function to tokenize the keyword search of readers into sequences of words. We do this to see how often the word X is followed by the word Y. Through this kind of analysis, we can model a relationship between words.
Below, we arranged the words into a network, or “graph”. We can see words and combinations of words that are connected by nodes.
web_data %>% dplyr::select(keyword) %>% dplyr::filter(!(keyword %in% c("(not set)", "(not provided)"))) %>% tidytext::unnest_tokens(bigram, keyword, token = "ngrams", n = 2) -> bigram bigram %>% tidyr::separate(bigram, c("word_1", "word_2")) %>% dplyr::count(word_1, word_2, sort = TRUE) -> bigram_counts bigram_counts %>% dplyr::filter(n > 10) %>% igraph::graph_from_data_frame() -> bigram_graph a <- grid::arrow(type = "closed", length = unit(0.15, "inches")) ggraph::ggraph(bigram_graph, layout = "fr") + geom_edge_link(aes(edge_alpha = n), show.legend = F, arrow = a, end_cap = circle(0.07, "inches")) + geom_node_point(color = "lightblue", size = 3) + geom_node_text(aes(label = name), vjust = 1, hjust = 1) + theme_void()
From the above graph, we can see common connections between words in data science. For example, people searched for “gradient descent”, “variance bias”, “line search”, “machine learning”, “web scraping”, “cross-validation”, or “trade off”.
I hope you have enjoyed this blog about Google Analytics and R. If you have any questions, let me know in the comments below.
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications