Analyzing Web Data with Google Search Console and searchConsoleR in R Studio

February 12, 2020 By Pascal Schmidt Personal Project R

In my last blog post, we investigated my website with the data from Google Analytics and the Google Analytics API. Today, we will be using Google’s search console and its API to pull data into R Studio and then analyze it. If you are interested in my analysis with the Google Analytics API and the googleAnalyticsR package, then check out this post.

Below is the set up to pull the data into R. I also provided the data set on my GitHub if you want to do reproduce my analysis or look at other variables that interest you. Let’s get started.

library(googleAuthR)
library(searchConsoleR)
library(tidyverse)
library(countrycode)
library(tidytext)
library(igraph)
library(ggraph)
library(wordcloud)

# scr_auth()
# searchConsoleR::search_analytics(sc_websites$siteUrl,
#                                  start = "2018-01-15", end = Sys.Date(),
#                                  dimensions = c("page", "query", "country", "date"),
#                                  rowLimit = 100000) -> web_data
# 
# web_data %>%
#   dplyr::as_tibble() -> web_data
# 
# write.csv(web_data, "data/search_console.csv")

After we have set everything up, we can start out analysis. First, I wanted to know if there is a difference in how my articles rank across different continents. I used the countrycode package in order to add a column with continents.

web_data <- readr::read_csv("data/search_console.csv")

web_data %>%
  dplyr::mutate(continent = countrycode::countrycode(sourcevar = web_data$countryName,
                                                     origin = "country.name",  
                                                     destination = "continent")) -> web_data

web_data %>%
  dplyr::mutate(date_month = lubridate::floor_date(date, "month")) %>%
  dplyr::group_by(date_month, continent) %>%
  dplyr::summarise(avg_pos = mean(position, na.rm = T)) %>%
  dplyr::filter(!(is.na(continent))) %>%
  ggplot(aes(x = date_month, y = avg_pos, col = continent)) +
  geom_line() +
  geom_point() + 
  ylab("Average SEO Position Per Month") +
  xlab("Date") +
  ggtitle("Average SEO Positions for Different Continents") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "bottom") +
  scale_y_continuous(limits = c(0, 85))

The lines look very similar. However, queries coming from Asia and Africa do not rank as high as queries coming from Europe, the Americas, and Oceania. Maybe keyword search queries from the Americas, Europe and Oceania are more targeted towards my blog posts.

To further analyze the differences in the average page position, we need to group by query and date. However, my blog only averages around 5000 visitors a month and so there is not enough data to answer this question.

For what Queries do my Blog Posts Rank on the First Page of Google in January 2020?

Next, I wanted to know for which queries I am on the first page of Google. Below, you can find a table that shows which queries some of my articles showed up on the first page of Google.

web_data_2 <- readr::read_csv("data/web_data_2.csv")

web_data_2 %>%
  dplyr::as_tibble() -> web_data_2

web_data_2 %>%
  dplyr::mutate(date_month = lubridate::floor_date(date, "month")) %>%
  dplyr::group_by(date_month, query) %>%
  dplyr::summarise(avg_pos = mean(position)) %>%
  dplyr::filter(avg_pos <= 10 & date_month >= "2020-01-01") %>%
  dplyr::rename(Date = date_month, Query = query, Average Position = avg_pos) %>%
  knitr::kable(digits = 0, caption = "First Page Google Articles for Certain Queries")

Date	Query	Average Position
2020-01-01	adjusted r squared formula	2
2020-01-01	assumptions of lda	4
2020-01-01	best subset regression in r	2
2020-01-01	bias consistency	10
2020-01-01	bias vs consistency	9
2020-01-01	consistency vs bias	9
2020-01-01	consistent bias	10
2020-01-01	data analyst education	4
2020-01-01	how to use rselenium	3
2020-01-01	imap purrr	8
2020-01-01	lda assumptions	4
2020-01-01	parsimony definition statistics	7
2020-01-01	physics data analyst	5
2020-01-01	placeholder in r	9
2020-01-01	price to rent ratio vancouver	3
2020-01-01	purrr imap	7
2020-01-01	purrr pokemon	7
2020-01-01	qda	6
2020-01-01	r selenium tutorial	4
2020-01-01	r selenium web scraping	3
2020-01-01	random error examples	10
2020-01-01	random vs systematic error examples	8
2020-01-01	rselenium example	4
2020-01-01	rselenium tutorial	4
2020-01-01	rselenium web scraping	3
2020-01-01	scrape zillow data r	5
2020-01-01	statistical bias	7
2020-01-01	systematic error example	5
2020-02-01	assumptions of lda	3
2020-02-01	data analyst degree	5
2020-02-01	lda assumptions	4
2020-02-01	price to rent ratio vancouver	2

Click-Through Rates for thatdatatho.com

Next, I am visualizing the click-through rates of my blog for the impressions that are greater than 4. The graph below is right-skewed. We can see that by examining the long right tail and the mean and median.

web_data_2 %>%
  dplyr::mutate(date_month = lubridate::floor_date(date, "month")) %>%
  dplyr::filter(impressions >= 5) %>%
  dplyr::filter(ctr != 0) %>%
  ggplot(aes(x = ctr)) +
  geom_histogram(binwidth = 0.02) +
  xlab("CTR") +
  ylab("Count") +
  ggtitle("Histogram of Click Through Rate for thatdatatho.com") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5)) +
  annotate("text", x = 0.3, y = 80, label = paste0("median = ", 
                                                    round(median(web_data_2 %>%
                                                             dplyr::filter(impressions >= 5) %>%
                                                             dplyr::filter(ctr != 0) %>%
                                                             dplyr::pull(ctr)), 2))) +
  annotate("text", x = 0.3, y = 85, label = paste0("mean = ", 
                                                    round(mean(web_data_2 %>%
                                                             dplyr::filter(impressions >= 5) %>%
                                                             dplyr::filter(ctr != 0) %>%
                                                             dplyr::pull(ctr)), 2)))

Google Analytics Search Console Correlations for Clicks, Impressions, and Positions

library(corrplot)

corr_results <- web_data_2 %>%
  dplyr::filter(impressions > 5) %>%
  dplyr::select(clicks:position) %>% 
  cor()

corrplot(corr_results, method = 'color',
         type = 'upper', addCoef.col = 'black',
         tl.col = 'black', tl.srt = 45,
         diag = FALSE)

The correlation plot in R shows that the more clicks we get, the higher the click-through rate is. Moreover, the lower the position of a certain article, the lower our click-through rate is.

Next, we want to see how our click-through rate depends on the rank of my blog posts on Google.

Click-Through Rates and Positions

avg_ctr <- web_data_2 %>%
  dplyr::group_by(query) %>%
  dplyr::filter(impressions > 5) %>%
  dplyr::summarize(clicks = sum(clicks),
                   impressions = sum(impressions),
                   position = median(position)) %>%
  dplyr::mutate(page_group = 1 * (position %/% 1)) %>% # CREATE NEW COLUMN TO GROUP AVG POSITIONS
  dplyr::filter(position < 21) %>%         # FILTER ONLY FIRST 2 PAGES
  dplyr::mutate(ctr =  100*(clicks / impressions)) %>%     # NORMALIZE TO 100%
  dplyr::ungroup()

# PLOT OUR RESULTS
avg_ctr %>%
  ggplot() +
  geom_boxplot(aes(page_group, ctr, group = page_group)) +
  labs(x = "SERP Position",
       y = "Click-through Rate (%)") +
  theme_minimal()

We can see that, on average, the higher the position, the higher the click-through rate is for the first two Google pages. However, having more data would be useful to get a more accurate picture. For some boxplots, we only see a horizontal line because of the lack of data.

Query Visualizations with Google Search Console and R Studio

pal2 <- brewer.pal(8, "Dark2")

web_data %>%
  dplyr::count(query) %>%
  with(wordcloud::wordcloud(query, n, max.words = 100, 
                            colors = pal2,  rot.per = .15))

Above, I visualized search queries from readers that landed on my blog. Most often, people found my blog by queries such as “quadratic discriminant analysis” and “rselenium”.

web_data %>%
  dplyr::select(query) %>%
  dplyr::distinct() %>%
  tidytext::unnest_tokens(bigram, query, token = "ngrams", n = 2) -> bigram

bigram %>%
  tidyr::separate(bigram, c("word_1", "word_2")) %>%
  dplyr::count(word_1, word_2, sort = TRUE) -> bigram_counts

bigram_counts %>%
  dplyr::filter(n > 20) %>%
  igraph::graph_from_data_frame() -> bigram_graph

a <- grid::arrow(type = "closed", length = unit(0.15, "inches"))
ggraph::ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = F,
                 arrow = a, end_cap = circle(0.07, "inches")) +
  geom_node_point(color = "lightblue", size = 3) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

I hope you have enjoyed this blog post about the Google Search Console and how to pull data into R with searchConsoleR.

If you want to find out more about what you can do with thesearchConsoleR package, then check out this post.

Tags: data analysis data visualization R searchConsoleR