Analyzing Web Data with Google Search Console and searchConsoleR in R Studio
February 12, 2020 By Pascal Schmidt Personal Project R
In my last blog post, we investigated my website with the data from Google Analytics and the Google Analytics API. Today, we will be using Google’s search console and its API to pull data into R Studio and then analyze it. If you are interested in my analysis with the Google Analytics API and the googleAnalyticsR
package, then check out this post.
Below is the set up to pull the data into R. I also provided the data set on my GitHub if you want to do reproduce my analysis or look at other variables that interest you. Let’s get started.
library(googleAuthR) library(searchConsoleR) library(tidyverse) library(countrycode) library(tidytext) library(igraph) library(ggraph) library(wordcloud) # scr_auth() # searchConsoleR::search_analytics(sc_websites$siteUrl, # start = "2018-01-15", end = Sys.Date(), # dimensions = c("page", "query", "country", "date"), # rowLimit = 100000) -> web_data # # web_data %>% # dplyr::as_tibble() -> web_data # # write.csv(web_data, "data/search_console.csv")
After we have set everything up, we can start out analysis. First, I wanted to know if there is a difference in how my articles rank across different continents. I used the countrycode
package in order to add a column with continents.
web_data <- readr::read_csv("data/search_console.csv") web_data %>% dplyr::mutate(continent = countrycode::countrycode(sourcevar = web_data$countryName, origin = "country.name", destination = "continent")) -> web_data
web_data %>% dplyr::mutate(date_month = lubridate::floor_date(date, "month")) %>% dplyr::group_by(date_month, continent) %>% dplyr::summarise(avg_pos = mean(position, na.rm = T)) %>% dplyr::filter(!(is.na(continent))) %>% ggplot(aes(x = date_month, y = avg_pos, col = continent)) + geom_line() + geom_point() + ylab("Average SEO Position Per Month") + xlab("Date") + ggtitle("Average SEO Positions for Different Continents") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5), legend.position = "bottom") + scale_y_continuous(limits = c(0, 85))
The lines look very similar. However, queries coming from Asia and Africa do not rank as high as queries coming from Europe, the Americas, and Oceania. Maybe keyword search queries from the Americas, Europe and Oceania are more targeted towards my blog posts.
To further analyze the differences in the average page position, we need to group by query and date. However, my blog only averages around 5000 visitors a month and so there is not enough data to answer this question.
For what Queries do my Blog Posts Rank on the First Page of Google in January 2020?
Next, I wanted to know for which queries I am on the first page of Google. Below, you can find a table that shows which queries some of my articles showed up on the first page of Google.
web_data_2 <- readr::read_csv("data/web_data_2.csv")
web_data_2 %>%
dplyr::as_tibble() -> web_data_2
web_data_2 %>%
dplyr::mutate(date_month = lubridate::floor_date(date, "month")) %>%
dplyr::group_by(date_month, query) %>%
dplyr::summarise(avg_pos = mean(position)) %>%
dplyr::filter(avg_pos <= 10 & date_month >= "2020-01-01") %>%
dplyr::rename(Date = date_month, Query = query, Average Position
= avg_pos) %>%
knitr::kable(digits = 0, caption = "First Page Google Articles for Certain Queries")
Date | Query | Average Position |
---|---|---|
2020-01-01 | adjusted r squared formula | 2 |
2020-01-01 | assumptions of lda | 4 |
2020-01-01 | best subset regression in r | 2 |
2020-01-01 | bias consistency | 10 |
2020-01-01 | bias vs consistency | 9 |
2020-01-01 | consistency vs bias | 9 |
2020-01-01 | consistent bias | 10 |
2020-01-01 | data analyst education | 4 |
2020-01-01 | how to use rselenium | 3 |
2020-01-01 | imap purrr | 8 |
2020-01-01 | lda assumptions | 4 |
2020-01-01 | parsimony definition statistics | 7 |
2020-01-01 | physics data analyst | 5 |
2020-01-01 | placeholder in r | 9 |
2020-01-01 | price to rent ratio vancouver | 3 |
2020-01-01 | purrr imap | 7 |
2020-01-01 | purrr pokemon | 7 |
2020-01-01 | qda | 6 |
2020-01-01 | r selenium tutorial | 4 |
2020-01-01 | r selenium web scraping | 3 |
2020-01-01 | random error examples | 10 |
2020-01-01 | random vs systematic error examples | 8 |
2020-01-01 | rselenium example | 4 |
2020-01-01 | rselenium tutorial | 4 |
2020-01-01 | rselenium web scraping | 3 |
2020-01-01 | scrape zillow data r | 5 |
2020-01-01 | statistical bias | 7 |
2020-01-01 | systematic error example | 5 |
2020-02-01 | assumptions of lda | 3 |
2020-02-01 | data analyst degree | 5 |
2020-02-01 | lda assumptions | 4 |
2020-02-01 | price to rent ratio vancouver | 2 |
Click-Through Rates for thatdatatho.com
Next, I am visualizing the click-through rates of my blog for the impressions that are greater than 4. The graph below is right-skewed. We can see that by examining the long right tail and the mean and median.
web_data_2 %>% dplyr::mutate(date_month = lubridate::floor_date(date, "month")) %>% dplyr::filter(impressions >= 5) %>% dplyr::filter(ctr != 0) %>% ggplot(aes(x = ctr)) + geom_histogram(binwidth = 0.02) + xlab("CTR") + ylab("Count") + ggtitle("Histogram of Click Through Rate for thatdatatho.com") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + annotate("text", x = 0.3, y = 80, label = paste0("median = ", round(median(web_data_2 %>% dplyr::filter(impressions >= 5) %>% dplyr::filter(ctr != 0) %>% dplyr::pull(ctr)), 2))) + annotate("text", x = 0.3, y = 85, label = paste0("mean = ", round(mean(web_data_2 %>% dplyr::filter(impressions >= 5) %>% dplyr::filter(ctr != 0) %>% dplyr::pull(ctr)), 2)))
Google Analytics Search Console Correlations for Clicks, Impressions, and Positions
library(corrplot) corr_results <- web_data_2 %>% dplyr::filter(impressions > 5) %>% dplyr::select(clicks:position) %>% cor() corrplot(corr_results, method = 'color', type = 'upper', addCoef.col = 'black', tl.col = 'black', tl.srt = 45, diag = FALSE)
The correlation plot in R shows that the more clicks we get, the higher the click-through rate is. Moreover, the lower the position of a certain article, the lower our click-through rate is.
Next, we want to see how our click-through rate depends on the rank of my blog posts on Google.
Click-Through Rates and Positions
avg_ctr <- web_data_2 %>% dplyr::group_by(query) %>% dplyr::filter(impressions > 5) %>% dplyr::summarize(clicks = sum(clicks), impressions = sum(impressions), position = median(position)) %>% dplyr::mutate(page_group = 1 * (position %/% 1)) %>% # CREATE NEW COLUMN TO GROUP AVG POSITIONS dplyr::filter(position < 21) %>% # FILTER ONLY FIRST 2 PAGES dplyr::mutate(ctr = 100*(clicks / impressions)) %>% # NORMALIZE TO 100% dplyr::ungroup() # PLOT OUR RESULTS avg_ctr %>% ggplot() + geom_boxplot(aes(page_group, ctr, group = page_group)) + labs(x = "SERP Position", y = "Click-through Rate (%)") + theme_minimal()
We can see that, on average, the higher the position, the higher the click-through rate is for the first two Google pages. However, having more data would be useful to get a more accurate picture. For some boxplots, we only see a horizontal line because of the lack of data.
Query Visualizations with Google Search Console and R Studio
pal2 <- brewer.pal(8, "Dark2") web_data %>% dplyr::count(query) %>% with(wordcloud::wordcloud(query, n, max.words = 100, colors = pal2, rot.per = .15))
Above, I visualized search queries from readers that landed on my blog. Most often, people found my blog by queries such as “quadratic discriminant analysis” and “rselenium”.
web_data %>% dplyr::select(query) %>% dplyr::distinct() %>% tidytext::unnest_tokens(bigram, query, token = "ngrams", n = 2) -> bigram bigram %>% tidyr::separate(bigram, c("word_1", "word_2")) %>% dplyr::count(word_1, word_2, sort = TRUE) -> bigram_counts bigram_counts %>% dplyr::filter(n > 20) %>% igraph::graph_from_data_frame() -> bigram_graph a <- grid::arrow(type = "closed", length = unit(0.15, "inches")) ggraph::ggraph(bigram_graph, layout = "fr") + geom_edge_link(aes(edge_alpha = n), show.legend = F, arrow = a, end_cap = circle(0.07, "inches")) + geom_node_point(color = "lightblue", size = 3) + geom_node_text(aes(label = name), vjust = 1, hjust = 1) + theme_void()
I hope you have enjoyed this blog post about the Google Search Console and how to pull data into R with searchConsoleR
.
If you want to find out more about what you can do with thesearchConsoleR
package, then check out this post.
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications