RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium

January 22, 2019 By Pascal Schmidt R web scraping

Scraping data from the web is a common tool for data analysis. In fact, it is very creative and ensures a unique data set that no one else has analyzed before. Often times, we can use packages such as rvest, scrapeR, or Rcrawler to get the job done. However, sometimes we want to scrape dynamic web pages that can only be scraped with RSelenium. This RSelenium tutorial will introduce you to how web scraping works with the R package.

RSelenium automates a web browser and lets us scrape content that is dynamically altered by JavaScript for example.

In this RSelenium tutorial, we will be going over two examples of how it can be used.

  • For example #1, we want to get some latitude and longitude coordinates for some street addresses we have in our data set. In order to do that, we have to let RSelenium type in our addresses, hit the enter button, and then scrape the latitude and longitude coordinates from the website.
  • For example #2, we are doing something similar with postal codes.

Let’s jump into our examples and this RSelenium tutorial!

UPDATE 09/11/2019: 

  • After having trouble opening a remote driver because the version did not match with the RSelenium package, I changed the web driver version here.
  • I also fixed some typos thanks to Sam’s comment!

UPDATE 16/02/2020: 

  • After I had trouble again connecting to my chrome browser, I found the following solution on StackOverflow. I copy-pasted the code from there for windows which you can see below.
  • You can find the code for this tutorial on my GitHub.

Example #1

Step 1: Navigate to the URL

For the first example, we are going to visit https://www.latlong.net/.

RSelenium latitude longitude

In the picture above, we can see the text box Place Name , where we are going to let RSelenium type in our street addresses. Afterward, we have to let RSelenium click the Find button and then we have to scrape the results that will appear in the Latitude and Longitude boxes.

Step 2: Let RSelenium Type in the Necessary Fields

################
### Original ###
################
library(RSelenium)
library(tidyverse)
 
driver <- rsDriver(browser=c("chrome"))
remote_driver <- driver[["client"]]
remote_driver$open()
 
remote_driver$navigate("https://www.latlong.net/convert-address-to-lat-long.html")

#########################
### UPDATE 09/11/2019 ###
#########################

driver <- rsDriver(browser = c("chrome"), chromever = "78.0.3904.70")
remote_driver <- driver[["client"]] 
remote_driver$navigate("https://www.latlong.net/convert-address-to-lat-long.html")

#########################
### UPDATE 16/02/2020 ###
#########################

driver <- RSelenium::rsDriver(browser = "chrome",
                              chromever =
                                system2(command = "wmic",
                                        args = 'datafile where name="C:\\\\Program Files (x86)\\\\Google\\\\Chrome\\\\Application\\\\chrome.exe" get Version /value',
                                        stdout = TRUE,
                                        stderr = TRUE) %>%
                                stringr::str_extract(pattern = "(?<=Version=)\\d+\\.\\d+\\.\\d+\\.") %>%
                                magrittr::extract(!is.na(.)) %>%
                                stringr::str_replace_all(pattern = "\\.",
                                                         replacement = "\\\\.") %>%
                                paste0("^",  .) %>%
                                stringr::str_subset(string =
                                                      binman::list_versions(appname = "chromedriver") %>%
                                                      dplyr::last()) %>%
                                as.numeric_version() %>%
                                max() %>%
                                as.character())

remote_driver <- driver[["client"]] 
remote_driver$navigate("https://www.latlong.net/convert-address-to-lat-long.html")

First, we have to load the library. Then we are connecting to the Chrome driver and navigate to the desired URL we want to scrape data from.

Now, we have to have a look at what location the Place Name box is located in the HTML code.

place-name-box

When looking at the HTML code, then we can see that the box is located in this snippet above with the XPath @class = “width70”. So, the code below shows how to navigate to that particular text box.

address_element <- remote_driver$findElement(using = 'class', value = 'width70')

Now, we have to let RSelenium type in the address we want to get coordinates for.

address_element$sendKeysToElement(list("Lombard Street, San Francisco"))

We are almost done. Now we have to press the Find button in order to get the coordinates.

find-button RSelenium

In the code below, we are using the XPath @class = “button”, to locate the button.

button_element <- remote_driver$findElement(using = 'class', value = "button")

After we have located the button, we have to click it.

button_element$clickElement()

Step 3: Scrape the Coordinates From the Website

When we scroll down to then we see the coordinates like this:

RSelenium

They are located here in the HTML code:

RSelenium

Under the XPath @class = “coordinatetxt”.

out <- remote_driver$findElement(using = "class", value="coordinatetxt")
lat_long <- out$getElementText()

When we have a lot of addresses we want to get coordinates for, then this could be accomplished like that:

street_names <- c("Lombard Street, San Francisco", 
                  "Santa Monica Boulevard", 
                  "Bourbon Street, New Orleans", 
                  "Fifth Avenue, New York", 
                  "Richards Street, Vancouver")

get_lat_lon <- function(street_names) {
  remote_driver$navigate("https://www.latlong.net/convert-address-to-lat-long.html")
  final <- c()
  for(i in 1:length(street_names)) {
    
    remote_driver$refresh()
    Sys.sleep(1)
    
    address_element <- remote_driver$findElement(using = 'class', value = 'width70')
    
    address_element$sendKeysToElement(list(street_names[i]))
    button_element <- remote_driver$findElement(using = 'class', value = "button")
    
    button_element$clickElement()
    Sys.sleep(3)
    
    out <- remote_driver$findElement(using = "class", value = "coordinatetxt")
    output <- out$getElementText()
    final <- c(final, output)
    
  }
  
  return(final)
}


vector_out <- get_lat_lon(street_names)

After, we can extract the latitude and longitude values with the code below

data.frame(street_names, purrr::flatten_chr(vector_out)) %>%
  dplyr::mutate(., vector_out = stringr::str_remove_all(vector_out, "\\(|\\)")) %>%
  tidyr::separate(., vector_out, into = c("latitude", "longitude"), sep = ",")

Output:

                   street_names   latitude    longitude
1 Lombard Street, San Francisco  37.799999  -122.434402
2        Santa Monica Boulevard -27.867491   153.353973
3   Bourbon Street, New Orleans  29.964100   -90.060791
4        Fifth Avenue, New York  42.105412   -76.247070
5    Richards Street, Vancouver  49.279030  -123.119431

Let’s jump to the next example of this RSelenium tutorial.

Example #2

Step 1: Navigate to the URL

As previously, we want to go to the website where we want to scrape data from. In our second example, we will be using the https://www.canadapost.ca/cpo/mc/personal/postalcode/fpc.jsf# url.

RSelenium

Again, we can see the box where we have to enter our address and the search button we have to click after we inserted our address.

Step 2: Let RSelenium Type in the Necessary Fields

First, we have to navigate to the desired URL.

url <- "https://www.canadapost.ca/cpo/mc/personal/postalcode/fpc.jsf"
remote_driver$navigate(url)

Then, we have to tell RSelenium to put in the desired address in the box. We do that, by locating where the box lies in the HTML code.

RSelenium

The XPath is underlined in green. The code to put text in the text box looks like this:

address_element <- remote_driver$findElement(using = 'id', value = 'addressComplete')
address_element$sendKeysToElement(list("413 Seymour Street Vancouver"))

Now, we have to locate the Search button in order to get the postal code for the address.

RSelenium

 

The XPath is underlined in green.

To click to the search button, we have to execute the following code:

button_element <- remote_driver$findElement(using = 'id', value = 'searchFpc')
button_element$clickElement()


##############
### Update ###
##############

# website has changed a bit so try code below
button_element <- remote_driver$findElement(class = 'class', value = 'clear')
button_element$clickElement()

After that, we only have to extract the desired information and we are done!

Step 3: Scrape the Postal Code From the Website

RSelenium

In order to get the address we have to do the following:

RSelenium tutorial

output <- remote_driver$findElement(using = "id", value="HeaderAddressLabel")
output <- output$getElementText()

Output:

"413 SEYMOUR ST\nVANCOUVER BC   V6B 3H5"

To only get the postal code, we can simply do:

unlist(output) %>%
  stringr::str_sub(., start = -7, end = -1)

Output:
"V6B 3H5"

I hope you have enjoyed this short RSelenium tutorial about web scraping. If you have any questions or suggestions then let me know in the comments below.

Additional Resources

Comments (16)

  1. Hi, thanks for your time in putting this together. This was very helpful for me.
    However, I’m having trouble executing your function and dataframe codes from example 1. In the second set of code from step 3, you include “street_address” as an object. Do you mean “street_names” instead? Second, “lenght” should be “length.” Third, I could only get this function to work by changing the last line from “out[[i]] <<- out$getElementText()" to "out[[as.character(i)]] <<- out$getElementText()."
    After doing these steps, I am able to run the function successfully. However, I am unable make the dataframe from the third set of code; there' s no object named "vector_out" and I'm not sure what to do to make this work. Thanks for your time.

    Sam

    1. Hi Sam,

      Thanks for your comment. I had a couple of bad typos in there possibly due to copy pasting incorrectly. My bad! I updated the post and ran the first example again. It all works on my part now.

      Hopefully there are no more errors in there. Let me know if you can get it to work this time! Thanks again for pointing out the mistakes!

  2. Hi, thanks a lot for this post. I ran your codes (example #2). I checked the screenshot using screenshot(display = TRUE) to verify the address is input correctly. But I got a weird result: “4-1041 PINE ST\nDUNNVILLE ON N1A 2N1”.

    Any idea how I can fix it? Thanks.

    CL

    1. Hi CL,

      Thank you for your comment!

      Try this code:
      button_element <- remote_driver$findElement(class = 'class', value = 'clear') button_element$clickElement() instead of this one: button_element <- remote_driver$findElement(using = 'id', value = 'searchFpc') button_element$clickElement() This should make it work. Next time it won't take that long for me to reply and I hope it still helps.

  3. Hey Pascal, great blog post! Thank you for putting this tutorial together. I was able to connect to the Selenium server (the rsDriver() wrapper was giving me some trouble so I did it the old fashion way). I was able to make the driver, use a Firefox browser to access the sites and then specific HTML elements referenced, etc. However, I’m getting no data once I run my code. Viewing the source for the two websites (https://www.canadapost.ca/cpo/mc/personal/postalcode/fpc.jsf) and (https://www.latlong.net/convert-address-to-lat-long.html) it seem like when I put in the example addresses, the Lat&Lng/Canadian Postal code aren’t actually on the website as they were in your example (The HTML for the coordinates site looked like this:

    Lat Long
    0,0

    and for the Canadian Postal Code site looked like this:

    Address found

    )

    I don’t know too much about webdev but I am assuming the content is loaded dynamically through some sort of JavaScript. Do you know if there is a way through RSelenium to access that content? Thanks again for the tutorial, really appreciate you taking the time 🙂

    1. Hi Adrian,

      Thanks for your comment! Try connecting to the chrome driver and run the code again. I have updated some code after I had trouble connecting to my chrome driver and ran my first example. Everything seems to work fine on my end. If you could provide your code that you ran that would be useful to me to help you out and provide better advice.

      If you still have trouble connecting to the chrome driver, here is a discussion on StackOverflow:https://stackoverflow.com/questions/55201226/session-not-created-this-version-of-chromedriver-only-supports-chrome-version-7/56173984#56173984

      I hope that helps! Let me know if you have any more questions.


    2. url <- "https://www.canadapost.ca/cpo/mc/personal/postalcode/fpc.jsf" remote_driver$navigate(url)


      address_element <- remote_driver$findElement(using = 'id', value = 'addressComplete') address_element$sendKeysToElement(list("413 Seymour Street Vancouver")) address_element$sendKeysToElement(list(" "))


      button_element <- remote_driver$findElement(using = 'id', value = 'searchFpc') button_element$clickElement()


      output <- remote_driver$findElement(using = "id", value="HeaderAddressLabel") output <- output$getElementText()


      unlist(output) %>%
      stringr::str_sub(., start = -7, end = -1)

      For the Canada Post website, there is a problem with autocompleting the address. The above code works but there also should be a better solution I have not found yet.

  4. I just want to thank the author for this tutorial. Very straight forward and saved me several more hours of chasing ghosts. 🙂

  5. Hi,

    Thank you. Very useful this tutorial. I have one question. If you have an input column in R (let’s say Place_Name column in a data frame named Data), how do you use this in the sentence sendKeysToElement? Because it doesn’t work like sendKeysToElement(Data$Place_Name). And you can’t use a list when you have 1000 rows or more.

  6. Hi I tried to use your code in the first example, but it gave me error message.
    driver <- rsDriver(browser=c("chrome"))
    remote_driver <- driver[["client"]]
    remote_driver$open()
    For these three lines of code, they sent messages to me saying "Selenium message:session not created: This version of ChromeDriver only supports Chrome version 95
    Current browser version is 94.0.4606.54 with binary path C:\Program Files (x86)\Google\Chrome\Application\chrome.exe
    Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
    System info: host: 'DESKTOP-ISSUGN5', ip: '192.168.1.73', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '16.0.1'
    Driver info: driver.version: unknown"

    I've updated my chrome to the latest version of 94, which was only released yesterday 9.22.2021.

    Could you please help with this problem? Thank you!

  7. Hi, I need help. In the process that I do, I need to go down to the bottom of the page, I have done this with the following code

    webElem <- remDr$findElement("css", "body")
    webElem$sendKeysToElement(list(key = "end"))

    Now I need to go back to the beginning on that same page, I would like to know how to do this?, or what is the key that I should use.

Post your comment