RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
January 22, 2019 By Pascal Schmidt R web scraping
Scraping data from the web is a common tool for data analysis. In fact, it is very creative and ensures a unique data set that no one else has analyzed before. Often times, we can use packages such as rvest
, scrapeR
, or Rcrawler
to get the job done. However, sometimes we want to scrape dynamic web pages that can only be scraped with RSelenium
. This RSelenium
tutorial will introduce you to how web scraping works with the R package.
RSelenium
automates a web browser and lets us scrape content that is dynamically altered by JavaScript for example.
In this RSelenium
tutorial, we will be going over two examples of how it can be used.
- For example #1, we want to get some latitude and longitude coordinates for some street addresses we have in our data set. In order to do that, we have to let
RSelenium
type in our addresses, hit the enter button, and then scrape the latitude and longitude coordinates from the website. - For example #2, we are doing something similar with postal codes.
Let’s jump into our examples and this RSelenium
tutorial!
UPDATE 09/11/2019:
- After having trouble opening a remote driver because the version did not match with the
RSelenium
package, I changed the web driver version here. - I also fixed some typos thanks to Sam’s comment!
UPDATE 16/02/2020:
- After I had trouble again connecting to my chrome browser, I found the following solution on StackOverflow. I copy-pasted the code from there for windows which you can see below.
- You can find the code for this tutorial on my GitHub.
Example #1
Step 1: Navigate to the URL
For the first example, we are going to visit https://www.latlong.net/.
In the picture above, we can see the text box Place Name , where we are going to let RSelenium
type in our street addresses. Afterward, we have to let RSelenium
click the Find button and then we have to scrape the results that will appear in the Latitude and Longitude boxes.
Step 2: Let RSelenium
Type in the Necessary Fields
################ ### Original ### ################ library(RSelenium) library(tidyverse) driver <- rsDriver(browser=c("chrome")) remote_driver <- driver[["client"]] remote_driver$open() remote_driver$navigate("https://www.latlong.net/convert-address-to-lat-long.html") ######################### ### UPDATE 09/11/2019 ### ######################### driver <- rsDriver(browser = c("chrome"), chromever = "78.0.3904.70") remote_driver <- driver[["client"]] remote_driver$navigate("https://www.latlong.net/convert-address-to-lat-long.html") ######################### ### UPDATE 16/02/2020 ### ######################### driver <- RSelenium::rsDriver(browser = "chrome", chromever = system2(command = "wmic", args = 'datafile where name="C:\\\\Program Files (x86)\\\\Google\\\\Chrome\\\\Application\\\\chrome.exe" get Version /value', stdout = TRUE, stderr = TRUE) %>% stringr::str_extract(pattern = "(?<=Version=)\\d+\\.\\d+\\.\\d+\\.") %>% magrittr::extract(!is.na(.)) %>% stringr::str_replace_all(pattern = "\\.", replacement = "\\\\.") %>% paste0("^", .) %>% stringr::str_subset(string = binman::list_versions(appname = "chromedriver") %>% dplyr::last()) %>% as.numeric_version() %>% max() %>% as.character()) remote_driver <- driver[["client"]] remote_driver$navigate("https://www.latlong.net/convert-address-to-lat-long.html")
First, we have to load the library. Then we are connecting to the Chrome driver and navigate to the desired URL we want to scrape data from.
Now, we have to have a look at what location the Place Name box is located in the HTML code.
When looking at the HTML code, then we can see that the box is located in this snippet above with the XPath @class = “width70”. So, the code below shows how to navigate to that particular text box.
address_element <- remote_driver$findElement(using = 'class', value = 'width70')
Now, we have to let RSelenium
type in the address we want to get coordinates for.
address_element$sendKeysToElement(list("Lombard Street, San Francisco"))
We are almost done. Now we have to press the Find button in order to get the coordinates.
In the code below, we are using the XPath @class = “button”, to locate the button.
button_element <- remote_driver$findElement(using = 'class', value = "button")
After we have located the button, we have to click it.
button_element$clickElement()
Step 3: Scrape the Coordinates From the Website
When we scroll down to then we see the coordinates like this:
They are located here in the HTML code:
Under the XPath @class = “coordinatetxt”.
out <- remote_driver$findElement(using = "class", value="coordinatetxt") lat_long <- out$getElementText()
When we have a lot of addresses we want to get coordinates for, then this could be accomplished like that:
street_names <- c("Lombard Street, San Francisco", "Santa Monica Boulevard", "Bourbon Street, New Orleans", "Fifth Avenue, New York", "Richards Street, Vancouver") get_lat_lon <- function(street_names) { remote_driver$navigate("https://www.latlong.net/convert-address-to-lat-long.html") final <- c() for(i in 1:length(street_names)) { remote_driver$refresh() Sys.sleep(1) address_element <- remote_driver$findElement(using = 'class', value = 'width70') address_element$sendKeysToElement(list(street_names[i])) button_element <- remote_driver$findElement(using = 'class', value = "button") button_element$clickElement() Sys.sleep(3) out <- remote_driver$findElement(using = "class", value = "coordinatetxt") output <- out$getElementText() final <- c(final, output) } return(final) } vector_out <- get_lat_lon(street_names)
After, we can extract the latitude and longitude values with the code below
data.frame(street_names, purrr::flatten_chr(vector_out)) %>% dplyr::mutate(., vector_out = stringr::str_remove_all(vector_out, "\\(|\\)")) %>% tidyr::separate(., vector_out, into = c("latitude", "longitude"), sep = ",") Output: street_names latitude longitude 1 Lombard Street, San Francisco 37.799999 -122.434402 2 Santa Monica Boulevard -27.867491 153.353973 3 Bourbon Street, New Orleans 29.964100 -90.060791 4 Fifth Avenue, New York 42.105412 -76.247070 5 Richards Street, Vancouver 49.279030 -123.119431
Let’s jump to the next example of this RSelenium
tutorial.
Example #2
Step 1: Navigate to the URL
As previously, we want to go to the website where we want to scrape data from. In our second example, we will be using the https://www.canadapost.ca/cpo/mc/personal/postalcode/fpc.jsf# url.
Again, we can see the box where we have to enter our address and the search button we have to click after we inserted our address.
Step 2: Let RSelenium
Type in the Necessary Fields
First, we have to navigate to the desired URL.
url <- "https://www.canadapost.ca/cpo/mc/personal/postalcode/fpc.jsf" remote_driver$navigate(url)
Then, we have to tell RSelenium
to put in the desired address in the box. We do that, by locating where the box lies in the HTML code.
The XPath is underlined in green. The code to put text in the text box looks like this:
address_element <- remote_driver$findElement(using = 'id', value = 'addressComplete') address_element$sendKeysToElement(list("413 Seymour Street Vancouver"))
Now, we have to locate the Search button in order to get the postal code for the address.
The XPath is underlined in green.
To click to the search button, we have to execute the following code:
button_element <- remote_driver$findElement(using = 'id', value = 'searchFpc') button_element$clickElement() ############## ### Update ### ############## # website has changed a bit so try code below button_element <- remote_driver$findElement(class = 'class', value = 'clear') button_element$clickElement()
After that, we only have to extract the desired information and we are done!
Step 3: Scrape the Postal Code From the Website
In order to get the address we have to do the following:
output <- remote_driver$findElement(using = "id", value="HeaderAddressLabel") output <- output$getElementText() Output: "413 SEYMOUR ST\nVANCOUVER BC V6B 3H5"
To only get the postal code, we can simply do:
unlist(output) %>% stringr::str_sub(., start = -7, end = -1) Output: "V6B 3H5"
I hope you have enjoyed this short RSelenium
tutorial about web scraping. If you have any questions or suggestions then let me know in the comments below.
Additional Resources
- If you are interested in other web scraping tutorials, then you can check out my post about scraping Indeed Job Postings.
- Another example of web scraping would be my post about building a scraper for a real estate website.
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
Comments (16)
Hi, thanks for your time in putting this together. This was very helpful for me.
However, I’m having trouble executing your function and dataframe codes from example 1. In the second set of code from step 3, you include “street_address” as an object. Do you mean “street_names” instead? Second, “lenght” should be “length.” Third, I could only get this function to work by changing the last line from “out[[i]] <<- out$getElementText()" to "out[[as.character(i)]] <<- out$getElementText()."
After doing these steps, I am able to run the function successfully. However, I am unable make the dataframe from the third set of code; there' s no object named "vector_out" and I'm not sure what to do to make this work. Thanks for your time.
Sam
Hi Sam,
Thanks for your comment. I had a couple of bad typos in there possibly due to copy pasting incorrectly. My bad! I updated the post and ran the first example again. It all works on my part now.
Hopefully there are no more errors in there. Let me know if you can get it to work this time! Thanks again for pointing out the mistakes!
Hi, thanks a lot for this post. I ran your codes (example #2). I checked the screenshot using screenshot(display = TRUE) to verify the address is input correctly. But I got a weird result: “4-1041 PINE ST\nDUNNVILLE ON N1A 2N1”.
Any idea how I can fix it? Thanks.
CL
Hi CL,
Thank you for your comment!
Try this code:
button_element <- remote_driver$findElement(class = 'class', value = 'clear') button_element$clickElement() instead of this one: button_element <- remote_driver$findElement(using = 'id', value = 'searchFpc') button_element$clickElement() This should make it work. Next time it won't take that long for me to reply and I hope it still helps.
Hey Pascal, great blog post! Thank you for putting this tutorial together. I was able to connect to the Selenium server (the rsDriver() wrapper was giving me some trouble so I did it the old fashion way). I was able to make the driver, use a Firefox browser to access the sites and then specific HTML elements referenced, etc. However, I’m getting no data once I run my code. Viewing the source for the two websites (https://www.canadapost.ca/cpo/mc/personal/postalcode/fpc.jsf) and (https://www.latlong.net/convert-address-to-lat-long.html) it seem like when I put in the example addresses, the Lat&Lng/Canadian Postal code aren’t actually on the website as they were in your example (The HTML for the coordinates site looked like this:
Lat Long
0,0
and for the Canadian Postal Code site looked like this:
Address found
)
I don’t know too much about webdev but I am assuming the content is loaded dynamically through some sort of JavaScript. Do you know if there is a way through RSelenium to access that content? Thanks again for the tutorial, really appreciate you taking the time 🙂
Hi Adrian,
Thanks for your comment! Try connecting to the chrome driver and run the code again. I have updated some code after I had trouble connecting to my chrome driver and ran my first example. Everything seems to work fine on my end. If you could provide your code that you ran that would be useful to me to help you out and provide better advice.
If you still have trouble connecting to the chrome driver, here is a discussion on StackOverflow:https://stackoverflow.com/questions/55201226/session-not-created-this-version-of-chromedriver-only-supports-chrome-version-7/56173984#56173984
I hope that helps! Let me know if you have any more questions.
url <- "https://www.canadapost.ca/cpo/mc/personal/postalcode/fpc.jsf" remote_driver$navigate(url)
address_element <- remote_driver$findElement(using = 'id', value = 'addressComplete') address_element$sendKeysToElement(list("413 Seymour Street Vancouver")) address_element$sendKeysToElement(list(" "))
button_element <- remote_driver$findElement(using = 'id', value = 'searchFpc') button_element$clickElement()
output <- remote_driver$findElement(using = "id", value="HeaderAddressLabel") output <- output$getElementText()
unlist(output) %>%
stringr::str_sub(., start = -7, end = -1)
For the Canada Post website, there is a problem with autocompleting the address. The above code works but there also should be a better solution I have not found yet.
Hello,
Can you suggest a way to refer to a hyperlink in a page and click on it ?
Hello,
You could just navigate to the href attribute and then open the URL as I showed in this tutorial.
I just want to thank the author for this tutorial. Very straight forward and saved me several more hours of chasing ghosts. 🙂
Thank you 🙂 very much appreciated
Hi,
Thank you. Very useful this tutorial. I have one question. If you have an input column in R (let’s say Place_Name column in a data frame named Data), how do you use this in the sentence sendKeysToElement? Because it doesn’t work like sendKeysToElement(Data$Place_Name). And you can’t use a list when you have 1000 rows or more.
Hi I tried to use your code in the first example, but it gave me error message.
driver <- rsDriver(browser=c("chrome"))
remote_driver <- driver[["client"]]
remote_driver$open()
For these three lines of code, they sent messages to me saying "Selenium message:session not created: This version of ChromeDriver only supports Chrome version 95
Current browser version is 94.0.4606.54 with binary path C:\Program Files (x86)\Google\Chrome\Application\chrome.exe
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
System info: host: 'DESKTOP-ISSUGN5', ip: '192.168.1.73', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '16.0.1'
Driver info: driver.version: unknown"
I've updated my chrome to the latest version of 94, which was only released yesterday 9.22.2021.
Could you please help with this problem? Thank you!
Hi, I need help. In the process that I do, I need to go down to the bottom of the page, I have done this with the following code
webElem <- remDr$findElement("css", "body")
webElem$sendKeysToElement(list(key = "end"))
Now I need to go back to the beginning on that same page, I would like to know how to do this?, or what is the key that I should use.
I think you can try webElem$sendKeysToElement(list(key = “home”)). Let me know if that works.