An Introduction to Scraping Real Estate Data with rvest and RSelenium

December 14, 2018 By Pascal Schmidt R web scraping

In this tutorial, I will be explaining how to scrape real estate data with rvest and RSelenium. In this tutorial, we scraped indeed job postings and analysed what it takes to become a Data Scientist. Scraping data from Indeed was relatively easy because we could get all the information we needed by looking at the view source.

Sometimes however, the view source does not contain all the code we are seeing when loading a website. This is a problem because we cannot scrape the data with the rvest package when there is no location in view source for the specific element. In short, we are unable to access the code without support of a browser.

What we will be covering in this tutorial where we will be scraping real estate data with rvest and RSelenium is:

  • Why do we need RSelenium?
  • Scraping real estate data with RSelenium
  • Developing a working scraper step by step
  1. Identifying the URL and page structure
  2. Scraping all the links from each page result
  3. Scraping the necessary information  from each link
  4. Scraping real estate prices
  5. Scraping real estate street addresses
  6. Scraping square feet, bedrooms, bathrooms, age, and year built
  7. Putting all parts together into one big scraper

Let’s start with scraping real estate data with rvest and RSelenium

Why do we Need RSelenium?

When you look view source then you’ll see the HTML as it was delivered by the server without any modification by JavaScript for example. In order to see the website’s code in its current state you have to use RSelenium.

Selenium is a web automation tool. It opens up a browser and sees everything on the website that you are seeing. This means, when you are inspecting a website’s code, you are able to scrape all elements from there even if the website is dynamically altered by JavaScript.

This is very handy an leaves you with so much more opportunities to automatically get data from websites.

Scraping Real Estate Data with rvest and RSelenium

In order see how to use RSelenium, we looked at sotheby’s real estate postings. However, you can use any other real estate website that is dynamically altered and then use RSelenium to get the information you are interested in. Some websites do not allow web scraping and there are specific copyright laws to using their data. Therefore, always make sure you scrape carefully.

When you are on the website and are searching postings from a specific town, then you’ll see something like this:

RSelenium Real Estate

The pictures link to the details about the house/apartment. So, you are able to find the price, number of bedrooms, number of bathrooms, how old the house is, how much you have to pay in taxes each year, and the specific address of where the real estate is located. 

So far so good. However, when we click view source in our browser and want to collect all the links to scrape all the details of the real estate then we see this: 

RSelenium javascript

This means, that the links are being dynamically altered by javascript. So, there is unfortunately no way to get the links with only rvest.Rvest is not able to extract it because it only finds “JavaScript” instead of the actual link.

However, if we click to inspect the page then we can clearly see that the links are there:

RSelenium

This is where RSelenium comes into play and helps us to scrape the necessary information. It helps us to see the website in real-time. 

Developing a Working Scraper Step by Step

Step 1: Identifying the URL and Page Structure

The first step is to identify the URL of the page. Let say we would be interested in scraping real estate data in Vancouver, for condos below one million dollar. The URL looks like this:

url <- "https://sothebysrealty.ca/en/search-results/
    region-greater-vancouver-british-columbia-real-estate/
    tloc-1/rloc-3/ptype-condo/price-0-1000000/
    view-grid/show-mls/sort-featured/pp-60/status-sales"

Afterwards, we have to identify the page structure. This is where we look at how many page results the website shows and how the URL changes when clicking through the pages.

RSelenium

There are all in all 38 page result with 60 homes per page. In order to get the links for all 60*38 = 2280 home we do it like this:

sapply(2:38, function(x) {
  url <- "https://sothebysrealty.ca/en/search-results/
  region-greater-vancouver-british-columbia-real-estate/
  tloc-1/rloc-3/ptype-condo/price-0-1000000/
  view-grid/show-mls/sort-featured/pp-60/status-sales/page-"
  paste0(url, x) }) -> urls

Notice how the we are appending /page- to the URL. What we are doing is looping over 2, 3, 4, …, 38 and each time appending /page-2, /page-3, …, /page-38 to the URL with paste0(url, x). Then, we are storing all 2280 URLs in the urls  object. 

Step 2: Scraping all the Links from Each Page Result

Next, we are using RSelenium to get all 60 links from each URL we previously stored. If you have not installed RSelenium yet then now it is the perfect time to do it and open up the browser with:

install.packages("RSelenium")
library(RSelenium)

driver <- rsDriver(browser=c("chrome"))
remDr <- driver[["client"]]

Now, we are ready to get all the links. We will be storing the links in a data frame. The code below shows you how to do that:

df_all <- data.frame()
for(i in 1:(length(urls))) {
  remDr$navigate(paste0(urls[[i]]))
  Sys.Sleep(1)
  links <- remDr$findElements(using = "xpath", value = "//*[@class = 'plink']")
  df <- data.frame(link = unlist(sapply(links, function(x){x$getElementAttribute('href')})))
  Sys.sleep(1)
  df_all <- rbind(df_all, df)
}

The code above loops over each URL with a for loop. It opens up the URL, locates the elements with the specified XPath (have a look at the picture above where it shows a link and where it says class="plink")loops over all elements of class=”plink” with sapply, and then extracts each link with the href attribute. Lastly, we are storing the links in the df_all data frame. That was not too hard was it? If you are confused about what an XPath is and how we got the links then check out this tutorial, where I go into more detail about web scraping and XPaths. 

Step 3: Scraping the Necessary Information  from each Link

Now, it is just a piece of cake to locate the information we need. 

Scraping Real Estate Data with rvestscraping real estate data with rvest

Step 3.1: Scraping Real Estate Prices

Rvest real estate web scraping

The price is located under the ul element and can be extracted with the following XPath //*[@class=”price_social”]. The code would look like this:

#get house price
  house_price <- page %>% 
    rvest::html_nodes("ul") %>% 
    rvest::html_nodes(xpath = '//*[@class="price_social"]') %>% 
    rvest::html_text() %>%
    .[[1]]

Step 3.2: Scraping Real Estate Street Addresses

Rvest real estate address

The address is located under the span element and the h1 header. The XPath would be //*[@class=”span8″]. The code to get the address looks like this:

#get street address
 street_address <- page %>% 
   rvest::html_nodes("span") %>% 
   rvest::html_nodes(xpath = '//*[@class="span8"]') %>% 
   rvest::html_nodes("h1") %>%
   rvest::html_text()

Step 3.3: Scraping Square Feet, Bedrooms, Bathrooms, Age, and Year Built

Extract the next information from the website is very easy. There is a section called key facts which contains all the remaining information we need. We can do that with the code below:

# getting the key facts from the condo
  # key facts are: building type, square feet, year built, bedrooms and bathrooms,
  # taxes, and age
  key_facts <- page %>% 
    rvest::html_nodes("ul") %>% 
    rvest::html_nodes(xpath = '//*[@class="key_facts"]') %>% 
    rvest::html_nodes("li") %>%
    rvest::html_text()

And voila! We have all the information we need. Now it is time to put the different parts together into one big scraper that does the work for us. 

Step 4: Putting All Parts Together Into one big Scraper

# converting links stored in a data.frame() as factors to character type stored in a vector df
links <- sapply(df_all$link, as.character)

# initalize empty data frame where we will be storing our scraped data 
df_all_data <- data.frame()

# write our scraper function
scraper <- function(links) {
  
  # save link in url object
  url <- links
  # parse page url
  page <- xml2::read_html(url)
  Sys.sleep(0.25)
  
  #get house price
  house_price <- page %>% 
    rvest::html_nodes("ul") %>% 
    rvest::html_nodes(xpath = '//*[@class="price_social"]') %>% 
    rvest::html_text() %>%
    .[[1]]
  
  #get street address
  street_address <- page %>% 
    rvest::html_nodes("span") %>% 
    rvest::html_nodes(xpath = '//*[@class="span8"]') %>% 
    rvest::html_nodes("h1") %>%
    rvest::html_text()
  
  # getting the key facts from the condo
  # key facts are: building type, square feet, year built, bedrooms and bathrooms,
  # taxes, and age
  key_facts <- page %>% 
    rvest::html_nodes("ul") %>% 
    rvest::html_nodes(xpath = '//*[@class="key_facts"]') %>% 
    rvest::html_nodes("li") %>%
    rvest::html_text()
  
  # removing unnecessary content from the vector of strings and naming the vector elements
  key_facts %>%
    stringr::str_replace_all(., ".*: ", "") %>%
    purrr::set_names(., nm = stringr::str_replace_all(key_facts, ":.*", "") %>%
                       stringr::str_replace_all(., "[0-9]+", "") %>%
                       stringi::stri_trim_both(.)) -> key_facts
  
  # the following code assigns the scraped data for each condo where applicable
  # if the information is not available, we are filling the observation with a NA value
  # for example, there are condos where taxes are not available
  # moreover, some condos are going to get build in the future, so age was not available
  
  # get the building type 
  building_type <- ifelse("Property Type" %in% names(key_facts),
                          key_facts[ grep("Property Type", names(key_facts), ignore.case = TRUE, value = TRUE) ],
                          NA) 
  
  # get square feet
  square_feet <- ifelse("Living Space" %in% names(key_facts),
                        key_facts[ grep("Living Space", names(key_facts), ignore.case = TRUE, value = TRUE) ],
                        NA)
  
  # get the number of bedrooms
  bedrooms <- ifelse("Bedrooms" %in% names(key_facts),
                     key_facts[ grep("Bedrooms", names(key_facts), ignore.case = TRUE, value = TRUE) ],
                     NA)
  
  # get the number of bathrooms
  bathrooms <- ifelse("Bathrooms" %in% names(key_facts),
                      key_facts[ grep("Bathrooms", names(key_facts), ignore.case = TRUE, value = TRUE) ],
                      NA)
  
  # get when the condo was built
  year_built <- ifelse("Year Built" %in% names(key_facts),
                       key_facts[ grep("Year Built", names(key_facts), ignore.case = TRUE, value = TRUE) ],
                       NA)
  
  # get the age of the condo
  age <- ifelse("Approximate Age" %in% names(key_facts),
                key_facts[ grep("Age", names(key_facts), ignore.case = TRUE, value = TRUE) ],
                NA)
  
  # get the taxes (property taxes)
  taxes <- ifelse("Other Taxes" %in% names(key_facts) | "Municipal Taxes" %in% names(key_facts),  
                  key_facts[ grep("taxes", names(key_facts), ignore.case = TRUE, value = TRUE) ], 
                  NA)
  
  # storing individual links in df_individual_page object
  df_individual_page <- data.frame(price = house_price,
                                   address = street_address,
                                   squares = square_feet,
                                   type = building_type,
                                   year = year_built,
                                   age = age,
                                   bed = bedrooms,
                                   bath = bathrooms,
                                   tax = taxes)
  
  # rbinding df_all_data and df_individual_page
  # <<- makes df_all_data a global variable. Making it available in the global environment
  df_all_data <<- rbind(df_all_data, df_individual_page)
}

# looping over all links in the vector and applying scraper function to each link
sapply(links, scraper)

And that’s it. We did some data cleaning inside the function and made some modification to make sure our function is not throwing an error when it cannot find elements such as age or taxes. 

Let me know in the comments below if this tutorial about scraping real estate data with rvest and RSelenium was helpful in getting you started with rvest and RSelenium. Also let me know what kind of question you want to answer with your new real estate data.

Questions that might be helpful to start a project would be:

  • What are the different price to rent ratios in different areas of a city?
  • What price for a house would be considered below market value?

It would be great to build different scrapers for multiple websites. This would decrease bias in what kind of real estate data we are getting. For example, some websites are only posting luxury homes. Another selection bias would that websites are only posting data for a specific neighborhood in a city. 

If you are looking for other tutorials as well, here is a tutorial about how to scrape Zillow real estate postings. 

Disclaimer: Any code provided in this tutorials is for illustration and learning purposes only. The presence of this code does not imply that we encourage scraping or scrape the websites referenced in the code and tutorial. The tutorials only help illustrate the technique of programming web scrapers with R Studio, rvest, and RSelenium for various websites. 

Comments (5)

  1. Hey man,

    Awesome post! I tried following along but I can’t seem to get it to work for the website I am trying to scrape. I was wondering if you could give me any help. The website is : http://www.njaqinow.net
    I would like to scrape the table under the “Current Status” section on the left nav bar. I’m confused on what xpath I should use

    1. Hey man,

      Thank you. I am not sure either seems like you also have to use RSelenium to scrape the table. I just downloaded the txt file and read it into R and did some adjustments to it, so the data looks better. Hope that helps. Here is the code:


      library(tidyverse)
      table <- read.table("YOUR WORKING DIRECTORY/table.txt", sep = "|", header = TRUE) table %>%
      .[-1, -1] %>%
      as.data.frame() %>%
      dplyr::mutate_all(as.character) %>%
      dplyr::mutate_all(stringi::stri_trim_both) %>%
      dplyr::mutate_at(vars(O3:WSPD), as.numeric)

      Have you figured out the xpath yet?

Post your comment