A Detailed Guide to Web Scraping Indeed Jobs With R and rvest
November 1, 2018 By Pascal Schmidt R web scraping
In this tutorial we will be covering scraping Indeed jobs with R and rvest. You will be learning how to exactly locate the information you want and need in the HTML document. At the end, we will have developed a fully functioning scraper for your own use.
Before we jump in, here and here is a blog post about the analysis of the scraped data. Specifically, we wanted to know how to become a Data Scientist. We looked at the level of education required, what majors you should pick, and what technologies are most important to know.
The second post is about the differences between a Data Scientist, a Data Analyst, and a Data Engineer.
Now let’s jump in!
Web Scraping Indeed Jobs With R and rvest: Where to Start?
Working in a technology driven industry, it is very important to keep up with current trends. Particularly, when one is working in the field of Data Science where no one knows exactly the boundaries between a Data Scientist, a Data Analyst, or a Data Engineer. Moreover, let alone what kind of skills are required for each position. Therefore, we will be attempting to clarify what it takes to become a Data Scientist by developing a webscraper for Indeed job postings.
Web scraping Indeed jobs with R and can easily be accomplished with the rvest
package. With this package, getting the relevant information from Indeed’s website is a straight forward process.
So let’s start with what we will be covering:
- How to get job titles from Indeed’s website.
- How to get job locations.
- How to look for company names.
- How to scrape all summary descriptions for each job.
- Building an entire scraper by putting all parts together.
First, we will be loading the required packages for this tutorial.
library(tidyverse) library(rvest) library(xml2)
Then we’ll have a look at how to get job titles from the web page. We want to look for Data Scientist jobs in Vancouver, Canada.
How To Get Indeed’s Data Into R
After we are done with that, we will copy the link address and store the URL in a variable called url
. Then we will use the xml2
package and the read_html
function to parse the page. In short, this means that the function will read in the code from the webpage and break it down into different elements (<div>
, <span>
, <p>
, etc.) for you to analyse it.
In the code below, we will show you how to get the page into R in order for you to analyse it.
url <- "https://www.indeed.ca/Data-Scientist-jobs-in-Vancouver%2C+BC" page <- xml2::read_html(url)
After we are done, we are ready to extract relevant nodes from the XML
object.
We call elements like divs (<div>
), spans (<span>
), parapgraphs parapgraphs (<p>
) or anchors (<a>
) nodes, after they have been parsed by the xml2::read_html()
function. Besides element nodes there are also attribute nodes and text nodes.
But wait… how do we know where to find all the relevant information we are looking for? Well, that is the hard part of developing a successful scraper. Luckily, Indeed’s website is not very hard to scrape.
So, what we will be doing is looking at the source code of the website and also inspecting individual elements.
Inspecting Indeed’s Webpage For Scraping
Let’s first inspect the code. For windows, you can do a right-click and then select inspect when you are on Indeed’s website. This should look like this:
Then, click on the little arrow in the top right corner and hover over elements on Indeed’s website.
By doing that you can see that the corresponding code on the right-hand side gets highlighted. The job title Data Scientist and Statistician
is located under the anchor tag. If we look more into it we can also see the it is located under the jobtitle CSS selector
and under the xpath a[@class=”jobtitle”]
. This makes it so much easier to find individual pieces on a website. Now you might ask yourself what CSS Selectors are and what a xpath is.
Xpath: a path to specificly extract certain parts from a tree-structured document such as XML or HTML. The path can be very specific and makes it possible to grab certain parts from a website easily.
CSS Selectors: A CSS selector has a similar function to xpath. Namely, locating certain nodes in a document and extracting information from these nodes. Every CSS selector can be translated into an equivalent xpath but not the other way around.
Scraping Indeed Jobs With R and rvest: How to Get Job Titles
Here is an example of how the syntax of a xml path works: //tagname[@attribute = “value“]
Now let’s have a look at a html code snippet on Indeed’s website:
Here we can see that there is an attribute data-tn-element
which value is ”jobTitle“
. This particular attribute is under the anchor node. So let’s construct the xpath:
//a[@data-tn-element = “jobTitle“]. And voila we get all job titles. You’ll notice that we have included //* instead of //a in our code below. The star acts as a wildcard and selects all elements or nodes not just the anchor node. For Indeed’s website, the attribute data-tn-element
is always under the anchor node so the wild card symbol wouldn’t be necessary.
So let’s recap what we have done:
- We looked at the source code and identified that the jobtitle is located within the anchor
<a>
and<div>
nodes. - Then we looked at the attribute
data-tn-element
with the value “jobTitle“. - From there, we grabbed the “title” attribute and extracted the information.
Don’t be descouraged if it looks somwhat complicated at first. It takes some time to get used to the structure of a HTML document. If you have a basic notion of what a xpath is, about different nodes and how you can select elements from a document, then you can start trying. Keep selecting different nodes and different attributes until you are happy with your results. Especially at the beginning, there is a lot to learn from web scraping and it’s trial and error until you get the information you want.
In the code below we are selecting all div
nodes and specify the xpath from where we grab the attribute title
to get all job titles from the website.
#get the job title job_title <- page %>% rvest::html_nodes("div") %>% rvest::html_nodes(xpath = '//*[@data-tn-element = "jobTitle"]') %>% rvest::html_attr("title")
Alternatively, we could have specified a CSS selector as well. This would look like this:
#get the job title page %>% rvest::html_nodes('[data-tn-element="jobTitle"]') %>% rvest::html_attr("title") # or page %>% rvest::html_nodes('a[data-tn-element="jobTitle"]') %>% rvest::html_attr("title")
Either way, we are getting all the job titles from the website.
Let’s move on to the next step. Getting a job location and the company. First, let’s have a look at the source code and find out where company names and locations are located in the document.
Scraping Indeed Jobs With R and rvest: How to Get Company Locations and Company Names
The next picture shows where the company name is located.
We can see that company location and name are located in the <span>
element with a class attribute value of location
and company
respectively.
Let’s see how we can extract this information from the document. First we’ll specify the xpath.
# get company location page %>% rvest::html_nodes("span") %>% rvest::html_nodes(xpath = '//*[@class="location"]')%>% rvest::html_text() %>% stringi::stri_trim_both() # get company name page %>% rvest::html_nodes("span") %>% rvest::html_nodes(xpath = '//*[@class="company"]') %>% rvest::html_text() %>% stringi::stri_trim_both()
Now, we are getting the same exact information just with the corresponding CSS selectors.
# get job location page %>% rvest::html_nodes(".location") %>% rvest::html_text() # get the company name page %>% rvest::html_nodes(".company") %>% rvest::html_text() %>% stringi::stri_trim_both()
Scraping Indeed Jobs With R and rvest: How to Get All Links
Lastly, we want to get the job description from every single job on the website. You’ll notice, that on the current page, there is just a little meta description of the job summary. However, we want to get the full description of how many years of experience we need, what skill set is required, and what responsibilities the job entails.
In order to do that we have to collect the links on the website. We do that with the following code.
# get links xpath page %>% rvest::html_nodes("div") %>% rvest::html_nodes(xpath = '//*[@data-tn-element="jobTitle"]') %>% rvest::html_attr("href") # get links CSS selectors page %>% rvest::html_nodes('[data-tn-element="jobTitle"]') %>% rvest::html_attr("href")
After we have collected the links we can now locate where the job description is located in the document.
Looking at the picture above we notice that the job description is in a <span>
element with a class
attribute values of jobsearch-JobComponent-description icl-u-xs-mt–md
. Let’s have a look at the code below.
# get job description xpath page %>% rvest::html_nodes("span") %>% rvest::html_nodes(xpath = '//*[@class="jobsearch-JobComponent-description icl-u-xs-mt--md"]') %>% rvest::html_text() %>% stringi::stri_trim_both() # get job description CSS selector page %>% rvest::html_nodes('.jobsearch-JobComponent-description icl-u-xs-mt--md') %>% rvest::html_text() %>% stringi::stri_trim_both()
That was the majority of our work! We are now done scraping Indeed jobs with R and can focus now on building a functioning scraper. That means, we have to put all the different parts together. One more thing we have to implement in our scraper are multiple page results.
We can do that by messing with the URL in our code. Notice what happens when we click on page number 2 until the end.
We have to manually find out how many page results Indeed’s website returns for our query. When we have completed that then we are finally ready to build the scraper. Let’s go!
page_result_start <- 10 # starting page page_result_end <- 190 # last page results page_results <- seq(from = page_result_start, to = page_result_end, by = 10) full_df <- data.frame() for(i in seq_along(page_results)) { first_page_url <- "https://ca.indeed.com/jobs?q=Data+Scientist&l=Vancouver%2C+BC" url <- paste0(first_page_url, "&start=", page_results[i]) page <- xml2::read_html(url) # Sys.sleep pauses R for two seconds before it resumes # Putting it there avoids error messages such as "Error in open.connection(con, "rb") : Timeout was reached" Sys.sleep(2) #get the job title job_title <- page %>% rvest::html_nodes("div") %>% rvest::html_nodes(xpath = '//a[@data-tn-element = "jobTitle"]') %>% rvest::html_attr("title") #get the company name company_name <- page %>% rvest::html_nodes("span") %>% rvest::html_nodes(xpath = '//*[@class="company"]') %>% rvest::html_text() %>% stringi::stri_trim_both() -> company.name #get job location job_location <- page %>% rvest::html_nodes("span") %>% rvest::html_nodes(xpath = '//*[@class="location"]')%>% rvest::html_text() %>% stringi::stri_trim_both() # get links links <- page %>% rvest::html_nodes("div") %>% rvest::html_nodes(xpath = '//*[@data-tn-element="jobTitle"]') %>% rvest::html_attr("href") job_description <- c() for(i in seq_along(links)) { url <- paste0("https://ca.indeed.com/", links[i]) page <- xml2::read_html(url) job_description[[i]] <- page %>% rvest::html_nodes("span") %>% rvest::html_nodes(xpath = '//*[@class="jobsearch-JobComponent-description icl-u-xs-mt--md"]') %>% rvest::html_text() %>% stringi::stri_trim_both() } df <- data.frame(job_title, company_name, job_location, job_description) full_df <- rbind(full_df, df) } df_Vancouver <- full_df %>% dplyr::distinct() %>% dplyr::mutate(city = "Vancouver") df_Montreal <- full_df %>% dplyr::distinct() %>% dplyr::mutate(city = "Montreal") df_Toronto <- full_df %>% dplyr::distinct() %>% dplyr::mutate(city = "Toronto") df_Canada <- rbind(df_Vancouver, df_Toronto, df_Montreal) write.csv(df_Canada, "df_Canada.csv") # some cleaning df_Canada$job_description <- gsub("[\r\n]", "", df_Canada$job_description) df_Canada <- read.csv(here::here("df_Canada.csv"))
The last page when we built the scraper was 190. So we are specifying page_result_end
to be 190 and the starting page, page_result_start
, to be 10.
Afterward, we are initiating full_df
and then starting the for loop. We are scraping the job title, the company name, the company location, and the links. Then we are starting the second for loop where we are collecting all the job summaries from page result 1. Then we are putting all our scraped data into a data frame and go on to page 2. We do that until we have reached the last page and we are done.
I scraper currently only does one city at a time. However, you can easily expand on that and add another for loop where you can specify from which cities you want to scrape data. I manually changed the URLs and scraped data for Vancouver, Toronto, and Montreal. Afterward, I put all postings in a data frame where it is ready for analysis in our next blog post.
I hope you have enjoyed scraping Indeed jobs with R and rvest
. If you have any questions or feedback, then let me know in the comments below.
Here are some resources that you might find interesting:
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
Comments (8)
Hello Pascal,
Very nice and helpful code. I noticed you may have a bug in your code:
first_page_url <- "https://ca.indeed.com/jobsq=Data+Scientist&l=Vancouver%2C+BC"
url <- paste0(first_page_url, "&start=", page_results[i])
page <- xml2::read_html(first_page_url)
By using
page <- xml2::read_html(first_page_url)
it creates multiple objects of the same data as it never leaves the first page. I believe it should be
page <- xml2::read_html(url)
instead so it can access the specified number of pages requested.
Let me know if this looks right or if I am making a mistake.
Thank you,
Brady
Hi Brady,
Thanks for your comment. Indeed, it should be url instead of first_page_url. I fixed it. Very good eye! 🙂
Bonjour,
d’abord merci pour ce fameux contenu.
j’aimerai vous demander si il existe une api indeed?
Merci
Bonjour,
Thank you. Yes, there is and you have to create a publisher account to access the job data.
https://opensource.indeedeng.io/api-documentation/docs/job-search/#prerequisite
Hi,
I noticed that the cod (in the loop) which scraps the job_description, only add the last job description to list.
I change the line :
„job_description% „ for „newelement %”
and add the line:
„job_description <- c(job_description, newelement)”
job_description <- c()
for(i in seq_along(links)) {
url <- paste0("https://pl.indeed.com/", links[i])
page <- xml2::read_html(url)
newelement %
rvest::html_nodes(“div”) %>%
rvest::html_nodes(xpath = ‘//*[@class=”jobsearch-jobDescriptionText”]’) %>%
rvest::html_text() %>%
stringi::stri_trim_both()
job_description <- c(job_description, newelement)
}
Awesome, thanks for your comment and code!
Hi, thank you so much for your code, this is really helpful!
I just wanted to know how you would solve the situation in which not all job listings advertise the company that posted it, hence leaving us with a vector of different length that does not fit the data frame.
I would check if a company name can be scraped and if the vector of length one is empty then you can put an NA in there. Web scraping can be difficult due to a lot of edge cases one has to handle for some websites.