Scraping Netflix Data with Python
November 21, 2022 By Pascal Schmidt python
As we all know, Netflix is an OTT platform where you can watch unlimited Shows and movies. Literally UNLIMITED! You can scrape Netflix to collect any episode’s names, cast, ratings, similar shows, pricing of plans, etc.
Using this data, you can analyze what users are watching these days, which also helps sentiment analysis.
I will be using Python for scraping Netflix. I am assuming you have already installed python on your computer.
Why Scrape Netflix with Python
With this programming language & an extensive collection of libraries, web scraping using Python is considerably more flexible than others. The community for Python is big, so you never get stuck anywhere while building your code.
If you are new to web scraping using python, consider going through this tutorial.
Let’s Scrape Netflix
To begin with, we will create a folder and install all the libraries we might need during this tutorial.
For now, we will install two libraries.
- Requests will help us to make an HTTP connection with Netflix.com.
- BeautifulSoup will help us to create an HTML tree for smooth data extraction.
>> mkdir netflix
>> pip install requests
>> pip install beautifulsoup4
Inside this folder, you can create a python file where we will write our code. We will scrape this Netflix page. Our data of interest will be:
- Name of the show
- The number of seasons.
- What is it about?
- Episode Names
- Episode overview.
- Genre
- Show Category
- Social media links
- Cast
I know it’s a long list of data, but in the end, you will have a ready code for scraping any page from Netflix, not just this page.
Let’s find the location of each of these elements
The title is stored under the h1 tag of class title-title.
The number of seasons is stored under the span tag of the duration class.
The about section is stored under the div tag of the class hook-text.
The episode title is stored under the h3 tag with the class episode-title.
The episode title is stored under the p tag with the class episode-synopsis.
Genre is stored under span tag with the class item-genres.
The category of the show is stored under the span tag with the class item-mood-tag.
Social Media links can be found under a tag with class name social-link.
The cast is stored under the span tag with class item-cast.
Let’s start with making a standard GET request to the target webpage and see what happens.
import requests from bs4 import BeautifulSoup target_url="https://www.netflix.com/in/title/80057281" resp = requests.get(target_url) print(resp.status_code)
If you get 200, then you have successfully scraped our target page. Now, let’s extract information from this data using BeautifulSoup
or BS4
.
soup=BeautifulSoup(resp.text, 'html.parser') l=list() o={} e={} d={} m={} c={}
Let us first extract all the data properties one by one. As discussed above, we will be using the exact HTML location.
o["name"]=soup.find("h1", {"class":"title-title"}).text o["seasons"] = soup.find("span", {"class":"duration"}).text o["about"] = soup.find("div", {"class":"hook-text"}).text
Now, let’s extract the episode details.
episodes = soup.find("ol",{"class":"episodes-container"}).find_all("li") for i in range(0,len(episodes)): e["episode-title"]=episodes[i].find("h3",{"class":"episode-title"}).text e["episode-description"]=episodes[i].find("p",{"class":"epsiode-synopsis"}).text l.append(e) e={}
Complete data is inside ol tag. So, we first find the ol tag and then all the li tags inside it. Then we used for loop to extract the title and the description.
Let’s extract the genre now.
genres = soup.find_all("span",{"class":"item-genres"}) for x in range(0,len(genres)): d["genre"]=genres[x].text.replace(",","") l.append(d) d={}
The genre can be found under the class item-genre. Again we have used for loop to extract all the genres.
Let’s extract the rest of the data properties with a similar technique.
mood = soup.find_all("span",{"class":"item-mood-tag"}) for y in range(0,len(mood)): m["mood"]=mood[y].text.replace(",","") l.append(m) m={} o["facebook"]=soup.find("a",{"data-uia":"social-link-facebook"}).get("href") o["twitter"]=soup.find("a",{"data-uia":"social-link-twitter"}).get("href") o["instagram"]=soup.find("a",{"data-uia":"social-link-instagram"}).get("href") cast=soup.find_all("span",{"class":"item-cast"}) for t in range(0,len(cast)): c["cast"]=cast[t].text l.append(c) c={} l.append(o) print(l)
We have managed to scrape all the data from Netflix.
Complete Code
With this code, we have managed to scrape Name, Number of seasons, What the show is about, Cast, Genre, Mood, Social links, etc. With just a few more changes to this code, you can extract more data from Netflix. You can use Scrapingdog’s Web Scraping API to extract data from Netflix at scale without getting blocked.
import requests from bs4 import BeautifulSoup l=list() o={} e={} d={} m={} c={} target_url="https://www.netflix.com/in/title/80057281" resp = requests.get(target_url) soup = BeautifulSoup(resp.text, 'html.parser') o["name"]=soup.find("h1", {"class":"title-title"}).text o["seasons"] = soup.find("span", {"class":"duration"}).text o["about"] = soup.find("div", {"class":"hook-text"}).text episodes = soup.find("ol",{"class":"episodes-container"}).find_all("li") for i in range(0,len(episodes)): e["episode-title"]=episodes[i].find("h3",{"class":"episode-title"}).text e["episode-description"]=episodes[i].find("p",{"class":"epsiode-synopsis"}).text l.append(e) e={} genres = soup.find_all("span",{"class":"item-genres"}) for x in range(0,len(genres)): d["genre"]=genres[x].text.replace(",","") l.append(d) d={} mood = soup.find_all("span",{"class":"item-mood-tag"}) for y in range(0,len(mood)): m["mood"]=mood[y].text.replace(",","") l.append(m) m={} o["facebook"]=soup.find("a",{"data-uia":"social-link-facebook"}).get("href") o["twitter"]=soup.find("a",{"data-uia":"social-link-twitter"}).get("href") o["instagram"]=soup.find("a",{"data-uia":"social-link-instagram"}).get("href") cast=soup.find_all("span",{"class":"item-cast"}) for t in range(0,len(cast)): c["cast"]=cast[t].text l.append(c) c={} l.append(o) print(l)
Conclusion
This was just a quick way to crawl the complete Netflix page. By changing the show title ID, you can scrape almost all the shows from Netflix. You need to have the IDs of those shows. In place of BS4, you can also use Xpath to create an HTML tree for data extraction. You can use Web Scraping API to extract data from Netflix at scale without getting blocked.
I hope you liked this quick tutorial on scraping Netflix, and if you did, please share this blog on your social networks.
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications