Scrape Wikipedia using NodeJs

February 20, 2023 By Pascal Schmidt web scraping

Wikipedia is a data-rich website and contains a large amount of information. This data can be used to take appropriate decisions or you can use it to train bots or neural networks.

In this post, we are going to scrape Wikipedia using Nodejs. We are going to target this page from Wikipedia. You can also read Web Scraping with Nodejs if you are a beginner and want to learn how websites can be scraped using Node.js

Before starting with scraping you should visit this page to completely analyze the page design.

Setting up the prerequisites

Before we start coding we have to install certain libraries which are going to be used in the course of this article. I am assuming that you have already installed Node.js on your machine.

Before installing the libraries let’s create a folder where we will keep our scraping files.

mkdir wikipedia

Now, using npm install the required libraries.

npm install unirest
npm install cheerio

Unirest: This library will be used to make a GET request to the host website.

Cheerio — It will be used for parsing HTML.

Also, create a file where you write the code.

What are we going to extract?

We are going to extract titles and their explanations. It is always better to decide what you want to scrape before even writing a single line of code.

The title is in red and the explanation is in green color.

Scraping Wikipedia

Before we start writing the code, let’s find the location of titles and their explanation inside the DOM.

As you can see all these explanations are under the p tag.

All the headings are under h2 tag.

Let’s code this in node.js step by step.

  1. The first step would be to import all the libraries that we installed earlier.
const unirest = require('unirest');
const cheerio = require('cheerio');

This will import unirest and cheerio in our file.

2. We will make a GET request to our target page in order to get the HTML code from the page.

async function wikipediaScraper(){

 let data = await unirest.get("https://en.wikipedia.org/wiki/Coronavirus").header("Accept", "text/html")

}

unirest.get method will make HTTP connection with our target URL and .header method will set Accept header to text/html.

3. We will load the raw html response using cheerio.

async function wikipediaScraper(){

 let data = await unirest.get("https://en.wikipedia.org/wiki/Coronavirus").header("Accept", "text/html")

 const $ = cheerio.load(data.body); 

}

cheerio.load method will load the HTML data into a cheerio object. This will help us to extract useful information from the raw HTML.

4. Finally we will extract the titles and their explanations paragraphs.

async function wikipediaScraper(){

 let data = await unirest.get("https://en.wikipedia.org/wiki/Coronavirus").header("Accept", "text/html")

 const $ = cheerio.load(data.body);

 $("h2").each(function(i, elem) {

   console.log($(elem).text());

   console.log($(elem).nextUntil("h2").text().trim());

 });

}

$(“h2”) selector will select all the h2 tags in the html response and then using .each() method we are going to iterate over them one by one.

.nextUntil() will help us to extract the text until we find another h2 tag, following the current h2 tag. It means the text is available between two h2 tags.

We have used .text() to get the text content of each and every heading.

5. Finally call this async function to execute this script.

async function wikipediaScraper(){

 let data = await unirest.get("https://en.wikipedia.org/wiki/Coronavirus").header("Accept", "text/html")

 const $ = cheerio.load(data.body);

 $("h2").each(function(i, elem) {

   console.log($(elem).text());

   console.log($(elem).nextUntil("h2").text().trim());

 });

}

wikipediaScraper()

The output of the program will look somewhat like this.

This program will print titles and paragraphs one by one.

Complete Code

In this article, you learned about how you can scrape Wikipedia using Node.js. You were introduced to libraries like cheerio and unirest. And finally, we wrote a code to extract data from Wikipedia.

Now, in this code, you can make further small changes to extract a little more data. You can separate h2 and h3 tags and many other things.

I hope you like this little tutorial and if you do then please do not forget to share it with your friends and on your social media.

Post your comment