Web Scraping with JavaScript

A quick look at web scraping with PHP.

JavaScript is one of the core technologies used in web development, along with HTML and CSS.

If you’re new to programming languages, it might be worth noting that JavaScript and Java share some similarities like the name, syntax, and some standard libraries. Still, in a practical sense, they are two completely different languages.

One of the critical use-cases of JavaScript is to add interactivity to websites and to go beyond what is possible with just HTML and CSS. Some of the most widely used front-end frameworks like Angular, React, and Vue.js use JavaScript.

However, in this guide, we will focus on using JavaScript for a different purpose – web scraping.

To get started, let’s first take a look at NodeJS – a runtime environment that enables you to execute JavaScript server-side too!

What is NodeJS?

Initially, JavaScript was mostly used to add interactive elements to web pages, which allowed for a lot of dynamic functionality that we take for granted today.

JavaScript was mostly executed client-side during the early days, and the web browser would act as the runtime environment for the JavaScript code on the page. But if you wanted to use JavaScript to handle client-side functions like reading or modifying files, there was no easy solution.

NodeJS changed all of that.

NodeJS was released in 2009 as a server-side JavaScript runtime. It is a modular, open-source runtime environment that allows JavaScript code to run outside a browser.

Instead of using different programming languages for server-side functions and client-side functions, NodeJS allowed developers to use one single language for web development in general.

Since we will be using JavaScript outside a browser, we’ll use NodeJS as the runtime environment for our scraping scripts.

To install NodeJS, visit this link.

Now that you’ve installed NodeJS, let’s take a quick look at how easy it is to set up a web server quickly with NodeJS:

const http = require('http');
const PORT = 3000;

const server = http.createServer((req, res) => {
  res.statusCode = 200;
  res.setHeader('Content-Type', 'text/plain');
  res.end('Hello World');
});

server.listen(PORT, () => {
  console.log(`Server running at PORT:${PORT}/`);
});

Save the following code as “create_server.js” and then execute it by typing the following command on your CLI:

node create_server.js

If you navigate to localhost:3000 with your browser, you should see an HTML page with “Hello World” written on it.

HTTP with JavaScript

To extract data from web pages, we’ll be using HTTP clients to handle HTTP requests and responses. If you want to learn more about how HTTP works, we covered it in-depth in our article – Web Scraping From the Ground up With Python.

Although we’ll be using Axios as our HTTP client for most of this guide, we will also cover a few available alternatives.

HTTP with JavaScript: Request

Request is one of the most popular HTTP clients for JavaScript. While the tool is robustly built and still used by many, the library’s author has officially declared it to be deprecated.

However, since it is one of the most typical HTTP clients used today, it’s worth looking at it anyway. Here’s how you can make an HTTP request using Request in JavaScript:

const request = require('request')
request('https://www.google.com', function (
  error,
  response,
  body
) {
  console.error('error:', error)
  console.log('body:', body)
})

If you want to run the code snippet provided above, you will have to install requests by running the following command on your CLI –

npm install request

HTTP with JavaScript: Axios

Request is one of the most popular HTTP clients, but it has some drawbacks. For example, Request still natively uses callbacks. Request was deprecated partly to make way for better HTTP clients that are compatible with modern JavaScript design.

Axios is a modern JavaScript HTTP client that is promise-based. It natively supports promises instead of callbacks, which makes it easier to use than Request.

While the differences between callback and promise are interesting, it is out of this guide’s scope, so we’ll mostly stick to the HTTP clients’ functional level covered here.

To install Axios, simply use the following command on your CLI

npm install axios

Here is how you can use Axios to send an HTTP request and get the response back:

const axios = require('axios')

axios
	.get('https://www.google.com')
	.then((response) => {
		console.log(response)
	})
	.catch((error) => {
		console.error(error)
	});

HTTP with JavaScript: Superagent

Superagent is a modern alternative to Axios and also supports promise natively. However, in terms of popularity and support, Superagent is still lagging behind Axios. To install Superagent, use the following command on your CLI –

npm install superagent

Here is how you can send a HTTP request with Superagent:

const superagent = require("superagent")
const forumURL = "https://www.google.com"

superagent
	.get(forumURL)
	.then((response) => {
		console.log(response)
	})
	.catch((error) => {
		console.error(error)
	})

Working with Regular Expression in JavaScript

Now that we have covered how to handle HTTP requests and responses with JavaScript let’s take a look at how to extract information from the data we obtain.

The most challenging way to extract meaningful information from web pages in JavaScript involves using Regular Expression.

Regular Expression is a handy tool that you can use to extract data from a string based on particular patterns. It is not the most flexible method to do so, and it is not a good idea to use Regular Expression on more significant projects as the code can get confusing and complex fast.

However, it is still important to know how Regular Expression works in JavaScript as it is an additional tool that can sometimes come in handy during web scraping. Here’s a quick code example showing how you can use Regular Expression in JavaScript:

const htmlString = '<label>Price: $15</label>'
const result = htmlString.match(/<label>(.+)<\/label>/)

console.log(result[1].split(": ")[1])

In the code snippet above, we create a constant named htmlString that contains the HTML code we want to parse. We use “<label>Price: $15</label>” and we want to extract “$15” from the string.

We use the match() function and use the following regular expression – “/<label>(.+)<\/label>/”. This essentially means that we want the constant result to be the element in htmlString that match our regular expression.

In this case, the code will take anything between in htmlString and store it in result.

We use split() to further parse our data and remove “Price:” to format our result exactly how we want it to be.

Cheerio and using JQuery server-side

If you have used JavaScript for client-side scripting, you have probably come across JQuery.

JQuery is one of the most popular JavaScript libraries that make it easy to deal with DOM, which makes it easy to manipulate the elements of a web page.

In case you would like to learn more about DOM, click here.

Cheerio is a JavaScript library that allows you to use JQuery server-side.
To install Cheerio, use the following command in your CLI:

npm install cheerio

Since we want our code to be efficient and straightforward, using Cheerio makes sense as it is a powerful tool for parsing data from web pages with user-friendly code. Here is an example of how Cheerio can be used to parse and manipulate data.

const cheerio = require('cheerio')
const $ = cheerio.load('<h2 class="title">Scraper is Good!</h2>')

$('h2.title').text('Scraper is Great!')
$('h2').addClass('welcome')

console.log($.html())

The output should be:

<html><head></head><body><h2 class="title welcome">Scrapper is Great!</h2></body></html>

As demonstrated in the code snippet, you can easily traverse the DOM using JQuery with Cheerio and also manipulate it to an extent. However, Cheerio has some limitations, such as you cannot render any of the DOM elements or load external resources.

So while it is a vital tool in your JavaScript toolkit, it might not be the best tool to use in case you’re scraping complex websites.

As a final example of how you can use Cheerio for web scraping, let’s scrape all the headers from the DOM page that we linked above. Here’s how we would do it using Cheerio:

const axios = require('axios');
const cheerio = require('cheerio');

const getHeaders = async () => {
	try {
		const { data } = await axios.get(
			'https://www.w3schools.com/js/js_htmldom.asp'
		);
		const $ = cheerio.load(data);
		const headers = [];

		$('h2').each((_idx, el) => {
			const header = $(el).text()
			headers.push(header)
		});

		return headers;
	} catch (error) {
		throw error;
	}
};

getHeaders()
.then((headers) => console.log(headers));

Here’s a quick overview of what’s happening in the code snippet above:

  • getHeaders() is an asynchronous function where the HTTP client (Axios) receives the data from the link page, which is then fed into Cheerio.
  • Once the data has been fed into Cheerio, we use a loop to catch all the “h2” elements in the DOM.
  • Once all the “h2” elements are collected, we use console.log() to print the results in the console.

As stated previously, you can use Cheerio to get basic web scraping tools done. However, if you are working on a project that involves more complexity, such as JavaScript execution or loading of external resources, you might need a more comprehensive and robust library.

JSDOM

JSDOM essentially does the same thing as Cheerio but provides more functionality. While both of them simulate DOM on Node, JSDOM allows you to interact with the web page. In case you’re trying to scrape a Single Page Application (SPA) website, you might have to interact with the website during the scraping process.

To install JSDOM, type the following command in your CLI

npm install jsdom axios

Here’s a quick example of how to use JSDOM to manipulate the DOM:

const { JSDOM } = require('jsdom')
const { document } = new JSDOM(
	'<h2 class="title">Scrapy is Good</h2>'
).window
const heading = document.querySelector('.title')
heading.textContent = 'Scrapy is Great!'
heading.classList.add('improved')

console.log(heading.innerHTML)

As demonstrated, you can use JSDOM to modify the DOM quite easily. JSDOM is the closest you can get to DOM manipulation on Node.

However, JSDOM is also not without flaws and might have some performance issues compared to headless browsers like Puppeteer and Nightmare.

Puppeteer and Nightmare: Headless browsers for JavaScript

If you are working on a complex scraping project, JSDOM might not be enough for you. In this case, headless browsers like Puppeteer and Nightmare can help you out.

A headless browser allows you to simulate a web browser when interacting with a website. The headless browser opens up many more possibilities, such as taking screenshots, saving websites as PDFs, and different user interaction events such as form submissions, navigation, keyboard inputs, and more!

In this section of the guide, we’ll take a look at two headless browsers in the JavaScript ecosystem – Puppeteer and Nightmare.

Puppeteer is a headless browser that offers a robust API that lets you control a headless version of Chrome. To install Puppeteer, use the following command on your CLI –

npm install puppeteer

The following code example shows how you can use Puppeteer to take a screenshot of a web page and also save it as a pdf file.

const puppeteer = require('puppeteer')

async function getScreenshotandPDF() {
	try {
		const URL = 'https://scraper.dev/'
		const browser = await puppeteer.launch()
		const page = await browser.newPage()

		await page.goto(URL)
		await page.screenshot({ path: 'screenshot.png' })
		await page.pdf({ path: 'page.pdf' })

		await browser.close()
	} catch (error) {
		console.error(error)
	}
}

getScreenshotandPDF ()

As you can see, Puppeteer allows you to quickly script complex browser behavior with accessible functions such as goto(), screenshot(), and pdf().

Nightmare is an alternative to Puppeteer and is considered to be a slightly faster alternative to Puppeteer. You might have noticed that when you install Puppeteer, you also have to download the Chromium bundle. Nightmare is a less resource-intensive alternative and downloads faster than Puppeteer. 

To install Nightmare, use the following command on your CLI –

npm install nightmare

Here’s a quick example of how you can use Nightmare to simplify the scraping process. In this example, we’ll use Nightmare to get the first new post in HackerNews.

const Nightmare = require('nightmare')
const nightmare = Nightmare()

nightmare
	.goto('https://news.ycombinator.com/')
	.click("#hnmain > tbody > tr:nth-child(1) > td > table > tbody > tr > td:nth-child(2) > span > a:nth-child(2)")
	.wait('body')
	.evaluate(
		() =>
			document.querySelector(
				'#\\32 4862645 > td:nth-child(3) > a'
			).innerText
	)
	.end()
	.then((link) => {
		console.log(link)
	})
	.catch((error) => {
		console.error('Query failed:', error)
	})

In the code snippet above, a Nightmare instance is first created, and then it visits the Hacker News home page. We then click on the navigation button that sorts the posts by time. Then we wait for the body of the new page to load.

Once the new page has loaded, we get the value of the first link posted using the selector and then extract the inner text from it. After that, the inner text is printed using console.log(). In case something goes wrong, we handle it by showing the message “Query failed” and the error that made it fail.

While this might be just a simple script, headless browsers like Puppeteer and Nightmare allow you to create complex scraping scripts while keeping the code relatively simple and easy to understand.

In conclusion

This guide serves as an introduction to web scraping with JavaScript, and we hope it helped you understand how you can use JavaScript can for your web scraping scripts and projects. Here’s a quick summary of all the topics we covered:

  • NodeJS – A runtime environment for JavaScript that allows you to use the programming language server-side too.
  • Request, Axios, and Superagent – HTTP clients that help you deal with HTTP requests and responses in JavaScript.
  • Cheerio – A lightweight library that brings the power of JQuery to server-side scripting, and allows you to parse HTML files quickly.
  • JSDOM – A native JavaScript implementation of DOM that allows you to manipulate the DOM too using NodeJS.
  • Puppeteer and Nightmare – Headless browsers which allow you to simulate a real browser with your scraping scripts, allowing for more convenience and functionality.

As you can probably guess by now, web scraping with JavaScript can be a bit tough to deal with initially. In this guide, we have stuck to the basics and have not indulged in topics like using proxy servers that ensure you do not get blocked while scraping. 

You can also use our API if you’re looking for a powerful, user-friendly, and reliable web scraping API for your next project.

Leave a Reply

Your email address will not be published. Required fields are marked *

Have a question?

A problem, a question, an emergency?
Do not hesitate to visit the help centre, we are here to help.

Powered by 🤖 Robots & ❤️️ Clive – Copyright © 2020 Scraper.dev