May 24, 2019 Thanks to Node.js, JavaScript is a great language to u se for a web scraper: not only is Node fast, but you’ll likely end up using a lot of the same methods you’re used to from querying the DOM. Jul 24, 2020 In this article, we’re going to illustrate how to perform web scraping with JavaScript and Node.js by rendering a static page and scraping desired content. Next, we’ll cover how to use a headless browser, Puppeteer, to retrieve data from a dynamic website that loads content via javascript. Javascript Web Scraping Guy. Jordan's Adventures Through Web Automation. April 12, 2021 April 10, 2021 Jordan Hansen. Ohio – Cobalt Int’s Secretary of State API.
2013-02-10T21:32:50Z
Jun 30, 2020 JS is a quite well-known language with a great spread and community support. It can be used for both client and server web scraping scripting that makes it pretty suitable for writing your scrapers and crawlers. Most of these libraries' advantages can be received by using our API and some of these libraries can be used in stack with it. Pjscrape is a framework for anyone who's ever wanted a command-line tool for web scraping using Javascript and jQuery. Built to run with PhantomJS, it allows you to scrape pages in a fully rendered, Javascript-enabled context from the command line, no browser required.
Web scraping is a technique used to extract data from websites using a computer program that acts as a web browser. The program requests pages from web servers in the same way a web browser does, and it may even simulate a user logging in to obtain access. It downloads the pages containing the desired data and extracts the data out of the HTML code. Once the data is extracted it can be reformatted and presented in a more useful way.
In this article I'm going to show you how to write web scraping scripts in Javascript using Node.js.
Why use web scraping?
Here are a few of examples where web scraping can be useful:
- You have several bank accounts with different institutions and you want to generate a combined report that includes all your accounts.
- You want to see data presented by a website in a different format. For example, the website shows a table, but you want to see a chart.
- A web site presents related data across multiple pages. You want to see all or part of this data combined in a single report.
- You are an app developer with apps in iTunes and several Android app stores, and you want to have a report of monthly sales across all app stores.
Web scraping can also be used in ways that are dishonest and sometimes even illegal. Harvesting of email addresses for spam purposes, or sniping Ebay auctions are examples of such uses. As a matter of principle I only use web scraping to collect and organize information that is either available to everyone (stock prices, movie showtimes, etc.) or only available to me (personal bank accounts, etc.). I avoid using this technique for profit, I just do it to simplify the task of obtaining information.
In this article I'm going to show you a practical example that implements this technique. Ready? Then let's get started!
Tools for web scraping
In its most basic form, a web scraping script just needs to have a way to download web pages and then search for data in them. All modern languages provide functions to download web pages, or at least someone wrote a library or extension that can do it, so this is not a problem. Locating and isolating data in HTML pages, however, is difficult. An HTML page has content, layout and style elements all intermixed, so a non trivial effort is required to parse and identify the interesting parts of the page.
For example, consider the following HTML page:
Let's say we want to extract the names of the people that appear in the table with id='data'
that is in the page. How do we get to those?
Typically the web page will be downloaded into a string, so it would be simple to just search this string for all the occurrences of <td>
and extract what comes after that and until the following </td>
.
But this could easily make us find incorrect data. The page could have other tables, either before or after the one we want that use the same CSS classes for some of its cells. Or worst, maybe this simple search algoritm works fine for a while, but one day the layout of the page changes so that the old <td>
becomes <td align='left'>
making our search find nothing.
While there is always a risk that a change to the target web page can break a scraping script, it is a good idea to be smart about how items are located in the HTML so that the script does not need to be revised every time the web site changes.
If you have ever written client-side Javascript for the browser using a library like jQuery then you know how the tricky task of locating DOM elements becomes much easier using CSS selectors.
For example, in the browser we could easily extract the names from the above web page as follows:
The CSS selector is what goes inside jQuery's $
function, #data .name
in this example, This is saying that we want to locate all the elements that are children of an element with the id data
and have a CSS class name
. Note that we are not saying anything about the data being in a table in this case. CSS selectors have great flexibility in how you specify search terms for elements, and you can be as specific or vague as you want.
The each
function will just call the function given as an argument for all the elements that match the selector, with the this
context set to the matching element. If we were to run this in the browser we would see an alert box with the name John, and then another one with the name 'Susan'.
Wouldn't it be nice if we could do something similar outside of the context of a web browser? Well, this is exactly what we are about to do.
Introducing Node.js
Javascript was born as a language to be embedded in web browsers, but thanks to the open source Node.js project we can now write stand-alone scripts in Javascript that can run on a desktop computer or even on a web server.
Manipulating the DOM inside a web browser is something that Javascript and libraries like jQuery do really well so to me it makes a lot of sense to write web scraping scripts in Node.js, since we can use many techniques that we know from DOM manipulation in the client-side code for the web browser.
If you would like to try the examples I will present in the rest of this article then this is the time to download and install Node.js. Installers for Windows, Linux and OS X are available at http://nodejs.org.
Node.js has a large library of packages that simplify different tasks. For web scraping we will use two packages called request
and cheerio
. The request
package is used to download web pages, while cheerio
generates a DOM tree and provides a subset of the jQuery function set to manipulate it. To install Node.js packages we use a package manager called npm
that is installed with Node.js. This is equivalent to Ruby's gem
or Python's easy_install
and pip
, it simplifies the download and installation of packages.
So let's start by creating a new directory where we will put our web scraping scripts and install these two modules in it:
Node.js modules will be installed in the scraping/node_modules
subdirectory and will only be accessible to scripts that are in the scraping
directory. It is also possible to install Node.js packages globally, but I prefer to keep things organized by installing modules locally.
Js Web Scraping Software
Now that we have all the tools installed let's see how we can implement the above scraping example using cheerio
. Let's call this script example.js
:
The first line imports the cheerio
package into the script. The require
statement is similar to #include
in C/C++, require
in Ruby or import
in Python.
In the second line we instantiate a DOM for our example HTML, by sending the HTML string to cheerio.load()
. The return value is the constructed DOM, which we store in a variable called $
to match how the DOM is accessed in the browser when using jQuery.
Once we have a DOM created we just go about business as if we were using jQuery on the client side. So we use the proper selector and the each
iterator to find all the occurrences of the data we want to extract. In the callback function we use the console.log
function to write the extracted data. In Node.js console.log
writes to the console, so it is handy to dump data to the screen.
Here is how to run the script and what output it produces:
Easy, right? In the following section we'll write a more complex scraping script.
Real world scraping
Let's use web scraping to solve a real problem.
The Tualatin Hills Park and Recreation District (THPRD) is a Beaverton, Oregon organization that offers area residents a number of recreational options, among them swimming. There are eight swimming pools, all in the area, each offering swimming instruction, lap swimming, open swim and a variety of other programs. The problem is that THPRD does not publish a combined schedule for the pools, it only publishes individual schedules for each pool. But the pools are all located close to each other, so many times the choice of pool is less important than what programs are offered at a given time. If I wanted to find the time slots a given program is offered at any of the pools I would need to access eight different web pages and search eight schedules.
For this example we will say that we want to obtain the list of times during the current week when there is an open swim program offered in any of the pools in the district. This requires obtaining the schedule pages for all the pools, locating the open swim entries and listing them.
Before we start, click here to open one of the pool schedules in another browser tab. Feel free to inspect the HTML for the page to familiarize yourself with the structure of the schedule.
The schedule pages for the eight pools have a URL with the following structure:
The id
is what selects which pool to show a schedule for. I took the effort to open all the schedules manually to take note of the names of each pool and its corresponding id
, since we will need those in the script. We will also use an array with the names of the days of the week. We can scrape these names from the web pages, but since this is information that will never change we can simplify the script by incorporating the data as constants.
Web scraping skeleton
With the above information we can sketch out the structure of our scraping script. Let's call the script thprd.js
:
We begin the script importing the two packages that we are going to use and defining the constants for the eight pools and the days of the week.
Then we download the schedule web pages of each of the pools in a loop. For this we construct the URL of each pool schedule and send it to the request
function. This is an asynchronous function that takes a callback as its second argument. If you are not very familiar with Javascript this may seem odd, but in this language asynchronous functions are very common. The request()
function returns immediately, so it is likely that the eight request()
calls will be issued almost simultaneously and will be processed concurrently by background threads.
When a request completes its callback function will be invoked with three arguments, an error code, a response object and the body of the response. Inside the callback we make sure there is no error and then we just send the body of the response into cheerio
to create a DOM from it. When we reach this point we are ready to start scraping.
We will look at how to scrape this content later, for now we just print the name of the pool as a placeholder. If you run this first version of our script you'll get a surprise:
What? Why do we get the same pool name eight times? Shouldn't we see all the pool names here?
Javascript scoping
Remember I said above that the request()
function is asynchronous? The for
loop will do its eight iterations, spawning a background job in each. The loop then ends, leaving the loop variable set to the pool name that was used in the last iteration. When the callback functions are invoked a few seconds later they will all see this value and print it.
I made this mistake on purpose to demonstrate one of the main sources of confusion among developers that are used to traditional languages and are new to Javascript's asynchronous model.
How can we get the correct pool name to be sent to each callback function then?
The solution is to bind the name of the pool to the callback function at the time the callback is created and sent to the request()
function, because that is when the pool
variable has the correct value.
As we've seen before the callback function will execute some time in the future, after the loop in the main script completed. But the callback function can still access the loop
variable even though the callback runs outside of the context of the main script. This is because the scope of Javascript functions is defined at the time the function is created. When we created the callback function the loop
variable was in scope, so the variable is accessible to the callback. The url
variable is also in the scope, so the callback can also make use of it if necessary, though the same problem will exist with it, its last value will be seen by all callbacks.
So what I'm basically saying is that the scope of a function is determined at the time the function is created, but the values of the variables in the scope are only retrieved at the time the function is called.
We can take advantage of these seemingly odd scoping rules of Javascript to insert any variable into the scope of a callback function. Let's do this with a simple function:
Can you guess what the output of this script will be? The output will be 2
, because that's the value of variable a
at the time the function stored in variable f
is invoked.
To freeze the value of a
at the time f
is created we need to insert the current value of a
into the scope:
Let's analyze this alternative way to create f
one step at a time:
We clearly see that the expression enclosed in parenthesis supposedly returns a function, and we invoke that function and pass the current value of a
as an argument. This is not a callback function that will execute later, this is executing right away, so the current value of a
that is passed into the function is 1.
Here we see a bit more of what's inside the parenthesis. The expression is, in fact, a function that expects one argument. We called that argument a
, but we could have used a different name.
In Javascript a construct like the above is called a self-executing function. You could consider this the reverse of a callback function. While a callback function is a function that is created now but runs later, a self-executing function is created and immediately executed. Whatever this function returns will be the result of the whole expression, and will get assigned to f
in our example.
Why would you want to use a self-executing function when you can make any code execute directly without enclosing it inside a function? The difference is subtle. By putting code inside a function we are creating a new scope level, and that gives us the chance to insert variables into that scope simply by passing them as arguments to the self-executing function.
We know f
should be a function, since later in the script we want to invoke it. So the return value of our self-executing function must be the function that will get assigned to f
:
Does it make a bit more sense now? The function that is assigned to f
now has a parent function that received a
as an argument. That a
is a level closer than the original a
in the scope of f
, so that is the a
that the scope of f
sees. When you run the modified script you will get a 1 as output.
Here is how the self-executing trick can be applied to our web scraping script:
This is pretty much identical to the simpler example above using the pool
variable instead of a
. Running this script again gives us the expected result:
Scraping the swimming pool schedules
To be able to scrape the contents of the schedule tables we need to discover how these schedules are structured. In rough terms the schedule table is located inside a page that looks like this:
Inside each of these <td>
elements that hold the daily schedules there is a <div>
wrapper around each scheduled event. Here is a simplified structure for a day:
Js Web Scraping Tools
Each <td>
element contains a link at the top that we are not interested in, then a sequence of <div>
elements, each containing the information for an event.
One way we can get to these event <div>
elements is with the following selector:
The problem with the above selector, though, is that we will get all the events of all the days in sequence, so we will not know what events happen on which day.
Instead, we can separate the search in two parts. First we locate the <td>
element that defines a day, then we search for <div>
elements within it:
The function that we pass to the each()
iterator receives the index number of the found element as a first argument. This is handy because for our outer search this is telling us which day we are in. We do not need an index number in the inner search, so there we do not need to use an argument in our function.
Running the script now shows the pool name, then the day of the week and then the text inside the event <div>
, which has the information that we want. The text()
function applied to any element of the DOM returns the constant text filtering out any HTML elements, so this gets rid of the <strong>
and <br>
elements that exist there and just returns the filtered text.
We are now very close. The only remaining problem is that the text we extracted from the <div>
element has a lot of whitespace in it. There is whitespace at the start and end of the text and also in between the event time and event description. We can eliminate the leading and trailing whitespace with trim()
:
This leaves us with a few lines of whitespace in between the event time and the description. To remove that we can use replace()
:
Note the regular expression that we use to remove the spaces requires at least two whitespace characters. This is because the event description can contain spaces as well, if we search for two or more spaces we will just find the large whitespace block in the middle and not affect the description.
When we run the script now this is what we get:
And this is just a CSV version of all the pool schedules combined!
We said that for this exercise we were only interested in obtaining the open swim events, so we need to add one more filtering layer to just print the targeted events:
And now we have completed our task. Here is the final version of our web scraping script:
Running the script gives us this output:
From this point on it is easy to continue to massage this data to get it into a format that is useful. My next step would be to sort the list by day and time instead of by pool, but I'll leave that as an exercise to interested readers.
Final words
I hope this introduction to web scraping was useful to you and the example script serves you as a starting point for your own projects.
If you have any questions feel free to leave them below in the comments section.
Thanks!
Miguel
Hello, and thank you for visiting my blog! If you enjoyed this article, please consider supporting my work on this blog on Patreon!
68 comments
#1Nano said 2013-02-11T01:18:06Z
#2fallanic said 2013-02-11T16:02:43Z
#3roshan agarwal said 2013-04-24T17:14:09Z
#4Kishore said 2013-06-18T15:53:24Z
#5Miguel Grinberg said 2013-06-19T04:08:32Z
#6Matt said 2013-06-26T00:53:46Z
#7Victor said 2013-07-09T12:01:34Z
#8Evis said 2013-07-09T12:21:47Z
#9Miguel Grinberg said 2013-07-09T16:19:19Z
#10Miguel Grinberg said 2013-07-09T16:20:38Z
#11Marko said 2013-07-19T18:46:04Z
#12Jinjo Johnson said 2013-07-31T12:10:15Z
#13Alexandru Cobuz said 2013-08-05T12:26:39Z
#14David Konsumer said 2013-08-12T11:20:53Z
#15David Konsumer said 2013-08-12T11:30:30Z
#16Max said 2013-09-06T18:37:43Z
#17ponk said 2013-10-09T21:59:18Z
#18Miguel Grinberg said 2013-10-10T06:06:28Z
#19Carlos said 2013-10-10T09:45:57Z
#20Trevor said 2013-10-17T18:57:47Z
#21dhar said 2013-10-18T07:42:05Z
#22Miguel Grinberg said 2013-10-18T14:38:29Z
#23Mark Thien said 2013-10-31T16:22:28Z
#24Miguel Grinberg said 2013-11-01T15:11:53Z
#25sotiris said 2013-11-14T12:18:41Z
Leave a Comment
Javascript is a widely-used programming language and an ever-increasing number of websites use JavaScript to fetch and render user content. While there are various tools available for web scraping, a growing number of people are exploring Javascript web scraping tools.
To carry out your web scraping projects, you need to familiarize yourself with web scraping tools to choose the right one. We will walk through open source Javascript tools and frameworks that are great for web crawling, web scraping, parsing, and extracting data.
Open Source Javascript Web Scraping Tools and Frameworks
Features/Tools | Github Stars | Github Forks | Github Open Issues | Last Updated | Documentation | License |
---|---|---|---|---|---|---|
Apify SDK | 22K | 1.4K | 216 | June 2020 | Excellent | MIT |
NodeCrawler | 5.4K | 828 | 23 | Nov 2015 | Good | MIT |
Puppeteer | 62K | 6.4K | 1,039 | June 2020 | Excellent | Apache License 2.0 |
Playwright | 13.3K | 402 | 115 | May 2020 | Good | Apache License 2.0 |
Node SimpleCrawler | 2K | 344 | 51 | April 2020 | Good | BSD 2-Clause |
PJScrape | 1K | 175 | 28 | Oct 2011 | Poor | MIT |
Cheerio | 22K | 1.4K | 216 | April 2020 | Good | MIT |
Note: All details in the table above are current at the time of writing this article.
Apify SDK
Apify SDK is a Node.js library which is a lot like Scrapy positioning itself as a universal web scraping library in JavaScript, with support for Puppeteer, Cheerio, and more. With its unique features like RequestQueue and AutoscaledPool, you can start with several URLs and then recursively follow links to other pages and can run the scraping tasks at the maximum capacity of the system respectively.
Requirements – The Apify SDK requires Node.js 10.17 or later
Available Selectors – CSS
Available Data Formats – JSON, JSONL, CSV, XML, Excel or HTML
Pros
- Supports any type of website
- Best library for web crawling in Javascript we have tried so far.
- Built-in support for Puppeteer and Cheerio
Installation
Add Apify SDK to any Node.js project by running:
Best Use Case
Apify SDK is a preferred tool when other solutions fall flat during heavier tasks – performing deep crawls, rotating proxies to mask the browser, scheduling the scraper to run multiple times, caching results to prevent data prevention if the code happens to crash, and more. Apify handles such operations with ease but it can also help to develop web scrapers of your own in Javascript.
Node SimpleCrawler
Simplecrawler is designed to provide a basic, flexible, and robust API for crawling websites. It was written to archive, analyze, and search some very large websites and can get through hundreds of thousands of pages and write large volumes of data without issue. It has a lot of useful events that can help you track the progress of your crawling process. This crawler is extremely configurable and there’s a long list of settings you can change to adapt it to your specific needs.
Requirements – Node.js 8.0+
Pros
- Respects robot.txt rules
- Highly configurable
- Easy setup and installation
Cons
- Does not download the response body when it encounters an HTTP error status in the response
- No promise support
- May get invalid URLs because of its brute force approach
Installation
To install simplecrawler type the command:
Best Use Case
If you need to start off with a flexible and configurable base for writing your own crawler
NodeCrawler
Nodecrawler is a popular web crawler for NodeJS, making it a very fast crawling solution. If you prefer coding in JavaScript, or you are dealing with mostly a Javascript project, Nodecrawler will be the most suitable web crawler to use. Its installation is pretty simple too. JSDOM and Cheerio (used for HTML parsing) use it for server-side rendering, with JSDOM being more robust.
Requires Version – Node v4.0.0 or greater
Available Selectors – CSS, XPath
Available Data Formats – CSV, JSON, XML
Pros
- Easy installation
Cons
- It has no Promise support
Installation
To install this package with npm:
Best Use Case
If you need a lightweight web crawler that combines efficiency and convenience.
PJScrape
PJscrape is a web scraping framework written in Python using Javascript and JQuery. It is built to run with PhantomJS, so it allows you to scrape pages in a fully rendered, Javascript-enabled context from the command line, with no browser required. The scraper functions are evaluated in a full browser context. This means you not only have access to the DOM, but you also have access to Javascript variables and functions, AJAX-loaded content, etc.
Requires Version – Node v4.0.0+, PhantomJS v.1.3+
Available Selectors – CSS
Available Data Format – JSON
Pros
- Easy installation and setup for more than one scraper
- Suitable for recursive crawling
Cons
- Poor documentation
Installation
To install this package with npm:
Best Use Case
If you need a web scraping tool in Javascript and JQuery
Puppeteer
Puppeteer is a Node library which provides a powerful but simple API that allows you to control Google’s headless Chrome browser. A headless browser means you have a browser that can send and receive requests but has no GUI. It works in the background, performing actions as instructed by an API. You can truly simulate the user experience, typing where they type and clicking where they click.
A headless browser is a great tool for automated testing and server environments where you don’t need a visible UI shell. For example, you may want to run some tests against a real web page, create a PDF of it, or just inspect how the browser renders a URL. Puppeteer can also be used to take screenshots of web pages visible by default when you open a web browser.
Puppeteer’s API is very similar to Selenium WebDriver, but works only with Google Chrome. Puppeteer has a more active support than Selenium, so if you are working with Chrome, Puppeteer is your best option for web scraping.
Requires Version – Node v6.4.0, Node v7.6.0 or greater
Available Selectors – CSS
Available Data Formats – JSON
Pros
- With its full-featured API, it covers a majority of use cases
- The best option for scraping Javascript websites on Chrome
Cons
- Only available for Chrome/Chromium browser
- Supports only JSON format
Installation
To install Puppeteer in your project run:
This will install Puppeteer and download the recent version of Chromium browser to run the puppeteer code. By default, puppeteer works with the Chromium browser but you can also use Chrome. You can also use the lightweight version of Puppeteer – puppeteer core. To install type the command:
Best Use Case
- If you need to test the speed, performance, responsivenes, and UI of a website.
- If you are using Chrome, Puppeteer is your best option for web scraping.
- If the information you want is generated using
Playwright
Playwright is a Node library to automate multiple browsers with a single API. It enables cross-browser web automation that is ever-green, capable, reliable, and fast. Playwright was created to improve automated UI testing by eliminating flakiness, improving the speed of execution, and offering insights into the browser operation.
Playwright is very similar to Puppeteer in many respects. The API methods are identical in most cases, and Playwright also bundles compatible browsers by default. Playwright’s biggest differentiating point is cross-browser support. It can drive Chromium, WebKit, MS Edge, and Firefox.
A noteworthy difference is that Playwright has a more powerful browser context feature than Puppeteer. This lets you simulate multiple devices with a single browser instance.
Requires Version – Node.js 10.15 or above.
Available Selectors – CSS
Available Data Formats – JSON
Pros
- Cross Browser support
- Detailed documentation
Con
- They have only patched the WebKit and Firefox debugging protocols, not the actual rendering engine
Installation
To install the package:
This installs Playwright and browser binaries for Chromium, Firefox, and WebKit. Once installed, you can use Playwright in a Node.js script and automate web browser interactions.
Best use case
If you need an efficient tool as good as Puppeteer to perform UI testing but across multiple browsers, you should use Playwright.
Cheerio
Cheerio is a library that parses raw HTML and XML documents and allows you to use the syntax of jQuery while working with the downloaded data. With Cheerio, you can write filter functions to fine-tune which data you want from your selectors. If you are writing a web scraper in JavaScript, Cheerio API is a fast option that makes parsing, manipulating, and rendering efficient.
It does not – interpret the result as a web browser, produce a visual rendering, apply CSS, load external resources, or execute JavaScript. If you require any of these features, you should consider projects like PhantomJS or JSDom.
Requirements – Up to date versions of Node.js and npm
Available Selectors – CSS
Pros
- Parsing, rendering and manipulating documents is very efficient
- Flexible, Easy to Use
- Very fast (Preliminary end to end benchmarks suggests its 8x faster than JSDOM)
Cons
- Does not fare well for dynamic Javascript websites
Installation
To install the required modules using NPM, simply type the following command:
Best Use Case
If you need speed, go for Cheerio.
These are just some of the open-source javascript web scraping tools and frameworks you can use for your web scraping projects. If you have greater scraping requirements or would like to scrape on a much larger scale it’s better to use web scraping services.
If you aren’t proficient with programming or your needs are complex, or you need large volumes of data to be scraped, there are great web scraping services that will suit your requirements to make the job easier for you.
You can save time and get clean, structured data by trying us out instead – we are a full-service provider that doesn’t require the use of any tools and all you get is clean data without any hassles.
We can help with your data or automation needs
Turn the Internet into meaningful, structured and usable data