Open Source Ecosystems

crawlerr is simple, yet powerful web crawler for Node.js, based on Promises. This tool allows you to crawl specific urls only based on wildcards. It uses Bloom filter for caching. A browser-like feeling.

Simple: our crawler is simple to use;
Elegant: provides a verbose, Express-like API;
MIT Licensed: free for personal and commercial use;
Server-side DOM: we use JSDOM to make you feel like in your browser;
Configurable pool size, retries, rate limit and more;

Installation

$ npm install crawlerr

Usage

crawlerr(base [, options])

You can find several examples in the examples/ directory. There are the some of the most important ones:

Example 1: Requesting title from a page

const spider = crawlerr("http://google.com/");

spider.get("/")
  .then(({ req, res, uri }) => console.log(res.document.title))
  .catch(error => console.log(error));

Example 2: Scanning a website for specific links

const spider = crawlerr("http://blog.npmjs.org/");

spider.when("/post/[digit:id]/[all:slug]", ({ req, res, uri }) => {
  const post = req.param("id");
  const slug = req.param("slug").split("?")[0];

  console.log(`Found post with id: ${post} (${slug})`);
});

Example 3: Server side DOM

const spider = crawlerr("http://example.com/");

spider.get("/").then(({ req, res, uri }) => {
  const document = res.document;
  const elementA = document.getElementById("someElement");
  const elementB = document.querySelector(".anotherForm");

  console.log(element.innerHTML);
});

Example 4: Setting cookies

const url = "http://example.com/";
const spider = crawlerr(url);

spider.request.setCookie(spider.request.cookie("foobar=…"), url);
spider.request.setCookie(spider.request.cookie("session=…"), url);

spider.get("/profile").then(({ req, res, uri }) => {
  //… spider.request.getCookieString(url);
  //… spider.request.setCookies(url);
});

API

`crawlerr(base [, options])`

Creates a new Crawlerr instance for a specific website with custom options. All routes will be resolved to base.

Option	Default	Description
`concurrent`	`10`	How many request can be run simultaneously
`interval`	`250`	How often should new request be send (in ms)
…	`null`	See `request` defaults for more informations

public `.get(url)`

Requests url. Returns a Promise which resolves with { req, res, uri }, where:

req is the Request object;
res is the Response object;
uri is the absolute url (resolved from base).

Example:

spider
  .get("/")
  .then(({ res, req, uri }) => …);

public `.when(pattern)`

Searches the entire website for urls which match the specified pattern. pattern can include named wildcards which can be then retrieved in the response via res.param.

Example:

spider
  .when("/users/[digit:userId]/repos/[digit:repoId]", ({ res, req, uri }) => …);

public `.on(event, callback)`

Executes a callback for a given event. For more informations about which events are emitted, refer to queue-promise.

Example:

spider.on("error", …);
spider.on("resolve", …);

public `.start()`/`.stop()`

Starts/stops the crawler.

Example:

spider.start();
spider.stop();

public `.request`

A configured request object which is used by retry-request when crawling webpages. Extends from request.jar(). Can be configured when initializing a new crawler instance through options. See crawler options and request documentation for more informations.

Example:

const url = "https://example.com";
const spider = crawlerr(url);
const request = spider.request;

request.post(`${url}/login`, (err, res, body) => {
  request.setCookie(request.cookie("session=…"), url);
  // Next requests will include this cookie

  spider.get("/profile").then(…);
  spider.get("/settings").then(…);
});

Request

Extends the default Node.js incoming message.

public `get(header)`

Returns the value of a HTTP header. The Referrer header field is special-cased, both Referrer and Referer are interchangeable.

Example:

req.get("Content-Type"); // => "text/plain"
req.get("content-type"); // => "text/plain"

public `is(...types)`

Check if the incoming request contains the "Content-Type" header field, and it contains the give mime type. Based on type-is.

Example:

// Returns true with "Content-Type: text/html; charset=utf-8"
req.is("html");
req.is("text/html");
req.is("text/*");

public `param(name [, default])`

Return the value of param name when present or defaultValue:

checks route placeholders, ex: user/[all:username];
checks body params, ex: id=12, {"id":12};
checks query string params, ex: ?id=12;

Example:

// .when("/users/[all:username]/[digit:someID]")
req.param("username");  // /users/foobar/123456 => foobar
req.param("someID");    // /users/foobar/123456 => 123456

Response

public `jsdom`

Returns the JSDOM object.

public `window`

Returns the DOM window for response content. Based on JSDOM.

public `document`

Returns the DOM document for response content. Based on JSDOM.

Example:

res.document.getElementById(…);
res.document.getElementsByTagName(…);
// …

Tests

npm test

Package Rankings

Top 14.0% on Npmjs.org

Badges

Extracted from project README

Related Projects

node-http-server-tools

Some wrappers around the standard node:http API

31 Aug 2024 0

body-parser

Node.js body parsing middleware

06 Jan 2014 5,435

node-crawler

Web Crawler/Spider for NodeJS + server-side jQuery ;-)

25 Nov 2010 6,629

websight

🕷A simple but *really* fast crawler built with Node.js & TypeScript

14 Jul 2019 18

http-proxy-middleware

The one-liner node.js http-proxy middleware for connect, express, next.js and more

14 Mar 2015 10,722

axios

Promise based HTTP client for the browser and node.js

18 Aug 2014 104,005

spiderable-middleware

🤖 Prerendering for JavaScript powered websites. Great solution for PWAs (Progressive Web Apps), S...

12 Mar 2016 34

article-extractor

To extract main article from given URL with Node.js

29 Nov 2015 1,570

jsdom

A JavaScript implementation of various web standards, for use with Node.js

19 Jan 2010 19,941

crawlerr

Installation

Usage

Example 1: Requesting title from a page

Example 2: Scanning a website for specific links

Example 3: Server side DOM

Example 4: Setting cookies

API

crawlerr(base [, options])

public .get(url)

public .when(pattern)

public .on(event, callback)

public .start()/.stop()

public .request

Request

public get(header)

public is(...types)

public param(name [, default])

Response

public jsdom

public window

public document

Tests

Related Projects

node-http-server-tools

body-parser

node-crawler

websight

http-proxy-middleware

axios

spiderable-middleware

article-extractor

jsdom

`crawlerr(base [, options])`

public `.get(url)`

public `.when(pattern)`

public `.on(event, callback)`

public `.start()`/`.stop()`

public `.request`

public `get(header)`

public `is(...types)`

public `param(name [, default])`

public `jsdom`

public `window`

public `document`