A simple and fully customizable web crawler/spider for Node.js with server-side DOM. Comes with elegant and hell-simple APIs.
MIT License
crawlerr is simple, yet powerful web crawler for Node.js, based on Promises. This tool allows you to crawl specific urls only based on wildcards. It uses Bloom filter for caching. A browser-like feeling.
$ npm install crawlerr
crawlerr(base [, options])
You can find several examples in the examples/
directory. There are the some of the most important ones:
const spider = crawlerr("http://google.com/");
spider.get("/")
.then(({ req, res, uri }) => console.log(res.document.title))
.catch(error => console.log(error));
const spider = crawlerr("http://blog.npmjs.org/");
spider.when("/post/[digit:id]/[all:slug]", ({ req, res, uri }) => {
const post = req.param("id");
const slug = req.param("slug").split("?")[0];
console.log(`Found post with id: ${post} (${slug})`);
});
const spider = crawlerr("http://example.com/");
spider.get("/").then(({ req, res, uri }) => {
const document = res.document;
const elementA = document.getElementById("someElement");
const elementB = document.querySelector(".anotherForm");
console.log(element.innerHTML);
});
const url = "http://example.com/";
const spider = crawlerr(url);
spider.request.setCookie(spider.request.cookie("foobar=…"), url);
spider.request.setCookie(spider.request.cookie("session=…"), url);
spider.get("/profile").then(({ req, res, uri }) => {
//… spider.request.getCookieString(url);
//… spider.request.setCookies(url);
});
crawlerr(base [, options])
Creates a new Crawlerr
instance for a specific website with custom options
. All routes will be resolved to base
.
Option | Default | Description |
---|---|---|
concurrent |
10 |
How many request can be run simultaneously |
interval |
250 |
How often should new request be send (in ms) |
… | null |
See request defaults for more informations |
.get(url)
Requests url
. Returns a Promise
which resolves with { req, res, uri }
, where:
req
is the Request object;res
is the Response object;uri
is the absolute url
(resolved from base
).Example:
spider
.get("/")
.then(({ res, req, uri }) => …);
.when(pattern)
Searches the entire website for urls which match the specified pattern
. pattern
can include named wildcards which can be then retrieved in the response via res.param
.
Example:
spider
.when("/users/[digit:userId]/repos/[digit:repoId]", ({ res, req, uri }) => …);
.on(event, callback)
Executes a callback
for a given event
. For more informations about which events are emitted, refer to queue-promise.
Example:
spider.on("error", …);
spider.on("resolve", …);
.start()
/.stop()
Starts/stops the crawler.
Example:
spider.start();
spider.stop();
.request
A configured request
object which is used by retry-request
when crawling webpages. Extends from request.jar()
. Can be configured when initializing a new crawler instance through options
. See crawler options and request
documentation for more informations.
Example:
const url = "https://example.com";
const spider = crawlerr(url);
const request = spider.request;
request.post(`${url}/login`, (err, res, body) => {
request.setCookie(request.cookie("session=…"), url);
// Next requests will include this cookie
spider.get("/profile").then(…);
spider.get("/settings").then(…);
});
Extends the default Node.js
incoming message.
get(header)
Returns the value of a HTTP header
. The Referrer
header field is special-cased, both Referrer
and Referer
are interchangeable.
Example:
req.get("Content-Type"); // => "text/plain"
req.get("content-type"); // => "text/plain"
is(...types)
Check if the incoming request contains the "Content-Type" header field, and it contains the give mime type
. Based on type-is.
Example:
// Returns true with "Content-Type: text/html; charset=utf-8"
req.is("html");
req.is("text/html");
req.is("text/*");
param(name [, default])
Return the value of param name
when present or defaultValue
:
user/[all:username]
;id=12, {"id":12}
;?id=12
;Example:
// .when("/users/[all:username]/[digit:someID]")
req.param("username"); // /users/foobar/123456 => foobar
req.param("someID"); // /users/foobar/123456 => 123456
jsdom
Returns the JSDOM object.
window
Returns the DOM window for response content. Based on JSDOM.
document
Returns the DOM document for response content. Based on JSDOM.
Example:
res.document.getElementById(…);
res.document.getElementsByTagName(…);
// …
npm test