crawlee

Bot releases are hidden (Show)

crawlee - v0.19.1

Published by mnmkng over 4 years ago

BREAKING (EXPERIMENTAL): session.checkStatus() -> session.retireOnBlockedStatusCodes().
Session API is no longer considered experimental.
Updates documentation and introduces a few internal changes.

crawlee - v0.19.0

Published by mnmkng over 4 years ago

BREAKING: APIFY_LOCAL_EMULATION_DIR env var is no longer supported (deprecated on 2018-09-11).
Use APIFY_LOCAL_STORAGE_DIR instead.
SessionPool API updates and fixes. The API is no longer considered experimental.
Logging of system info moved from require time to Apify.main() invocation.
Use native RegExp instead of xregexp for unicode property escapes.

crawlee - v0.18.1

Published by mnmkng almost 5 years ago

Fix SessionPool not automatically working in CheerioCrawler.
Fix incorrect management of page count in PuppeteerPool.

crawlee - v0.18.0

Published by petrpatek almost 5 years ago

BREAKING CheerioCrawler ignores ssl errors by default - options.ignoreSslErrors: true.
Add SessionPool implemenation to CheerioCrawler.
Add SessionPool implementation to PuppeteerPool and PupeteerCrawler.
Fix Request constructor not making a copy of objects such as userData and headers.
Fix desc option not being applied in local dataset.getData().

crawlee - v0.17.0

Published by mnmkng almost 5 years ago

BREAKING: Node 8 and 9 are no longer supported. Please use Node 10.17.0 or higher.
DEPRECATED: Apify.callTask() body and contentType options are now deprecated.
Use input instead. It must be of content-type: application/json.
Add default SessionPool implementation to BasicCrawler.
Add the ability to create ad-hoc webhooks via Apify.call() and Apify.callTask().
Add an example of form filling with Puppeteer.
Add country option to Apify.getApifyProxyUrl().
Add Apify.utils.puppeteer.saveSnapshot() helper to quickly save HTML and screenshot of a page.
Add the ability to pass got supported options to requestOptions in CheerioCrawler
thus supporting things such as cookieJar again.
Switch Puppeteer to web socket again due to suspected pipe errors.
Fix an issue where some encodings were not correctly parsed in CheerioCrawler.
Fix parsing bad Content-Type headers for CheerioCrawler.
Fix custom headers not being correctly applied in Apify.utils.requestAsBrowser().
Fix dataset limits not being correctly applied.
Fix a race condition in RequestQueueLocal.
Fix RequestList persistence of downloaded sources in key-value store.
Fix Apify.utils.puppeteer.blockRequests() always including default patterns.
Fix inconsistent behavior of Apify.utils.puppeteer.infiniteScroll() on some websites.
Fix retry histogram statistics sometimes showing invalid counts.
Added regexps for Youtube videos (YOUTUBE_REGEX, YOUTUBE_REGEX_GLOBAL) to utils.social
Added documentation for option json in handlePageFunction of CheerioCrawler

crawlee - v0.16.1

Published by drobnikj almost 5 years ago

Add useIncognitoPages option to PuppeteerPool to enable opening new pages in incognito
browser contexts. This is useful to keep cookies and cache unique for each page.
Added options to load every content type in CheerioCrawler.
There are new options body and contentType in handlePageFunction for this purposes.
DEPRECATED: CheerioCrawler html option in handlePageFunction was replaced with body option.

crawlee - v0.16.0

Published by mnmkng about 5 years ago

Update @apify/http-request to version 1.1.2.
Update CheerioCrawler to use requestAsBrowser() to better disguise as a real browser.

crawlee - v0.15.5

Published by mnmkng about 5 years ago

This release just updates some dependencies (not Puppeteer).

crawlee - v0.15.4

Published by mnmkng about 5 years ago

DEPRECATED: dataset.delete(), keyValueStore.delete() and requestQueue.delete() methods have been deprecated in favor of *.drop() methods, because the drop name more clearly communicates the fact that those methods drop / delete the storage itself, not individual elements in the storage.
Added Apify.utils.requestAsBrowser() helper function that enables you to make HTTP(S) requests disguising as a browser (Firefox). This may help in overcoming certain anti-scraping and anti-bot protections.
Added options.gotoTimeoutSecs to PuppeteerCrawler to enable easier setting of navigation timeouts.
PuppeteerPool options that were deprecated from the PuppeteerCrawler constructor were finally removed. Please use maxOpenPagesPerInstance, retireInstanceAfterRequestCount, instanceKillerIntervalSecs, killInstanceAfterSecs and proxyUrls via the puppeteerPoolOptions object.
On the Apify Platform a warning will now be printed when using an outdated apify package version.
Apify.utils.puppeteer.enqueueLinksByClickingElements() will now print a warning when the nodes it
tries to click become modified (detached from DOM). This is useful to debug unexpected behavior.

crawlee - v0.15.3

Published by mnmkng about 5 years ago

Apify.launchPuppeteer() now accepts proxyUrl with the https, socks4
and socks5 schemes, as long as it doesn't contain username or password.
This is to fix Issue #420.
Added desiredConcurrency option to AutoscaledPool constructor, removed
unnecessary bound check from the setter property

crawlee - v0.15.2

Published by mnmkng over 5 years ago

Fix error where Puppeteer would fail to launch when pipes are turned off.
Switch back to default Web Socket transport for Puppeteer due to upstream issues.

crawlee - v0.15.1

Published by mnmkng over 5 years ago

BREAKING CHANGE Removed support for Web Driver (Selenium) since no further updates are planned.
If you wish to continue using Web Driver, please stay on Apify SDK version ^0.14.15
BREAKING CHANGE: Dataset.getData() throws an error if user provides an unsupported option
when using local disk storage.
DEPRECATED: options.userData of Apify.utils.enqueueLinks() is deprecated.
Use options.transformRequestFunction instead.
Improve logging of memory overload errors.
Improve error message in Apify.call().
Fix multiple log lines appearing when a crawler was about to finish.
Add Apify.utils.puppeteer.enqueueLinksByClickingElements() function which enables you
to add requests to the queue from pure JavaScript navigations, form submissions etc.
Add Apify.utils.puppeteer.infiniteScroll() function which helps you with scrolling to the bottom
of websites that auto-load new content.
The RequestQueue.handledCount() function has been resurrected from deprecation,
in order to have compatible interface with RequestList.
Add useExtendedUniqueKey option to Request constructor to include method and payload
in the Request's computed uniqueKey.
Updated Puppeteer to 1.18.1
Updated apify-client to 0.5.22

Package Rankings

Top 7.39% on Npmjs.org

Related Projects

apify-sdk-js

Apify SDK monorepo

22 Apr 2022 119

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In ...

26 Aug 2016 15,235

websight

🕷A simple but *really* fast crawler built with Node.js & TypeScript

14 Jul 2019 18

x-crawl

x-crawl is a flexible Node.js AI-assisted crawler library. Making crawler work more efficient, in...

22 Jan 2023 829

swagger-parser

Swagger 2.0 and OpenAPI 3.0 parser/validator

20 Oct 2014 1,065

hackathon-starter

A boilerplate for Node.js web applications

13 Nov 2013 34,679

Fetchly

Discover Fetchly, a remarkably efficient 1kb fetchly solution for JavaScript and Node.js, ideal f...

18 Jan 2024 1

crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In ...

26 Aug 2016 15,235

crawlyx

Crawlyx is an open-source command-line interface (CLI) based web crawler built using Node.js. It ...

20 Mar 2023 9