Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
APACHE-2.0 License
Bot releases are visible (Hide)
APIFY_LOCAL_STORAGE_ENABLE_WAL_MODE
), closes #956@ts-ignore
comments to imports of optional peer dependencies (#1152)sdk.openSessionPool()
(#1154)infiniteScroll
(#1140)Published by mnmkng about 3 years ago
ProxyConfiguration
and CheerioCrawler
.got-scraping
to receive multiple improvements.This release improves the stability of the SDK.
Published by mnmkng about 3 years ago
CheerioCrawler
caused by parser conflicts in recent versions of cheerio
.Published by mnmkng about 3 years ago
got-scraping
2.0.1 until fully compatible.Published by mnmkng about 3 years ago
We're releasing SDK 2 ahead of schedule, because we need state of the art HTTP2 support for scraping and with Node.js versions <15.10, HTTP2 is not very reliable. We bundled in 2 more potentially breaking changes that we were waiting for, but we expect those to have very little impact on users. Migration should therefore be super simple. Just bump your Node.js version.
If you're waiting for full TypeScript support and new features, those are still in the works and will be released in SDK 3 at the end of this year.
cheerio
to 1.0.0-rc.10
from rc.3
. There were breaking changes in cheerio
between the versions so this bump might be breaking for you as well.LiveViewServer
which was deprecated before release of SDK v1.Published by B4nan about 3 years ago
browser-pool
rewritePublished by mnmkng about 3 years ago
headerGeneratorOptions
not being passed to got-scraping
in requestAsBrowser
.Published by petrpatek over 3 years ago
/v2
duplication in apiBaseUrl
.Published by mnmkng over 3 years ago
CheerioCrawler
CheerioCrawler
downloads the web pages using the requestAsBrowser
utility function.
As opposed to the browser based crawlers that are automatically encoding the URLs, the
requestAsBrowser
function will not do so. We either need to manually encode the URLs
via encodeURI()
function, or set forceUrlEncoding: true
in the requestAsBrowserOptions
,
which will automatically encode all the URLs before accessing them.
We can either use
forceUrlEncoding
or encode manually, but not both - it would
result in double encoding and therefore lead to invalid URLs.
We can use the preNavigationHooks
to adjust requestAsBrowserOptions
:
preNavigationHooks: [
(crawlingContext, requestAsBrowserOptions) => {
requestAsBrowserOptions.forceUrlEncoding = true;
}
]
Apify
class and Configuration
Adds two new named exports:
Configuration
class that serves as the main configuration holder, replacing explicit usage ofApify
class that allows configuring the SDK. Env vars still have precedence over the SDK configuration.When using the Apify class, there should be no side effects.
Also adds new configuration for WAL mode in ApifyStorageLocal
.
As opposed to using the global helper functions like main
, there is an alternative approach using Apify
class.
It has mostly the same API, but the methods on Apify
instance will use the configuration provided in the constructor.
Environment variables will have precedence over this configuration.
const { Apify } = require('apify'); // use named export to get the class
const sdk = new Apify({ token: '123' });
console.log(sdk.config.get('token')); // '123'
// the token will be passed to the `call` method automatically
const run = await sdk.call('apify/hello-world', { myInput: 123 });
console.log(`Received message: ${run.output.body.message}`);
Another example shows how the default dataset name can be changed:
const { Apify } = require('apify'); // use named export to get the class
const sdk = new Apify({ defaultDatasetId: 'custom-name' });
await sdk.pushData({ myValue: 123 });
is equivalent to:
const Apify = require('apify'); // use default export to get the helper functions
const dataset = await Apify.openDataset('custom-name');
await dataset.pushData({ myValue: 123 });
Configuration
class and Apify
named export, see above.proxyUrl
without a port throwing an error when launching browsers.maxUsageCount
of a Session
not being persisted.puppeteer
and playwright
to match stable Chrome (90).taskTimeoutSecs
to allow control over timeout of AutoscaledPool
tasksforceUrlEncoding
to requestAsBrowser
optionspreNavigationHooks
and postNavigationHooks
to CheerioCrawler
prepareRequestFunction
and postResponseFunction
methods of CheerioCrawler
aborting
for handling gracefully aborted run from Apify platform.Published by B4nan over 3 years ago
requestAsBrowser
behavior with various combinations of json
, payload
legacy options. closes: #1028Published by mnmkng over 3 years ago
This release brings the long awaited HTTP2 capabilities to requestAsBrowser
. It could make HTTP2 requests even before, but it was not very helpful in making browser-like ones. This is very important for disguising as a browser and reduction in the number of blocked requests. requestAsBrowser
now uses got-scraping
.
The most important new feature is that the full set of headers requestAsBrowser
uses will now be generated using live data about browser headers that we collect. This means that the "header fingeprint" will always match existing browsers and should be indistinguishable from a real browser request. The header sets will be automatically rotated for you to further reduce the chances of blocking.
We also switched the default HTTP version from 1 to 2 in requestAsBrowser
. We don't expect this change to be breaking, and we took precautions, but we're aware that there are always some edge cases, so please let us know if it causes trouble for you.
utils.requestAsBrowser()
with got-scraping
.useHttp2
true
by default with utils.requestAsBrowser()
.Apify.call()
failing with empty OUTPUT
.puppeteer
to 8.0.0
and playwright
to 1.10.0
with Chromium 90 in Docker images.@apify/ps-tree
to support Windows better.@apify/storage-local
to support Node.js 16 prebuilds.Published by mnmkng over 3 years ago
utils.waitForRunToFinish
please use the apify-client
package and its waitForFinish
functions. Sorry, forgot to deprecate this with v1 release.require
that broke the SDK with underscore
1.13 release.@apify/storage-local
to v2 written in TypeScript.Published by mnmkng over 3 years ago
SessionPoolOptions
not being correctly used in BrowserCrawler
.puppeteer
or playwright
installations.Published by mnmkng over 3 years ago
In this minor release we focused on the SessionPool
. Besides fixing a few bugs, we added one important feature: setting and getting of sessions by ID.
// Now you can add specific sessions to the pool,
// instead of relying on random generation.
await sessionPool.addSession({
id: 'my-session',
// ... some config
});
// Later, you can retrieve the session. This is useful
// for example when you need a specific login session.
const session = await sessionPool.getSession('my-session');
sessionPool.addSession()
function to add a new session to the session pool (possibly with the provided options, e.g. with specific session id).sessionId
to sessionPool.getSession()
to be able to retrieve a session from the session pool with the specific session id.SessionPool
not working properly in both PuppeteerCrawler
and PlaywrightCrawler
.Apify.call()
and Apify.callTask()
output - make it backwards compatible with previous versions of the client.browser-pool
to fix issues with failing hooks causing browsers to get stuck in limbo.proxy-chain
dependency because now it's covered in browser-pool
.Published by mnmkng over 3 years ago
ProxyConfiguration
status check URL with the APIFY_PROXY_STATUS_URL
env var.SessionPool
was used.Published by mnmkng over 3 years ago
dataset.pushData()
validation which would not allow other than plain objects.PuppeteerLaunchContext.stealth
throwing when used in PuppeteerCrawler
.