crawlee

Bot releases are visible (Hide)

crawlee - v1.0.0

Published by mnmkng over 3 years ago

After 3.5 years of rapid development, and a lot of breaking changes and deprecations, here comes the result - Apify SDK v1. There were two goals for this release. Stability and adding support for more browsers - Firefox and Webkit (Safari).

The SDK has grown quite popular over the years, powering thousands of web scraping and automation projects. We think our developers deserve a stable environment to work in and by releasing SDK v1, we commit to only make breaking changes once a year, with a new major release.

We added support for more browsers by replacing PuppeteerPool with browser-pool. A new library that we created specifically for this purpose. It builds on the ideas from PuppeteerPool and extends them to support Playwright. Playwright is a browser automation library similar to Puppeteer. It works with all well known browsers and uses almost the same interface as Puppeteer, while adding useful features and simplifying common tasks. Don't worry, you can still use Puppeteer with the new BrowserPool.

A large breaking change is that neither puppeteer nor playwright are bundled with the SDK v1. To make the choice of a library easier and installs faster, users will have to install the selected modules and versions themselves. This allows us to add support for even more libraries in the future.

Thanks to the addition of Playwright we now have a PlaywrightCrawler. It is very similar to PuppeteerCrawler and you can pick the one you prefer. It also means we needed to make some interface changes. The launchPuppeteerFunction option of PuppeteerCrawler is gone and launchPuppeteerOptions were replaced by launchContext. We also moved things around in the handlePageFunction arguments. See the migration guide for more detailed explanation and migration examples.

What's in store for SDK v2? We want to split the SDK into smaller libraries, so that everyone can install only the things they need. We plan a TypeScript migration to make crawler development faster and safer. Finally, we will take a good look at the interface of the whole SDK and update it to improve the developer experience. Bug fixes and scraping features will of course keep landing in versions 1.X as well.

Full list of changes:

BREAKING: Removed puppeteer from dependencies. If you want to use Puppeteer, you must install it yourself.
BREAKING: Removed PuppeteerPool. Use browser-pool.
BREAKING: Removed PuppeteerCrawlerOptions.launchPuppeteerOptions. Use launchContext.
BREAKING: Removed PuppeteerCrawlerOptions.launchPuppeteerFunction. Use PuppeteerCrawlerOptions.preLaunchHooks and postLaunchHooks.
BREAKING: Removed args.autoscaledPool and args.puppeteerPool from handle(Page/Request)Function arguments. Use args.crawler.autoscaledPool and args.crawler.browserPool.
BREAKING: The useSessionPool and persistCookiesPerSession options of crawlers are now true by default. Explicitly set them to false to override the behavior.
BREAKING: Apify.launchPuppeteer() no longer accepts LaunchPuppeteerOptions. It now accepts PuppeteerLaunchContext.

New deprecations:

DEPRECATED: PuppeteerCrawlerOptions.gotoFunction. Use PuppeteerCrawlerOptions.preNavigationHooks and postNavigationHooks.

Removals of earlier deprecated functions:

BREAKING: Removed Apify.utils.puppeteer.enqueueLinks(). Deprecated in 01/2019. Use Apify.utils.enqueueLinks().
BREAKING: Removed autoscaledPool.(set|get)MaxConcurrency(). Deprecated in 2019. Use autoscaledPool.maxConcurrency.
BREAKING: Removed CheerioCrawlerOptions.requestOptions. Deprecated in 03/2020. Use CheerioCrawlerOptions.prepareRequestFunction.
BREAKING: Removed Launch.requestOptions. Deprecated in 03/2020. Use CheerioCrawlerOptions.prepareRequestFunction.

New features:

Added Apify.PlaywrightCrawler which is almost identical to PuppeteerCrawler, but it crawls with the playwright library.
Added Apify.launchPlaywright(launchContext) helper function.
Added browserPoolOptions to PuppeteerCrawler to configure BrowserPool.
Added crawler to handle(Request/Page)Function arguments.
Added browserController to handlePageFunction arguments.
Added crawler.crawlingContexts Map which includes all running crawlingContexts.

crawlee - v0.22.4

Published by mnmkng almost 4 years ago

Fix issues with Apify.pushData() and keyValueStore.forEachKey() by updating @apify/storage-local to 1.0.2.

crawlee - v0.22.2

Published by mnmkng almost 4 years ago

Pinned cheerio to 1.0.0-rc.3 to avoid install problems in some builds.
Increased default maxEventLoopOverloadedRatio in SystemStatusOptions to 0.6.
Updated packages and improved docs.

crawlee - v0.22.1

Published by mnmkng almost 4 years ago

This is the last major release before SDK v1.0.0. We're committed to deliver v1 at the
end of 2020 so stay tuned. Besides Playwright integration via a new BrowserPool,
it will be the first release of SDK that we'll support for an extended period of time.
We will not make any breaking changes until 2.0.0, which will come at the end of
2021. But enough about v1, let's see the changes in 0.22.0.

In this release we've changed a lot of code, but you may not even notice.
We've updated the underlying apify-client package which powers all communication with
the Apify API to version 1.0.0. This means a completely new API for all internal calls.
If you use Apify.client calls in your code, this will be a large breaking change for you.
Visit the client docs
to see what's new in the client, but also note that we removed the default client
available under Apify.client and replaced it with Apify.newClient() function.
We think it's better to have separate clients for users and internal use.

Until now, local emulation of Apify Storages has been a part of the SDK. We moved the logic
into a separate package @apify/storage-local which shares interface with apify-client.
RequestQueue is now powered by SQLite3 instead of file system, which improves
reliability and performance quite a bit. Dataset and KeyValueStore still use file
system, for easy browsing of data. The structure of apify_storage folder remains unchanged.

After collecting common developer mistakes, we've decided to make argument validation stricter.
You will no longer be able to pass extra arguments to functions and constructors. This is
to alleviate frustration, when you mistakenly pass useChrome to PuppeteerPoolOptions
instead of LaunchPuppeteerOptions and don't realize it. Before this version, SDK wouldn't
let you know and would silently continue with Chromium. Now, it will throw an error saying
that useChrome is not an allowed property of PuppeteerPoolOptions.

Based on developer feedback, we decided to remove --no-sandbox from the default Puppeteer
launch args. It will only be used on Apify Platform. This gives you the chance to use
your own sandboxing strategy.

LiveViewServer and puppeteerPoolOptions.useLiveView were never very user-friendly
or performant solutions, due to the inherent performance issues with rapidly taking many
screenshots in Puppeteer. We've decided to remove it. If you need similar functionality,
try the devtools-server NPM package, which utilizes the Chrome DevTools Frontend for
screen-casting live view of the running browser.

Full list of changes:

BREAKING: Updated apify-client to 1.0.0 with a completely new interface.
We also removed the Apify.client property and replaced it with an Apify.newClient()
function that creates a new ApifyClient instance.
BREAKING: Removed --no-sandbox from default Puppeteer launch arguments.
This will most likely be breaking for Linux and Docker users.
BREAKING: Function argument validation is now more strict and will not accept extra
parameters which are not defined by the functions' signatures.
DEPRECATED: puppeteerPoolOptions.useLiveView is now deprecated.
Use the devtools-server NPM package instead.
Added postResponseFunction to CheerioCrawlerOptions. It allows you to override
properties on the HTTP response before processing by CheerioCrawler.
Added HTTP2 support to utils.requestAsBrowser(). Set useHttp2 to true
in RequestAsBrowserOptions to enable it.
Fixed handling of XML content types in CheerioCrawler.
Fixed capitalization of headers when using utils.puppeteer.addInterceptRequestHandler.
Fixed utils.puppeteer.saveSnapshot() overwriting screenshots with HTML on local.
Updated puppeteer to version 5.4.1 with Chrom(ium) 87.
Removed RequestQueueLocal in favor of @apify/storage-local API emulator.
Removed KeyValueStoreLocal in favor of @apify/storage-local API emulator.
Removed DatasetLocal in favor of @apify/storage-local API emulator.
Removed the userData option from Apify.utils.enqueueLinks (deprecated in Jun 2019).
Use transformRequestFunction instead.
Removed instanceKillerIntervalMillis and killInstanceAfterMillis (deprecated in Feb 2019).
Use instanceKillerIntervalSecs and killInstanceAfterSecs instead.
Removed the memory option from Apify.call options which was (deprecated in 2018).
Use memoryMbytes instead.
Removed delete() methods from Dataset, KeyValueStore and RequestQueue (deprecated in Jul 2019).
Use .drop().
Removed utils.puppeteer.hideWebDriver() (deprecated in May 2019).
Use LaunchPuppeteerOptions.stealth.
Removed utils.puppeteer.enqueueRequestsFromClickableElements() (deprecated in 2018).
Use utils.puppeteer.enqueueLinksByClickingElements.
Removed request.doNotRetry() (deprecated in June 2019)
Use request.noRetry = true.
Removed RequestListOptions.persistSourcesKey (deprecated in Feb 2020)
Use persistRequestsKey.

crawlee - v0.21.10

Published by mnmkng almost 4 years ago

Bump Puppeteer to 5.5.0 and Chrom(ium) 88.

crawlee - v0.21.9

Published by mnmkng almost 4 years ago

Fix various issues in stealth.
Fix SessionPool not retiring sessions immediately when they become unusable. It fixes a problem where PuppeteerPool would not retire browsers wit bad sessions.

crawlee - v0.21.8

Published by mnmkng about 4 years ago

Make PuppeteerCrawler safe against malformed Puppeteer responses.
Update default user agent to Chrome 86
Bump Puppeteer to 5.3.1 with Chromium 86

crawlee - v0.21.7

Published by mnmkng about 4 years ago

Fix an error in PuppeteerCrawler caused by page.goto() randomly returning null.

crawlee - v0.21.6

Published by mnmkng about 4 years ago

It appears that CheerioCrawler was correctly retiring sessions on timeouts
and blocked status codes (401, 403, 429), whereas PuppeteerCrawler did not.
Apologies for the omission, this release fixes the problem.

Fix sessions not being retired on blocked status codes in PuppeteerCrawler.
Fix sessions not being marked bad on navigation timeouts in PuppeteerCrawler.
Update apify-shared to version 0.5.0.

crawlee - v0.21.5

Published by mnmkng about 4 years ago

This is a very minor release that fixes some issues that were preventing
use of the SDK with Node 14.

Update the request serialization process which is used in RequestList
to work with Node 10+ and not only 10 and 12.
Update some TypeScript types that were preventing build due to changes
in typed dependencies.

crawlee - v0.21.4

Published by mnmkng about 4 years ago

The request statistics that you may remember from logs are now persisted in key-value store,
so you won't lose count when your actor restarts. We've also added a lot of useful
stats in there which can be useful to you after a run finishes. Besides that,
we fixed some bugs and annoyances and improved the TypeScript experience a bit.

Add persistence to Statistics class and automatically persist it in BasicCrawler.
Fix issue where inaccessible Apify Proxy would cause ProxyConfiguration to throw
a timeout error.
Update default user agent to Chrome 85
Bump Puppeteer to 5.2.1 which uses Chromium 85
TypeScript: Fix RequestAsBrowserOptions missing some values and add RequestQueueInfo
as a return value from requestQueue.getInfo()

crawlee - v0.21.3

Published by mnmkng about 4 years ago

Fix useless logging in Session.

crawlee - v0.21.2

Published by mnmkng about 4 years ago

Fix cookies with leading dot in domain (as extracted from Puppeteer) not being correctly added to Sessions.

crawlee - v0.21.1