Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
APACHE-2.0 License
Bot releases are visible (Hide)
Published by mnmkng over 3 years ago
After 3.5 years of rapid development, and a lot of breaking changes and deprecations, here comes the result - Apify SDK v1. There were two goals for this release. Stability and adding support for more browsers - Firefox and Webkit (Safari).
The SDK has grown quite popular over the years, powering thousands of web scraping and automation projects. We think our developers deserve a stable environment to work in and by releasing SDK v1, we commit to only make breaking changes once a year, with a new major release.
We added support for more browsers by replacing PuppeteerPool
with browser-pool
. A new library that we created specifically for this purpose. It builds on the ideas from PuppeteerPool
and extends them to support Playwright. Playwright is a browser automation library similar to Puppeteer. It works with all well known browsers and uses almost the same interface as Puppeteer, while adding useful features and simplifying common tasks. Don't worry, you can still use Puppeteer with the new BrowserPool
.
A large breaking change is that neither puppeteer
nor playwright
are bundled with the SDK v1. To make the choice of a library easier and installs faster, users will have to install the selected modules and versions themselves. This allows us to add support for even more libraries in the future.
Thanks to the addition of Playwright we now have a PlaywrightCrawler
. It is very similar to PuppeteerCrawler
and you can pick the one you prefer. It also means we needed to make some interface changes. The launchPuppeteerFunction
option of PuppeteerCrawler
is gone and launchPuppeteerOptions
were replaced by launchContext
. We also moved things around in the handlePageFunction
arguments. See the migration guide for more detailed explanation and migration examples.
What's in store for SDK v2? We want to split the SDK into smaller libraries, so that everyone can install only the things they need. We plan a TypeScript migration to make crawler development faster and safer. Finally, we will take a good look at the interface of the whole SDK and update it to improve the developer experience. Bug fixes and scraping features will of course keep landing in versions 1.X as well.
puppeteer
from dependencies. If you want to use Puppeteer, you must install it yourself.PuppeteerPool
. Use browser-pool
.PuppeteerCrawlerOptions.launchPuppeteerOptions
. Use launchContext
.PuppeteerCrawlerOptions.launchPuppeteerFunction
. Use PuppeteerCrawlerOptions.preLaunchHooks
and postLaunchHooks
.args.autoscaledPool
and args.puppeteerPool
from handle(Page/Request)Function
arguments. Use args.crawler.autoscaledPool
and args.crawler.browserPool
.useSessionPool
and persistCookiesPerSession
options of crawlers are now true
by default. Explicitly set them to false
to override the behavior.Apify.launchPuppeteer()
no longer accepts LaunchPuppeteerOptions
. It now accepts PuppeteerLaunchContext
.PuppeteerCrawlerOptions.gotoFunction
. Use PuppeteerCrawlerOptions.preNavigationHooks
and postNavigationHooks
.Apify.utils.puppeteer.enqueueLinks()
. Deprecated in 01/2019. Use Apify.utils.enqueueLinks()
.autoscaledPool.(set|get)MaxConcurrency()
. Deprecated in 2019. Use autoscaledPool.maxConcurrency
.CheerioCrawlerOptions.requestOptions
. Deprecated in 03/2020. Use CheerioCrawlerOptions.prepareRequestFunction
.Launch.requestOptions
. Deprecated in 03/2020. Use CheerioCrawlerOptions.prepareRequestFunction
.Apify.PlaywrightCrawler
which is almost identical to PuppeteerCrawler
, but it crawls with the playwright
library.Apify.launchPlaywright(launchContext)
helper function.browserPoolOptions
to PuppeteerCrawler
to configure BrowserPool
.crawler
to handle(Request/Page)Function
arguments.browserController
to handlePageFunction
arguments.crawler.crawlingContexts
Map
which includes all running crawlingContext
s.Published by mnmkng almost 4 years ago
Apify.pushData()
and keyValueStore.forEachKey()
by updating @apify/storage-local
to 1.0.2
.Published by mnmkng almost 4 years ago
cheerio
to 1.0.0-rc.3
to avoid install problems in some builds.maxEventLoopOverloadedRatio
in SystemStatusOptions
to 0.6.Published by mnmkng almost 4 years ago
This is the last major release before SDK v1.0.0. We're committed to deliver v1 at the
end of 2020 so stay tuned. Besides Playwright integration via a new BrowserPool
,
it will be the first release of SDK that we'll support for an extended period of time.
We will not make any breaking changes until 2.0.0, which will come at the end of
2021. But enough about v1, let's see the changes in 0.22.0.
In this release we've changed a lot of code, but you may not even notice.
We've updated the underlying apify-client
package which powers all communication with
the Apify API to version 1.0.0
. This means a completely new API for all internal calls.
If you use Apify.client
calls in your code, this will be a large breaking change for you.
Visit the client docs
to see what's new in the client, but also note that we removed the default client
available under Apify.client
and replaced it with Apify.newClient()
function.
We think it's better to have separate clients for users and internal use.
Until now, local emulation of Apify Storages has been a part of the SDK. We moved the logic
into a separate package @apify/storage-local
which shares interface with apify-client
.
RequestQueue
is now powered by SQLite3
instead of file system, which improves
reliability and performance quite a bit. Dataset
and KeyValueStore
still use file
system, for easy browsing of data. The structure of apify_storage
folder remains unchanged.
After collecting common developer mistakes, we've decided to make argument validation stricter.
You will no longer be able to pass extra arguments to functions and constructors. This is
to alleviate frustration, when you mistakenly pass useChrome
to PuppeteerPoolOptions
instead of LaunchPuppeteerOptions
and don't realize it. Before this version, SDK wouldn't
let you know and would silently continue with Chromium. Now, it will throw an error saying
that useChrome
is not an allowed property of PuppeteerPoolOptions
.
Based on developer feedback, we decided to remove --no-sandbox
from the default Puppeteer
launch args. It will only be used on Apify Platform. This gives you the chance to use
your own sandboxing strategy.
LiveViewServer
and puppeteerPoolOptions.useLiveView
were never very user-friendly
or performant solutions, due to the inherent performance issues with rapidly taking many
screenshots in Puppeteer. We've decided to remove it. If you need similar functionality,
try the devtools-server
NPM package, which utilizes the Chrome DevTools Frontend for
screen-casting live view of the running browser.
Full list of changes:
BREAKING: Updated apify-client
to 1.0.0
with a completely new interface.
We also removed the Apify.client
property and replaced it with an Apify.newClient()
function that creates a new ApifyClient
instance.
BREAKING: Removed --no-sandbox
from default Puppeteer launch arguments.
This will most likely be breaking for Linux and Docker users.
BREAKING: Function argument validation is now more strict and will not accept extra
parameters which are not defined by the functions' signatures.
DEPRECATED: puppeteerPoolOptions.useLiveView
is now deprecated.
Use the devtools-server
NPM package instead.
Added postResponseFunction
to CheerioCrawlerOptions
. It allows you to override
properties on the HTTP response before processing by CheerioCrawler
.
Added HTTP2 support to utils.requestAsBrowser()
. Set useHttp2
to true
in RequestAsBrowserOptions
to enable it.
Fixed handling of XML content types in CheerioCrawler
.
Fixed capitalization of headers when using utils.puppeteer.addInterceptRequestHandler
.
Fixed utils.puppeteer.saveSnapshot()
overwriting screenshots with HTML on local.
Updated puppeteer
to version 5.4.1
with Chrom(ium) 87.
Removed RequestQueueLocal
in favor of @apify/storage-local
API emulator.
Removed KeyValueStoreLocal
in favor of @apify/storage-local
API emulator.
Removed DatasetLocal
in favor of @apify/storage-local
API emulator.
Removed the userData
option from Apify.utils.enqueueLinks
(deprecated in Jun 2019).
Use transformRequestFunction
instead.
Removed instanceKillerIntervalMillis
and killInstanceAfterMillis
(deprecated in Feb 2019).
Use instanceKillerIntervalSecs
and killInstanceAfterSecs
instead.
Removed the memory
option from Apify.call
options
which was (deprecated in 2018).
Use memoryMbytes
instead.
Removed delete()
methods from Dataset
, KeyValueStore
and RequestQueue
(deprecated in Jul 2019).
Use .drop()
.
Removed utils.puppeteer.hideWebDriver()
(deprecated in May 2019).
Use LaunchPuppeteerOptions.stealth
.
Removed utils.puppeteer.enqueueRequestsFromClickableElements()
(deprecated in 2018).
Use utils.puppeteer.enqueueLinksByClickingElements
.
Removed request.doNotRetry()
(deprecated in June 2019)
Use request.noRetry = true
.
Removed RequestListOptions.persistSourcesKey
(deprecated in Feb 2020)
Use persistRequestsKey
.
Published by mnmkng almost 4 years ago
Published by mnmkng almost 4 years ago
stealth
.SessionPool
not retiring sessions immediately when they become unusable. It fixes a problem where PuppeteerPool
would not retire browsers wit bad sessions.Published by mnmkng about 4 years ago
PuppeteerCrawler
safe against malformed Puppeteer responses.Published by mnmkng about 4 years ago
PuppeteerCrawler
caused by page.goto()
randomly returning null
.Published by mnmkng about 4 years ago
It appears that CheerioCrawler
was correctly retiring sessions on timeouts
and blocked status codes (401, 403, 429), whereas PuppeteerCrawler
did not.
Apologies for the omission, this release fixes the problem.
PuppeteerCrawler
.PuppeteerCrawler
.apify-shared
to version 0.5.0
.Published by mnmkng about 4 years ago
This is a very minor release that fixes some issues that were preventing
use of the SDK with Node 14.
RequestList
Published by mnmkng about 4 years ago
The request statistics that you may remember from logs are now persisted in key-value store,
so you won't lose count when your actor restarts. We've also added a lot of useful
stats in there which can be useful to you after a run finishes. Besides that,
we fixed some bugs and annoyances and improved the TypeScript experience a bit.
Statistics
class and automatically persist it in BasicCrawler
.ProxyConfiguration
to throwRequestAsBrowserOptions
missing some values and add RequestQueueInfo
requestQueue.getInfo()
Published by mnmkng about 4 years ago
Published by mnmkng about 4 years ago
Published by mnmkng over 4 years ago
We fixed some bugs, improved a few things and bumped Puppeteer to match latest Chrome 84.
Apify.createProxyConfiguration
to be used seamlessly with the proxy componentCheerioCrawler
with the crawler.use()
function.RequestQueueLocal
to fail handling requests.SessionPool
.ProxyConfiguration
error message for missing password / token.Published by mnmkng over 4 years ago
This release comes with breaking changes that will affect most, if not all of your projects. See the migration guide for more information and examples.
First large change is a redesigned proxy configuration. Cheerio
and Puppeteer
crawlers now accept a proxyConfiguration
parameter, which is an instance of ProxyConfiguration
. This class now exclusively manages both Apify Proxy and custom proxies. Visit the new proxy management guide
We also removed Apify.utils.getRandomUserAgent()
as it was no longer effective in avoiding bot detection and changed the default values for empty properties in Request
instances.
Apify.getApifyProxyUrl()
. To get an Apify Proxy url, use proxyConfiguration.newUrl([sessionId])
.useApifyProxy
, apifyProxyGroups
and apifyProxySession
parameters from all applications in the SDK. Use proxyConfiguration
in crawlers and proxyUrl
in requestAsBrowser
and Apify.launchPuppeteer
.Apify.utils.getRandomUserAgent()
as it was no longer effective in avoiding bot detection.Request
instances no longer initialize empty properties with null
, which means that:
errorMessages
are now represented by []
, andloadedUrl
, payload
and handledAt
are undefined
.Apify.createProxyConfiguration()
async
function to create ProxyConfiguration
instances. ProxyConfiguration
itself is not exposed.proxyConfiguration
to CheerioCrawlerOptions
and PuppeteerCrawlerOptions
.proxyInfo
to CheerioHandlePageInputs
and PuppeteerHandlePageInputs
. You can use this object to retrieve information about the currently used proxy in Puppeteer
and Cheerio
crawlers.Apify.utils.puppeteer.infiniteScroll()
.Apify.utils.requestAsBrowser()
would get into redirect loops.Apify.utils.getMemoryInfo()
crashing the process on AWS Lambda and on systems running in Docker without memory cgroups enabled.Published by mnmkng over 4 years ago
Apify.utils.waitForRunToFinish()
which simplifies waiting for an actor run to finish.async
handlers in Apify.utils.puppeteer.addInterceptRequestHandler()
cheerioCrawler.use()
function to enable attaching CrawlerExtension
SessionPool
.@apify/http-request
to fix issue in the proxy-agent
package.Published by mnmkng over 4 years ago
CheerioCrawlerOptions.requestOptions
is now deprecated. Please useCheerioCrawlerOptions.prepareRequestFunction
instead.limit
option to Apify.utils.enqueueLinks()
for situations when full crawls are not needed.suggestResponseEncoding
and forceResponseEncoding
options to CheerioCrawler
to allowApify.utils.puppeteer.saveSnapshot()
when used locally.CheerioCrawler
.Published by mnmkng over 4 years ago
SessionPool
would fail if a cookie included invalidexpires
value.Published by mnmkng over 4 years ago
Apify.utils.requestAsBrowser()
no longer aborts request on status code 406text/html
type is received. Use options.abortFunction
if you want touseInsecureHttpParser
option to Apify.utils.requestAsBrowser()
whichtrue
by default and forces the function to use a HTTP parser that is less strict thanRequestList
now removes all the elements from the sources
array onRequestList
Request
instances.RequestListOptions.persistSourcesKey
is now deprecated. Please useRequestListOptions.persistRequestsKey
.RequestListOptions.sources
can now be an array of string
URLs as well.sourcesFunction
to RequestListOptions
. It enables dynamic fetching of sourcesRequests
were not retrieved from key-value store.stealth
hiding of webdriver
to avoid recent detections.Apify.utils.log
now points to an updated logger instance which prints colored logs (in TTY)Apify.launchPuppeteer()
code to prevent triggering bugs in Puppeteer by passingpuppeteer.launch()
.BasicCrawler.autoscaledPool
property, and added CheerioCrawler.autoscaledPool
PuppeteerCrawler.autoscaledPool
properties.SessionPool
now persists state on teardown
. Before, it only persisted state every minute.proxy-chain
NPM package from 0.2.7 to 0.4.1 and many other dependenciesrequest
package.Published by mnmkng over 4 years ago
session.checkStatus() -> session.retireOnBlockedStatusCodes()
.Session
API is no longer considered experimental.