crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

APACHE-2.0 License

Downloads
148
Stars
15.2K

Bot releases are hidden (Show)

crawlee - v0.19.1

Published by mnmkng over 4 years ago

  • BREAKING (EXPERIMENTAL): session.checkStatus() -> session.retireOnBlockedStatusCodes().
  • Session API is no longer considered experimental.
  • Updates documentation and introduces a few internal changes.
crawlee - v0.19.0

Published by mnmkng over 4 years ago

  • BREAKING: APIFY_LOCAL_EMULATION_DIR env var is no longer supported (deprecated on 2018-09-11).
    Use APIFY_LOCAL_STORAGE_DIR instead.
  • SessionPool API updates and fixes. The API is no longer considered experimental.
  • Logging of system info moved from require time to Apify.main() invocation.
  • Use native RegExp instead of xregexp for unicode property escapes.
crawlee - v0.18.1

Published by mnmkng almost 5 years ago

  • Fix SessionPool not automatically working in CheerioCrawler.
  • Fix incorrect management of page count in PuppeteerPool.
crawlee - v0.18.0

Published by petrpatek almost 5 years ago

  • BREAKING CheerioCrawler ignores ssl errors by default - options.ignoreSslErrors: true.
  • Add SessionPool implemenation to CheerioCrawler.
  • Add SessionPool implementation to PuppeteerPool and PupeteerCrawler.
  • Fix Request constructor not making a copy of objects such as userData and headers.
  • Fix desc option not being applied in local dataset.getData().
crawlee - v0.17.0

Published by mnmkng almost 5 years ago

  • BREAKING: Node 8 and 9 are no longer supported. Please use Node 10.17.0 or higher.
  • DEPRECATED: Apify.callTask() body and contentType options are now deprecated.
    Use input instead. It must be of content-type: application/json.
  • Add default SessionPool implementation to BasicCrawler.
  • Add the ability to create ad-hoc webhooks via Apify.call() and Apify.callTask().
  • Add an example of form filling with Puppeteer.
  • Add country option to Apify.getApifyProxyUrl().
  • Add Apify.utils.puppeteer.saveSnapshot() helper to quickly save HTML and screenshot of a page.
  • Add the ability to pass got supported options to requestOptions in CheerioCrawler
    thus supporting things such as cookieJar again.
  • Switch Puppeteer to web socket again due to suspected pipe errors.
  • Fix an issue where some encodings were not correctly parsed in CheerioCrawler.
  • Fix parsing bad Content-Type headers for CheerioCrawler.
  • Fix custom headers not being correctly applied in Apify.utils.requestAsBrowser().
  • Fix dataset limits not being correctly applied.
  • Fix a race condition in RequestQueueLocal.
  • Fix RequestList persistence of downloaded sources in key-value store.
  • Fix Apify.utils.puppeteer.blockRequests() always including default patterns.
  • Fix inconsistent behavior of Apify.utils.puppeteer.infiniteScroll() on some websites.
  • Fix retry histogram statistics sometimes showing invalid counts.
  • Added regexps for Youtube videos (YOUTUBE_REGEX, YOUTUBE_REGEX_GLOBAL) to utils.social
  • Added documentation for option json in handlePageFunction of CheerioCrawler
crawlee - v0.16.1

Published by drobnikj almost 5 years ago

  • Add useIncognitoPages option to PuppeteerPool to enable opening new pages in incognito
    browser contexts. This is useful to keep cookies and cache unique for each page.
  • Added options to load every content type in CheerioCrawler.
    There are new options body and contentType in handlePageFunction for this purposes.
  • DEPRECATED: CheerioCrawler html option in handlePageFunction was replaced with body option.
crawlee - v0.16.0

Published by mnmkng about 5 years ago

  • Update @apify/http-request to version 1.1.2.
  • Update CheerioCrawler to use requestAsBrowser() to better disguise as a real browser.
crawlee - v0.15.5

Published by mnmkng about 5 years ago

  • This release just updates some dependencies (not Puppeteer).
crawlee - v0.15.4

Published by mnmkng about 5 years ago

  • DEPRECATED: dataset.delete(), keyValueStore.delete() and requestQueue.delete() methods have been deprecated in favor of *.drop() methods, because the drop name more clearly communicates the fact that those methods drop / delete the storage itself, not individual elements in the storage.
  • Added Apify.utils.requestAsBrowser() helper function that enables you to make HTTP(S) requests disguising as a browser (Firefox). This may help in overcoming certain anti-scraping and anti-bot protections.
  • Added options.gotoTimeoutSecs to PuppeteerCrawler to enable easier setting of navigation timeouts.
  • PuppeteerPool options that were deprecated from the PuppeteerCrawler constructor were finally removed. Please use maxOpenPagesPerInstance, retireInstanceAfterRequestCount, instanceKillerIntervalSecs, killInstanceAfterSecs and proxyUrls via the puppeteerPoolOptions object.
  • On the Apify Platform a warning will now be printed when using an outdated apify package version.
  • Apify.utils.puppeteer.enqueueLinksByClickingElements() will now print a warning when the nodes it
    tries to click become modified (detached from DOM). This is useful to debug unexpected behavior.
crawlee - v0.15.3

Published by mnmkng about 5 years ago

  • Apify.launchPuppeteer() now accepts proxyUrl with the https, socks4
    and socks5 schemes, as long as it doesn't contain username or password.
    This is to fix Issue #420.
  • Added desiredConcurrency option to AutoscaledPool constructor, removed
    unnecessary bound check from the setter property
crawlee - v0.15.2

Published by mnmkng over 5 years ago

  • Fix error where Puppeteer would fail to launch when pipes are turned off.
  • Switch back to default Web Socket transport for Puppeteer due to upstream issues.
crawlee - v0.15.1

Published by mnmkng over 5 years ago

  • BREAKING CHANGE Removed support for Web Driver (Selenium) since no further updates are planned.
    If you wish to continue using Web Driver, please stay on Apify SDK version ^0.14.15
  • BREAKING CHANGE: Dataset.getData() throws an error if user provides an unsupported option
    when using local disk storage.
  • DEPRECATED: options.userData of Apify.utils.enqueueLinks() is deprecated.
    Use options.transformRequestFunction instead.
  • Improve logging of memory overload errors.
  • Improve error message in Apify.call().
  • Fix multiple log lines appearing when a crawler was about to finish.
  • Add Apify.utils.puppeteer.enqueueLinksByClickingElements() function which enables you
    to add requests to the queue from pure JavaScript navigations, form submissions etc.
  • Add Apify.utils.puppeteer.infiniteScroll() function which helps you with scrolling to the bottom
    of websites that auto-load new content.
  • The RequestQueue.handledCount() function has been resurrected from deprecation,
    in order to have compatible interface with RequestList.
  • Add useExtendedUniqueKey option to Request constructor to include method and payload
    in the Request's computed uniqueKey.
  • Updated Puppeteer to 1.18.1
  • Updated apify-client to 0.5.22