crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

APACHE-2.0 License

Downloads
1.9M
Stars
15.2K
Committers
94

Bot releases are visible (Hide)

crawlee - v3.11.4

Published by apify-service-account 27 days ago

3.11.4 (2024-09-23)

Bug Fixes

  • SitemapRequestList.teardown() doesn't break persistState calls (#2673) (fb2c5cd), closes #2672
crawlee - v3.11.3

Published by apify-service-account about 2 months ago

3.11.3 (2024-09-03)

Bug Fixes

  • improve FACEBOOK_REGEX to match older style page URLs (#2650) (a005e69), closes #2216
  • RequestQueueV2: reset recently handled cache too if the queue is pending for too long (#2656) (51a69bc)
crawlee - v3.11.2

Published by apify-service-account about 2 months ago

3.11.2 (2024-08-28)

Bug Fixes

  • RequestQueueV2: remove inProgress cache, rely solely on locked states (#2601) (57fcb08)
  • use namespace imports for cheerio to be compatible with v1 (#2641) (f48296f)
  • Use the correct mutex in memory storage RequestQueueClient (#2623) (2fa8a29)

Features


This release is pinning the dependency on cheerio to the last RC version, we might postpone the official support for v1 to next major, or at least wait for them to fix their stuff. Nice demonstration of how not to maintain popular open source projects 😞

crawlee - v3.11.1

Published by apify-service-account 3 months ago

3.11.1 (2024-07-24)

Bug Fixes

crawlee - v3.11.0

Published by apify-service-account 3 months ago

3.11.0 (2024-07-09)

Features

  • add iframe expansion to parseWithCheerio in browsers (#2542) (328d085), closes #2507
  • add ignoreIframes opt-out from the Cheerio iframe expansion (#2562) (474a8dc)
  • Sitemap-based request list implementation (#2498) (7bf8f0b)
crawlee - v3.10.5

Published by apify-service-account 4 months ago

3.10.5 (2024-06-12)

Bug Fixes

  • allow creating new adaptive crawler instance without any parameters (9b7f595)
  • declare missing peer dependencies in @crawlee/browser package (#2532) (3357c7f)
  • fix detection of HTTP site when using the useState in adaptive crawler (#2530) (7e195c1)
  • mark context.request.loadedUrl and id as required inside the request handler (#2531) (2b54660)
crawlee - v3.10.4

Published by apify-service-account 4 months ago

3.10.4 (2024-06-11)

Bug Fixes

  • add waitForAllRequestsToBeAdded option to enqueueLinks helper (925546b), closes #2318
  • add missing useState implementation into crawling context (eec4a71)
  • make crawler.log publicly accessible (#2526) (3e9e665)
  • playwright: allow passing new context options in launchOptions on type level (0519d40), closes #1849
  • respect crawler.log when creating child logger for Statistics (0a0d75d), closes #2412
crawlee - v3.10.3

Published by apify-service-account 4 months ago

3.10.3 (2024-06-07)

Bug Fixes

  • adaptive-crawler: log only once for the committed request handler execution (#2524) (533bd3f)
  • increase timeout for retiring inactive browsers (#2523) (195f176)
  • respect implicit router when no requestHandler is provided in AdaptiveCrawler (#2518) (31083aa)
  • revert the scaling steps back to 5% (5bf32f8)

Features

  • add waitForSelector context helper + parseWithCheerio in adaptive crawler (#2522) (6f88e73)
  • log desired concurrency in the default status message (9f0b796)
crawlee - v3.10.2

Published by apify-service-account 5 months ago

3.10.2 (2024-06-03)

Bug Fixes

Features

crawlee - v3.10.1

Published by apify-service-account 5 months ago

3.10.1 (2024-05-23)

Bug Fixes

crawlee - v3.10.0

Published by apify-service-account 5 months ago

3.10.0 (2024-05-16)

Bug Fixes

  • EnqueueStrategy.All erroring with links using unsupported protocols (#2389) (8db3908)
  • conversion between tough cookies and browser pool cookies (#2443) (74f73ab)
  • fire local SystemInfo events every second (#2454) (1fa9a66)
  • use createSessionFunction when loading Session from persisted state (#2444) (3c56b4c)
  • do not drop statistics on migration/resurrection/resume (#2462) (8ce7dd4)
  • double tier decrement in tiered proxy (#2468) (3a8204b)
  • Fixed double extension for screenshots (#2419) (e8b39c4), closes #1980
  • malformed sitemap url when sitemap index child contains querystring (#2430) (e4cd41c)
  • return true when robots.isAllowed returns undefined (#2439) (6f541f8), closes #2437
  • sitemap content-type check breaks on content-type parameters (#2442) (db7d372)

Features

Performance Improvements

  • improve scaling based on memory (#2459) (2d5d443)
  • optimize RequestList memory footprint (#2466) (12210bd)
  • optimize adding large amount of requests via crawler.addRequests() (#2456) (6da86a8)
crawlee - v3.9.2

Published by apify-service-account 6 months ago

3.9.2 (2024-04-17)

Bug Fixes

Features

crawlee - v3.9.1 Latest Release

Published by apify-service-account 6 months ago

3.9.1 (2024-04-11)

Features

crawlee - v3.9.0

Published by apify-service-account 6 months ago

3.9.0 (2024-04-10)

Bug Fixes

  • include actual key in error message of KVS' setValue (#2411) (9089bf1)
  • notify autoscaled pool about newly added requests (#2400) (a90177d)
  • puppeteer: allow passing networkidle to waitUntil in gotoExtended (#2399) (5d0030d), closes #2398
  • sitemaps support application/xml (#2408) (cbcf47a)

Features

crawlee - v3.8.2

Published by apify-service-account 7 months ago

3.8.2 (2024-03-21)

Bug Fixes

  • core: solve possible dead locks in RequestQueueV2 (#2376) (ffba095)
  • correctly report gzip decompression errors (#2368) (84a2f17)
  • puppeteer: improve detection of older versions (98d4e86), closes #2370
  • use 0 (number) instead of false as default for sessionRotationCount (#2372) (667a3e7)

Features

  • implement global storage access checking and use it to prevent unwanted side effects in adaptive crawler (#2371) (fb3b7da), closes #2364
crawlee - v3.8.1

Published by apify-service-account 8 months ago

3.8.1 (2024-02-22)

Bug Fixes

  • fix crawling context type in router.addHandler() (#2355) (d73c202)
crawlee - v3.8.0

Published by apify-service-account 8 months ago

3.8.0 (2024-02-21)

Bug Fixes

Features

  • KeyValueStore.recordExists() (#2339) (8507a65)
  • accessing crawler state, key-value store and named datasets via crawling context (#2283) (58dd5fc)
  • adaptive playwright crawler (#2316) (8e4218a)
  • add Sitemap.tryCommonNames to check well known sitemap locations (#2311) (85589f1), closes #2307
  • ci: snapshot docs automatically on minor/major publish (#2344) (092f51e)
  • core: add userAgent parameter to RobotsFile.isAllowed() + RobotsFile.from() helper (#2338) (343c159)
  • Support plain-text sitemap files (sitemap.txt) (#2315) (0bee7da)
crawlee - v3.7.3

Published by apify-service-account 9 months ago

3.7.3 (2024-01-30)

Bug Fixes

crawlee - v3.7.2

Published by apify-service-account 9 months ago

3.7.2 (2024-01-09)

Bug Fixes

  • RequestQueue: always clear locks when a request is reclaimed (#2263) (0fafe29), closes #2262
crawlee - v3.7.1

Published by apify-service-account 10 months ago

3.7.1 (2024-01-02)

Bug Fixes

  • ES2022 build compatibility and move to NodeNext for module (#2258) (7fe1e68), closes #2257