crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

APACHE-2.0 License

Downloads
148
Stars
15.2K

Bot releases are hidden (Show)

crawlee - v3.7.1

Published by apify-service-account 10 months ago

3.7.1 (2024-01-02)

Bug Fixes

  • ES2022 build compatibility and move to NodeNext for module (#2258) (7fe1e68), closes #2257
crawlee - v3.7.0

Published by apify-service-account 10 months ago

3.7.0 (2023-12-21)

Bug Fixes

  • retryOnBlocked doesn't override the blocked HTTP codes (#2243) (81672c3)
  • browser-pool: respect user options before assigning fingerpints (#2190) (f050776), closes #2164
  • filter out empty globs (#2205) (41322ab), closes #2200
  • make CLI work on Windows too with --no-purge (#2244) (83f3179)
  • make SessionPool queue up getSession calls to prevent overruns (#2239) (0f5665c), closes #1667
  • MemoryStorage: lock request JSON file when reading to support multiple process crawling (#2215) (eb84ce9)

Features

crawlee - v3.6.2

Published by apify-service-account 11 months ago

3.6.2 (2023-11-26)

Bug Fixes

  • prevent race condition in KeyValueStore.getAutoSavedValue() (#2193) (e340e2b)
crawlee - v3.6.1

Published by apify-service-account 11 months ago

3.6.1 (2023-11-15)

Bug Fixes

  • ts: ignore import errors for got-scraping (012fc9e)
  • ts: specify type explicitly for logger (aec3550)

Features

crawlee - v3.6.0

Published by apify-service-account 11 months ago

3.6.0 (2023-11-15)

Bug Fixes

  • add skipNavigation option to enqueueLinks (#2153) (118515d)
  • BrowserPool: ignore --no-sandbox flag for webkit launcher (#2148) (1eb2f08), closes #1797
  • core: respect some advanced options for RequestList.open() + improve docs (#2158) (c5a1b07)
  • declare missing dependency on got-scraping in the core package (cd2fd4d)
  • provide more detailed error messages for browser launch errors (#2157) (f188ebe)
  • retry incorrect Content-Type when response has blocked status code (#2176) (b54fb8b), closes #1994

Features

crawlee - v3.5.8

Published by apify-service-account about 1 year ago

3.5.8 (2023-10-17)

Bug Fixes

  • MemoryStorage: ignore invalid files for request queues (#2132) (fa58581), closes #1985
  • refactor extractUrls to split the text line by line first (#2122) (7265cd7)
crawlee - v3.5.7

Published by apify-service-account about 1 year ago

3.5.7 (2023-10-05)

Bug Fixes

  • add warning when we detect use of RL and RQ, but RQ is not provided explicitly (#2115) (6fb1c55), closes #1773
  • ensure the status message cannot stuck the crawler (#2114) (9034f08)
  • RQ request count is consistent after migration (#2116) (9ab8c18), closes #1855 #1855
crawlee - v3.5.6

Published by apify-service-account about 1 year ago

3.5.6 (2023-10-04)

Bug Fixes

  • types: re-export RequestQueueOptions as an alias to RequestProviderOptions (#2109) (0900f76)

Features

crawlee - v3.5.5

Published by apify-service-account about 1 year ago

3.5.5 (2023-10-02)

Bug Fixes

  • allow to use any version of puppeteer or playwright (#2102) (0cafceb), closes #2101
  • session pool leaks memory on multiple crawler runs (#2083) (b96582a), closes #2074 #2031
  • templates: install browsers on postinstall for playwright (#2104) (323768b)
  • types: make return type of RequestProvider.open and RequestQueue(v2).open strict and accurate (#2096) (dfaddb9)

Features

  • experimental support for request locking (Request Queue v2) (#1975) (70a77ee), closes #1365
crawlee - v3.5.4

Published by apify-service-account about 1 year ago

3.5.4 (2023-09-11)

Bug Fixes

  • core: allow explicit calls to purgeDefaultStorage to wipe the storage on each call (#2060) (4831f07)
  • various helpers opening KVS now respect Configuration (#2071) (59dbb16)

Features

  • remove side effect from the deprecated error context augmentation (#2069) (f9fb5c4)
crawlee - v3.5.3

Published by apify-service-account about 1 year ago

3.5.3 (2023-08-31)

Bug Fixes

  • browser-pool: improve error handling when browser is not found (#2050) (282527f), closes #1459
  • clean up inProgress cache when delaying requests via sameDomainDelaySecs (#2045) (f63ccc0)
  • crawler instances with different StorageClients do not affect each other (#2056) (3f4c863)
  • pin all internal dependencies (#2041) (d6f2b17), closes #2040
  • respect current config when creating implicit RequestQueue instance (845141d), closes #2043

Features

  • core: add default dataset helpers to BasicCrawler (#2057) (e2a7544)
crawlee - v3.5.2

Published by apify-service-account about 1 year ago

3.5.2 (2023-08-21)

Bug Fixes

  • make the Request constructor options typesafe (#2034) (75e7d65)
  • pin @crawlee/* packages versions in crawlee metapackage (#2040) (61f91c7)
  • support DELETE requests in HttpCrawler (#2039) (7ea5c41), closes #1658

Features

crawlee - v3.5.1

Published by apify-service-account about 1 year ago

3.5.1 (2023-08-16)

Bug Fixes

  • add Request.maxRetries to the RequestOptions interface (#2024) (6433821)
  • log original error message on session rotation (#2022) (8a11ffb)

Features

  • exceeding maxSessionRotations calls failedRequestHandler (#2029) (b1cb108), closes #2028
crawlee - v3.5.0

Published by apify-service-account about 1 year ago

3.5.0 (2023-07-31)

Bug Fixes

  • cleanup worker stuff from memory storage to fix vitest (#2004) (d2e098c), closes #1999
  • core: add requests from URL list (requestsFromUrl) to the queue in batches (418fbf8), closes #1995
  • core: support relative links in enqueueLinks explicitly provided via urls option (#2014) (cbd9d08), closes #2005

Features

  • add closeCookieModals context helper for Playwright and Puppeteer (#1927) (98d93bb)
  • add support for sameDomainDelaySecs (#2003) (e796883), closes #1993
  • basic-crawler: allow configuring the automatic status message (#2001) (3eb4e4c)
  • core: use RequestQueue.addBatchedRequests() in enqueueLinks helper (4d61ca9), closes #1995
  • retire session on proxy error (#2002) (8c0928b), closes #1912
crawlee - v3.4.2

Published by apify-service-account over 1 year ago

3.4.2 (2023-07-19)

Bug Fixes

  • basic-crawler: limit internalTimeoutMillis in addition to requestHandlerTimeoutMillis (#1981) (8122622), closes #1766

Features

  • core: add RequestQueue.addRequestsBatched() that is non-blocking (#1996) (c85485d), closes #1995
  • retryOnBlocked detects blocked webpage (#1956) (766fa9b)
crawlee - v3.4.1

Published by apify-service-account over 1 year ago

3.4.1 (2023-07-13)

Bug Fixes

  • http-crawler: replace IncomingMessage with PlainResponse for context's response (#1973) (2a1cc7f), closes #1964

Features

  • jsdom,linkedom: Expose document to crawler router context (#1950) (4536dc2)
crawlee - v3.4.0

Published by apify-service-account over 1 year ago

3.4.0 (2023-06-12)

Bug Fixes

Features

crawlee - v3.3.3

Published by apify-service-account over 1 year ago

3.3.3 (2023-05-31)

Bug Fixes

  • MemoryStorage: handle EXDEV errors when purging storages (#1932) (e656050)
  • set status message every 10 seconds and log it via debug level (#1918) (32aede6)

Features

  • add support for requestsFromUrl to RequestQueue (#1917) (7f2557c)
  • core: add Request.maxRetries to allow overriding the maxRequestRetries (#1925) (c5592db)
crawlee - v3.3.2

Published by B4nan over 1 year ago

3.3.2 (2023-05-11)

Bug Fixes

  • MemoryStorage: cache requests in RequestQueue (#1899) (063dcd1)
  • respect config object when creating SessionPool (#1881) (db069df)

Features

  • allow running single crawler instance multiple times (#1844) (9e6eb1e), closes #765
  • HttpCrawler: add parseWithCheerio helper to HttpCrawler (#1906) (ff5f76f)
  • router: allow inline router definition (#1877) (2d241c9)
  • RQv2 memory storage support (#1874) (049486b)
  • support alternate storage clients when opening storages (#1901) (661e550)
crawlee - v3.3.1

Published by B4nan over 1 year ago

3.3.1 (2023-04-11)

Bug Fixes

  • infiniteScroll() not working in Firefox (#1826) (4286c5d), closes #1821
  • jsdom: add timeout to the window.load wait when runScripts are enabled (806de31)
  • jsdom: delay closing of the window and add some polyfills (2e81618)
  • jsdom: use no-op enqueueLinks in http crawlers when parsing fails (fd35270)
  • MemoryStorage: handling of readable streams for key-value stores when setting records (#1852) (a5ee37d), closes #1843
  • start status message logger after the crawl actually starts (5d1df7a)
  • status message - total requests (#1842) (710f734)
  • Storage: queue up opening storages to prevent issues in concurrent calls (#1865) (044c740)
  • templates: added missing '@types/node' peer dependency (#1860) (d37a7e2)
  • try to detect stuck request queue and fix its state (#1837) (95a9f94)

Features

  • add parseWithCheerio context helper to cheerio crawler (b336a73)
  • jsdom: add parseWithCheerio context helper (c8f0796)
Package Rankings
Top 7.39% on Npmjs.org