Bot releases are hidden (Show)

crawlee - v2.1.0

Published by B4nan about 3 years ago

What's Changed

feat: warn if apify proxy is used in proxyUrls by @szmarczak in https://github.com/apify/apify-js/pull/1173
feat: use puppeteer emulating scrolls instead of window.scrollBy by @vladfrangu in https://github.com/apify/apify-js/pull/1170
feat: support channel and user links in YouTube regex by @vladfrangu in https://github.com/apify/apify-js/pull/1178
feat: add support for cgroups V2 to utils.getMemoryInfo by @mnmkng in https://github.com/apify/apify-js/pull/1177
feat: add purgeLocalStorage method by @vladfrangu in https://github.com/apify/apify-js/pull/1187
feat: allow passing forceCloud down to the KV store by @vladfrangu in https://github.com/apify/apify-js/pull/1186
fix: automatically convert gdoc share urls to csv download ones in request list by @B4nan in https://github.com/apify/apify-js/pull/1174
fix YOUTUBE_REGEX_STRING being too greedy by @B4nan in https://github.com/apify/apify-js/pull/1171
fix: incorrect offset in fixUrl function by @szmarczak in https://github.com/apify/apify-js/pull/1184
fix: catch errors inside request interceptors by @B4nan in https://github.com/apify/apify-js/pull/1192
fix: use encodeURIComponent instead of encodeURI by @szmarczak in https://github.com/apify/apify-js/pull/1198
fix: merge cookies provided by user with session cookies by @B4nan in https://github.com/apify/apify-js/pull/1201

Full Changelog: https://github.com/apify/apify-js/compare/v2.0.7...v2.1.0

crawlee -

Published by B4nan about 3 years ago

Fix casting of int/bool environment variables (e.g. APIFY_LOCAL_STORAGE_ENABLE_WAL_MODE), closes #956
Fix incognito pages and user data dir (#1145)
Add @ts-ignore comments to imports of optional peer dependencies (#1152)
Use config instance in sdk.openSessionPool() (#1154)
Add a breaking callback to infiniteScroll (#1140)

crawlee - v2.0.6

Published by mnmkng about 3 years ago

Fix deprecation messages logged from ProxyConfiguration and CheerioCrawler.
Update got-scraping to receive multiple improvements.

crawlee -

Published by B4nan about 3 years ago

2.0.5 / 2021/08/24

Fix error handling in puppeteer crawler

crawlee -

Published by szmarczak about 3 years ago

feat: use session token with got-scraping (#1122) d3261ca

This update introduces persistent browser headers when using got-scraping.

crawlee -

Published by szmarczak about 3 years ago

chore: add aborting event to events docs [skip ci] c89f532f2ba01eb886c0c4fb7d54be0eae2b0e01
fix: refactor requestAsBrowser to Got 12 (#1111) ef9a4ad32701f6613aea0a6104c68824687fe93a
fix: limit handleRequestTimeoutMillis to max valid value (#1116) 594895835222d1f78218e9798619a417edde731f
fix: disable SSL validation on MITM proxies (#1117) 853c5cd474bf28773adb214a431e274cef2b244c
fix: bump got-scraping to 3.0.1 (#1121) b9e99b7013654cef9feca33d54ebe6a82b73077e

This release improves the stability of the SDK.

crawlee - v2.0.2

Published by mnmkng about 3 years ago

Fix serialization issues in CheerioCrawler caused by parser conflicts in recent versions of cheerio.

crawlee - v2.0.1

Published by mnmkng about 3 years ago

Use got-scraping 2.0.1 until fully compatible.

crawlee - v2.0.0

Published by mnmkng about 3 years ago

We're releasing SDK 2 ahead of schedule, because we need state of the art HTTP2 support for scraping and with Node.js versions <15.10, HTTP2 is not very reliable. We bundled in 2 more potentially breaking changes that we were waiting for, but we expect those to have very little impact on users. Migration should therefore be super simple. Just bump your Node.js version.

If you're waiting for full TypeScript support and new features, those are still in the works and will be released in SDK 3 at the end of this year.

BREAKING: Require Node.js >=15.10.0 because HTTP2 support on lower Node.js versions is very buggy.
BREAKING: Bump cheerio to 1.0.0-rc.10 from rc.3. There were breaking changes in cheerio between the versions so this bump might be breaking for you as well.
Remove LiveViewServer which was deprecated before release of SDK v1.
We no longer tag beta releases.

crawlee - v1.3.4

Published by B4nan about 3 years ago

1.3.4 / 2021/08/04

Fix issues with TS builds caused by incomplete browser-pool rewrite

crawlee -

Published by B4nan about 3 years ago

1.3.3 / 2021/08/04

Fix public URL getter of key-value stores

crawlee - v1.3.2

Published by mnmkng about 3 years ago

Fix headerGeneratorOptions not being passed to got-scraping in requestAsBrowser.

crawlee - v1.3.1

Published by petrpatek over 3 years ago

1.3.1 / 2021/07/13

Fix client /v2 duplication in apiBaseUrl.

crawlee - v1.3.0

Published by mnmkng over 3 years ago

Navigation hooks in `CheerioCrawler`

CheerioCrawler downloads the web pages using the requestAsBrowser utility function.
As opposed to the browser based crawlers that are automatically encoding the URLs, the
requestAsBrowser function will not do so. We either need to manually encode the URLs
via encodeURI() function, or set forceUrlEncoding: true in the requestAsBrowserOptions,
which will automatically encode all the URLs before accessing them.

We can either use forceUrlEncoding or encode manually, but not both - it would
result in double encoding and therefore lead to invalid URLs.

We can use the preNavigationHooks to adjust requestAsBrowserOptions:

preNavigationHooks: [
    (crawlingContext, requestAsBrowserOptions) => {
        requestAsBrowserOptions.forceUrlEncoding = true;
    }
]

`Apify` class and `Configuration`

Adds two new named exports:

Configuration class that serves as the main configuration holder, replacing explicit usage of
environment variables.
Apify class that allows configuring the SDK. Env vars still have precedence over the SDK configuration.

When using the Apify class, there should be no side effects.
Also adds new configuration for WAL mode in ApifyStorageLocal.

As opposed to using the global helper functions like main, there is an alternative approach using Apify class.
It has mostly the same API, but the methods on Apify instance will use the configuration provided in the constructor.
Environment variables will have precedence over this configuration.

const { Apify } = require('apify'); // use named export to get the class

const sdk = new Apify({ token: '123' });
console.log(sdk.config.get('token')); // '123'

// the token will be passed to the `call` method automatically
const run = await sdk.call('apify/hello-world', { myInput: 123 });
console.log(`Received message: ${run.output.body.message}`);

Another example shows how the default dataset name can be changed:

const { Apify } = require('apify'); // use named export to get the class

const sdk = new Apify({ defaultDatasetId: 'custom-name' });
await sdk.pushData({ myValue: 123 });

is equivalent to:

const Apify = require('apify'); // use default export to get the helper functions

const dataset = await Apify.openDataset('custom-name');
await dataset.pushData({ myValue: 123 });

Full list of changes:

Add Configuration class and Apify named export, see above.
Fix proxyUrl without a port throwing an error when launching browsers.
Fix maxUsageCount of a Session not being persisted.
Update puppeteer and playwright to match stable Chrome (90).
Fix support for building TypeScript projects that depend on the SDK.
add taskTimeoutSecs to allow control over timeout of AutoscaledPool tasks
add forceUrlEncoding to requestAsBrowser options
add preNavigationHooks and postNavigationHooks to CheerioCrawler
deprecated prepareRequestFunction and postResponseFunction methods of CheerioCrawler
Added new event aborting for handling gracefully aborted run from Apify platform.

crawlee - v1.2.1

Published by B4nan over 3 years ago

Fix requestAsBrowser behavior with various combinations of json, payload legacy options. closes: #1028

crawlee - v1.2.0

Published by mnmkng over 3 years ago

This release brings the long awaited HTTP2 capabilities to requestAsBrowser. It could make HTTP2 requests even before, but it was not very helpful in making browser-like ones. This is very important for disguising as a browser and reduction in the number of blocked requests. requestAsBrowser now uses got-scraping.

The most important new feature is that the full set of headers requestAsBrowser uses will now be generated using live data about browser headers that we collect. This means that the "header fingeprint" will always match existing browsers and should be indistinguishable from a real browser request. The header sets will be automatically rotated for you to further reduce the chances of blocking.

We also switched the default HTTP version from 1 to 2 in requestAsBrowser. We don't expect this change to be breaking, and we took precautions, but we're aware that there are always some edge cases, so please let us know if it causes trouble for you.

Full list of changes:

Replace the underlying HTTP client of utils.requestAsBrowser() with got-scraping.
Make useHttp2 true by default with utils.requestAsBrowser().
Fix Apify.call() failing with empty OUTPUT.
Update puppeteer to 8.0.0 and playwright to 1.10.0 with Chromium 90 in Docker images.
Update @apify/ps-tree to support Windows better.
Update @apify/storage-local to support Node.js 16 prebuilds.

crawlee - v1.1.2

Published by mnmkng over 3 years ago

DEPRECATED: utils.waitForRunToFinish please use the apify-client package and its waitForFinish functions. Sorry, forgot to deprecate this with v1 release.
Fix internal require that broke the SDK with underscore 1.13 release.
Update @apify/storage-local to v2 written in TypeScript.

crawlee - v1.1.1

Published by mnmkng over 3 years ago

Fix SessionPoolOptions not being correctly used in BrowserCrawler.
Improve error messages for missing puppeteer or playwright installations.

crawlee - v1.1.0

Published by mnmkng over 3 years ago

In this minor release we focused on the SessionPool. Besides fixing a few bugs, we added one important feature: setting and getting of sessions by ID.

// Now you can add specific sessions to the pool,
// instead of relying on random generation.
await sessionPool.addSession({
    id: 'my-session',
    // ... some config
});

// Later, you can retrieve the session. This is useful
// for example when you need a specific login session.
const session = await sessionPool.getSession('my-session');

Full list of changes:

Add sessionPool.addSession() function to add a new session to the session pool (possibly with the provided options, e.g. with specific session id).
Add optional parameter sessionId to sessionPool.getSession() to be able to retrieve a session from the session pool with the specific session id.
Fix SessionPool not working properly in both PuppeteerCrawler and PlaywrightCrawler.
Fix Apify.call() and Apify.callTask() output - make it backwards compatible with previous versions of the client.
Improve handling of browser executable paths when using the official SDK Docker images.
Update browser-pool to fix issues with failing hooks causing browsers to get stuck in limbo.
Removed proxy-chain dependency because now it's covered in browser-pool.

crawlee - v1.0.2

Published by mnmkng over 3 years ago

Add the ability to override ProxyConfiguration status check URL with the APIFY_PROXY_STATUS_URL env var.
Fix inconsistencies in cookie handling when SessionPool was used.
Fix TS types in multiple places. TS is still not a first class citizen, but this should improve the experience.