percollate

A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.

MIT License

Downloads
339
Stars
4.1K
Committers
18

Bot releases are hidden (Show)

percollate - Latest Release

Published by danburzo 5 months ago

Fixes

  • Fetching article images (when using the --inline flag, or when producing EPUBs) now sends the Referer header as the browser would when rendering the article (using the strict-origin-when-cross-origin referrer policy) (#172). Note that when producing HTML or PDF documents of web pages containing images that require a Referer header, you’ll need to use the --inline flag.
  • Use a fallback when slugifying an article title results in the empty string.
percollate -

Published by danburzo 5 months ago

New features

Compress entries in the EPUB file with the DEFLATE algorithm at maximum level of compression (#169).

percollate -

Published by danburzo 6 months ago

New features

Adds the --toc-level=<level> option. By default, the table of contents is a flat list of article titles. With the --toc-level option the table of contents will include headings under each article title (<h2>, <h3>, etc.), up to the specified heading depth. A number between 1 and 6 is expected. Using --toc-level with a value greater than 1 implies --toc.

percollate -

Published by danburzo 9 months ago

Bug fixes

percollate -

Published by danburzo 11 months ago

Bug fixes

  • Updated to [email protected] and fixed the loading of language patterns for hyphenation when percollate is used as a npm package (See #163, hyphenopoly#207). With thanks to @mnater, @yashha.
percollate -

Published by danburzo about 1 year ago

Bug fixes

  • Fixes usage of mdast-util-gfm to allow serializing HTML <table> elements to Markdown when using percollate md (#161)
percollate -

Published by danburzo over 1 year ago

Bug fixes

Further improvements to detecting and bundling images re: #141 (which should have really been part of v4.0.1, had the necessary insight not manifested exactly five seconds after publishing said version).

percollate -

Published by danburzo over 1 year ago

Bug fixes

Thank you @vongrad for contributing two fixes to this release:

  • Fixes regression in --inline failing to base64-encode images (#154)
  • Fixes heuristic in imagesAtFullSize() DOM enhancement to exclude non-English Wikipedia URLs that look like they point to images but are in fact HTML pages (eg. wiki/File: URLs in English) (#156, #141)
percollate -

Published by danburzo over 1 year ago

Breaking changes

This release changes how Percollate interprets operands (See #150): when no operand is provided, an implicit - (stdin) is assumed. This makes it nicer to pipe data into percollate from an external tool.

Although not part of the public API, Percollate's logging has largely shifted from stdout to stderr, to allow html and md to be piped to an external tool.

New features

  • Support for Markdown output with percollate md (#93)
  • html and md commands can output to stdout with the -o - / --output=- flag (#150). When used in combination with the --individual flag, all results are concatenated to stdout.
percollate -

Published by danburzo over 1 year ago

⚠️ Breaking changes

Node 14 required

Node.js 14.17 or later is required to run Percollate 3.0.0. Users on Node.js 12.x can continue using Percollate 2.x by installing it with:

npm install -g percollate@2

Programmatic API breaking changes

Note: The programmatic API is not currently part of the public, documented API.

fetchContent(), which used to return the page content as a string decoded to 'utf-8', will now return an object of the shape { buffer: ArrayBuffer, contentType: string? }. Consequently, calls to pdf(), epub() and html() will return on the .originalContent this new structure as well. See Programmatic API migration for details below.

New features

Experimental Firefox support for PDF rendering

Added experimental Firefox Nightly support for rendering PDFs, via the percollate pdf --browser=firefox option. To fetch Firefox Nightly, perform the following installation steps:

# fetches Chrome
npm install -g percollate

# fetches Firefox Nightly
PUPPETEER_PRODUCT=firefox npm install -g percollate

Bug fixes

Better default styles for code blocks with the tab-size: 2 CSS property.

Migration

Programmatic API migration

Note: The programmatic API is not currently part of the public, documented API.

In general, an ArrayBuffer can be converted to a String with the TextDecoder class available in Node.js. In case the content uses a different encoding than the default utf-8, you can use the whatwg-mimetype and html-encoding-sniffer packages (on which jsdom already depends) to obtain the content's encoding:

import { TextDecoder } from 'node:util';
import htmlEncodingSniffer from 'html-encoding-sniffer';
import MimeType from 'whatwg-mimetype';

const { buffer, contentType } = await fetchContent(...);

const encoding = contentType
	? new MimeType(contentType).parameters.get('charset')
	: undefined;

const str = new TextDecoder(
	htmlEncodingSniffer(buffer, {
		transportLayerEncodingLabel: encoding
	})
).decode(buffer);
percollate -

Published by danburzo over 1 year ago

Bug fixes

  • Duplicate file names are now given a numeric suffix to avoid one overwriting the other (#144)
percollate -

Published by danburzo about 2 years ago

Bug fixes

  • Improves Windows compatibility of some generated path names (#139)
  • Fixes some images not showing up on Wikipedia article pages (#141)
percollate -

Published by danburzo over 2 years ago

New features

Adds the -w, --wait=<sec> option to process URLs sequentially, and pause for a number of seconds between URLs. If unspecified, URLs are processed in parallel as before. (#133)

percollate -

Published by danburzo about 3 years ago

New features

Add support for the --inline flag. This fetches the images and embeds them into the document as Base64-encoded data URIs, so that you can use percollate html to obtain self-contained HTML files.

percollate - v2.0.0

Published by danburzo about 3 years ago

⚠️ Breaking change: Percollate 2.x is ESM only. As such:

  • It requires Node.js 12.20.0, Node.js 14.10.0, or Node 16.0 or later to run.
  • It can no longer be require()d into your project. You must either import it statically, or import() it dynamically.

You can continue to use Percollate 1.x on Node.js 10, and as a CommonJS dependency:

npm install -g percollate@1

(Please note that while 1.x version is perfectly usable, it will no longer receive updates going forward.)

Additionally, the default Git branch has been renamed to main.

percollate -

Published by danburzo about 3 years ago

percollate -

Published by danburzo about 3 years ago

This release includes some fixes to make articles on acoup.blog work better in epub, thanks @Akuukis!

Bug fixes

  • When fetching images for bundling in the EPUB, include URLs that use query parameters (e.g. image.png?w=1024)
  • When an <img> or <source> element contains a src attribute and a srcset attribute, discard the srcset to keep the EPUB size down. (Previously, several versions of an image would have be bundled in the EPUB, to the detriment of disk space.)
percollate -

Published by danburzo about 3 years ago

Upgraded to puppeteer@9 which fixes installation on Node.js on Apple Silicon.

percollate -

Published by danburzo over 3 years ago

Bug fixes:

  • Adds uuid as an explicit direct dependency in package.json (#127, thanks @Jackymancs4!)
percollate -

Published by danburzo over 3 years ago

Bug fixes:

  • EPUB: Don't bundle images that have been stripped out by Readability (see #124)
  • EPUB: Fixes XHTML generation to avoid putting a <body> element inside the article's content (see #124)

New features:

  • Programmatic API: Return something useful — in the shape of { items, options } — from the pdf() / epub() / html() methods (see #122, thanks @yashha)