A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.
MIT License
Bot releases are hidden (Show)
--inline
flag, or when producing EPUBs) now sends the Referer
header as the browser would when rendering the article (using the strict-origin-when-cross-origin
referrer policy) (#172). Note that when producing HTML or PDF documents of web pages containing images that require a Referer
header, you’ll need to use the --inline
flag.Adds the --toc-level=<level>
option. By default, the table of contents is a flat list of article titles. With the --toc-level
option the table of contents will include headings under each article title (<h2>
, <h3>
, etc.), up to the specified heading depth. A number between 1 and 6 is expected. Using --toc-level
with a value greater than 1 implies --toc
.
<pre>
tags on the Rust reference one-pager (#165)[email protected]
and fixed the loading of language patterns for hyphenation when percollate is used as a npm package (See #163, hyphenopoly#207). With thanks to @mnater, @yashha.Thank you @vongrad for contributing two fixes to this release:
--inline
failing to base64-encode images (#154)imagesAtFullSize()
DOM enhancement to exclude non-English Wikipedia URLs that look like they point to images but are in fact HTML pages (eg. wiki/File:
URLs in English) (#156, #141)This release changes how Percollate interprets operands (See #150): when no operand is provided, an implicit -
(stdin
) is assumed. This makes it nicer to pipe data into percollate from an external tool.
Although not part of the public API, Percollate's logging has largely shifted from stdout
to stderr
, to allow html
and md
to be piped to an external tool.
percollate md
(#93)html
and md
commands can output to stdout
with the -o -
/ --output=-
flag (#150). When used in combination with the --individual
flag, all results are concatenated to stdout
.Node.js 14.17 or later is required to run Percollate 3.0.0. Users on Node.js 12.x can continue using Percollate 2.x by installing it with:
npm install -g percollate@2
Note: The programmatic API is not currently part of the public, documented API.
fetchContent()
, which used to return the page content as a string decoded to 'utf-8', will now return an object of the shape { buffer: ArrayBuffer, contentType: string? }
. Consequently, calls to pdf()
, epub()
and html()
will return on the .originalContent
this new structure as well. See Programmatic API migration for details below.
Added experimental Firefox Nightly support for rendering PDFs, via the percollate pdf --browser=firefox
option. To fetch Firefox Nightly, perform the following installation steps:
# fetches Chrome
npm install -g percollate
# fetches Firefox Nightly
PUPPETEER_PRODUCT=firefox npm install -g percollate
Better default styles for code blocks with the tab-size: 2
CSS property.
Note: The programmatic API is not currently part of the public, documented API.
In general, an ArrayBuffer
can be converted to a String with the TextDecoder
class available in Node.js. In case the content uses a different encoding than the default utf-8
, you can use the whatwg-mimetype
and html-encoding-sniffer
packages (on which jsdom
already depends) to obtain the content's encoding:
import { TextDecoder } from 'node:util';
import htmlEncodingSniffer from 'html-encoding-sniffer';
import MimeType from 'whatwg-mimetype';
const { buffer, contentType } = await fetchContent(...);
const encoding = contentType
? new MimeType(contentType).parameters.get('charset')
: undefined;
const str = new TextDecoder(
htmlEncodingSniffer(buffer, {
transportLayerEncodingLabel: encoding
})
).decode(buffer);
Published by danburzo about 3 years ago
⚠️ Breaking change: Percollate 2.x is ESM only. As such:
require()
d into your project. You must either import
it statically, or import()
it dynamically.You can continue to use Percollate 1.x on Node.js 10, and as a CommonJS dependency:
npm install -g percollate@1
(Please note that while 1.x version is perfectly usable, it will no longer receive updates going forward.)
Additionally, the default Git branch has been renamed to main
.
This release includes some fixes to make articles on acoup.blog work better in epub
, thanks @Akuukis!
image.png?w=1024
)<img>
or <source>
element contains a src
attribute and a srcset
attribute, discard the srcset
to keep the EPUB size down. (Previously, several versions of an image would have be bundled in the EPUB, to the detriment of disk space.)Bug fixes:
<body>
element inside the article's content (see #124)New features:
{ items, options }
— from the pdf()
/ epub()
/ html()
methods (see #122, thanks @yashha)