To extract main article from given URL with Node.js
MIT License
Bot releases are visible (Hide)
Published by ndaidong over 1 year ago
Published by ndaidong over 1 year ago
Published by ndaidong almost 2 years ago
Related issues: #321, #326
Published by ndaidong almost 2 years ago
Published by ndaidong almost 2 years ago
article-parser
to @extractus/article-extractor
Published by ndaidong almost 2 years ago
Published by ndaidong about 2 years ago
Published by ndaidong about 2 years ago
By removing HTML validation step, we increased the performance to about 4x - 5x faster.
Before, article-parser
checks if the extract
's input is URL or valid HTML to decide next step.
Now when receiving the input, if that isn't URL, it assumes that's a HTML string and start extracting immediately.
v7.2.2 - Before
v7.2.3 - After
Published by ndaidong about 2 years ago
Published by ndaidong about 2 years ago
Published by ndaidong about 2 years ago
urlpattern-polyfill
to by pass deno/bun error
Published by ndaidong about 2 years ago
tldts
Published by ndaidong about 2 years ago
Published by ndaidong about 2 years ago
Published by ndaidong about 2 years ago
Release v7.0.0 for production.
This version use transformations instead of queryRules
. The missing of pre-defined rules may break some article sources. But you can easily fix there problems with a little knowledge about DOM manipulation. After that, transformations can help you improve extraction result in a completely new way.
Published by ndaidong over 2 years ago
source
(for a consistent format)
https://www.google.com/s2/favicons?domain={DOMAIN.TLD}
description
length, tend to take summary from content, remove unneccessary partsPublished by ndaidong over 2 years ago
Accept-Encoding
to request optionsPublished by ndaidong over 2 years ago
queryRule
with new one called transformation
Published by ndaidong over 2 years ago