article-extractor

To extract main article from given URL with Node.js

MIT License

Downloads
14.8K
Stars
1.6K
Committers
13

Bot releases are hidden (Show)

article-extractor - v7.2.10

Published by ndaidong over 1 year ago

  • Fix issue #331
  • Update dependencies
  • Remove unnecessary watermark
article-extractor - v7.2.9

Published by ndaidong over 1 year ago

  • Fix issue #329
  • Update dependencies
  • Improve unit test
article-extractor - v7.2.8

Published by ndaidong almost 2 years ago

  • Expose new API method extractFromHtml()
  • Update dependencies
  • Change coding style (remove standardjs)

Related issues: #321, #326

article-extractor - v7.2.7

Published by ndaidong almost 2 years ago

  • Update dependencies
  • Fix CI issues
  • Update docs & links
article-extractor - v7.2.6 - Change name

Published by ndaidong almost 2 years ago

  • Change package name from article-parser to @extractus/article-extractor
  • Move to new organization Extractus
article-extractor - v7.2.5

Published by ndaidong almost 2 years ago

  • Update dependencies
  • Improve meta data extraction
  • Add security policy
article-extractor - v7.2.4

Published by ndaidong about 2 years ago

  • Improve space/newline processing
    • no longer remove all linebreaks but multi empty lines are stripped
    • similar to spaces, muti spaces will be replaced with single space
article-extractor - v7.2.3

Published by ndaidong about 2 years ago

  • Optimize performance

By removing HTML validation step, we increased the performance to about 4x - 5x faster.

Before, article-parser checks if the extract's input is URL or valid HTML to decide next step.
Now when receiving the input, if that isn't URL, it assumes that's a HTML string and start extracting immediately.

v7.2.2 - Before

v7.2.3 - After

article-extractor - v7.2.2

Published by ndaidong about 2 years ago

  • Add options to extract method
    • Replace global config with on-request parserOptions
    • Add new param fetchOptions to extract() method
      • Allow to pass request to proxy
  • Remove unnecessary dependencies for reduce bundle size
  • Fix problem while building esm version for browser
  • Add demo for running on browser
article-extractor - v7.2.1

Published by ndaidong about 2 years ago

  • Use external string-similarity
  • Improve fetch control
  • Update build script
  • Fix typo error on example packages
article-extractor - v7.2.0

Published by ndaidong about 2 years ago

  • Refactor some parts to run on deno, bun and tsnode
    • Use internal string-similarity file to by pass bun.js resolve error
    • Stop depending on urlpattern-polyfill to by pass deno/bun error
      • Replace URLPattern syntax with regular RegExp
  • Add some examples for each platform
  • Remove rarely used configuration methods
article-extractor - v7.1.0

Published by ndaidong about 2 years ago

The first step to get it work on deno and bun environment

  • Replace axios with cross-fetch
  • Remove 4 API methods relating to axios and htmlcrush
article-extractor - v7.0.3

Published by ndaidong about 2 years ago

  • Update dependencies
  • Remove depending on tldts
  • Use conditional exports
  • Improve pre-defined options
article-extractor - v7.0.2

Published by ndaidong about 2 years ago

  • Update dependencies
  • Add button "Deploy to Deta"
  • Use Deta service for example faas
  • Copy types definition to cjs dist (#287)
article-extractor - v7.0.1

Published by ndaidong about 2 years ago

  • Fix potential logic error while generating description
  • Update dependencies
article-extractor - v7.0.0

Published by ndaidong about 2 years ago

Release v7.0.0 for production.

This version use transformations instead of queryRules. The missing of pre-defined rules may break some article sources. But you can easily fix there problems with a little knowledge about DOM manipulation. After that, transformations can help you improve extraction result in a completely new way.

article-extractor - v7.0.0rc4

Published by ndaidong over 2 years ago

  • Use tldts to get domain, used this value as source (for a consistent format)
    • with domain as source, you can access to its favicon with https://www.google.com/s2/favicons?domain={DOMAIN.TLD}
  • Increase description length, tend to take summary from content, remove unneccessary parts
article-extractor - v7.0.0rc3

Published by ndaidong over 2 years ago

  • Add default Accept-Encoding to request options
  • Update default sanitizeHtml options
  • Update dependencies
article-extractor - v7.0.0rc2

Published by ndaidong over 2 years ago

  • Update processing logic
  • Replace old concept queryRule with new one called transformation
  • Re-organize source code structure
article-extractor - v6.0.6

Published by ndaidong over 2 years ago

  • Improve set/get config methods
  • Fix potential problems with query rules
  • Apply multi transformation from all matched query rules
  • Add more guide about query rules