scrapper

Web scraper with a simple REST API living in Docker and using a Headless browser and Readability.js for parsing.

APACHE-2.0 License

Stars
169
scrapper - v0.17.0 Latest Release

Published by amerkurev 6 months ago

  • fix: handle empty strings in levenshtein_similarity to avoid division by zero
  • feat(parser): include zero and one pixel elements in hidden checks
  • feat(parser): add comment removal functionality to article parser
  • feat(htmlutil): add text content improvement function
  • fix: remove 'media' from negative regex to prevent image removal in Readability.js
scrapper - v0.16.0

Published by amerkurev 10 months ago

  • fix: remove fixed width from code elements in custom.css
  • feat(htmlutil): expand tag checks in content improvement function
scrapper - v0.15.0

Published by amerkurev 10 months ago

  • add support for fetching any web page
  • add support for the device parameter to simulate specific device behaviors
  • deprecate explicit viewport-width, viewport-height, screen-width, and screen-height in favor of using the device parameter

BREAKING CHANGE: Default values for viewport-width, viewport-height, screen-width, and screen-height have been removed. Users should now use the device parameter to specify device-specific behaviors.

scrapper - v0.14.0

Published by amerkurev 10 months ago

  • add Docker health check features
  • fix parser for extra HTTP headers
  • display app version in API, documentation, and UI
  • rename user_data_dir to user_dir for clarity
  • add browser context limit
  • refactor test script for clarity
  • add Pylint for code linting
  • add HTTPS and authentication deployment instructions
scrapper - v0.13.0

Published by amerkurev 10 months ago

  • update dependencies: Playwright and Readability (0.5.0)
  • move backend from flask to fastapi (use async)
  • switching to uvicorn from gunicorn
  • fix some errors
  • add tests
scrapper - v0.12.0

Published by amerkurev about 1 year ago

Added --user-scripts-timeout parameter

scrapper - v0.11.0

Published by amerkurev over 1 year ago

fix http404 error

scrapper - v0.10.0

Published by amerkurev over 1 year ago

  • extract social meta tags (open graph, twitter)
  • new request parameter: headless
  • new request parameter: scroll-down
scrapper - v0.9.0

Published by amerkurev over 1 year ago

add page URL after redirects

scrapper - v0.8.0

Published by amerkurev over 1 year ago

add link parser

scrapper - v0.7.0

Published by amerkurev over 1 year ago

fix some errors

scrapper - v0.6.0

Published by amerkurev over 1 year ago

modular project structure

scrapper - v0.5.0

Published by amerkurev over 1 year ago

fix error: cannot take screenshot larger than ...

scrapper - v0.4.0

Published by amerkurev over 1 year ago

first public release