crawler

A Website Crawler Implementation written in PHP. High extendible, Indexes PDFs and is very memory efficient.

MIT License

Downloads
19.7K
Stars
10
Committers
2

Bot releases are hidden (Show)

crawler - 1.7.1 Latest Release

Published by nadar over 2 years ago

1.7.1 (5. April 2022)

  • Added catch for throwable when parsing pdfs, also updated to latest version of smalot/pdfparser.
crawler - 1.7.0

Published by nadar about 3 years ago

1.7.0 (10. August 2021)

  • #20 Improve the strip tags for html parser in order to generate a more clean and readable output when $stripTags is enabled. Things like <p>foo</p><p>bar</p> are now handled as foo bar instead of foobar.
crawler - 1.6.2

Published by nadar over 3 years ago

1.6.2 (16. April 2021)

  • #18 Fix issue with pages where utf8 chars are in title tag.
crawler - 1.6.1

Published by nadar over 3 years ago

1.6.1 (16. April 2021)

  • #17 Fixed issue where crawler group is not generated correctly.
crawler - 1.6.0

Published by nadar over 3 years ago

1.6.0 (16. March 2021)

  • #15 Do not follow links which have rel="nofollow" by default. This can be configured in the HtmlParser::$ignoreRels property.
crawler - 1.5.0

Published by nadar almost 4 years ago

1.5.0 (13. January 2020)

  • #14 Pass the StatusCode of the response into the parsers and process only HTML and PDFs with code 200 (OK).
crawler - 1.4.0

Published by nadar almost 4 years ago

1.4.0 (13. January 2020)

  • #13 New Crawler method getCycles() returns the number of times the run() method was called.
crawler - 1.3.0

Published by nadar almost 4 years ago

1.3.0 (20. December 2020)

  • #10 Add relative url check to Url class.
  • #8 Merge the path of an url when only a query param is provided.
crawler - 1.2.1

Published by nadar almost 4 years ago

1.2.1 (17. December 2020)

  • #9 Fix issue where CRAWL_IGNORE tag had no effect. Trim the array value for found linkes, which is equals to the link title
crawler - 1.2.0

Published by nadar almost 4 years ago

1.2.0 (14. November 2020)

  • #7 By default, response content which is bigger then 5MB won't be passed to Parsers. In order to turn off this behavior use 'maxSize' => false or increase the limit 'maxSize' => 15000000 (which is 15MB for example). The value must be provided in Bytes. The main goal is to ensure that the PDF Parser won't run into very large memory consumption. This restriction won't stop the Crawler from downloading the URL (whether its large the the maxSize definition or not), but preventing memory leaks when the Parsers start to interact with the response content.
crawler - 1.1.2

Published by nadar almost 4 years ago

1.1.2 (12. November 2020)

  • Decrease the CURL Request Timeout. A Site will now timeout after 5 seconds when before trying to crawl.
crawler - 1.1.1

Published by nadar almost 4 years ago

1.1.1 (21. October 2020)

  • #5 Fix a bug with not done function isValid to check whether an url is a mailto link or similar.
crawler - 1.1.0

Published by nadar almost 4 years ago

1.1.0 (21. October 2020)

  • #4 Add option to encode the url paths.
crawler - 1.0.0

Published by nadar about 4 years ago

1.0.0 (25. September 2020)

  • First stable release.