crawler

A Website Crawler Implementation written in PHP. High extendible, Indexes PDFs and is very memory efficient.

MIT License

Downloads

19.7K

Stars

10

Committers

View Code on GitHub

Ecosystems: PHP

Bot releases are hidden (Show)

crawler - 1.7.1 Latest Release

Published by nadar over 2 years ago

1.7.1 (5. April 2022)

Added catch for throwable when parsing pdfs, also updated to latest version of smalot/pdfparser.

crawler - 1.7.0

Published by nadar about 3 years ago

1.7.0 (10. August 2021)

#20 Improve the strip tags for html parser in order to generate a more clean and readable output when $stripTags is enabled. Things like <p>foo</p><p>bar</p> are now handled as foo bar instead of foobar.

crawler - 1.6.2

Published by nadar over 3 years ago

1.6.2 (16. April 2021)

#18 Fix issue with pages where utf8 chars are in title tag.

crawler - 1.6.1

Published by nadar over 3 years ago

1.6.1 (16. April 2021)

#17 Fixed issue where crawler group is not generated correctly.

crawler - 1.6.0

Published by nadar over 3 years ago

1.6.0 (16. March 2021)

#15 Do not follow links which have rel="nofollow" by default. This can be configured in the HtmlParser::$ignoreRels property.

crawler - 1.5.0

Published by nadar almost 4 years ago

1.5.0 (13. January 2020)

#14 Pass the StatusCode of the response into the parsers and process only HTML and PDFs with code 200 (OK).

crawler - 1.4.0

Published by nadar almost 4 years ago

1.4.0 (13. January 2020)

#13 New Crawler method getCycles() returns the number of times the run() method was called.

crawler - 1.3.0

Published by nadar almost 4 years ago

1.3.0 (20. December 2020)

#10 Add relative url check to Url class.
#8 Merge the path of an url when only a query param is provided.

crawler - 1.2.1

Published by nadar almost 4 years ago

1.2.1 (17. December 2020)

#9 Fix issue where CRAWL_IGNORE tag had no effect. Trim the array value for found linkes, which is equals to the link title

crawler - 1.2.0

Published by nadar almost 4 years ago

1.2.0 (14. November 2020)

#7 By default, response content which is bigger then 5MB won't be passed to Parsers. In order to turn off this behavior use 'maxSize' => false or increase the limit 'maxSize' => 15000000 (which is 15MB for example). The value must be provided in Bytes. The main goal is to ensure that the PDF Parser won't run into very large memory consumption. This restriction won't stop the Crawler from downloading the URL (whether its large the the maxSize definition or not), but preventing memory leaks when the Parsers start to interact with the response content.

crawler - 1.1.2

Published by nadar almost 4 years ago

1.1.2 (12. November 2020)

Decrease the CURL Request Timeout. A Site will now timeout after 5 seconds when before trying to crawl.

crawler - 1.1.1

Published by nadar almost 4 years ago

1.1.1 (21. October 2020)

#5 Fix a bug with not done function isValid to check whether an url is a mailto link or similar.

crawler - 1.1.0

Published by nadar almost 4 years ago

1.1.0 (21. October 2020)

#4 Add option to encode the url paths.

crawler - 1.0.0

Published by nadar about 4 years ago

1.0.0 (25. September 2020)

First stable release.

Package Rankings

Top 12.45% on Packagist.org

Badges

Extracted from project README

Test Coverage

Maintainability

Packagist Downloads

Related Projects

Crawler

An advanced web-crawler written in PHP.

caching

⏱ Caching library with easy-to-use API and many cache backends.

13 Mar 2014 396

Shel.Crawler

Neos based crawler for nodes and sites

php-spider

A configurable and extensible PHP web spider

25 Feb 2013 1,325

crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

02 Nov 2015 2,513

json-machine

Efficient, easy-to-use, and fast PHP JSON stream parser

05 Nov 2018 1,021

PHPScraper

A universal web-util for PHP.

21 Apr 2020 526

robot-loader

🍀 RobotLoader: high performance and comfortable autoloader that will search and autoload classes ...

13 Mar 2014 868