A Website Crawler Implementation written in PHP. High extendible, Indexes PDFs and is very memory efficient.
MIT License
A highly extendible, dependency free Crawler for HTML, PDFS or any other type of Documents.
Why another Page Crawler? Yes, indeed, there are already very good Crawlers around, therefore those where my goals:
Composer is required to install this library:
composer require nadar/crawler
In order to use the PDF Parser, the optional library smalot/pdfparser
must be installed:
smalot/pdfparser
Create your handler, those are the classes which interact with the crawler in order to store your content/results somwehere. The afterRun() method will run whenever an URL is crawled and contains the results:
class MyCrawlHandler implements \Nadar\Crawler\Interfaces\HandlerInterface
{
public function afterRun(\Nadar\Crawler\Result $result)
{
echo $result->title . " with content " . $result->content . " for url " . $result->url->getNormalized();
}
public function onSetup(Crawler $crawler)
{
// do some stuff before the crawler runs, maybe truncate your temporary table where the results should be stored.
}
public function onEnd(Crawler $crawler)
{
// runs when the crawler is finished, maybe synchronize your temporary index table with the "real" site index.
}
}
$crawler = new Crawler('https://luya.io', new ArrayStorage, new LoopRunner);
// what kind of document types would you like to parse?
$crawler->addParser(new Nadar\Crawler\Parsers\Html);
// adding will increases memory consumption
// $crawler->addParser(new Nadar\Crawler\Parsers\Pdf);
// register your handler in order to interact with the results, maybe store them in a database?
$crawler->addHandler(new MyCrawlHandler);
// setup and start the crawl process
$crawler->setup();
$crawler->run();
Attention: Keep in mind that wen you enable the PDF Parser and have multiple concurrent requests this can drastically increases memory usage (Especially if there are large PDFs)! Therefore it's recommend to lower the concurrent value when enabling PDF Parser!
Of course those benchmarks may vary depending on internet connection, bandwidth, servers but we made all the tests under the same circumstances. The memory peak varys strong when using the PDF parsers, therefore we test only with HTML parser:
Index Size | Concurrent Requests | Memory Peak | Time | Storage |
---|---|---|---|---|
308 | 30 | 6MB | 19s | ArrayStorage |
308 | 30 | 6MB | 20s | FileStorage |
Still looking for a good website to use for benchmarking. See the
benchmark.php
file for the test setup.
For a better understanding, here is en explenation of how the classes are capsulated and for what they are used.
Lifecycle
Crawler -> Job -> (ItemQueue -> Storage) -> RequestResponse -> Parser -> ParserResult -> Result