Miner is a PHP library that extracting metadata and interesting text content (like author, summary, and etc.) from HTML pages. It acts like a simplified HTML metadata parser in Apache Tika.
MIT License
![Gitter](https://badges.gitter.im/Join Chat.svg)
This library is part of Project Golem, see yoozi/golem for more info.
Miner is a PHP library that extracting metadata and interesting text content (like author, summary, and etc.) from HTML pages. It acts like a simplified HTML metadata parser in Apache Tika.
Ta-da! Consider the screenshot taken from LinkedIn below:
When you post a link to your connections on LinkedIn, it will automatically extract the title, summary, and even cover image for you. Miner can be typically used to achieve tasks like this.
The best and easy way to install the Golem package is with Composer.
Open your composer.json and add the following to the require array:
"yoozi/miner": "1.0.*"
Run Composer to install or update the new package dependencies.
php composer install
or
php composer update
Hybrid is enabled by default. You can change parsers to best fit your needs:
// Use the Readability Parser.
$extractor->getConfig()->set('parser', 'readability');
// Or...use the Hybrid Parser.
// $extractor->getConfig()->set('parser', 'hybrid');
// Or...use the Meta Parser.
// $extractor->getConfig()->set('parser', 'meta');
We can parse a remote url and extract its metadata directly.
<?php
use Yoozi\Miner\Extractor;
use Buzz\Client\Curl;
$extractor = new Extractor();
// Use the Hybrid Parser.
$extractor->getConfig()->set('parser', 'hybrid');
// Strip all HTML tags in the description we parsed.
$extractor->getConfig()->set('strip_tags', true);
$meta = $extractor->fromUrl('http://www.example.com/', new Curl)->run();
var_dump($meta);
Data returned:
array(9) {
["title"]=>
string(14) "Example Domain"
["author"]=>
NULL
["keywords"]=>
array(0) {
}
["description"]=>
string(220) "
Example Domain
This domain is established to be used for illustrative examples in documents. You may use this
domain in examples without prior coordination or asking for permission.
More information...
"
["image"]=>
NULL
["url"]=>
string(23) "http://www.example.com/"
["host"]=>
string(22) "http://www.example.com"
["domain"]=>
string(11) "example.com"
["favicon"]=>
string(52) "http://www.google.com/s2/favicons?domain=example.com"
}