A text extraction package for Laravel
MIT License
A Laravel package to extract text from files like DOC, Excel, Image, Pdf and more.
Following file formats is supported currently. You need to install proper extensions to your server to work with all the following extension related files. The package will check file content MIME type before execute.
We are working hard to make this laravel plugin useful. If you found any issue please add a post on discussion.
composer require nilgems/laravel-textract
Once installed you can do stuff like this:
# Run the extractor
$output = Textract::run('/path/to/file.extension');
# Display the extracted text
echo $output->text;
# Display the extracted text word count
echo $output->word_count;
# Display the extracted text with direct string conversion
echo (string) $output;
Run the extractor to any supported file:
Textract::run(string $file_path, [string $job_id],[TesseractOcrOptions $extra_data]);
Option | Type | Default value | Required | Description |
---|---|---|---|---|
$file_path | String |
No default value | Yes | Text extractable file absolute path. |
$job_id | String |
NULL |
No | It's a optional parameter. Extraction job id. If this option is blank the plugin will auto create the ID |
$extra_data | TesseractOcrOptions |
NULL |
No | It's a optional parameter. To pass extra parameter. If you are extracting a image file, you can mention languages and more by this Nilgems\PhpTextract\ExtractorService\Ocr\Contracts\TesseractOcrOptions parameter. |
app.php
under the config
folder of your'providers' => [
...
Nilgems\PhpTextract\Providers\ServiceProvider,
...
]
app.php
under the config
folder of yourfacade
in your application.
'aliases' => [
...
'Textract' => Nilgems\PhpTextract\Textract::class,
...
]
config
file, run:
php artisan vendor:publish --tag=textract
You can extract text from supported file format.
It is recommended to use the extractor with Laravel Queue Job from better performance.
In php
there have a restriction of execution time and memory limit defined in php.ini
file with the option max_execution_time
and memory_limit
. If file size is big, the process may kill forcefully when exceed the limit. You can use queue - database/redis
or Laravel horizon
to run the process in background.
........
Route::get('/textract', function(){
return Textract::run('/path/to/image/example.png');
});
........
If you need to specify languages in image file for better extraction output from image file.
........
Route::get('/textract', function(){
return Textract::run('/path/to/image/example.png', null, [
'lang' => ['eng', 'jpn', 'spa']
]);
});
........
sudo apt update
sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel
sudo apt install -y tesseract-ocr
sudo apt update
tesseract --version
choco install capture2text --version 5.0
Note: Recent versions of Capture2Text stopped shipping the tesseract
binary
sudo apt update
sudo apt-get install poppler-utils
pdftotext -v
pdftotext
available via poppler and the poppler is not available yet for windows. But you can install and use the library by windows linux sub-system WLS. Alternatively, you can install Laravel Homestead in your project and using vagrant virtualization you can run the project in ubuntu virtual server.