A Mediawiki extension that enables full-text search of uploaded files, by using Apache Tika to extract text and metadata
OTHER License
TikaAllTheFiles (TATF) is an extension for MediaWiki which facilitates full-text search over uploaded files, by using the Apache Tika content analysis toolkit, which "detects and extracts metadata and text from over a thousand different file types". In practical terms: if you already have CirrusSearch set up and working on your wiki, TATF will allow you to perform full-text searches over the contents of almost any uploaded file --- not just the PDF's.
TATF's features and capabilities:
File:
pages;Brought to you by...
This extension is developed by the Center for Transparent Analysis and Policy, a 501(c)(3) charitable non-profit organization. If this extension is useful for your wiki, consider making a donation to support CTAP. CTAP All The Donations!
To make use of TikaAllTheFiles (TATF), you will need:
PHP >= 8.1.0
Mediawiki >= 1.37
CirrusSearch extension
Apache Tika Server >= 2.1.0
(optional) Tesseract OCR
Setting up the prerequisites is beyond the scope of these instructions, but some pointers for Tika and Tesseract are provided in Hints and Tips.
TikaAllTheFiles (TATF) has defaults that should get it to do something useful out-of-the-box, but it is helpful to understand how it works a bit before installing it.
In MediaWiki, any operation on an uploaded file that requires interpreting its content is provided by a MediaHandler. Thumbnails, image display, metadata extraction, text extraction, etc all depend on a MediaHandler. Without a MediaHandler for a file, MediaWiki only knows about its name, size, and MIME type.
MediaHandlers are registered to MIME types.
MediaWiki provides a handful of MediaHandlers in its core code, e.g.,
JpegHandler
for JPEG images (MIME type image/jpeg
). The rest are provided
by extensions. The PdfHandler
extension, which ships with MediaWiki and is
installed by default, provides a MediaHandler for PDF files (MIME type
application/pdf
).
TATF works by providing a MediaHandler that knows how to extract text and metadata by farming files out to a Tika server. Unlike a typical media extension, however, TATF does not register its MediaHandler for specific MIME types, Instead, it installs a special MediaHandlerFactory that knows how to provide its MediaHandler for any MIME type that shows up. (It's called "Tika All The Files" for a reason.)
When MediaWiki needs a MediaHandler for a file, it asks TATF's factory and the factory returns one of three results:
Which outcome occurs depends on the configuration of TATF's
MimeTypeProfiles
parameter.
A TATF MediaHandler offers two types of functionality:
The content and metadata functions are independent of each other; if both are enabled and if both are invoked by MediaWiki for a given file, then TATF will actually query Tika twice for that file. One query would occur when the file is initially uploaded (to record its metadata in the database); the other would occur when the search engine indexes the file (to obtain the text content to be indexed).
A solo TATF MediaHandler will simply provide its Tika-based content and/or metadata services, and that's that. It is not able to provide thumbnails or previews or any other MediaHandler functionality.
A wrapping TATF MediaHandler is able to delegate to the wrapped MediaHandler for any function beyond content or metadata. Thus, TATF can be used to add text extraction (and enhanced metadata extraction) for MIME types and MediaHandlers that don't already support it. This is, for example, what enables TATF to be used to extract searchable text from bitmap image files.
Which of content and/or metadata functionality is provided by TATF, and
how the Tika results are blended with the native output of a wrapped
MediaHandler, is all configurable via the
MimeTypeProfiles
parameter.
The recommended installation method for TikaAllTheFiles (TATF) is to use
composer
. This will automatically install any
(future) PHP dependencies.
Go to your MediaWiki installation directory and run two composer
commands:
$ cd YOUR-MEDIA-WIKI-DIRECTORY
$ COMPOSER=composer.local.json composer require --no-update centertap/tika-all-the-files
$ composer update centertap/tika-all-the-files --with-dependencies --no-dev --optimize-autoloader
The require
command will add an entry for TATF to your
composer.local.json
file (creating the file if necessary). The update
command will update your composer.lock
file and download/install TATF in
the extensions
directory.
If you want to pin the major version of this extension (so that future updates do not inadvertently introduce breaking changes), change the first command to something like this (e.g., for major revision "194"):
$ COMPOSER=composer.local.json composer require --no-update centertap/tika-all-the-files:^194.0.0
Edit your site's LocalSettings.php
to load the extension:
...
wfLoadExtension( 'TikaAllTheFiles' );
...
Configure TATF as needed. (See Configuration below.)
Run some post-configuration commands to (re)index files that have already been uploaded to your wiki. (See Post-configuration / Maintenance below.)
TikaAllTheFiles (TATF) has the following configuration parameters;
each of them has a prefix
of $wgTikaAllTheFiles_
which has been omitted here for brevity:
parameter | default | description |
---|---|---|
TikaServiceBaseUrl |
http://localhost:9998/ |
Base URL of the Tika server |
QueryTimeoutSeconds |
5 | Tika server response time limit (seconds) |
QueryRetryCount |
2 | Number of times to retry a failed Tika query |
QueryRetryDelaySeconds |
2 | Delay (seconds) before query retry |
LocalCacheSize |
16 | Number of entries in the local query cache |
MimeTypeProfiles |
see below | Handler configuration, by mime-type |
PropertyMap |
[] |
Additional mappings for Tika metadata |
All the parameters have nominally reasonable defaults that should cause TATF
to do something useful --- most important is that TikaServiceBaseUrl
points to
your Tika server. More details on the parameters follow below.
$wgTikaAllTheFiles_TikaServiceBaseUrl
http://localhost:9998/
$wgTikaAllTheFiles_QueryTimeoutSeconds
5
$wgTikaAllTheFiles_QueryRetryCount
2
$wgTikaAllTheFiles_QueryRetryDelaySeconds
2
$wgTikaAllTheFiles_LocalCacheSize
16
$wgTikaAllTheFiles_MimeTypeProfiles
[
'defaults' => [
'handler_strategy' => 'fallback',
'allow_ocr' => false,
'ocr_languages' => '',
'content_strategy' => 'combine',
'content_composition' => 'text',
'metadata_strategy' => 'prefer_other',
'ignore_content_service_errors' => false,
'ignore_content_parsing_errors' => false,
'ignore_metadata_service_errors' => false,
'ignore_metadata_parsing_errors' => false,
'cache_expire_success_before': false,
'cache_expire_failure_before': false,
'cache_file_backend': false,
],
'*' => 'defaults',
]
The effect of the built-in default profile configuration (shown above) is:
* Every MIME type is handled in 'fallback' mode.
* TATF will provide a "solo" handler for files that do not already have
a handler (provided by the MW core or another extension).
* The TATF handler will provide Tika-extracted text for search indexing
(but only text, not metadata).
* Text extraction will not use OCR.
* The TATF handler will provide Tika-extracted metadata to display on
a file's File: page.
* Errors encountered while querying Tika will not be ignored.
* Cached Tika responses will not be expired.
* No persistent, file-based cache will be used.
To customize the configuration, it is best to leave the defaults
profile
alone; new versions of TATF may add new default parameters to try to allow
for a seamless upgrades. Instead, create a profile that inherits from the
defaults
profile, and make all your modifications there.
For example, if you put the following in LocalSettings.php
:
$wgTikaAllTheFiles_MimeTypeProfiles['*'] = [
'inherit' => 'defaults',
'handler_strategy' => 'wrapping',
'allow_ocr' => true,
'content_composition' => 'text_and_metadata',
'metadata_strategy' => 'combine',
'cache_file_backend' => 'my-tatf-cache',
];
$wgTikaAllTheFiles_MimeTypeProfiles['application/pdf'] = [
'inherit' => '*',
'allow_ocr' => false,
];
it will build on top of the built-in defaults with the result:
* Every MIME type is handled in 'wrapping' mode.
* TATF will provide a "solo" handler for files that do not already have
a handler, and a "wrapping" handler for those that do.
* The TATF handler will provide both Tika-extracted text and metadata for
search indexing, and combine that content with any content produced by
a wrapped handler.
* TATF will persistently cache Tika responses in a file-backend called
`'my-tatf-cache'`.
* Text extraction will use OCR if it is available --- but not for PDF files!
* The TATF handler will combine Tika-extracted metadata along with metadata
from a wrapped handler, for display on a file's File: page.
How TikaAllTheFiles (TATF) handles any particular file is determined
by the file's mime-type
(that is, mime-type as decided by the MW core). TATF looks up the mime-type
in the MimeTypeProfiles
array and assembles a profile which configures
a MediaHandler for the file.
The keys of the MimeTypeProfiles
parameter array are called labels.
A label can be any arbitrary string, but '*'
has a special meaning as
the catch-all label.
A label can map to:
false
, which causes profile assembly to abort.A profile block contains profile parameters. The special parameter
'inherits'
can be used to reference another label/block.
Profile assembly for a mime-type works like this:
'*'
is an existing label, use that.false
, abort profile assembly.'inherits'
, that becomes the next current label.If a complete profile cannot be assembled for a mime-type, then TATF will leave the file alone and it will get handled by the existing handler (if any) for that mime-type.
A complete profile requires values for each of the following parameters:
'handler_strategy'
: keyword - one of:
'fallback'
: TATF will only handle this type if there is no other'override'
: TATF will take over handling of this type, by itself,'wrapping'
: TATF handle this type, injecting its own behavior for'allow_ocr'
: boolean - whether or not to allow Tika to perform OCR'ocr_languages'
: string - which languages to enable for OCR;'content_strategy'
: keyword for how to handle text extraction - one of:
'no_tika'
: don't use Tika-extracted content at all'prefer_other'
: only use Tika-extracted content if no content is'combine'
: combine Tika-extracted content with any content provided by'prefer_tika'
: only use content provided by another handler if there'only_tika'
: don't use another handler's content at all'content_composition'
: keyword describing what content should be'text'
- index extracted text'metadata'
- index metadata'text_and_metadata'
- index extracted text and metadata'metadata_strategy'
: keyword describing how TATF should handle metadata;'no_tika'
: don't use Tika-extracted metadata at all'prefer_other'
: only use Tika-extracted metadata if no metadata is'combine'
: combine Tika-extracted metadata with any metadata provided by'prefer_tika'
: only use metadata provided by another handler if there'only_tika'
: don't use another handler's metadata at all'ignore_metadata_service_errors'
: boolean'ignore_metadata_parsing_errors'
: boolean'ignore_content_service_errors'
: boolean'ignore_content_parsing_errors'
: boolean
metadata
refers to a contextcontent
means a context whereparsing_errors
refers to problems Tika has in processingservice_errors
refers to problems communicating with the Tikafalse
, errors in the given context become exceptionstrue
, errors are ignored and treated as if'cache_expire_success_before'
: string|false
'cache_expire_failure_before'
: string|false
false
, no expiration occurs. Otherwise, the value must be a string'2021-02-14T20:54:32.171+00:00'
.'cache_file_backend'
: string|false
false
to disable file-based caching.$wgTikaAllTheFiles_PropertyMap
[]
TikaAllTheFiles (TATF) contains an internal property map which controls
how metadata properties are formatted, both when rendered on File:
pages
and when added to search-indexable text content. You can add new mappings,
or override existing mappings, by adding entries to
$wgTikaAllTheFiles_PropertyMap
.
PropertyMap
exampleConfiguring PropertyMap
like so:
$wgTikaAllTheFiles_PropertyMap['dc:language'] = true;
$wgTikaAllTheFiles_PropertyMap['!'] = false;
will cause the dc:language
property to be trivially formatted, and all other
properties will be discarded.
PropertyMap
configurationThe key of each key-value entry can take one of three forms:
'some-name'
, to be matched exactly;'!'
, which matches to any property that does not have$wgTikaAllTheFiles_PropertyMap
;'*'
, which matches to any property that does not have$wgTikaAllTheFiles_PropertyMap
or in theA Tika property will be mapped to the first match in this order:
$wgTikaAllTheFiles_PropertyMap
with exactly matching name;$wgTikaAllTheFiles_PropertyMap
with special name '!'
;$wgTikaAllTheFiles_PropertyMap
with special name '*'
;true
if nothing matches.The value of each entry can take one of three forms as well:
false
- drop/ignore the property;true
- trivially format the property (render the name and[ callable, arg1, arg2, ... ]
- process the property with callable
.In the third case, callable
must be a PHP callable that accepts at least
three arguments:
false
or an IContextSource
context for string renderingAny additional arg1
, arg2
, etc, in the property map entry will be provided
as additional arguments to callable
. The return value of callable
must
be either null
(if the property should be discarded) or an instance of
TikaAllTheFiles::ProcessedProperty
. If you are still reading at this point,
you should look at the code to understand how/why to construct a
ProcessedProperty
.
TikaAllTheFiles (TATF) implements two layers of caching of Tika responses:
The cache keeps track of both Tika query successes and failures, indexed by the SHA1 hash of the contents of queried files, not the pathnames of files. (Files often move around the system during uploads, and the same file could also be uploaded multiple times with different filenames.)
The process-local cache layer is enabled by default, and there is no known reason to ever disable it under normal operating circumstances. Due to its internal wiring, MediaWiki tends to ask TATF for metadata for the same file multiple times during a single web request while uploading a single file. This cache layer prevents TATF from unnecessarily repeatedly querying Tika during such requests.
The file-based cache layer is not enabled by default, as it requires
configuration of a place to store the files. This layer is configured by
the 'cache_file_backend'
parameter within the handler profiles. This
allows it to be customized per MIME-type, if one has a need for that.
(E.g., file-based caching could be enabled only for file types for which
OCR text extraction is also enabled, or different file types could have
their cache-files stored in different places.)
The entire cache system can be configured to have cache entries expire.
Expiration of cached successes and cached failures are configured
independently of each other. This is also controlled per MIME-type by
parameters in type profiles: 'cache_expire_success_before'
and
'cache_expire_failure_before'
.
To set up a persistent file-based cache on the local filesystem:
images/
directory from which media files are served./somewhere/on/disk/amazing-tatf-cache/
.LockManager
in $wgLockManagers
. For example:
$wgLockManagers[] = [
'name' => 'my-tatf-lock-manager',
'class' => FSLockManager::class,
'lockDirectory' => "/somewhere/on/disk/amazing-tatf-cache/lockdir",
];
FileBackend
in $wgFileBackends
. For example:
$wgFileBackends[] = [
'name' => 'my-tatf-cache',
'class' => FSFileBackend::class,
'domainId' => '',
'lockManager' => 'my-tatf-lock-manager',
'basePath' => "/somewhere/on/disk/amazing-tatf-cache",
'fileMode' => 0644,
'directoryMode' => 0755,
];
$wgTikaAllTheFiles_Mime_Type_Profiles
),'cache_file_backend'
to 'my-tatf-cache'
.File-based caching should work with any FileBackend
provided by MediaWiki,
e.g., there are extensions that facilitate connecting to various cloud-based
storage backends.
The search indexing and metadata recording operations for an uploaded file are typically triggered once (each), when the file is uploaded. That means that when after you install and configure TikaAllTheFiles (TATF), you will want to tell MediaWiki to repeat these operations for the files that have already been uploaded to your wiki.
Likewise, when you upgrade TATF or change its configuration in a way that will affect its content or metadata extraction, you may want to rescan any affected files.
If you are using the metadata extraction features of TATF (e.g., profiles
with metadata_strategy
other than no_tika
), then you can force a refresh
of metadata for all uploaded files like so:
$ cd YOUR-WIKI-INSTALL-DIRECTORY/maintenance
$ php refreshImageMetadata.php --force
It is possible to refresh only a subset of files. See
https://www.mediawiki.org/wiki/Manual:RefreshImageMetadata.php
for more information (or, use the --help
option).
If you are using the content extraction features of TATF (e.g., profiles with
content_strategy
other than no_tika
), and if you are using CirrusSearch as
your search engine, then you can force a re-indexing of all uploaded files
like so:
$ cd YOUR-WIKI-INSTALL-DIRECTORY/extensions/CirrusSearch/maintenance/
$ php ForceSearchIndex.php
It is possible to re-index only a subset of files. Use the --help
option to
get a list of all the command-line options.
TikaAllTheFiles (TATF) doesn't do anything without access to a Tika server:
If you want to quickly fire up a Tika server to try it out:
$ apt install default-jre-headless
tika-server-standard-2.1.0.jar
:
$ wget https://dlcdn.apache.org/tika/2.1.0/tika-server-standard-2.1.0.jar
$ java -jar tika-server-standard-2.1.0.jar
That should be enough to get a Tika server listening for queries at
http://localhost:9998
.
There are two overlapping timeouts involved in Tika queries:
QueryTimeoutSeconds
parameter. The timer starts whentaskTimeoutMillis
parameter.You'll need to decide how long you are willing to let Tika analyze a file, and set both timeouts appropriately. For metadata, Tika is very fast, and the limiting factor is likely just the time necessary to transfer large files into Tika. On the other hand, text extraction with OCR (see below) can take multiple minutes.
Note that if TATF's QueryTimeoutSeconds
is less than Tika's own
taskTimeoutMillis
, then if TATF times out and gives up on a query,
Tika will keep chugging along, unaware that any result it produces
will ultimately be ignored.
See https://cwiki.apache.org/confluence/display/TIKA/TikaOCR for information on installing and using Tesseract with Tika.
On Debian, it is as simple as apt install tesseract
. However, that by
itself will only the language pack for English. You will need to install
more tesseract-*
packages if you want support for other languages.
By default, Tika only enables English language support ("eng"). To enable
other languages, in addition to installing the appropriate Tesseract language
packs, you will need to override Tika's default configuration for the
language
parameter of TesseractOCRParser
.
You can do this in TATF by setting a handler profile's
ocr_languages
parameter to a non-empty value. The parameter should
be set to a list of Tesseract language codes, separated by +
characters
(for example, 'ocr_languages' => 'eng+fra+jpn'
).
OCR is a really neat trick, but it can also be really slow, reportedly
increasing Tika query times by a factor of a hundred. For that reason,
the TATF configuration defaults to disabling OCR ('allow_ocr' => false
).
If you enable OCR:
$wgTikaAllTheFiles_QueryTimeoutSeconds
,PDF's have an intricate relationship with Tika's OCR functionality; see the Tika wiki for the full scoop.
With Tika's default settings, it will do the following with PDF's:
So, if you want Tika to fallback to OCR on image-only PDF's, you will need
to set 'allow_ocr' => true
for a PDF profile in your TATF configuration.
PdfHandler
ExtensionMediaWiki comes with the PdfHandler
extension, which (with the help of a few
external programs like pdftotext
) can extract searchable text, extract
metadata, and display per-page previews and thumbnails of PDF documents.
In other words, PdfHandler
does everything that TATF does and more, for
PDF files. With the default configuration, TATF will let PdfHandler
take care of PDF files.
However, you may want to configure TATF to wrap PdfHandler
instead, for
a number of possible reasons:
PdfHandler
stores its extracted text in the wiki database along withPdfHandler
from$wgPdftoText
in your local settings:
unset( $wgPdftoText )
PdfHandler
can only extract embedded digital text from PDF's.For example:
$wgTikaAllTheFiles_MimeTypeProfiles['application/pdf'] = [
'handler_strategy' => 'wrapping',
'allow_ocr' => true,
'content_strategy' => 'tika_only',
'content_composition' => 'text_and_metadata',
'metadata_strategy' => 'prefer_other',
'inherits' => 'defaults',
];
will cause TATF to:
PdfHandler
(allowing PdfHandler
to continue providing its pagePdfHandler
's metadata;See RELEASE-NOTES.md
.
TATF is expected to work with MediaWiki 1.40 and 1.41, however it has not
yet been tested with any version >1.39. If there are any version-related
issues, we would only expect them to affect MIME types configured to use
the wrapping
handler-strategy.
TATF's metadata property processing/formatting is still under development, and is currently pretty coarse. The current efforts have focused on properties that would be found in document files (versus properties found in image files, which are already handled by MediaWiki). We try to use existing MW core facilities for interpretation and localization, but Tika provides a lot of novel properties. Setting up localization for Tika-only properties is on the ToDo list.
See TODO
comments in the source code.
TikaAllTheFiles is licensed under GPL 3.0 (or any later version).
See LICENSE
for details.
SPDX-License-Identifier: GPL-3.0-or-later