Proof-of-concept tool for extracting data from Bags and indexing it in Elasticsearch
UNLICENSE License
A proof-of-concept tool for extracting data from serialized (zipped) Bags and indexing it in Elasticsearch. Its purpose is to demonstrate potential techniques for managing Bags, ranging from retrieving a specific file in a Bag to preparing for digital preservation processes such as auditing or format migrations.
For example, questions you can ask of the sample data in this Git repository include:
data
directorydata
directoryWith a little more developement beyond this proof of concept, you could ask questions like:
fetch
URLsfetch
URLs that point to a specific hostnameUsing Elasticsearch's Kibana, it is possible to create visualizations of the indexed data. This video provides a useful introduction to Kibana.
Features that may be desirable in a tool based on this proof of concept include:
This proof of concept implementation can index Bags stored at disparate locations (and on heterogeneous hardware):
In addition to preservation staff querying the index, automated processes can as well, for example a script to generate a daily list of new Bags added to the index.
To install and run this proof of concept indexer, you will need:
watch
script, you will need to install the Python watchdog libraryTo install the Bagit Indexer:
cd bagit_indexer
php composer.phar install
(or equivalent on your system, e.g., ./composer install
)./index
extracts data from Bags and pushes it into Elasticsearch.
Run ./index --help
to get help info:
--help
Show the help page for this command.
-i/--input <argument>
Required. Absolute or relative path to either a directory containing Bags (trailing slash is optional), or a Bag filename.
-c/--content_files <argument>
Comma-separated list of plain text or XML file paths relative to the Bag data directory that are to be indexed into the "content"
field, e.g., "--content MODS.xml,notes.txt".
-e/--elasticsearch_url <argument>
URL (including port number) of your Elasticsearch endpoint. Default is "http://localhost:9200".
-x/--elasticsearch_index <argument>
Elasticsearch index. Default is "bags".
To index Bags (serialized or loose) in your input directory, run the index
script like this:
./index -i sample_bags
You will see the following:
====================================================================================================> 100%
Done. 5 Bags added to http://localhost:9200/bags
This indexing results in an Elasticsarch document for each Bag like this:
{
"_index": "bags",
"_type": "bag",
"_id": "ebd53651c768da1dbca352988e8a93d3f5f9c2d7",
"_version": 2,
"found": true,
"_source": {
"bag_location_exact": "\/home\/mark\/Documents\/hacking\/bagit\/bagit_indexer\/sample_bags\/bag_03.tgz",
"bag_location": "\/home\/mark\/Documents\/hacking\/bagit\/bagit_indexer\/sample_bags\/bag_03.tgz",
"bag_validated": {
"timestamp": "2017-11-19T22:36:52Z",
"result": "valid"
},
"bag_hash": {
"type": "sha1",
"value": "ebd53651c768da1dbca352988e8a93d3f5f9c2d7"
},
"bagit_version": {
"major": 0,
"minor": 96
},
"fetch": {
"fileName": "fetch.txt",
"data": [],
"fileEncoding": "UTF-8"
},
"serialization": "tgz",
"content": "",
"bag-info": {
"External-Description": "A simple bag.",
"Bagging-Date": "2016-02-28",
"Internal-Sender-Identifier": "bag_03",
"Source-Organization": "Acme Bags",
"Contact-Email": "[email protected]"
},
"data_files": ["data\/atextfile.txt", "data\/master.tif", "data\/metadata.xml"],
"manifest": {
"fileName": "manifest-sha1.txt",
"hashEncoding": "sha1",
"fileEncoding": "UTF-8",
"data": {
"data\/atextfile.txt": "eb2614a66a1d34a6d007139864a1a9679c9b96aa",
"data\/master.tif": "44b16ef126bd6e0ac642460ddb1d8b1551064b03",
"data\/metadata.xml": "78f4cb10e0ad1302e8f97f199620d8333efaddfb"
}
},
"tombstone": false
}
}
This is the data that you will be querying in the "Finding Bags" section.
Within the index, each Bag is identified by its SHA1 checksum value at the time of initial indexing. Using the SHA1 value ensures that each Bag's ID is unique. Alternatives identifiers include the Bag's filename or the value of a required tag in the bagit-info.txt
file. However, both of these are problematic because it would be very difficult to guarantee that they will provide unique values. Another option is to have the index
script assign a UUID. The UUID would be unique, but the SHA1 value has the added advantage of being derivable from the serialized Bag file itself in the event that the Elasticsearch index becomes lost.
The advantage of having the file's ID derived from the file itself only applies to Bags that have never been modified. The ability to derive a Bag's ID from its SHA1 checksum is lost once the Bag has been modified. This disadvantage can be mitigated by storing the history of changes to the Elasticsearch document for the Bag in a Git repository, for example, by being able search for the Bag's current SHA1 value in the Git repository and getting its ID from there.
Including the --content_files
option will index the content of the specified files and store it in the Elasticsearch 'content' field. You should only include paths to plain text or XML files, not paths to image, word processing, or other binary files. If you list multiple files, the content from all files is combined into one 'content' field.
A possible enhancement to this feature would be to use Apache Tika to extract the text content from a wide variety of file formats.
The find
script allows you to perform simple queries against the indexed data. The following types of queries are possible:
bag-info.txt
'External-Description' tagbag-info.txt
'Bagging-Date' tagbag-info.txt
'Source-Organization' tagdata
directoryindex
's -input
option when the index was populatedQueries take the form -q field:query
. For example, to search for the phrase "cold storage" in the description, run the command (note that quotes are required because of the space in the query):
./find -q "description:cold storage"
which will return the following results:
Your query found 2 hit(s):
--------------------------------------------------------------------------------------------------------------------------------
| Bag ID | External-Description |
================================================================================================================================
| 212835b8628503774e482279167a1c965d107303 | Contains some stuff we want to put into cold storage. |
--------------------------------------------------------------------------------------------------------------------------------
| 0216ce82b6a3c4ff127c28569f4ae84589bc3e99 | Contains some stuff we want to put into cold storage, and that is very important. |
--------------------------------------------------------------------------------------------------------------------------------
To search for Bags that have a Bagging-Date of "2017-06-18", run this command:
./find -q date:2017-06-18
which will return the following result:
Your query found 4 hit(s):
-----------------------------------------------------------
| Bag ID | Bagging-Date |
===========================================================
| 0216ce82b6a3c4ff127c28569f4ae84589bc3e99 | 2017-06-18 |
-----------------------------------------------------------
| 212835b8628503774e482279167a1c965d107303 | 2017-06-18 |
-----------------------------------------------------------
| 7c17053b7d30abd69c5e0eb10d5cc4c2ad915f4f | 2017-06-18 |
-----------------------------------------------------------
| fa50e06f6cc12e9e1b90e84da1f394bb8b624d54 | 2017-06-18 |
-----------------------------------------------------------
To search for Bags that contain a file under data
named 'master.tif', run this command:
./find -q file:master.tif
which will return the following result:
Your query found 1 hit(s):
-----------------------------------------------------------------------------------------------------
| Bag ID | Data files |
=====================================================================================================
| ebd53651c768da1dbca352988e8a93d3f5f9c2d7 | data/atextfile.txt, data/master.tif, data/metadata.xml |
-----------------------------------------------------------------------------------------------------
If you want to see a list of all Bags' IDs and file path locations, issue the following command:
./find -a
If you want to retrieve the raw Elasticsearch document for a specific Bag, use the --id
option instead of the -q
option, and provide the Bag's ID:
./find --id ebd53651c768da1dbca352988e8a93d3f5f9c2d7
Here are the values from bag-info.txt
tags and the list of files in the data
directories for the sample Bags, in case you want to try some searches of your own:
The Python script watch
will monitor a directory for new and updated Bags and index them automatically. Run it like this:
./watch /path/to/input/dir
where /path/to/input/dir
is the directory you want to watch. This should correspond to the directory specified in the-i
/--input
option used with index
. Currently the watcher only reacts to new and deleted Bag files, but it would be possible to make it react to modified, renamed and moved Bag files as well (provided those features were added to the index
script).
Deletions of Bags should be recorded with the tombstone
script, which updates the Bag's entry in the index in the following ways:
tombstone
field is updated to indicate true
document_timestamp
field is updated to the date when tombstone
was runThe tombstone
command's parameters are:
--help
Show the help page for this command.
-e/--elasticsearch_url <argument>
URL (including port number) of your Elasticsearch endpoint. Default is "http://localhost:9200".
-x/--elasticsearch_index <argument>
Elasticsearch index. Default is "bags".
-i/--id <argument>
The ID of the bag to create the tombstone for. Use either this option or --path.
-p/--path <argument>
Absolute or relative path to the Bag filename to create the tombstone for. Use either this option or --id.
To see which Bag entries in the index are flagged as tombstones, you can issue queries like this:
./find -q "tombstone:true"
Your query found 1 hit(s):
--------------------------------------------------------
| Bag ID | Tombstone |
========================================================
| 212835b8628503774e482279167a1c965d107303 | 1 |
--------------------------------------------------------
./find -q "tombstone:false"
Your query found 4 hit(s):
--------------------------------------------------------
| Bag ID | Tombstone |
========================================================
| 0216ce82b6a3c4ff127c28569f4ae84589bc3e99 | |
--------------------------------------------------------
| ebd53651c768da1dbca352988e8a93d3f5f9c2d7 | |
--------------------------------------------------------
| 7c17053b7d30abd69c5e0eb10d5cc4c2ad915f4f | |
--------------------------------------------------------
| fa50e06f6cc12e9e1b90e84da1f394bb8b624d54 | |
--------------------------------------------------------
The false values show up as blank in the results - that is normal.
To the extent possible under law, Mark Jordan has waived all copyright and related or neighboring rights to this work. This work is published from Canada.
Since this is proof-of-concept code, I don't intend to add a lot more features. However, this proof of concept could be used as the basis for a production application. Fork and enjoy!
That said, if you have any questions or suggestions, feel free to open an issue.