Utility to generate Bags for objects using Islandora's REST interface using either a command-line tool or via a batch-oriented queue. In addition, Islandora Bagger provides its own REST interface that allows population of the queue. Specific content is added to the Bag's data
directory and bag-info.txt
file using plugins. Bags are compliant with version 1.0 of the BagIt specification. If you want to allow your Islandora users to initiate the creation of Bags, install the Islandora Bagger Integration module.
This utility is for Islandora 8.x-1.x. For creating Bags for Islandora 7.x, use Islandora Fetch Bags.
cd islandora_bagger
php composer.phar install
(or equivalent on your system, e.g., ./composer install
)Even though each Bag is created using options defined in its own configuration file (see next section), Islandora Bagger uses several application-wide configuration options defined in the parameters
section of config/services.yaml
.
You probably don't need to change app.queue.path
and app.location.log.path
since these specify default locations for some data files. However, if you are providing the ability for users to download serialized Bags, you will need to change the app.bag.download.prefix
parameter to the hostname/path to append to each Bag's filename as described in the "Making Bags downloadable" section below.
The command to generate a Bag takes two required parameters, --settings
and --node
. Assuming the configuration file is named sample_config.yml
, and the Drupal node ID you want to generate a Bag from is 112, the command would look like this:
./bin/console app:islandora_bagger:create_bag --settings=sample_config.yml --node=112
A third parameter, --extra
, is explained in the "Passing settings via the command line" section below.
For each Bag it creates, Islandora Bagger requires a configuration file in YAML format:
####################
# General settings #
####################
# Required.
drupal_base_url: 'http://localhost:8000'
drupal_basic_auth: ['admin', 'islandora']
# Register creation of this Bag with Islandora Bagger Integration. Default is false.
register_bags_with_islandora: true
# Required. How to name the Bag directory (or file if serialized). One of 'nid' or 'uuid'.
bag_name: nid
# Optional. Template for the Bag name. The % is replaced by the nid or uuid (depending on
# the value of "bag_name") in the name of the Bag directory (or file if serialized). If absent,
# the bare value of the nid or uuid is used.
# bag_name_template: sfu_aip_%
# Both temp_dir and output_dir are required.
temp_dir: /tmp/islandora_bagger_temp
output_dir: /tmp
# Required. Whether or not to zip up the Bag. One of 'false', 'zip', or 'tgz'.
serialize: zip
# Required. Whether or not to log Bag creation. Set log output path in config/packages/{environment}/monolog.yaml.
log_bag_creation: true
# Optional. Static bag-info.txt tags. No plugin needed. You can use any combination
# of tag name / value here, as long as ou seprate tags from values using a colon (:).
bag-info:
Contact-Name: Mark Jordan
Contact-Email: [email protected]
Source-Organization: Simon Fraser University
Foo: Bar
# Optional. Whether or not to include the Payload-Oxum tag in bag-info.txt. Defaults to true.
# include_payload_oxum: false
# Optional. Which hash algorithm(s) to use.
# One of md5, sha1, sha224, sha256, sha384, sha512, sha3224, sha3256, sha3384, sha3512,
# or a list of values. Default is sha512.
# hash_algorithm: md5
# hash_algorithm: [md5, sha1, sha256]
# Optional. Timeout to use for Guzzle requests, in seconds. Default is 60.
# http_timeout: 120
# Optional. Whether or not to verify the Certificate Authority in Guzzle requests
# against websites that implement HTTPS. Used on Mac OSX if Islandora Bagger is
# interacting with websites running HTTPS. Default is true. Note that if you set
# verify_ca to false, you are bypassing HTTPS encryption between Islandora Bagger
# and the remote website. Use at your own risk.
# verify_ca: false
# Optional. Whether or not to delete the settings file upon successful creation
# of the Bag. Default is false.
# delete_settings_file: true
# Optional. Whether or not to log the serialized Bag's location so Islandora can
# retrieve the Bag's download URL. Default is false.
# log_bag_location: true
############################
# Plugin-specific settings #
############################
# Required. Register plugins to populate bag-info.txt and the data directory.
# Plugins are executed in the order they are listed here.
plugins: ['AddBasicTags', 'AddMedia', 'AddNodeJson', 'AddNodeJsonld', 'AddMediaJson', 'AddMediaJsonld', 'AddFileFromTemplate', 'AddFedoraTurtle', 'AddNodeCsv']
# Used by the 'AddFedoraTurtle' plugin.
fedora_base_url: 'http://localhost:8080/fcrepo/rest/'
# Used by the 'AddMedia' plugin. These are the Drupal taxomony term IDs
# from the "Islandora Media Use" vocabulary. Use an emply list (e.g., [])
# to include all media.
drupal_media_tags: ['/taxonomy/term/16']
# Used by the 'AddMedia' plugin. Indicates whether the Bag should contain a file
# named 'media_use_summary.tsv' that lists all the media files plus the taxonomy
# name corresponding to the 'drupal_media_tags' list. Default is false.
include_media_use_list: true
# Used by the 'AddMedia' plugin. Include this option save media files with the
# specified subdirectories within the Bag's data directory. Include the trailing /.
# media_file_directories: 'foo/bar/baz/'
# Used by the 'AddFileFromTemplate' plugin.
# template_path can be absolute or relative to the Islandora Bagger directory.
template_path: 'templates/mods.twig'
# template_output_filename will be assigned to the file generated from the template,
# which will be added to the Bag's data directory. You may include a subdirectory
# or subdirectories as part of the filename.
templated_output_filename: 'metadata/MODS.xml'
# Used by the 'AddNodeCsv' plugin.
# csv_output_filename will be assigned to the CSV file, which will be added to
# the Bag's data directory. You may include a subdirectory or subdirectories
# as part of the filename.
csv_output_filename: 'metadata.csv'
####################
# Post-Bag scripts #
####################
# post_bag_scripts: ["php /tmp/test.php", "python /path/to/script.py"]
The resulting Bag would look like this:
/tmp/112
├── bag-info.txt
├── bagit.txt
├── data
│ ├── IMG_1410.JPG
│ ├── media.json
│ ├── media.jsonld
│ ├── node.json
│ ├── node.jsonld
│ ├── metadata
│ │ └── MODS.xml
│ ├── metadata.csv
│ ├── media_use_summary.tsv
│ └── node.turtle.rdf
├── manifest-sha1.txt
└── tagmanifest-sha1.txt
Since the Drupal node's ID is not included in the configuration file, the same file can be used for multiple Bags. It is called a 'per-Bag' configuration file because it is used each time Islandora Bagger creates a Bag.
In some cases, you may want to define configuration options in config/services.yml
that are normally defined in the per-Bag configuration file. The most common reasons to do this are 1) to keep sensitive data such as login credentials out of the per-Bag configuration files and 2) to centralize commonly used options in one place rather than repeat them in each per-Bag configuration file.
To do this, define the options from the per-Bag configuration file in config/services.yml
and prepend their keys with app.
. For example, to define drupal_base_url
and drupal_basic_auth
in config/services.yml
, do the following:
# Required.
# drupal_base_url: 'http://localhost:8000'
# drupal_basic_auth: ['admin', 'islandora']
parameters
section of config/services.yml
and append each option key with app.
:parameters:
app.queue.path: '%kernel.project_dir%/var/islandora_bagger.queue'
app.location.log.path: '%kernel.project_dir%/var/islandora_bagger.locations'
# The hostname/path to where users can download serialized bags. This string
# will be prepended to the Bag's filename.
app.bag.download.prefix: 'http://example.com/bags/'
# These options are usually defined in the per-Bag config file.
app.drupal_base_url: 'http://localhost:8000'
app.drupal_basic_auth: ['admin', 'islandora']
A couple of things to note about this:
config/services.yml
. This way, you can define commonly used options in the config/services.yml
but override them on a per-Bag basis.services/config.yml
are not accessible to post-Bag scripts.You can pass settings to Islandora Bagger on the command line using the optional --extra
parameter:
./bin/console app:islandora_bagger:create_bag --settings=sample_config.yml --node=112 --extra='{"serialize": "tar", "hash_algorithm": "md5"}'
The value of this parameter is a serialized JSON object containing key:value pairs of settings. Key:value pairs passed in this way will be added to the config settings and will also override settings in the config file and in 'config/services.yml'.
Islandora Bagger can also initiate the creation Bags via a simple REST interface. It does this by 1) receiving a PUT
request containing the node ID of the Islandora object to be bagged in a "Islandora-Node-ID" header and 2) receiving a YAML configuration file as the body of the request. Using this data, it adds the request to a queue (see below), which is then processed at a later time. The REST interface also provides the ability to GET
a Bag's download URL.
Note that requests to the REST interface do not generate Bags directly, they only populate a queue as described below.
To use the REST API to add a Bag-creation job to the queue:
symfony server:start
curl -v -X POST -H "Islandora-Node-ID: 4" --data-binary "@sample_config.yml" http://127.0.0.1:8001/api/createbag
To use the REST API to get a serialized Bag's location for download:
serialize
setting is either "zip" or "tgz", and the log_bag_creation
setting is true
.PUT
request.curl -v -H "Islandora-Node-ID: 4" http://127.0.0.1:8001/api/createbag
. Your response will be a JSON string containing the node ID, the Bag's location, and an ISO8601 timestamp of when the Bag was created, e.g.:{"nid":"4","location":"http:\/\/example.com\/bags\/4.zip","created":"2019-05-06T19:31:33-0700"}
A couple of things to note about this REST API:
POST
requests that contain a request body (in this case, the YAML configuration file). Some HTTP clients (Guzzle, for example) convert requests that are redirected (e.g., in respones to a 301, etc.) to GET
. If this happens, the request body is lost and the resulting YAML configuration files will be empty. If you are running Islandora Bagger in a web server environment that returns HTTP response codes in the 3xx
range, such as inside an Apache Alias
directive, your HTTP client will need to not redirect POST requests with GET requests. Guzzle's documentation on this behavior is useful.As described in the previous section, the location of each Bag is available via a GET
request to Islandora Bagger's REST interface. If you want to use this information to provide a way to download Bags from Islandora Bagger, follow these steps:
serialize
option is set to zip
or tgz
(only serialized Bags can be downloaded).log_bag_location
option is set to true
.output_dir
option is exposed to the web.config/services.yml
file
app.bag.download.prefix
parameter contains the hostname/path leading to the directory specified in the configuration file's output_dir
option.GET
requests to the REST API will now return location
values that contain URLs that combine the path specified in app.bag.download.prefix
with the serialized Bag's filename.
This is insecure, since anyone who can guess the path to a Bags will have access to it. Please join the discussion at this issue if you have a suggestion on implementing more robust security on Bag downloads.
Another approach is to use a post-Bag script (see below) to copy the Bag to a location from where it can be downloaded, and to email the user with the location.
Islandora Bagger implements a simple processing queue, which is populated mainly by REST requests to generate Bags. However, the queue can be populated by any process (manually, scripted, etc.). Islandora Bagger processes the queue by inspecting each entry in first-in, first-out order and for each entry, runs the app:islandora_bagger:create_bag
command, which creates the Bag by fetching the files and other data from the Islandora instance as defined in that entry's configuration file.
The queue is a simple tab-delimited text file that contains one entry per line. The three fields in each entry are 1) the node ID, 2) the full path to the YAML configuration file, and 3) and ISO8601 timestamp, e.g.:
2073 /home/mark/Documents/hacking/islandora_bagger/var/islandora_bagger.2073.yaml 2020-09-14T19:01:46-0700
To process the queue, run the following command:
./bin/console app:islandora_bagger:process_queue --queue=var/islandora_bagger.queue
where the value of the --queue
option is the path to the queue file. This command is then executed as needed, or from within a scheduled job managed by cron. This command iterates through the queue in first-in, first-out order. Once processed, the entry is removed from the queue. You can also optionally specify how many queue entries to process by including the --entries
option, e.g., ./bin/console app:islandora_bagger:process_queue --queue=var/islandora_bagger.queue --entries=100
Since the queue file is just a plain tab-separated value file, looking at its contents can be done in a variety of ways (openning it in a text editor, using cat
, etc.). Islandora Bagger offers two other ways of inspecting the queue:
app:islandora_bagger:get_queue
(e.g. ./bin/console app:islandora_bagger:get_queue --queue=var/islandora_bagger.queue --output_format=json
)curl -v http://127.0.0.1:8000/api/queue
)In both cases, the output is a serialized JSON object containing each item in the queue. The console command can also print the raw queue if the --output_format
option has a value of "csv").
Customizing the generated Bags is done via values in the configuration file and via plugins.
Items in the "General Configuration" section provide some simple options for customizing Bags, e.g.:
bag-info.txt
file. Tags specified in general settings' bag-info
option are static in that they are simple strings. In order to include tags that are dynamically generated, you must use a plugin.Apart from the static tags mentioned in the previous section, all file content and additional tags are added to the Bag using plugins. Plugins are registerd in the plugins
section of the configuration file.
The following plugins are bundled with Islandora Bagger:
Internal-Sender-Identifier
bag-info.txt tag using the Drupal URL for the node as its value, and the Bagging-Date
tag using the current date as its value./node/1234?_format=json
./node/1234?_format=jsonld
.drupal_media_tags
configuration option./node/1234/media?_format=json
./node/1234/media?_format=jsonld
.files_to_add
configuration option, e.g., files_to_add: ['/tmp/file1.txt', '/tmp/file2.txt']
.fetch.txt
file to the Bag, using URLs listed in the fetch_urls
configuation option, e.g., fetch_urls: ['http://example.com/path/to/file.htm', 'https://someother.url.com/about']
.Each plugin is a PHP class that extends the base AbstractIbPlugin
class. The Sample.php
plugin illustrates what you can (and must) do within a plugin. Plugins are located in the islandora_bagger/src/Plugin
directory, and must implement an execute()
method. Within that method, you have access to the Bag object, the Bag temporary directory, the node's ID, the node's JSON representation from Drupal. You also have access to all values in the configuratin file via the $this->settings
associative array.
To use a custom plugin, simply register its class name in the plugins
list in your configuation file.
The post_bag_scripts
option in the configuration file allows you to specify a list of scripts to run after the Bag has been successfully created. These scripts can send email messages, copy Bag files to alternate locations, and other tasks. You can include any script, in any language, with the following constraints:
app:islandora_bagger:create_bag
commandIn the YAML configuration file, you can define any options needed by your scripts, for example, an email address to send a message to. For example, if your script /opt/utils/send_bag_notice.py
requires an email address to send its notice to, you can include that option's value in your configuration file, as long as the script can parse YAML files:
####################
# Post-Bag scripts #
####################
post_bag_scripts: ["python /opt/utils/send_bag_notice.py"]
recipient_email: [email protected]
Then within your script, you would have access to the value of recipient_email
. Within your scripts, you have access to all options used by Islandora Bagger's app:islandora_bagger:create_bag
command, and you can define any additional options you need as long as they don't have the same key names as existing values.
See CONTRIBUTING.md.