Islandora batch module for ingesting objects that have pregenerated derivatives (or, in other words, pregenerated datastreams). The typical use cases are:
We need to use a specialized batch ingest module for this because the standard Islandora Batch only allows for two files per object, one .xml file for the MODS or DC and one other file for the OBJ. Islandora Batch with Derivatives allows you to group all of the files corresonding to an object's datastreams (with the exception of RELS-EXT) into a subdirectory, as illustrated below.
The Islandora Book Batch and Islandora Newspaper Batch modules allow you to add derivative files to page-level directories, speeding up ingestion of those content types hugely. This module takes the same approach, but for other content models.
Enable this module, then run its drush command to import objects:
drush --user=admin islandora_batch_with_derivs_preprocess --key_datastream=MODS --scan_target=/path/to/object/files --namespace=mynamespace --parent=islandora:mycollection
Then, to perform the ingest:
drush --user=admin islandora_batch_ingest
Islandora has a setting that turns derivative creation off. To do this, got to Admin > Islandora > Configuration, and check "Defer derivative generation during ingest".
When using this batch module, you do not need to turn derivative creation off. If you do not turn this off, datastreams based on the files in the input directories will be created, and Islandora will only generate other datastreams, as defined by the object's content model, that do not have corresponding files in the input directories.
If you do turn derivative creation off, only datastreams based on the files in your input directories will be created. Other derivatives will not be created. You should probably return this setting to its original value after your batch finishes running.
This batch module uses filenames to identify the files that correspond to specific datastreams. All of the files you are ingesting with an object should go in one directory (a subdirectory of the path you identify in the drush command with the --scan_target
option). Each object-level subdirectory must have at least a file for the "key datastream", which is either the MODS (MODS.xml) or DC (DC.xml) datastream. This datastream is identified in the --key_datastream
option. All other datastream files are optional, but you will usually also include a file corresponding to the OBJ datastream.
Some points to note:
<title>
element, or if there is no MODS.xml file in the object directory, from the DC <title>
element.--key_datastream
, the DC datastream that is generated for each object will be generated from the MODS.xml (or MADS.xml) file, which is Islandora's default behavior. If you prefer that a minimal DC datastream containing only a title and an identifier element be generated (in other words, Fedora's default DC datastream), include the --create_dc=false
option in the islandora_batch_with_derivs_preprocess
command (e.g., islandora_batch_with_derivs_preprocess --key_datastream=MODS --create_dc=false
). If both MODS.xml (or MADs.xml) and DC.xml exist in the object's input directory, both datastreams are populated from the files.--content_models
option. Note that the specificed content model must apply to all objects in the current batch.
--content_models=islandora:personCModel
.--content_models
option.Each object in the batch must be in its own subdirectory under the path specified in --scan_target
. Within each object directory are all the files that will be used to create that object's datastreams, named using datastream IDs:
/tmp/valueofscantarget
├── foo
│ ├── DC.xml
│ ├── MEDIUM_SIZE.jpg
│ ├── MODS.xml
│ ├── OBJ.jpg
│ ├── TECHMD.xml
│ └── TN.jpg
├── bar
│ ├── DC.xml
│ ├── MEDIUM_SIZE.jpg
│ ├── MODS.xml
│ ├── OBJ.jpg
│ ├── TECHMD.xml
│ └── TN.jpg
└── baz
├── DC.xml
├── MEDIUM_SIZE.jpg
├── MODS.xml
├── OBJ.jpg
├── TECHMD.xml
└── TN.jpg
The names of the object subdirectories have no significance (unless the --use_pids=true
option is present, as described below).
It is possible to use this module to ingest content that originated in another Islandora instance. You can choose to allow the target Islandora to generate new PIDs for the ingested objects, or you can choose to assign specific PIDs to the new objects. A common use case for the latter is that you want the target Islandora instance to use the PIDs assigned to the objects in the source Islandora instance in order to preserve relationships between objects, and to make object URLs in the target Islandora map directly to their equivalents in the source Islandora.
The default behavior of this module is to allow the target Islandora to mint new PIDs. The most important implication of this is that Islandora will generate a new RELS-EXT datastream for each object, containing only the RDF statements that are generated on ingest (collecion membership and content model). If a RELS-EXT datastream file is present, it will be ignored. The namespace of the new PIDs will be the one specified in the islandora_batch_with_derivs_preprocess
command's --namespace
option.
PIDs can be assigned to the objects being ingested through use of the islandora_batch_with_derivs_preprocess
command's --use_pids
option. If this option has a value of 'true' (e.g., --use_pids=true
), the name of each object-level directory containing the datastream files will be converted to a PID, and that PID will be assigned to the resulting object. For this to work, the colon in the PID must be represented in the directory name by a plus sign (+
); the plus sign will be converted into the colon in the preserved PID. For example, with the --use_pids=true
, the following directory structure will result in objects with PIDs 'foo:23', 'bar:198392', and 'baz:special_object':
/tmp/valueofscantarget
├── foo+23
│ ├── DC.xml
│ ├── MEDIUM_SIZE.jpg
│ ├── MODS.xml
│ ├── OBJ.jpg
│ ├── RELS-EXT.rdf
│ ├── TECHMD.xml
│ └── TN.jpg
├── bar+198392
│ ├── DC.xml
│ ├── MEDIUM_SIZE.jpg
│ ├── MODS.xml
│ ├── OBJ.jpg
│ ├── RELS-EXT.rdf
│ ├── TECHMD.xml
│ └── TN.jpg
└── baz+special_object
├── DC.xml
├── MEDIUM_SIZE.jpg
├── MODS.xml
├── OBJ.jpg
├── RELS-EXT.rdf
├── TECHMD.xml
└── TN.jpg
Note that:
--use_pids=true
is present, the relationships expressed in it will be parsed out and added to the new object. All new relationships resulting from the ingest (e.g., additional collection membership) will also be added to the object's RELS-EXT datastream with the exception of duplicate 'isMemberOfCollection' relationships.--content_models
option is present.--use_pids=true
option were absent or false
.drush islandora_batch_with_derivs_check_pids --scan_target=/path/to/object/files
.A useful strategy for migrating objects between Islandora instances, with their PIDs and relationships intact, is to export objects using the Islandora Dump Datastreams module and then ingest the resulting packages as described in this section.
Feel free to open issues in this Github repo. Use cases and suggestions are welcome, as are pull requests (but before you open a pull request, please open an issue).