Under the DLIB004 project, the Digital Library team successfully implemented Goobi, a new software for digitisation workflows.
The goal of the DLIB011 project was to establish a proof-of-concept for an automated link between Goobi and Archivematica to support the digital preservation of new content.
Within the Digital Library, the Development and Systems team, in close collaboration with the digitisation team and the Digital Archivist, have been developing automation technology to support the digital preservation of digitised content processed through Goobi. A process has now been developed that will enable Goobi users to send digitised collection material at the end of a Goobi workflow to a ‘hot folder’ where Archivematica can pick it up and process it for long-term preservation storage. The authenticity and integrity of the transferred material is verified at multiple points: at ingest into Archivematica using checksums assigned by Goobi and through an automated email alert to the DIU team if the preservation process in Archivematica has failed.
While the test material has been successfully transferred through the new automated process, work remains to develop an end-to-end workflow for the digital preservation of digitised materials, both for legacy materials and newly digitised content. In particular, pre-ingest curation and management procedures as well as long-term management and preservation planning procedures need to be created for both legacy and newly digitised content.
The scope of this project was comprised of the following:
Automating a digital preservation workflow to accommodate new content - establishing a proof-of-concept
Automating a digital preservation workflow to accommodate legacy content – using a selection of content as a proof-of-concept (note: this element was removed from scope during the project as a reevalution of the work necessary to process legacy material indicated it would take longer than originally anticipated, for a more detailed discussion, see below)
Ensuring appropriate scalability of Archivematica in order to support the pilot content material
Updating relevant procedures and documentation, and delivering appropriate training to specified Digital Library staff to ensure adoption and advocacy of the new preservation processes
Out of scope
The following elements are deemed as out of scope:
The use of Data Vault for the storage of collections data. DataVault holds content for up to 10 years but does not carry out any preservation actions on the files, for example migrating file formats due to technology obsolescence. The Digital Library, furthermore, has a requirement, established in the Digital Preservation Policy, to safeguard this material for longer than 10 years in most cases.
Any scalability of storage was to be strictly limited to what was required for the preservation of the proof-of-concept pilot content. Any further increase beyond this is to be assessed incrementally as part of a future project.
This project covered the proof-of-concept only for newly digitised content and processing the existing backlog of legacy material was out-of-scope. The processing of backlog content will be completed as a future service activity once the project is complete.
Objectives & Deliverables
|No||Description||MoSCoW||Delivered?||Reason for not delivering||Output|
|O1||To create an automated digital preservation workflow to ensure all new digitised content is preserved (and to stop the backlog growing)|
|D1||Archivematica configuration settings agreed||MUST||YES||Settings implemented within Archivematica|
|D2||Goobi configured to output correctly configured JSON file, METS metadata file and checksum validation information||MUST||YES||New digital preservation step created in Goobi workflow|
|D3||Hot folder that links Goobi output to Archivematica ingest activated||MUST||YES||Hot folder ingest script created|
|D4||Lyell notebooks run through new digitisation process as proof-of-concept||MUST||YES*||*While Lyell content was used to draft metadata requirements and to assess the necessary steps in Goobi, alternate content (Books and Borrowing) was determined to be more appropriate and was used in establishing the proof of concept|
|D5||Results tested to confirm accuracy||MUST||YES|
|02||To create an automated digital preservation workflow to preserve identified legacy material|
|D6||Metadata requirements established||MUST||NO||Legacy content proof-of-concept was removed from scope during the course of the project (see issues section below)|
|D7||Script written to generate correct JSON metadata files using current library process||MUST||NO||As above|
|D8||Chosen pilot material run through new digitisation process as proof-of-concept||MUST||NO||As above|
|D9||Results tested to confirm accuracy||MUST||NO||As above|
|03||To update procedures and processes, create user guides, and deliver training to specified Digital Library staff to ensure adoption and advocacy of new preservation processes|
|D10||Goobi User Guide created||MUST||YES||Goobi User Guide|
|D11||New procedures documented||MUST||YES||Link to Git Hub|
|D12||Training delivered to Digital Library staff||MUST||YES||
Initial training has been delivered. A 1-hour session on Digital Preservation Concepts and Overview of Archivematica was delivered by the Digital Archivist on 3/12/20 (after kick-off of DLIB11), aimed at Project team, but opened up to Library staff. The objectives of the session were:
What Does Archivematica actually *do*?
What is digital preservation, in brief?
What is OAIS and why is it used by Archivematica?
What are the main microservices run by Archivematica and why are they useful?
What are the configuration settings in Archivematica?
Why is it important to design an AIP and document those decisions?
Attendees: Gavin Willshaw, Susan Pettigrew, Ianthe Sutherland, Scott Renton, Hrafn Malmquist, David Speed, Alex Ross, Aline Brodin, Kirsty Stewart, and Lesley Bryson
Additional training will be delivered by the Digital Archivist to the digitisation team on main concepts of digital preservation and basic digital preservation process for content managed in Goobi. This training will be delivered on 03/06/21.
|Success Criteria as stated in project brief||Delivered?||How delivered?|
|Successful proof-of concept for the digital preservation of new content - items are automatically picked up by Archivematica and processed as designed without failure, AIPs and DIPs appear in correct location and contain correct content and preservation description information||YES||Proof-of-concept has been created and testing shows it is operating as designed. The automated link between Goobi and Archivematica will support an end-to-end workflow for newly digitised content, although manual activities and system functionality for long-term management and preservation planning remain outstanding due to the variation and complexity of digitised materials and will be addressed outwith this project|
|Successful proof-of-concept for the digital preservation of legacy content - items are automatically picked up by Archivematica and processed as designed without failure, AIPs and DIPs appear in correct location and contain correct content and preservation description information||N/A||Legacy content proof-of-concept was removed from scope during the course of the project (see variances section below)|
|Procedure agreed and documented for digitally preserving new material moving forward||PARTLY||Relevant user guides have been updated with information about the automation of transfer of digitised content from Goobi to Archivematica. However, there is a caveat of outstanding procedures yet to be agreed (see below)|
|Procedure agreed and documented for the task of processing backlog of legacy data (task to be undertaken outwith this project)||NO||This will be determined by the DIU team outwith this project (see outstanding issues below)|
|Automation of preservation workflow||Delivered||How delivered|
|Automation of preservation workflow||YES||Established and tested proof-of-concept now in place|
|Ability to process new and legacy content||N/A||Legacy content proof-of-concept was removed from scope during the course of the project (see variances section below)|
|Provides a further step in the progress towards Digital Preservation as a Service||YES||This established proof-of-concept for an automated transfer process will enable the expansion of digital preservation as a service|
|Increase skills capacity within Digital Library team to use Archivematica||YES||Through cross-team collaborative working and training delivered, as well as through the use of the new digital preservation process moving forward|
Analysis of Resource Usage
|Project Brief estimated costs||23d||23d|
|Changes to costs||0d||0d|
In the course of more detailed effort and time estimations undertaken during the planning phase of the project, it was determined that a total of 23 days would be required in order to adequately resource the project. Additionally, the delivery and closure milestones were moved back slightly in order to ensure adequate time was available to complete all project tasks. This change was approved by WIS on 11/12/2020.
Due to unforeseen complexity with configuring Goobi to produce the correct metadata.json files and the project's assigned developer being on half-furlough, initial milestone estimate dates were proved to have been slightly optimistic. Updated dates for the milestones related to the delivery of work packages were therefore updated during the project team meeting on 25/02/2021.
After a re-scoping exercise, the project team determined that a small amount of additional time would be required in order to complete the remaining project tasks. This was due to delays caused by the availability of key development staff over the Easter respite period and slightly more effort than was originally anticipated being required to complete the Archivematica automated script development. The revised dates were approved by WIS on 16/04/2021.
In addition to revising the Delivery and Closure milestones, the re-scoping exercise highlighted the need for the scope of the project to be adjusted as follows:
- This change of scope relates to WP2: Proof-of-concept for legacy content, aimed at extending the newly automated link between Goobi and Archivematica to support the transfer of legacy content.
- While the WP2: Proof-of-concept for legacy content work package was initially deemed in-scope, as the project progressed the project team determined that it made more sense for this piece of work to be completed outwith the scope of this project at a later date.
- The rationale for this was that a significant amount of data cleansing, sorting and key decisions around the renaming of files would need to take place before the team would be in a position to establish an effective process for processing this material, and that the work required to achieve this potentially involves time consuming data wrangling, metadata editing, preparations of new workflows, such as a file format policy, that was unforeseen at the time of initial scoping for DLIB011.
- The team was therefore of the view that any work to create a proof-of-concept for legacy material at this stage (pre data cleansing) could only be done using 'low-hanging fruit' type content. This meant that a significant amount of work would need to be redone or significantly adjusted at a later date in order to take into account more complex content types and the changes that will be occurring through the metadata review process and the implementation of the new DAMS (DLIB008), bringing the value of completing this proof-of-concept into question.
- The removal of WP2 from the scope of the project was approved by WIS on 16/04/2021.
- Goobi digital preservation was not sufficiently mature at the start of the project. Therefore, checksum generation was only able to be implemented later in the project. This resulted in the Lyell content needing to go into Digital Preservation manually, which lead to the decision to use Books & Borrowing content for the pilot.
Key Learning Points
Increased cross-team collaboration between Digital Library & Project Services
The project team worked well and effectively particularly with the significant impact of COVID19 remote working
Successful development of automation from a content management system (Goobi) to Archivematica, has provided a highly valuable template for automating the transfer from other systems to long-term storage, potentially supporting digital preservation workflows for other content across the Library
Collaborative high-level digital preservation review of digitisation workflows and some types of digitised content during the course of building requirements for this project has furthered the development of digital preservation strategies for digitised content
Drafting the metadata requirements for digital preservation of a digitised collection against the format and output options from Goobi has been a valuable exercise in aligning systems and processes across teams
There remains a need to establish an appropriate reporting mechanism for gathering statistics on the throughput of content through the new automated process in order to provide transparent and verifiable evidence that processes are working effectively and to communicate the benefits and trustworthiness of the digital preservation service across the Library and to wider stakeholders. A bespoke solution or effective work-around will need to be found to address this moving forward outwit the project
A procedure for how the existing backlog of content is to be processed will need to be agreed
A documented procedure for retrieving content from the digital preservation system to support the delivery of access to digitised content remains outstanding
A guiding policy and strategy for how to share responsibility for the management and maintenance of preserved digitised content as well as system functionality to support that management and maintenance
The proof of concept for the automated process achieved did not include configuring a process to ensure successfully preserved content would be removed from their original deposit locations at the end of the process. This is an important “housekeeping” step that will be addressed outwith the project
A consequence of the automation process coming from Goobi being the second source of content so far feeding into Archivematica (the other source being another similarly set up hot folder), is the issue of queuing/pooling. This means that in a live scenario, if this workflow were executed in the current environment, new content would be stalled in potentially significant queues for preservation processing, creating a bottleneck and barrier to use of Archivematica by multiple users. (This will be addressed as a part of the scalability review being undertaken in DLIB012)
The outstanding issues above are to be addressed through close collaboration between the Development and Systems team, the digitisation team and the Digital Archivist and after further analysis an appropriate determination will be made as to whether these are to be addressed through the initiation of new DLIB projects, or as part of BAU. This includes the following steps:
Increasing the scalability of the system, including the potential upgrade of technology, and investigating the establishment of multiple concurrent instances of Archivematica. A scalability review is being conducted as part of DLIB012, and the results of this will inform future steps here
Begin the processing of backlog content
Expand process to address legacy content, including creation of new Goobi workflow/s for this
Establish appropriate reporting mechanism for fathering statistics on throughput (see above)
Agree outstanding process and procedure questions (see above)