Closure Report

Closure Report 

 

Project Summary 

 

Background 

Under the DLIB004 project, the Digital Library team successfully implemented Goobi, a new software for digitisation workflows.  

The goal of the DLIB011 project was to establish a proof-of-concept for an automated link between Goobi and Archivematica to support the digital preservation of new content. 

Within the Digital Library, the Development and Systems team, in close collaboration with the  digitisation team and the Digital Archivist, have been developing automation technology to support the digital preservation of digitised content processed through Goobi. A process has now been developed that will enable Goobi users to send digitised collection material at the end of a Goobi workflow to a ‘hot folder’ where Archivematica can pick it up and process it for long-term preservation storage. The authenticity and integrity of the transferred material is verified at multiple points: at ingest into Archivematica using checksums assigned by Goobi and through an automated email alert to the DIU team if the preservation process in Archivematica has failed.   

While the test material has been successfully transferred through the new automated process, work remains to develop an end-to-end workflow for the digital preservation of digitised materials, both for legacy materials and newly digitised content. In particular, pre-ingest curation and management procedures as well as long-term management and preservation planning procedures need to be created for both legacy and newly digitised content. 

  

Scope 

The scope of this project was comprised of the following:   

  • Automating a digital preservation workflow to accommodate new content - establishing a proof-of-concept   

  • Automating a digital preservation workflow to accommodate legacy content – using a selection of content as a proof-of-concept (note: this element was removed from scope during the project as a reevalution of the work necessary to process legacy material indicated it would take longer than originally anticipated, for a more detailed discussion, see below)   

  • Ensuring appropriate scalability of Archivematica in order to support the pilot content material  

  • Updating relevant procedures and documentation, and delivering appropriate training to specified Digital Library staff to ensure adoption and advocacy of the new preservation processes  

  

Out of scope 

The following elements are deemed as out of scope:  

  • The use of Data Vault for the storage of collections data. DataVault holds content for up to 10 years but does not carry out any preservation actions on the files, for example migrating file formats due to technology obsolescence. The Digital Library, furthermore, has a requirement, established in the Digital Preservation Policy, to safeguard this material for longer than 10 years in most cases. 

  • Any scalability of storage was to be strictly limited to what was required for the preservation of the proof-of-concept pilot content. Any further increase beyond this is to be assessed incrementally as part of a future project.  

  • This project covered the proof-of-concept only for newly digitised content and processing the existing backlog of legacy material was out-of-scope. The processing of backlog content will be completed as a future service activity once the project is complete. 

 

Objectives & Deliverables

No Description MoSCoW Delivered? Reason for not delivering Output
O1 To create an automated digital preservation workflow to ensure all new digitised content is preserved (and to stop the backlog growing)        
D1 Archivematica configuration settings agreed   MUST YES   Settings implemented within Archivematica 
D2 Goobi configured to output correctly configured JSON file, METS metadata file and checksum validation information   MUST YES   New digital preservation step created in Goobi workflow 
D3 Hot folder that links Goobi output to Archivematica ingest activated   MUST YES   Hot folder ingest script created 
D4 Lyell notebooks run through new digitisation process as proof-of-concept     MUST YES*   *While Lyell content was used to draft metadata requirements and to assess the necessary steps in Goobi, alternate content (Books and Borrowing) was determined to be more appropriate and was used in establishing the proof of concept 
D5 Results tested to confirm accuracy   MUST YES    
02 To create an automated digital preservation workflow to preserve identified legacy material          
D6 Metadata requirements established   MUST NO Legacy content proof-of-concept was removed from scope during the course of the project (see issues section below)   
D7 Script written to generate correct JSON metadata files using current library process   MUST NO As above   
D8 Chosen pilot material run through new digitisation process as proof-of-concept   MUST NO As above   
D9 Results tested to confirm accuracy   MUST NO As above   
03 To update procedures and processes, create user guides, and deliver training to specified Digital Library staff to ensure adoption and advocacy of new preservation processes          
D10 Goobi User Guide created   MUST YES   Goobi User Guide 
D11 New procedures documented   MUST YES   Link to Git Hub 
D12 Training delivered to Digital Library staff   MUST YES  

Initial training has been delivered. A 1-hour session on Digital Preservation Concepts and Overview of Archivematica  was delivered by the Digital Archivist on 3/12/20 (after kick-off of DLIB11), aimed at Project team, but opened up to Library staff. The objectives of the session were: 

  

What Does Archivematica actually *do*? 

What is digital preservation, in brief? 

What is OAIS and why is it used by Archivematica?  

What are the main microservices run by Archivematica and why are they useful? 

What are the configuration settings in Archivematica? 

Why is it important to design an AIP and document those decisions? 

  

Attendees: Gavin Willshaw, Susan Pettigrew, Ianthe Sutherland, Scott Renton, Hrafn Malmquist, David Speed, Alex Ross, Aline Brodin, Kirsty Stewart, and Lesley Bryson  

 

Additional training will be delivered by the Digital Archivist to the digitisation team on main concepts of digital preservation and basic digital preservation process for content managed in Goobi. This training will be delivered on 03/06/21. 

           

 

Success Criteria

Success Criteria as stated in project brief   Delivered? How delivered?
Successful proof-of concept for the digital preservation of new content - items are automatically picked up by Archivematica and processed as designed without failure, AIPs and DIPs appear in correct location and contain correct content and preservation description information  YES Proof-of-concept has been created and testing shows it is operating as designed. The automated link between Goobi and Archivematica will support an end-to-end workflow for newly digitised content, although manual activities and system functionality for long-term management and preservation planning remain outstanding due to the variation and complexity of digitised materials and will be addressed outwith this project
Successful proof-of-concept for the digital preservation of legacy content - items are automatically picked up by Archivematica and processed as designed without failure, AIPs and DIPs appear in correct location and contain correct content and preservation description information  N/A Legacy content proof-of-concept was removed from scope during the course of the project (see variances section below)
Procedure agreed and documented for digitally preserving new material moving forward  PARTLY Relevant user guides have been updated with information about the automation of transfer of digitised content from Goobi to Archivematica. However, there is a caveat of outstanding procedures yet to be agreed (see below) 
Procedure agreed and documented for the task of processing backlog of legacy data (task to be undertaken outwith this project)  NO This will be determined by the DIU team outwith this project (see outstanding issues below) 

 

Benefits

Automation of preservation workflow Delivered  How delivered  
Automation of preservation workflow    YES Established and tested proof-of-concept now in place 
Ability to process new and legacy content   N/A Legacy content proof-of-concept was removed from scope during the course of the project (see variances section below) 
Provides a further step in the progress towards Digital Preservation as a Service  YES This established proof-of-concept for an automated transfer process will enable  the expansion of digital preservation as a service 
Increase skills capacity within Digital Library team to use Archivematica    YES Through cross-team collaborative working and training delivered, as well as through the use of the new digital preservation process moving forward 

 

Analysis of Resource Usage

Cost

  20/21 Total
Project Brief estimated costs   23d 23d
Changes to costs   0d 0d
Actual Cost    23d 23d

 

 

Variances  

  • In the course of more detailed effort and time estimations undertaken during the planning phase of the project, it was determined that a total of 23 days would be required in order to adequately resource the project. Additionally, the delivery and closure milestones were moved back slightly in order to ensure adequate time was available to complete all project tasks. This change was approved by WIS on 11/12/2020. 

  • Due to unforeseen complexity with configuring Goobi to produce the correct metadata.json files and the project's assigned developer being on half-furlough, initial milestone estimate dates were proved to have been slightly optimistic. Updated dates for the milestones related to the delivery of work packages were therefore updated during the project team meeting on 25/02/2021. 

  • After a re-scoping exercise, the project team determined that a small amount of additional time would be required in order to complete the remaining project tasks. This was due to delays caused by the availability of key development staff over the Easter respite period and slightly more effort than was originally anticipated being required to complete the Archivematica automated script development. The revised dates were approved by WIS on 16/04/2021. 

  • In addition to revising the Delivery and Closure milestones, the re-scoping exercise highlighted the need for the scope of the project to be adjusted as follows: 

    • This change of scope relates to WP2: Proof-of-concept for legacy content, aimed at extending the newly automated link between Goobi and Archivematica to support the transfer of legacy content.  
    • While the WP2: Proof-of-concept for legacy content work package was initially deemed in-scope, as the project progressed the project team determined that it made more sense for this piece of work to be completed outwith the scope of this project at a later date. 
    • The rationale for this was that a significant amount of data cleansing, sorting and key decisions around the renaming of files would need to take place before the team would be in a position to establish an effective process for processing this material, and that the work required to achieve this potentially involves time consuming data wrangling, metadata editing, preparations of new workflows, such as a file format policy, that was unforeseen at the time of initial scoping for DLIB011. 
    • The team was therefore of the view that any work to create a proof-of-concept for legacy material at this stage (pre data cleansing) could only be done using 'low-hanging fruit' type content. This meant that a significant amount of work would need to be redone or significantly adjusted at a later date in order to take into account more complex content types and the changes that will be occurring through the metadata review process and the implementation of the new DAMS (DLIB008), bringing the value of completing this proof-of-concept into question. 
    • The removal of WP2 from the scope of the project was approved by WIS on 16/04/2021.
  • Goobi digital preservation was not sufficiently mature at the start of the project. Therefore, checksum generation was only able to be implemented later in the project. This resulted in the Lyell content needing to go into Digital Preservation manually, which lead to the decision to use Books & Borrowing content for the pilot. 

 

Key Learning Points  

  1. Increased cross-team collaboration between Digital Library & Project Services 

  2. The project team worked well and effectively particularly with the significant impact of COVID19 remote working  

  3. Successful development of automation from a content management system (Goobi) to Archivematica, has provided a highly valuable template for automating the transfer from other systems to long-term storage, potentially supporting digital preservation workflows for other content across the Library

  4. Collaborative high-level digital preservation review of digitisation workflows and some types of digitised content during the course of building requirements for this project has furthered the development of digital preservation strategies for digitised content

  5. Drafting the metadata requirements for digital preservation of a digitised collection against the format and output options from Goobi has been a valuable exercise in aligning systems and processes across teams

Outstanding Issues  

  1. There remains a need to establish an appropriate reporting mechanism for gathering statistics on the throughput of content through the new automated process in order to provide transparent and verifiable evidence that processes are working effectively and to communicate the benefits and trustworthiness of the digital preservation service across the Library and to wider stakeholders. A bespoke solution or effective work-around will need to be found to address this moving forward outwit the project 

  1. A procedure for how the existing backlog of content is to be processed will need to be agreed

  1. A documented procedure for retrieving content from the digital preservation system to support the delivery of access to digitised content remains outstanding 

  1. A guiding policy and strategy for how to share responsibility for the management and maintenance of preserved digitised content as well as system functionality to support that management and maintenance 

  1. The proof of concept for the automated process achieved did not include configuring a process to ensure successfully preserved content would be removed from their original deposit locations at the end of the process. This is an important “housekeeping” step that will be addressed outwith the project

  1. A consequence of the automation process coming from Goobi being the second source of content so far feeding into Archivematica (the other source being another similarly set up hot folder), is the issue of queuing/pooling. This means that in a live scenario, if this workflow were executed in the current environment, new content would be stalled in potentially significant queues for preservation processing, creating a bottleneck and barrier to use of Archivematica by multiple users. (This will be addressed as a part of the scalability review being undertaken in DLIB012) 

 

Next steps 

The outstanding issues above are to be addressed through close collaboration between the Development and Systems team, the digitisation team and the Digital Archivist and after further analysis an appropriate determination will be made as to whether these are to be addressed through the initiation of new DLIB projects, or as part of BAU. This includes the following steps: 

  1. Increasing the scalability of the system, including the potential upgrade of technology, and investigating the establishment of multiple concurrent instances of Archivematica. A scalability review is being conducted as part of DLIB012, and the results of this will inform future steps here

  1. Begin the processing of backlog content

  1. Expand process to address legacy content, including creation of new Goobi workflow/s for this

  1. Establish appropriate reporting mechanism for fathering statistics on throughput (see above) 

  1. Agree outstanding process and procedure questions (see above)

Project Info

Project
Automation of Digital Preservation Workflow for Digitisation
Code
DLIB011
Programme
Digital Transformation - Digital Library
Management Office
ISG PMO
Project Manager
Alex Ross
Project Sponsor
Kirsty Lingstadt
Current Stage
Close
Status
Closed
Project Classification
Transform
Start Date
13-Nov-2020
Planning Date
04-Dec-2020
Delivery Date
30-Apr-2021
Close Date
14-May-2021
Programme Priority
5
Overall Priority
Normal
Category
Discretionary

Documentation

Close