Closure Report

Project Summary

The University of Edinburgh’s DataVault, part of the Research Data Service is a facility supported by the Research Data Support (RDS) team and delivered by the Digital Library, both part of the Library and University Collections (L&UC) division, which assists researchers to comply with funding regulations by providing cost-effective, secure, long-term archiving of their research data.

In 2015-16, the JISC funded DataVault project run by the University’s RDS team, in collaboration with the University of Manchester, developed generic software designed to collect basic research metadata and deposit research datasets into an archive service. In parallel, the University’s ITI Research Services Section (RSS) implemented an initial systems architecture to support these archive services.

Due to demand and the time to develop a UoE specific DataVault service, an interim service was developed and launched in 2016. The interim service used the RSS Storage Manager facility to manage research data deposits on the disk based DataStore system.

In 2017, a project to deliver a fully supported cost-effective DataVault service began and the service was launched in January 2019. This main objectives of this DataVault project (RSS022) were to deliver the following -

  • expand the interim service to include –

    • an improved (self-service) user interface
    • encryption of data
    • resilient storage on internal and external archive services
    • usage reporting
  • design a robust system architecture for the service
  • define the processes and policies of the full service
  • achieve GDPR compliance for the service
  • launch DataVault as a fully supported RDS service

In 2019, a third project, DataVault III (RSS044), funded from the FY2018-19 DRS budget, ran from February to November 2019 and delivered additional features to the DataVault service including –

  • organisational structures and roles for access management
  • automated billing and usage reporting
  • storage of deposits up to 10TB
  • reduced processing times and improved resilience
  • the migration of deposits from the interim DataVault service
  • improved release management for DataVault deployments to test and production

The latest project, RSS212 – DataVault Development Phase IV, will run from November 2019 to July 2020 and is chartered to deliver the following features –

  • retention date management
  • processing data from other file systems in addition to DataStore
  • automated scheduling and processing of deposits
  • additional resilience and recovery measures
  • user interface improvements
  • improved synchronisation of metadata between DataVault and Pure
  • processing increased deposit sizes (above 10TB)
  • evaluation of improved encryption key management
  • improved processing times

 

Project Scope

DataVault has been developed to provide a low-cost and reliable archiving service for the University’s research community in compliance with research data management and funding policies. The scope of this project will concentrate on improving the features, resilience and performance of the service for the University’s research community.

Significant features were added to the DataVault service and delivered during the previous project (RSS044) including roles management, processing of datasets up to 10TB, storage auditing, automated billing and usage reporting. Work in this project will include improvements to retention date management, resilience and recovery, encryption key management, processing times, and metadata collection by DataVault synchronised with Pure.

In addition, options were presented to increase archive sizing above the 10TB limit and potentially processing deposits of an unlimited size. The project will, at least, develop proofs of concept for some of the options suggested and, depending on the development work required, could implement the processing of increased deposit sizes based on one of these options.

DataVault development has concentrated so far on the University’s research datasets stored in DataStore. The project plans to deliver the processing of data from other sources e.g. local drives, network drives, OneDrive. In addition, the scope will include the evaluation of automated, scheduled processing of data from research equipment.

Encryption key management is essential to providing a secure service for DataVault users. It is recognised that improvements in this area need to be reviewed to provide enhanced security management to counter the increased threat levels to a service that manages, in part, sensitive research data.

 

Out of Scope

Expansion of the service to support the long-term needs of other groups within the University or elsewhere, outside the research community, is not within scope for this project. Future projects could be initiated to expand the service to provide, inter alia, -

  • archiving of University business data
  • a DataVault service to other universities
  • a DataVault service to non-academic organisations

 

Objectives

The key objectives of this project are to deliver additional functionality and improved processing goals for the DataVault service. These include –

Objective Achieved? Supporting Notes (if applicable)
To provide an effective retention data management service which automatically tells users when deposits need to be reviewed for retention or deletion   Yes
To increase maximum deposit sizes above the current 10TB limit (and potentially unlimited) to meet the demand to archive larger data sets e.g. imaging data Yes
To deliver a more resilient and recoverable service so that fewer failed deposits occur and, if they do, archiving, validation and retrieval can automatically restart from point of failure Yes
To reduce deposit and retrieval processing times to provide a basis for improved customer service levels while ensuring acceptable impact on other systems and network services Yes
To provide processing for other data sources, in additional to data store Yes Can deposit from MAC / VM's etc - See documentation provided by William Petit.
To implement automated processing of deposits from continuously created data sources

No

Not prioritised through the Sprint process in time available

To continue to develop the website user interface to ensure simple and effective access to DataVault services and features including the elimination or, at least, simplification of the authentication ‘cut-and-paste’ process currently used to access a user’s DataStore The majority is completed and the User interface is improved.  This is part of the PURE / Interface works still under completion.   Will need further review as part of any future Project works.   Again time / prioritisation meant that this objective was never fully defined.
To improve the user experience when creating a vault, providing better synchronisation of research metadata between DataVault and Pure The majority is completed and the User interface is improved.    This is part of the PURE / Interface works still under completion.  
To research improved encryption key management methods to ensure more secure access to archived data and service recovery No - other priorities prevailed and this was not prioritisted through the sprint process.
To evaluate how DataVault could be used for the low-cost archiving by other groups in the University (e.g. Finance), and even external organisations, that required regulatory data retention
  No - other priorities prevailed and this was not prioritisted through the sprint process.
To continue to promote and provide workshops and training for researchers and support staff on the services offered by DataVault Yes
To implement more responsive release management processes, potentially using the new shared Puppet service, to ensure updates and fixes are installed effectively minimising outages to the production services

Yes

The process has been improved and there are scripts available, though it was decided that Puppet would not be used. 

 

 

Deliverables

More details on specific deliverables to meet the objectives of the project are given in the table below.

Requirement Deliverable(s) MoSCoW Achieved? Supporting Notes (if applicable)
Provide effective retention date management and review features A subsystem that manages data retention dates and alerts users to pending reviews for retention and deletion of data Must Yes
Increase deposit sizes above the current 10TB limit A DataVault service handling data sets above 10TB in size and potentially unlimited Should Yes
Improve processing recovery procedures A recovery subsystem that restarts archive, validation and retrieval processing from point of failure Must Yes
Provide processing for other data sources, in additional to data store An expanded DataVault service processing data from sources in addition to DataStore Should Yes
Implement automated processing of deposits from continuously created data sources Scheduling in place in DataVault to process data on a repetitive basis Should No
Improve the website user interface and features Improved website presentation and navigation features based on UX recommendations by Digirati developed during the previous DataVault project (RSS044) Must Yes
Eliminate DataStore ‘cut-and-paste’ authentication A simplified method of providing the user with access to their DataStore data from the DataVault Should No - De-prioritised during the Project
Simplify retention policy updates An automated data-driven (file) subsystem that provides facilities for DataVault administrators to easily update retention policy information Must Yes
Develop a simplified process to update homepage announcements without downtime An automated data-driven (file) process to change homepage notifications without the need to restart the service Must No - De-prioritised during the Project
Implement an effective ‘service unavailable’ process during downtime A ‘service unavailable’ page in place of the homepage when the service is down for any reason Should No - De-prioritised during the Project
Better Pure synchronisation An improved connection to Pure to push metadata; effective management of Pure API upgrades Must

No - Not yet completed, works ongoing

Reduce deposit and retrieval processing times Significantly reduced deposit and retrieval processing times in comparison to current observed rates Must Yes
Evaluate alternative opportunities for the use of the DataVault service A report on specific areas of possible implementation including benefits analysis and development effort Should No - De-prioritised during the Project
Promote information and training on DataVault services to potential users Publicity and training material including video, presentations, workshops, blog posts Must Yes
Automated release management procedures An automated and reliable Puppet deployment service for release to test, demonstration and live environments Should

Yes

The process has been improved and there are scripts available, though it was decided that Puppet would not be used. 

 

 

Success Criteria

The following are the criteria to be met to ensure a successful completion of this project –

Success Criteria Achieved? Supporting Notes (if applicable)
An increase in the usage of the DataVault service through the promotion of the service, publicising the additional features and processing improvements to be delivered by this project Yes
The effective deployment of retention review reminders via email and other methods Yes
The elimination of deposit failures from unplanned outages and loss of TSM and Oracle subsystems Yes
A reduction in system outages and downtime through improved resilience and data driven parameterisation of the service Yes
A significant reduction in the processing time (e.g. 25%) for the archiving, validation and retrieval of large datasets Yes
Report published on the potential use of the DataVault service, internally and externally, in archiving other data requiring regulatory retention

No - De-prioritised during the Project

 

Benefits

The successful delivery of this project is expected to provide the following benefits –

Benefits Achieved? Supporting Notes (if applicable)
Improved website user interface and features for users of DataVault which should attract new users and retain existing ones Yes
Better control of retention dates so that deposits can be reviewed and removed in a timely manner, reducing costs to data owners Yes
Enable researchers to archive deposits in excess of 10TB thus opening DataVault to new users with larger datasets Yes - though maximum is 10TB currently
Recovery, from point of failure, of deposits during processing, if impacted by planned or unplanned outages, again contributing to increased customer satisfaction Yes
Deposit and retrieval processing times improved so reducing wait times and risk of unplanned outages and improving customer satisfaction Yes
Identification of improved data security measures through more robust encryption key management processes making DataVault more attractive to users with sensitive data to archive

No - De-prioritised during the Project

Increased promotion of the service and training channels to publicise the service and expand the user base Yes

 

Outcome

The project, through utilisation of a 'semi' agile approach successfully managed to stay focused and deliver the majority of the key deliverables that were identified by the Project Stake Holders / Business Team. 

The Team worked hard and despite factors like Covid all major deliverables / objectives were achieved, along side also adapting to deliver business as usual (BAU) fixes as required and a requirement to migrate from Oracle Gen1 to Gen2 storage.

 

Key Learning Points

  • Adopting a 'semi' Agile / sprint approach worked really well for the Team and allowed them to focus on the agreed priorities.
  • Regular and effective communication and planning, utilising a JIRA board, in focused sprints, also helped facilitate works.
  • Focused meetings meant that fewer of them were required.
  • Having at times multiple technical Resource allowed for discussion discussion and learning, especially when problems were experienced.
  • Effective use of tools, notably Slack and Teams facilitated remote working and almost nullified the impact of Covid / remote working.

 

Outstanding Issues

The following identified Project works remain outstanding, all should be reviewed / assessed as part of any future DV Project planning, alongside reviewing the DV IV backlog in Jira:

  • Improved synchronisation of metadata between DataVault and Pure

    • It is estimated that there are still approx. 40 days of development / testing to complete these works.
    • Interface / screen works are almost completed, however integration and database population are outstanding.  This time difference can chiefly be attributed to the need to pick up the Oracle migration works through the agile process.
    • These works include a number of other User interface improvements
  • Automated scheduling and processing of deposits
  • Eliminate DataStore ‘cut-and-paste’ authentication
  • Evaluation of improved encryption key management
  • Evaluate alternative opportunities for the use of the DataVault service
  • Develop a simplified process to update homepage announcements without downtime
  • Implement an effective ‘service unavailable’ process during downtime
  • Storage Manager Certificate Issue 15/12:  Jira - "RSS212-161 Analysis for options for adding key to Storage manager issue“ tracks progress against this issue.  There is no update / progress against this issue, a workaround is in place and there have been higher priorities in RSS Systems that have required addressing.  This Jira will remain open in the Project backlog to allow completion at the first appropriate time.  The issue will be updated to reflect this and closed within the Project, to be tracked via BAU or any new DV Project.

 

 

After Project Closure the DV Team need to decide how to progress, options Include:

  • Revert to BAU, while postponing Pure / Metadata Interface Works
  • Finishing Pure / Metadata Interface Works along side BAU
  • Creating a DV V Project that could:
    • Finish the Pure / Metadata Interface Works
    • Include agreed refactoring works
    • Any other considerations?

This will be decided in the coming weeks.

 

 

 

 

 

Project Info

Project
DataVault Development Phase IV
Code
RSS212
Programme
ITI - Research Services (RSS)
Management Office
ISG PMO
Project Manager
Richard Bailey
Project Sponsor
Anthony Weir
Current Stage
Close
Status
Closed
Project Classification
Grow
Start Date
04-Nov-2019
Planning Date
08-Apr-2020
Delivery Date
16-Jul-2021
Close Date
28-Jul-2021
Overall Priority
Normal
Category
Compliance

Documentation

Close