Closure Report
Project Summary
The University of Edinburgh’s DataVault, part of the Research Data Service is a facility supported by the Research Data Support (RDS) team and delivered by the Digital Library, both part of the Library and University Collections (L&UC) division, which assists researchers to comply with funding regulations by providing cost-effective, secure, long-term archiving of their research data.
In 2015-16, the JISC funded DataVault project run by the University’s RDS team, in collaboration with the University of Manchester, developed generic software designed to collect basic research metadata and deposit research datasets into an archive service. In parallel, the University’s ITI Research Services Section (RSS) implemented an initial systems architecture to support these archive services.
Due to demand and the time to develop a UoE specific DataVault service, an interim service was developed and launched in 2016. The interim service used the RSS Storage Manager facility to manage research data deposits on the disk based DataStore system.
In 2017, a project to deliver a fully supported cost-effective DataVault service began and the service was launched in January 2019. This main objectives of this DataVault project (RSS022) were to deliver the following -
-
expand the interim service to include –
- an improved (self-service) user interface
- encryption of data
- resilient storage on internal and external archive services
- usage reporting
- design a robust system architecture for the service
- define the processes and policies of the full service
- achieve GDPR compliance for the service
- launch DataVault as a fully supported RDS service
In 2019, a third project, DataVault III (RSS044), funded from the FY2018-19 DRS budget, ran from February to November 2019 and delivered additional features to the DataVault service including –
- organisational structures and roles for access management
- automated billing and usage reporting
- storage of deposits up to 10TB
- reduced processing times and improved resilience
- the migration of deposits from the interim DataVault service
- improved release management for DataVault deployments to test and production
The latest project, RSS212 – DataVault Development Phase IV, will run from November 2019 to July 2020 and is chartered to deliver the following features –
- retention date management
- processing data from other file systems in addition to DataStore
- automated scheduling and processing of deposits
- additional resilience and recovery measures
- user interface improvements
- improved synchronisation of metadata between DataVault and Pure
- processing increased deposit sizes (above 10TB)
- evaluation of improved encryption key management
- improved processing times
Project Scope
DataVault has been developed to provide a low-cost and reliable archiving service for the University’s research community in compliance with research data management and funding policies. The scope of this project will concentrate on improving the features, resilience and performance of the service for the University’s research community.
Significant features were added to the DataVault service and delivered during the previous project (RSS044) including roles management, processing of datasets up to 10TB, storage auditing, automated billing and usage reporting. Work in this project will include improvements to retention date management, resilience and recovery, encryption key management, processing times, and metadata collection by DataVault synchronised with Pure.
In addition, options were presented to increase archive sizing above the 10TB limit and potentially processing deposits of an unlimited size. The project will, at least, develop proofs of concept for some of the options suggested and, depending on the development work required, could implement the processing of increased deposit sizes based on one of these options.
DataVault development has concentrated so far on the University’s research datasets stored in DataStore. The project plans to deliver the processing of data from other sources e.g. local drives, network drives, OneDrive. In addition, the scope will include the evaluation of automated, scheduled processing of data from research equipment.
Encryption key management is essential to providing a secure service for DataVault users. It is recognised that improvements in this area need to be reviewed to provide enhanced security management to counter the increased threat levels to a service that manages, in part, sensitive research data.
Out of Scope
Expansion of the service to support the long-term needs of other groups within the University or elsewhere, outside the research community, is not within scope for this project. Future projects could be initiated to expand the service to provide, inter alia, -
- archiving of University business data
- a DataVault service to other universities
- a DataVault service to non-academic organisations
Objectives
The key objectives of this project are to deliver additional functionality and improved processing goals for the DataVault service. These include –
Objective | Achieved? Supporting Notes (if applicable) | ||
To provide an effective retention data management service which automatically tells users when deposits need to be reviewed for retention or deletion | Yes | ||
To increase maximum deposit sizes above the current 10TB limit (and potentially unlimited) to meet the demand to archive larger data sets e.g. imaging data | Yes | ||
To deliver a more resilient and recoverable service so that fewer failed deposits occur and, if they do, archiving, validation and retrieval can automatically restart from point of failure | Yes | ||
To reduce deposit and retrieval processing times to provide a basis for improved customer service levels while ensuring acceptable impact on other systems and network services | Yes | ||
To provide processing for other data sources, in additional to data store | Yes Can deposit from MAC / VM's etc - See documentation provided by William Petit. | ||
To implement automated processing of deposits from continuously created data sources |
No Not prioritised through the Sprint process in time available |
||
To continue to develop the website user interface to ensure simple and effective access to DataVault services and features including the elimination or, at least, simplification of the authentication ‘cut-and-paste’ process currently used to access a user’s DataStore | The majority is completed and the User interface is improved. This is part of the PURE / Interface works still under completion. Will need further review as part of any future Project works. Again time / prioritisation meant that this objective was never fully defined. | ||
To improve the user experience when creating a vault, providing better synchronisation of research metadata between DataVault and Pure | The majority is completed and the User interface is improved. This is part of the PURE / Interface works still under completion. | ||
To research improved encryption key management methods to ensure more secure access to archived data and service recovery | No - other priorities prevailed and this was not prioritisted through the sprint process. | ||
To evaluate how DataVault could be used for the low-cost archiving by other groups in the University (e.g. Finance), and even external organisations, that required regulatory data retention |
|
||
To continue to promote and provide workshops and training for researchers and support staff on the services offered by DataVault | Yes | ||
To implement more responsive release management processes, potentially using the new shared Puppet service, to ensure updates and fixes are installed effectively minimising outages to the production services |
Yes The process has been improved and there are scripts available, though it was decided that Puppet would not be used. |
Deliverables
More details on specific deliverables to meet the objectives of the project are given in the table below.
Requirement | Deliverable(s) | MoSCoW | Achieved? Supporting Notes (if applicable) |
Provide effective retention date management and review features | A subsystem that manages data retention dates and alerts users to pending reviews for retention and deletion of data | Must | Yes |
Increase deposit sizes above the current 10TB limit | A DataVault service handling data sets above 10TB in size and potentially unlimited | Should | Yes |
Improve processing recovery procedures | A recovery subsystem that restarts archive, validation and retrieval processing from point of failure | Must | Yes |
Provide processing for other data sources, in additional to data store | An expanded DataVault service processing data from sources in addition to DataStore | Should | Yes |
Implement automated processing of deposits from continuously created data sources | Scheduling in place in DataVault to process data on a repetitive basis | Should | No |
Improve the website user interface and features | Improved website presentation and navigation features based on UX recommendations by Digirati developed during the previous DataVault project (RSS044) | Must | Yes |
Eliminate DataStore ‘cut-and-paste’ authentication | A simplified method of providing the user with access to their DataStore data from the DataVault | Should | No - De-prioritised during the Project |
Simplify retention policy updates | An automated data-driven (file) subsystem that provides facilities for DataVault administrators to easily update retention policy information | Must | Yes |
Develop a simplified process to update homepage announcements without downtime | An automated data-driven (file) process to change homepage notifications without the need to restart the service | Must | No - De-prioritised during the Project |
Implement an effective ‘service unavailable’ process during downtime | A ‘service unavailable’ page in place of the homepage when the service is down for any reason | Should | No - De-prioritised during the Project |
Better Pure synchronisation | An improved connection to Pure to push metadata; effective management of Pure API upgrades | Must |
No - Not yet completed, works ongoing |
Reduce deposit and retrieval processing times | Significantly reduced deposit and retrieval processing times in comparison to current observed rates | Must | Yes |
Evaluate alternative opportunities for the use of the DataVault service | A report on specific areas of possible implementation including benefits analysis and development effort | Should | No - De-prioritised during the Project |
Promote information and training on DataVault services to potential users | Publicity and training material including video, presentations, workshops, blog posts | Must | Yes |
Automated release management procedures | An automated and reliable Puppet deployment service for release to test, demonstration and live environments | Should |
Yes The process has been improved and there are scripts available, though it was decided that Puppet would not be used. |
Success Criteria
The following are the criteria to be met to ensure a successful completion of this project –
Success Criteria | Achieved? Supporting Notes (if applicable) |
An increase in the usage of the DataVault service through the promotion of the service, publicising the additional features and processing improvements to be delivered by this project | Yes |
The effective deployment of retention review reminders via email and other methods | Yes |
The elimination of deposit failures from unplanned outages and loss of TSM and Oracle subsystems | Yes |
A reduction in system outages and downtime through improved resilience and data driven parameterisation of the service | Yes |
A significant reduction in the processing time (e.g. 25%) for the archiving, validation and retrieval of large datasets | Yes |
Report published on the potential use of the DataVault service, internally and externally, in archiving other data requiring regulatory retention |
No - De-prioritised during the Project |
Benefits
The successful delivery of this project is expected to provide the following benefits –
Benefits | Achieved? Supporting Notes (if applicable) |
Improved website user interface and features for users of DataVault which should attract new users and retain existing ones | Yes |
Better control of retention dates so that deposits can be reviewed and removed in a timely manner, reducing costs to data owners | Yes |
Enable researchers to archive deposits in excess of 10TB thus opening DataVault to new users with larger datasets | Yes - though maximum is 10TB currently |
Recovery, from point of failure, of deposits during processing, if impacted by planned or unplanned outages, again contributing to increased customer satisfaction | Yes |
Deposit and retrieval processing times improved so reducing wait times and risk of unplanned outages and improving customer satisfaction | Yes |
Identification of improved data security measures through more robust encryption key management processes making DataVault more attractive to users with sensitive data to archive |
No - De-prioritised during the Project |
Increased promotion of the service and training channels to publicise the service and expand the user base | Yes |
Outcome
The project, through utilisation of a 'semi' agile approach successfully managed to stay focused and deliver the majority of the key deliverables that were identified by the Project Stake Holders / Business Team.
The Team worked hard and despite factors like Covid all major deliverables / objectives were achieved, along side also adapting to deliver business as usual (BAU) fixes as required and a requirement to migrate from Oracle Gen1 to Gen2 storage.
Key Learning Points
- Adopting a 'semi' Agile / sprint approach worked really well for the Team and allowed them to focus on the agreed priorities.
- Regular and effective communication and planning, utilising a JIRA board, in focused sprints, also helped facilitate works.
- Focused meetings meant that fewer of them were required.
- Having at times multiple technical Resource allowed for discussion discussion and learning, especially when problems were experienced.
- Effective use of tools, notably Slack and Teams facilitated remote working and almost nullified the impact of Covid / remote working.
Outstanding Issues
The following identified Project works remain outstanding, all should be reviewed / assessed as part of any future DV Project planning, alongside reviewing the DV IV backlog in Jira:
-
Improved synchronisation of metadata between DataVault and Pure
- It is estimated that there are still approx. 40 days of development / testing to complete these works.
- Interface / screen works are almost completed, however integration and database population are outstanding. This time difference can chiefly be attributed to the need to pick up the Oracle migration works through the agile process.
- These works include a number of other User interface improvements
- Automated scheduling and processing of deposits
- Eliminate DataStore ‘cut-and-paste’ authentication
- Evaluation of improved encryption key management
- Evaluate alternative opportunities for the use of the DataVault service
- Develop a simplified process to update homepage announcements without downtime
- Implement an effective ‘service unavailable’ process during downtime
-
Storage Manager Certificate Issue –15/12: Jira - "RSS212-161 Analysis for options for adding key to Storage manager issue“ tracks progress against this issue. There is no update / progress against this issue, a workaround is in place and there have been higher priorities in RSS Systems that have required addressing. This Jira will remain open in the Project backlog to allow completion at the first appropriate time. The issue will be updated to reflect this and closed within the Project, to be tracked via BAU or any new DV Project.
After Project Closure the DV Team need to decide how to progress, options Include:
- Revert to BAU, while postponing Pure / Metadata Interface Works
- Finishing Pure / Metadata Interface Works along side BAU
- Creating a DV V Project that could:
- Finish the Pure / Metadata Interface Works
- Include agreed refactoring works
- Any other considerations?
This will be decided in the coming weeks.