RSS044 - Data Vault III
|Tony Weir||Project Sponsor||Director, IT Infrastructure|
|Robin Rice||Service Owner||Data Librarian & Head, Research Data Support Services||17/02/2020|
|Kirsty Lingstadt||Senior Supplier||Head, Digital Library & Deputy Director, L&UC|
|David Fergusson||Senior Supplier||Head, ITI Research Services Section||14/02/2020|
|Product Owner||Development & Systems Manager, Digital Library||14/02/2020|
|Maurice Franceschi||Programme Manager (RSS)||ITI Portfolio Manager||20/02/2020|
 Scott Renton to approve for Ianthe Sutherland
|Fraser Muir||Senior User||Chief Information Officer, CAHSS|
|Anthony Davie||Senior User||IS Campus Leader, MVM|
|Colin Higgs||Senior User||Computing Officer, School of Engineering|
|Janet Roberts||Senior Supplier||Director, EDINA|
|Graeme Wood||Senior Supplier||Head, ITI Enterprise Services|
The University of Edinburgh’s DataVault is a component of Research Data Management services which assists researchers to comply with funding requirements for the long-term retention of their research data. In 2015-16, the JISC funded DataVault project run by the Library and University Collections (L&UC) team at the University of Edinburgh, in collaboration with the University of Manchester, developed generic software designed to collect basic research metadata and deposit research datasets into an archive service. In parallel, the University’s ITI Research Services Section (RSS) defined an initial systems architecture to provide robust archive services.
Due to demand and the time to develop a UoE specific DataVault service, an interim service was launched in 2016. The interim service used the RSS Storage Manager facility to manage research data deposits on the disk-based DataStore system.
A project to deliver a full, support DataVault service began in 2017 and the service was launched in January 2019. This main objectives of the DataVault project (RSS022) were to deliver the following -
expand the interim service to include –
- an improved (self-service) user interface
- encryption of data
- resilient storage on internal and external archive services
- usage reporting
- design a robust system architecture for the service
- define the processes and policies of the full service
- achieve GDPR compliance for the service
- launch DataVault as a fully supported Research Data service
This project, DataVault III (RSS044), funded for FY2018-19, ran from February 2019 to February 2020 and planned to deliver additional features to the DataVault service including –
- a review to improve the DataVault user experience (UX)
- the introduction of organisational structures and roles for data management
- improvements to retention date management
- increasing the processing of deposits up to 10TB (from 2TB)
- the migration of deposits from the interim DataVault service
- providing more detailed usage reporting
- providing automated billing processes (to replace manual billing)
- scheduling the auditing of deposits
- improving encryption management
- reducing processing times
- improving resilience and recovery
- improving release management procedures
DataVault was developed to provide a low-cost and reliable archiving service for the University’s research community in compliance with research data management and funding policies. The scope of this project concentrated on improving the service for the University’s research community.
Significant work on the design and development of the DataVault user interface was achieved in the previous project. Further UX work in this project included a review the user journey and recommendations on the redesign of user interface, incorporating feedback received during the early adoption period of the DataVault system.
A major requirement in this project, was the improvement to data management, within a research group or department, by incorporating organisational structures and roles, to enable sharing of data and transfer of ownership. For example, to ensure that the University retains control of data access and retention if the original researcher leaves the University.
Retention date management was a major deliverable of the project to ensure users were informed of review and expiry dates for deposits on a timely basis. This work was originally assigned to Digirati, an external development company, for development but complexity and cost precluded the work being done in this project. The requirement has been moved to the new DataVault project, RSS212 DV IV.
In February 2020, DataVault could process and retrieve deposits up to 2TB. The ability to process deposits up to 10TB deposits was a key objective of this project, based on requests from several researchers interested in storing larger datasets in DataVault.
The migration of deposits in the Interim DataVault was essential to enable the interim system to be decommissioned and to provide users with the advanced features of the new DataVault service.
From an administration perspective, the project also included work to improve usage reporting and to automate the manual billing process.
Storage auditing on a selective and scheduled basis was required to ensure the data integrity of deposits in the TSM archive on a regular basis, to guarantee successful retrieval of data, if required.
The DataVault was designed to archive personal and sensitive data, within the scope of GDPR regulations, by (a) advising users to anonymise or pseudo-anonymise personalised data and (b) encrypting all data prior to transmission and deposit, using SHA-256 protocols, to the archive subsystems. Further work was undertaken in this project to review encryption and key management processes.
The project also included work to improve the resilience of the system during planned and unplanned loss of service, to prepare the system to support the recovery of a deposit being processed from the point of failure and to reduce the processing times for archiving and retrieval.
Release management procedures needed to be improved to ensure code and deployment management was durable enough to support up to three development workstreams working in parallel. In addition, it was agreed to evaluate, later in the project, the use of the new shared Puppet environment for deployments.
The project achieved most of the key goals and objectives defined during the planning phase. The table below lists the initial deliverables of the project and their outcomes.
|Review the website user experience (UX)||Must||A report on the user journey plus the suggested redesign of the user interface to simplify navigation and support additional features||Yes|
|Improve vault webpage features and PURE interface||Should||Improvements to vault webpage; an improved metadata connection using PURE API||No|
|Implement organisational structures and roles for data management||Must||A user interface and back-end subsystem that supports a research organisation structure and the ability for a PI to share a data archive with colleagues and department supervisors to reassign ownership of deposits||Yes|
|Improve retention date management and review features||Must||A subsystem that manages review and expiry dates of deposits by informing users proactively||No|
|Simplify retention policy updates||Should||An automated data-driven subsystem that provides facilities for DataVault administrators to easily update retention policy information||Yes|
|Increase maximum deposit sizes to 10TB (currently 2TB)||Should||A DataVault service handling data sets up to 10TB in size||Yes|
|Migrate deposits from the interim DataVault service||Must||To be done manually by retrieving from the interim vault and depositing to the DataVault||Yes|
|Provide usage reports for Schools administrators||Must||A reporting subsystem proving usage reports automatically to senior research staff||Yes|
|Implement automated billing||Should||An automated billing system using usage statistics to provide DataVault users with invoices for costs of storage to replace the current manual methods||Yes|
|Provide storage auditing features||Must||Regular data integrity checks (a) database records against archive storage data and (b) regular, random deposit retrievals to check datasets||Yes|
|Improve encryption key management||Should||An industry standard encryption and key management system integrated into DataVault||Partial|
|Optimise deposit and retrieval processing times||Must||Improved processing which reduces deposit and retrieval times by a significant margin||Yes|
|Introduce processing recovery procedures||Must||A recovery subsystem that restarts processing from point of failure (or start of current stage at minimum)||Partial|
|Improve release management||Should||A reliable code and deployment subsystem for release to test, demonstration and live environments supporting multiple development workstreams||Yes|
The Data Vault III project was subject to some delay for various reasons. The project was initiated in January 2019 and planning was completed at the end of March with a delivery date of October 2019. However, after delays to the design stage and prolongation of the Roles development, the project end date moved out to December 2019. Further delays to the work to deliver deposit auditing, hampered by TSM service issues, and processing up to 10TB dataset, delayed by issues with SAN storage backups, the final delivery date and closure was in February 2020.
The original plan was to engage an external software developer, Digirati, to develop the following work packages –
- UX and system review
- Roles and Permissions
- Usage Reporting
- Automated Billing
- Retention Date Management
However, due to cost, only the first two work packages were awarded to and completed by Digirati. Usage Reporting and Automated Billing were then delivered by Digital Library Systems using contract development resources. Due to cost and complexity, Retention Date Management was removed from the scope of the project.
The table below shows the planned phase exit dates for the key work packages in the project. The project timeline at project closure are shown in the Appendix.
|Module||Planned Date||Actual Date||Comment|
|Resilience & Recovery||26/04/2019||26/04/2019|
|UX & System Review||10/05/2019||12/07/2019||Delay in completing review and approval to proceed (Digirati)|
|Improved Processing Rates||24/05/2019||24/05/2019|
|Roles & Permissions||07/06/2019||04/10/2019||Delay in development (Digirati) and testing|
|Deposit Auditing||07/06/2019||30/08/2019||Rescheduled to use contract development staff|
|Usage Reporting||21/06/2019||30/08/2019||Rescheduled to use contract development staff|
|Improved Release management||05/07/2019||05/07/2019||Improved release management process to support Digital Library Systems, EDINA and Digirati developing separate modules in parallel|
|Interim Data Vault Migration||05/07/2019||07/02/2020||Dependent on Roles availability and hiring of temporary staff|
|Retention Date Management||19/07/2019||N/A||Removed from scope; requirements completed|
|UI/Pure Improvements||02/08/2019||N/A||Removed from scope; requirements completed|
|Encryption Improvements||30/08/2019||30/09/2019||Evaluation only (e.g. Thales SafeNet)|
|Up to 10TB deposits||27/09/2019||14/02/2020||Delayed by SAN backup issues|
 Still to be activated in production
 Plus evaluation of Puppet for deployment in February 2020
 Dependent on Roles module being available
 Transferred to RSS212 DataVault IV
 Transferred to RSS212 DataVault IV
Estimated resourcing on the project has been significantly exceeded by actuals due to the delays on the project. At the end of planning in March 2019, 258 days of effort was estimated for Research Data Services, Digital Library Systems and EDINA development staff, and project management. This estimate excluded development effort from Digirati which was covered by project budget.
During the project, estimated resource actuals exceeded these estimates in October 2019, the original planned end date. The final estimated actuals for internal resourcing by February 2020 is 306 days, 18% over plan which overall is outside project tolerance. The costs for contract development staff in Digital Library Systems was covered by budget originally earmarked to cover Digirati development.
The table below shows planned effort and estimated actuals by month during the project.
|Month||Planned Effort||Estimated Actual|
For FY2018-19, this Data Vault project was allocated a budget of £111,000 capital and £129,750 revenue. In addition, an additional £22,000 revenue was allocated to the project from the FY2019-20 budget.
A total of £102,824 was spent from the capital budget on external development costs, leaving a surplus of £7,176. A total of £145,018 was spent from the revenue budget to cover EDINA and contract development costs. An additional £22,000 was used from the FY2019-20 to cover 0.5 FTE for curation support in the Research Data Services Team. More detail on budget and spend are shown in the table below.
|Digirati system design review/UX design||£29,880||£29,880|
|Digirati development (Roles UI)||£36,000||£24,000||£60,000|
|Digital Library contract staff (Aug)||£12,944||£12,944|
|EDINA development staff (Nov)||£8,575||£8,575|
|EDINA development staff (Dec-Jan)||£11,550||£11,550|
|EDINA development staff (Feb-Jul)||£36,750||£36,750|
|Digital Library contract staff (May-Jul)||£53,632||£53,632|
|EDINA development staff (Aug-Oct)||£14,350||£14,350|
Income from the DataVault service during the project was estimated as £10,000. Actual service income over the period was £9,550.
The following, within the project scope, were not completed –
- Retention date management, transferred to RSS212 – DataVault IV
- Improved encryption key management, Thales SafeNet evaluated but further work moved to DataVault IV
- Improved resilience and recovery, some work completed, but remainder moved to DataVault IV
- Further work on UI and Pure improvements to be covered by DataVault IV
The following key features are planned for the next Data Vault project, RSS212 DV IV, running from December 2019 to August 2020 –
- retention date management
- processing data from other file systems in addition to DataStore (e.g. Windows drives)
- automated scheduling and processing of deposits
- additional resilience and recovery measures
- user interface improvements
- improved synchronisation of metadata between DataVault and Pure
- evaluation of improved encryption key management
- processing increased deposit sizes (above 10TB)
- improved processing times
The key observations from the project are summarised in the table below –
|External development support||External development work by Digirati was of high quality but costly and delayed the project||Consider scope and costs of any work package to be developed by an external company||High|
|Detailed system design was limited||There is no system architect available to support the project. Overall design was limited and provided by Digirati in the main. Detailed design was done at the start of each work package.||Ensure overall design workshops are held and design documents produced during a distinct design phase prior development starting||High|
|Some budget overspend on development resources||Spending on development resources ate into the FY2019-20 budget (£14,350)||Ensure project planning and costing includes contingency for potential delays||Medium|
|Processing times are still high||
Processing times are still high (19MB/s) but have reduced by over 20% during the project due to performance tuning
|Continue to take measures to ensure the service achieves reduced processing times particularly in the archive and verification phases||Medium|
|High working storage overheads||Currently the working storage overheads are proportional to the size of the dataset being processed e.g. 20TB is need to process a 10TB dataset.||Review processing methods that are not dependent on large working storage requirements e.g. streaming||Medium|
|TSM infrastructure unreliable||Although a second TSM subsystem was added during the project, various issues and outages were encountered||Ensure TSM support staff are available to proactively reduce potential issues with the system||Medium|
|Oracle contract cost high||The current Oracle contract covers 500TB/month storage, well above usage||Ensure next Oracle contract covers a more realistic lower storage limit||Medium|
|TSM and Oracle password expiry||Links to both TSM and Oracle archive services were lost when access passwords expired||Put process in place to schedule password changes prior to expiry (implemented)||Low|