Closure Report

RSS044 - Data Vault III

Approvals

Name Role Position Date
Tony Weir Project Sponsor Director, IT Infrastructure  
Robin Rice Service Owner Data Librarian & Head, Research Data Support Services 17/02/2020
Kirsty Lingstadt Senior Supplier Head, Digital Library & Deputy Director, L&UC  
David Fergusson Senior Supplier Head, ITI Research Services Section 14/02/2020

Ianthe Sutherland[1]

Scott Renton

Product Owner Development & Systems Manager, Digital Library 14/02/2020
Maurice Franceschi Programme Manager (RSS) ITI Portfolio Manager 20/02/2020

[1] Scott Renton to approve for Ianthe Sutherland

Distribution

Name Role Position
Fraser Muir Senior User Chief Information Officer, CAHSS
Anthony Davie Senior User IS Campus Leader, MVM
Colin Higgs Senior User Computing Officer, School of Engineering
Janet Roberts Senior Supplier Director, EDINA
Graeme Wood Senior Supplier Head, ITI Enterprise Services

Project Summary

The University of Edinburgh’s DataVault is a component of Research Data Management services which assists researchers to comply with funding requirements for the long-term retention of their research data. In 2015-16, the JISC funded DataVault project run by the Library and University Collections (L&UC) team at the University of Edinburgh, in collaboration with the University of Manchester, developed generic software designed to collect basic research metadata and deposit research datasets into an archive service. In parallel, the University’s ITI Research Services Section (RSS) defined an initial systems architecture to provide robust archive services.

Due to demand and the time to develop a UoE specific DataVault service, an interim service was launched in 2016. The interim service used the RSS Storage Manager facility to manage research data deposits on the disk-based DataStore system.

A project to deliver a full, support DataVault service began in 2017 and the service was launched in January 2019. This main objectives of the DataVault project (RSS022) were to deliver the following -

  • expand the interim service to include –

    • an improved (self-service) user interface
    • encryption of data
    • resilient storage on internal and external archive services
    • usage reporting
  • design a robust system architecture for the service
  • define the processes and policies of the full service
  • achieve GDPR compliance for the service
  • launch DataVault as a fully supported Research Data service

This project, DataVault III (RSS044), funded for FY2018-19, ran from February 2019 to February 2020 and planned to deliver additional features to the DataVault service including –

  • a review to improve the DataVault user experience (UX)
  • the introduction of organisational structures and roles for data management
  • improvements to retention date management
  • increasing the processing of deposits up to 10TB (from 2TB)
  • the migration of deposits from the interim DataVault service
  • providing more detailed usage reporting
  • providing automated billing processes (to replace manual billing)
  • scheduling the auditing of deposits
  • improving encryption management
  • reducing processing times
  • improving resilience and recovery
  • improving release management procedures

Project Scope

DataVault was developed to provide a low-cost and reliable archiving service for the University’s research community in compliance with research data management and funding policies. The scope of this project concentrated on improving the service for the University’s research community.

Significant work on the design and development of the DataVault user interface was achieved in the previous project. Further UX work in this project included a review the user journey and recommendations on the redesign of user interface, incorporating feedback received during the early adoption period of the DataVault system.

A major requirement in this project, was the improvement to data management, within a research group or department, by incorporating organisational structures and roles, to enable sharing of data and transfer of ownership. For example, to ensure that the University retains control of data access and retention if the original researcher leaves the University.

Retention date management was a major deliverable of the project to ensure users were informed of review and expiry dates for deposits on a timely basis. This work was originally assigned to Digirati, an external development company, for development but complexity and cost precluded the work being done in this project. The requirement has been moved to the new DataVault project, RSS212 DV IV.

In February 2020, DataVault could process and retrieve deposits up to 2TB. The ability to process deposits up to 10TB deposits was a key objective of this project, based on requests from several researchers interested in storing larger datasets in DataVault.

The migration of deposits in the Interim DataVault was essential to enable the interim system to be decommissioned and to provide users with the advanced features of the new DataVault service.

From an administration perspective, the project also included work to improve usage reporting and to automate the manual billing process.

Storage auditing on a selective and scheduled basis was required to ensure the data integrity of deposits in the TSM archive on a regular basis, to guarantee successful retrieval of data, if required.

The DataVault was designed to archive personal and sensitive data, within the scope of GDPR regulations, by (a) advising users to anonymise or pseudo-anonymise personalised data and (b) encrypting all data prior to transmission and deposit, using SHA-256 protocols, to the archive subsystems. Further work was undertaken in this project to review encryption and key management processes.

The project also included work to improve the resilience of the system during planned and unplanned loss of service, to prepare the system to support the recovery of a deposit being processed from the point of failure and to reduce the processing times for archiving and retrieval.

Release management procedures needed to be improved to ensure code and deployment management was durable enough to support up to three development workstreams working in parallel. In addition, it was agreed to evaluate, later in the project, the use of the new shared Puppet environment for deployments.

Objectives/Deliverables

The project achieved most of the key goals and objectives defined during the planning phase. The table below lists the initial deliverables of the project and their outcomes.

 

Objective MoSCoW Deliverable(s) Achieved
Review the website user experience (UX) Must A report on the user journey plus the suggested redesign of the user interface to simplify navigation and support additional features Yes
Improve vault webpage features and PURE interface Should Improvements to vault webpage; an improved metadata connection using PURE API No
Implement organisational structures and roles for data management Must A user interface and back-end subsystem that supports a research organisation structure and the ability for a PI to share a data archive with colleagues and department supervisors to reassign ownership of deposits Yes
Improve retention date management and review features Must A subsystem that manages review and expiry dates of deposits by informing users proactively No
Simplify retention policy updates Should An automated data-driven subsystem that provides facilities for DataVault administrators to easily update retention policy information Yes
Increase maximum deposit sizes to 10TB (currently 2TB) Should A DataVault service handling data sets up to 10TB in size Yes
Migrate deposits from the interim DataVault service Must To be done manually by retrieving from the interim vault and depositing to the DataVault Yes
Provide usage reports for Schools administrators Must A reporting subsystem proving usage reports automatically to senior research staff Yes
Implement automated billing Should An automated billing system using usage statistics to provide DataVault users with invoices for costs of storage to replace the current manual methods Yes
Provide storage auditing features Must Regular data integrity checks (a) database records against archive storage data and (b) regular, random deposit retrievals to check datasets Yes
Improve encryption key management Should An industry standard encryption and key management system integrated into DataVault Partial
Optimise deposit and retrieval processing times Must Improved processing which reduces deposit and retrieval times by a significant margin Yes
Introduce processing recovery procedures Must A recovery subsystem that restarts processing from point of failure (or start of current stage at minimum) Partial
Improve release management Should A reliable code and deployment subsystem for release to test, demonstration and live environments supporting multiple development workstreams Yes

Project Quality

Project Plan

The Data Vault III project was subject to some delay for various reasons. The project was initiated in January 2019 and planning was completed at the end of March with a delivery date of October 2019. However, after delays to the design stage and prolongation of the Roles development, the project end date moved out to December 2019. Further delays to the work to deliver deposit auditing, hampered by TSM service issues, and processing up to 10TB dataset, delayed by issues with SAN storage backups, the final delivery date and closure was in February 2020.

The original plan was to engage an external software developer, Digirati, to develop the following work packages –

  • UX and system review
  • Roles and Permissions
  • Usage Reporting
  • Automated Billing
  • Retention Date Management

However, due to cost, only the first two work packages were awarded to and completed by Digirati. Usage Reporting and Automated Billing were then delivered by Digital Library Systems using contract development resources. Due to cost and complexity, Retention Date Management was removed from the scope of the project.

The table below shows the planned phase exit dates for the key work packages in the project. The project timeline at project closure are shown in the Appendix.

Module Planned Date Actual Date Comment
Planning 29/03/2019 29/03/2019  
Resilience & Recovery 26/04/2019 26/04/2019  
UX & System Review 10/05/2019 12/07/2019 Delay in completing review and approval to proceed (Digirati)
Improved Processing Rates 24/05/2019 24/05/2019  
Roles & Permissions 07/06/2019 04/10/2019 Delay in development (Digirati) and testing
Deposit Auditing[1] 07/06/2019 30/08/2019 Rescheduled to use contract development staff
Usage Reporting 21/06/2019 30/08/2019 Rescheduled to use contract development staff
Improved Release management 05/07/2019 05/07/2019 Improved release management process to support Digital Library Systems, EDINA and Digirati developing separate modules in parallel[2]
Interim Data Vault Migration[3] 05/07/2019 07/02/2020 Dependent on Roles availability and hiring of temporary staff
Retention Date Management 19/07/2019 N/A Removed from scope; requirements completed[4]
UI/Pure Improvements 02/08/2019 N/A Removed from scope; requirements completed[5]
Automated Billing 16/08/2019 30/08/2019  
Encryption Improvements 30/08/2019 30/09/2019 Evaluation only (e.g. Thales SafeNet)
Up to 10TB deposits 27/09/2019 14/02/2020 Delayed by SAN backup issues
Project closure 18/10/2019 21/02/2020  
       

[1] Still to be activated in production

[2] Plus evaluation of Puppet for deployment in February 2020

[3] Dependent on Roles module being available

[4] Transferred to RSS212 DataVault IV

[5] Transferred to RSS212 DataVault IV

Project Resourcing

Estimated resourcing on the project has been significantly exceeded by actuals due to the delays on the project. At the end of planning in March 2019, 258 days of effort was estimated for Research Data Services, Digital Library Systems and EDINA development staff, and project management. This estimate excluded development effort from Digirati which was covered by project budget.

During the project, estimated resource actuals exceeded these estimates in October 2019, the original planned end date. The final estimated actuals for internal resourcing by February 2020 is 306 days, 18% over plan which overall is outside project tolerance. The costs for contract development staff in Digital Library Systems was covered by budget originally earmarked to cover Digirati development.

The table below shows planned effort and estimated actuals by month during the project.

Month Planned Effort Estimated Actual
February 2019 28.0 25.0
March 2019 29.0 48.0
April 2019 30.0 42.0
May 2019 22.0 35.0
June 2019 27.0 19.0
July 2019 25.5 18.0
 August 2019 26.5 15.0
September 2019 38.0 16.0
October 2019 32.0 22.0
November 2019 0.0 27.0
December 2019 0.0 14.0
January 2020 0.0 16.0
February 2020 0.0 9.0
Total 258.0 306.0

Project Budget

For FY2018-19, this Data Vault project was allocated a budget of £111,000 capital and £129,750 revenue. In addition, an additional £22,000 revenue was allocated to the project from the FY2019-20 budget.

A total of £102,824 was spent from the capital budget on external development costs, leaving a surplus of £7,176. A total of £145,018 was spent from the revenue budget to cover EDINA and contract development costs. An additional £22,000 was used from the FY2019-20 to cover 0.5 FTE for curation support in the Research Data Services Team. More detail on budget and spend are shown in the table below.

Category Expenditure FY2018-19 FY2019-20 Total
Capital Budget £110,000   £110,000
  Digirati system design review/UX design £29,880   £29,880
  Digirati development (Roles UI) £36,000 £24,000 £60,000
  Digital Library contract staff (Aug)   £12,944 £12,944
  Capital subtotal £65,880 £36,944 £102,824
  Over/underspend     £7,176
         
Revenue Budget £129,750 £40,000 £169,750
  EDINA development staff (Nov) £8,575   £8,575
  EDINA development staff (Dec-Jan) £11,550   £11,550
  Curation consultancy £20,161 £22,000 £42,161
  EDINA development staff (Feb-Jul) £36,750   £36,750
  Digital Library contract staff (May-Jul) £53,632   £53,632
  EDINA development staff (Aug-Oct) £14,350   £14,350
  Revenue subtotal £130,668 £36,350 £167,018
  Over/underspend     £2,732

Income from the DataVault service during the project was estimated as £10,000. Actual service income over the period was £9,550.

Outstanding Issues

The following, within the project scope, were not completed –

  • Retention date management, transferred to RSS212 – DataVault IV
  • Improved encryption key management, Thales SafeNet evaluated but further work moved to DataVault IV
  • Improved resilience and recovery, some work completed, but remainder moved to DataVault IV
  • Further work on UI and Pure improvements to be covered by DataVault IV

The following key features are planned for the next Data Vault project, RSS212 DV IV, running from December 2019 to August 2020 –

  • retention date management
  • processing data from other file systems in addition to DataStore (e.g. Windows drives)
  • automated scheduling and processing of deposits
  • additional resilience and recovery measures
  • user interface improvements
  • improved synchronisation of metadata between DataVault and Pure
  • evaluation of improved encryption key management
  • processing increased deposit sizes (above 10TB)
  • improved processing times

Lessons Learned

The key observations from the project are summarised in the table below –

Observation Description Recommendations Impact
External development support External development work by Digirati was of high quality but costly and delayed the project Consider scope and costs of any work package to be developed by an external company High
Detailed system design was limited There is no system architect available to support the project. Overall design was limited and provided by Digirati in the main. Detailed design was done at the start of each work package. Ensure overall design workshops are held and design documents produced during a distinct design phase prior development starting High
Some budget overspend on development resources Spending on development resources ate into the FY2019-20 budget (£14,350) Ensure project planning and costing includes contingency for potential delays Medium
Processing times are still high

Processing times are still high (19MB/s) but have reduced by over 20% during the project due to performance tuning

Continue to take measures to ensure the service achieves reduced processing times particularly in the archive and verification phases Medium
High working storage overheads Currently the working storage overheads are proportional to the size of the dataset being processed e.g. 20TB is need to process a 10TB dataset. Review processing methods that are not dependent on large working storage requirements e.g. streaming Medium
TSM infrastructure unreliable Although a second TSM subsystem was added during the project, various issues and outages were encountered Ensure TSM support staff are available to proactively reduce potential issues with the system Medium
Oracle contract cost high The current Oracle contract covers 500TB/month storage, well above usage Ensure next Oracle contract covers a more realistic lower storage limit Medium
TSM and Oracle password expiry Links to both TSM and Oracle archive services were lost when access passwords expired Put process in place to schedule password changes prior to expiry (implemented) Low

Appendix 1 – Final Project Timeline

 

AttachmentSize
Image icon time.png124.84 KB

Project Info

Project
Data Vault III - Continuing Development
Code
RSS044
Programme
ITI - Research Services (RSS)
Management Office
ISG PMO
Project Manager
Lawrence Stevenson
Project Sponsor
Anthony Weir
Current Stage
Close
Status
Closed
Project Classification
Transform
Start Date
04-Feb-2019
Planning Date
29-Mar-2019
Delivery Date
14-Feb-2020
Close Date
21-Feb-2020
Programme Priority
1
Overall Priority
Normal
Category
Discretionary

Documentation

Close