Closure Report

RSS022 - Data Vault

Approvals

Name Role Position Date
Tony Weir Project Sponsor Head, IT Infrastructure -
Robin Rice Service Owner Data Librarian & Head, Research Data Support Services 22/02/2019
Fraser Muir Senior User Chief Information Officer, CAHSS 11/03/2019
Anthony Davie Senior User IS Campus Leader, MVM 11/03/2019
Colin Higgs Senior User Computing Officer, School of Engineering 11/03/2019

Kirsty Lingstadt

Senior Supplier Head, Digital Library & Deputy Director, L&UC 27/02/2019
David Fergusson Senior Supplier Head, ITI Research Services Section 11/03/2019
Claire Young Senior Supplier Operations Manager, EDINA -
Graeme Wood Senior Supplier

Head, ITI Enterprise Services

-
Ianthe Sutherland Product Owner Development & Systems Manager, Digital Library 06/03/2019
Maurice Franceschi Programme Manager (RSS) ITI Portfolio Manager, ISG 25/02/2019
Lawrence Stevenson Project Manager ITI Project Manager, ISG 15/02/2019

Project Summary

The University of Edinburgh’s Data Vault is a service component of the 2012 Research Data Management Roadmap which assists researchers to comply with funding requirements for long-term retention of their research data. In 2015-16, the JISC funded Data Vault project run by the Library and University Collections (L&UC) team at the University of Edinburgh, in collaboration with the University of Manchester, developed generic software designed to collect basic research metadata and deposit research datasets into an archive service. In parallel, the University’s ITI Research Services Section (RSS) defined an initial systems architecture to provide robust archive services.

Due to demand and the time to develop a UoE specific Data Vault service, an interim service was launched in 2016. The interim service uses the RSS Storage Manager facility to manage research data deposits on their disk-based DataStore system.

This main objectives of the UoE Data Vault project published in the original project brief in May 2017 were to –

  • expand the interim service
  • make Data Vault ‘self-service’ for users
  • define the processes and policies of the full service
  • design a robust system architecture for the service
  • launch Data Vault, as soon as possible, as a fully supported Research Data service

Project Scope

Data Vault was developed to provide a low-cost and reliable archiving service for the University’s research community in compliance with research data management and funding policies. Researchers are only able to deposit datasets in Data Vault if the corresponding metadata is held in the University’s PURE research management information service.

Researchers can create a ‘vault’ to deposit datasets. Normally a single vault is used to hold the dataset(s) from a single research project. Initial Data Vault requirements also included the need for secure, data sharing within a research group or department, by incorporating organisational structures and roles, so that the University retains control of data access and retention if the original researcher is no longer a member of staff.

Unlike the interim service, the Data Vault has been designed to archive personal and sensitive data, within the scope of GDPR regulations, by (a) advising users to anonymise or pseudo-anonymise personal data and (b) encrypting all data prior to transmission and deposit, using SHA-256 protocols, to the archive subsystems.

Out of Scope

The following deliverables were not included in the scope of this project –

  • archiving of University business data[1]
  • storage of staff file systems
  • storage of staff personal data (e.g. photo or music collections)
  • archiving of student file systems and data
  • archiving of PURE ‘restricted’ data

[1] This service may be included in the scope of a future Data Vault project

Outcomes

Objectives

The key objectives of the Data Vault project were to provide –

  • a legacy research archive and retrieval service at low-cost (£50/TB per annum) that met the relevant funding body’s storage and retention policies
  • controlled archiving of research data by only allowing users to deposit datasets with (unrestricted) metadata already defined in the University’s PURE research information system
  • research organisational structures and roles supported to enable data sharing and control, particularly when the principal investigator is no longer employed at the University
  • a web-based user experience (UX) that provides simple and user-friendly archive, retrieval and review functionality
  • encrypted transmission and storage of data that ensures datasets can only be accessed by authorised users
  • resilience in archiving and retrieval by storing three (3) copies of every deposit in two on-site facilities, using IBM’s Tivoli Storage Manager (TSM) and one off-site copy on the Oracle Cloud Archive service
  • automated billing and usage reports
  • support for dataset deposits up to 10TB

Most of the objectives outlined above have been achieved and confirmed. The principal goal of the project to provide a low-cost, secure, resilient and reliable service has been met.

Requirements

The status of the key requirements are summarised in the table below.

Requirement MoSCoW Status
Low-cost archiving meeting funding retention policies Must Delivered
Controlled archiving using metadata in PURE Must Delivered

Embedded organisational structures and roles to support data sharing

Must Not delivered
A user-friendly web user interface Must Delivered
Encrypted transmission and storage of data Must Delivered
Resilient back-end archive services Must Delivered
Automated billing and usage reports Should Not delivered
Support for dataset deposits up to 10TB Should Not delivered

Deliverables

More detail on the status of deliverables is given in the table below.

Objective Deliverable(s) Achieved
A legacy research archive and retrieval service at low-cost (£50/TB per annum) that met the relevant funding body’s storage and retention policies A Data Vault service that meets cost constraints and conforms to funding regulations Yes. Service launched in January 2019 with costs pegged at the interim system level and meeting funding regulations
Controlled archiving of research data by only allowing users to deposit datasets with (unrestricted) metadata already defined in the University’s PURE research management information system A service that controls the depositing of data by only using metadata held in the PURE system Yes. Only data defined by (unrestricted) metadata held in PURE can be processed
Research organisational structures and roles incorporated to enable data sharing and control, particularly when the principal investigator is no longer employed at the University A Data Vault subsystem that provides the ability to create an organisational structure and roles to enable data control and access No. Not achieved, included in the objectives of the next Data Vault project initiated in February 2019
A web-based user experience (UX) that provides simple and user-friendly archive and retrieval functionality An easy-to-use and informative user interface (UI) presenting Data Vault functions in a consistent form Yes. Extensive work was done on the design and development of the user interface with clear instructions, fields and supporting text
Encrypted transmission and storage of data that ensures datasets can only be accessed by authorised users Secure transmission and storage of data to internal and external archive services Yes. Client-side encryption of all deposits completed before transmission
Resilience in archiving and retrieval by storing three (3) copies of every deposit in two on-site facilities, using Tivoli Storage Manager (TSM) and one off-site to the Oracle Cloud Archive service Data stored, encrypted, in two onsite (TSM) archiving services and one offsite (Oracle Cloud) services Yes. Data archived to TSM and Oracle Cloud services
Automated billing and usage reports Billing and usage reports sent to user groups automatically on a regular basis from the Data Vault service Partial. Online reports provided for staff from the Library Research Support team to prepare group-based billing and usage reports manually
Support for dataset deposits up to 10TB A Data Vault service handling data sets up to 10Tb in size Partial. Data Vault can currently handle datasets up to 2TB in size

Product Quality

The quality of the Data Vault as a product is measured below against the University’s and ISG’s strategic visions as published in the project brief.

University Strategic Vision

Vision Commentary
A unique Edinburgh offer for all of our students  
  • All of our undergraduates developed as student and/or researchers with clear, supported pathways through to Masters and PhD

The Data Vault service provides the research community with a low-cost, secure and regulatory data archiving service for research projects

  • All our students offered the opportunity to draw from deep expertise outside their core discipline

N/A

  • A highly satisfied student body with a strong sense of community.
N/A
Strong and vibrant communities within and beyond the University – making the most of our unique offer of world-leading thinking and learning within one of the world’s most attractive cities N/A
A larger, more international staff who feel valued and supported in a University that is a great and collegial place to work, develop and progress Data Vault supplements existing services to researchers
More postgraduate students – underpinned by the best support in the sector to ensure we attract the brightest and best regardless of ability to pay Data Vault does not currently support postgraduate based research work

A strong culture of philanthropic support focussed especially on our students and on outstanding research capabilities.

N/A
Many more students benefiting from the Edinburgh experience (largely or entirely) in their own country – supported by deep international partnerships and world leading online distance learning N/A
Sustained world leading reputation for the breadth, depth and inter-disciplinary of our research supported by strong growth in research funding and strong international partnerships – drawing from well-established and less well developed sources Data Vault provides a unique service, not currently available to Universities in the UK
An estate that matches expectations, responds flexibly to changing student and staff needs, and showcases the University The Data Vault service meet the needs for data retention required by funding bodies as well as meeting GDPR regulations
A deeper and earlier collaboration with industry, the public sector and the third sector – in terms of research; knowledge exchange; and in giving our students the best possible set of skills for their future Data Vault has the potential to be extended to provide low-cost, secure storage for legacy data that require retention by other organisations in the University e.g. Finance and to external organisations seeking similar services

ISG Strategic Vision

Vision Commentary
Student Experience  
  • Student experience and the unique Edinburgh offer

N/A

  • Online and distance learning leaders

N/A

  • Library national and international leadership
Data Vault development has been driven by the Library and University Collections team who recognised a gap in service provisioning in an area that could be exploited in the UK and overseas
Research and Innovation  
  • Research IT and Data Sciences
The service provides the University’s research community with ability to store completed research data cheaply, meet regulatory requirements but be accessible as and when required
  • Innovation

A market review has shown that Data Vault provides a type of service that is currently not available but would benefit University researchers across the UK

  • Collaborative leadership  and social responsibility
The University, along with the University of Manchester, are taking the lead in delivering a JISC funded version of Data Vault for use in UK Universities
Service Excellence  
  • Process improvement, efficiency, quality and best practice

Data Vault provides researchers with a secure central facility to improve their research process management for storing, retrieving and sharing past research data

  • Long-term IS strategic planning and linked professional services

Data Vault is one of the strands of development to meet the goals of the Research Data Management Roadmap, which also includes complementary initiatives such as DataShare and Data Safe Heaven

  • Information Security
Data Vault incorporates secure user account management by integrating with the University’s EASE and Shibboleth authentication services. The service also incorporates client-side encryption and decryption to ensure the secure transmission and storage of data

Project Quality

Project Plan

The Data Vault project has been subject to considerable delay for various reasons. The project was initiated in February 2017, the initial project brief and plan were prepared in May 2017 with a delivery date of October 2017. However, after a change of project manager, the project was re-planned in October 2017, with a delivery date of February 2018 and an Agile (Scrum) approach adopted for project planning and tracking. In December 2018, with another change of project manager, the project was again re-planned with the Jira backlog reverse engineered to produce a complementary project plan. At that point, the delivery of a minimal viable product (MVP) was estimated at the end of May 2018. Further iterations generated delivery dates in September 2018 and December 2018.

Finally, after moving working storage from DataStore, where peripheral disk performance was being adversely affected, to ITI Enterprise Services SAN and rerunning user and performance testing, originally run in June and July 2018, in November and December 2018, Data Vault was launched in January 2019.

Another reason for delays, was the frequent addition of new requirements to the Jira backlog which went from around 50 tasks in October 2017 to over 300 tasks at the end of the project. The core reason for this behaviour was the lack of effective planning and design in the early stages of the project and the leeway to continually expand on requirements during the project under Agile (Scrum) governance, as illustrated in the cumulative flow diagram In Appendix 1

Project Resourcing

Estimated resourcing on the project has been significantly exceeded by actuals due to the delays on the project. When the plan was revised in January 2018, 468 days of effort was estimated (based on an August 2017 start date), including EDINA resources. This was an estimate of 252 days excluding EDINA resources.

During the project, resource actuals exceeded estimates in July 2018. Additional resourcing in FY2018-19 was estimated as 180 days, including EDINA development. If EDINA resources are removed this gives 95 days net. This means that if EDINA resources (covered by revenue, see below) are excluded, project resourcing was 37% above plan.

Project Budget

In FY2017-18, the Data Vault project was allocated £111,000 capital and £100,000 revenue budget. £12,766 was spent from the capital budget, on TSM licences. Costs from the revenue budget were £97,915 with £84,350 for EDINA development support and £13,565 for Oracle Cloud Archive services. An additional £20,000 was used from the FY2018-19 budget to cover the EDINA development costs needed to launch Data Vault.

Lessons Learned

The key observations from the project are summarised in the table below –

Observation Description Recommendations Impact
Limited knowledge transfer from interim system Development team for DV II not the same as used on interim system but there was limited knowledge transfer from interim system which put the new development team at a disadvantage. Minimal documentation produced from interim system. Ensure DV III architecture and operations documentation is of a detail and quality that benefits DV III developers. High
New development team engaged distinct for interim system team Development for DV II was subcontracted to EDINA.  The lack of knowledge transfer and documentation  provide a steep learning curve for the developers Ensure DV III development team includes EDINA developers, if possible. If not ensure L&UC development staff are involved in the final stages of DV II and the system is fully documented at the architecture and operations level Medium
No detailed system design There was no appointed system architect so no detailed system design was undertaken. A 3-4 page outline system design document was produced. Extensive design decisions were done 'on the fly' during the build phase of the project. System evolved rather than being designed Ensure system architect in place and detailed design - functional and infrastructure - is completed in a distinct design phase prior to system build High
Workflow has high working storage overheads and long processing times Workflow is inefficient with high working storage overheads and long processing times in what is essentially a system integration exercise e.g. encryption, TSM, Oracle. DV is fundamentally a secure archiving service. A detailed redesign exercise is recommended to ensure the service provides acceptable archive and retrieval times. The system design evolved and includes various steps that could be replaced by more efficient methods.  High
Planning moved to Jira when project was in flight A decision was made to use Jira (Scrum version) for planning by adding functional requirements directly into the Jira backlog in November 2017 (project started February 2017). Limited formal planning seems to have been in place. A formal plan was reverse-engineering from Jira in January 2018.  Jira is a short term planning and tracking tool. It does not provide facilities for overall project planning such as scheduling, resourcing and dependencies. Use a formal planning process and tools for DV III. Jira can still be used for task level initiation and tracking. Medium
Adoption of Jira forced a Scrum based approach Data Vault is not best suited as a pure Agile project for the following reasons - The system is quasi-regulatory (e.g. GDPR required). The end product was well defined. The project team were not 100% assigned to the project and had limited Agile/Scrum experience.    DV III project should be managed using formal requirements-planning-design-build-test-launch approach. Once complete, the plan can be added to Jira and a sprint or Kanban based framework for tracking tasks. Medium
Long lead time identifying cloud supplier The selection of Oracle cloud as the external archive supplier took over 6 months using the G-Cloud framework. With the value of the contract (£13,500 over 3 years) a single supplier could have been engaged directly. Ensure procurement approach is appropriate for the costs involved. High
TSM licences not available as agreed TSM licences from the RSS pool were to be used but when they were required they were not available. Unplanned expenditure (£12,766) was required to purchase the requisite licences. Ensure commitment to supply licences is formally confirmed. Medium
Processing on DataStore impacted other users Processing on DataStore, a networked shared storage system, impacted other users including staff network drives. Testing had to be curtailed and a request to move to ENT SAN raised where performance and reliability improved. Ensure appropriate storage technology is used for DV working storage during the design phase. High
Design of the system infrastructure was ad hoc Design of the system infrastructure was done just prior to the provisioning of final test and production environments. The team had limited infrastructure design knowledge. Ensure infrastructure design is included at the start of the project as part of the system design process Medium
Significant budget overspend on development resources Spending on development resources ate into the FY2018-19 project (DV III) budget potentially reducing the work that could be done on the next phase of DV.  Effective planning at the start of the project identifies resource and scheduling requirements that can be used to control project spend more efficiently. High

Outstanding Issues

The following, within the project scope, were not completed –

  • Deposits up to 10TB not available – current limit is 2TB
  • Organisation structure and roles not implemented for data sharing
  • Automated billing and usage reports – basic reporting provided

The following key features are planned for the next Data Vault project, DV III, with starts in February 2019.

  • Increase maximum deposit sizes to 10TB (currently 2TB)
  • Improve release management
  • Improve processing recovery procedures
  • Implement organisational structures and roles for data sharing
  • Improve the website user experience (UX)
  • Provide storage auditing features
  • Implement automated billing
  • Provide retention date review features
  • Migrate deposits from the interim Data Vault service
  • Provide usage reports for Schools administrators
  • Simplify retention policy updates
  • Improve encryption key management
  • Optimise deposit and retrieval processing times

Appendix 1 – Jira Cumulative Flow Diagram

 

AttachmentSize
Image icon jira_cumulative_chart.jpg46.54 KB

Project Info

Project
Data Vault
Code
RSS022
Programme
ITI - Research Services (RSS)
Management Office
ISG PMO
Project Manager
Lawrence Stevenson
Project Sponsor
Robin Rice
Current Stage
Close
Status
Closed
Project Classification
Transform
Start Date
08-Feb-2017
Planning Date
16-Oct-2018
Delivery Date
25-Jan-2019
Close Date
15-Mar-2019
Overall Priority
Highest
Category
Discretionary