Closure Report
RSS022 - Data Vault
Approvals
| Name | Role | Position | Date | 
|---|---|---|---|
| Tony Weir | Project Sponsor | Head, IT Infrastructure | - | 
| Robin Rice | Service Owner | Data Librarian & Head, Research Data Support Services | 22/02/2019 | 
| Fraser Muir | Senior User | Chief Information Officer, CAHSS | 11/03/2019 | 
| Anthony Davie | Senior User | IS Campus Leader, MVM | 11/03/2019 | 
| Colin Higgs | Senior User | Computing Officer, School of Engineering | 11/03/2019 | 
| Kirsty Lingstadt | Senior Supplier | Head, Digital Library & Deputy Director, L&UC | 27/02/2019 | 
| David Fergusson | Senior Supplier | Head, ITI Research Services Section | 11/03/2019 | 
| Claire Young | Senior Supplier | Operations Manager, EDINA | - | 
| Graeme Wood | Senior Supplier | Head, ITI Enterprise Services | - | 
| Ianthe Sutherland | Product Owner | Development & Systems Manager, Digital Library | 06/03/2019 | 
| Maurice Franceschi | Programme Manager (RSS) | ITI Portfolio Manager, ISG | 25/02/2019 | 
| Lawrence Stevenson | Project Manager | ITI Project Manager, ISG | 15/02/2019 | 
Project Summary
The University of Edinburgh’s Data Vault is a service component of the 2012 Research Data Management Roadmap which assists researchers to comply with funding requirements for long-term retention of their research data. In 2015-16, the JISC funded Data Vault project run by the Library and University Collections (L&UC) team at the University of Edinburgh, in collaboration with the University of Manchester, developed generic software designed to collect basic research metadata and deposit research datasets into an archive service. In parallel, the University’s ITI Research Services Section (RSS) defined an initial systems architecture to provide robust archive services.
Due to demand and the time to develop a UoE specific Data Vault service, an interim service was launched in 2016. The interim service uses the RSS Storage Manager facility to manage research data deposits on their disk-based DataStore system.
This main objectives of the UoE Data Vault project published in the original project brief in May 2017 were to –
- expand the interim service
- make Data Vault ‘self-service’ for users
- define the processes and policies of the full service
- design a robust system architecture for the service
- launch Data Vault, as soon as possible, as a fully supported Research Data service
Project Scope
Data Vault was developed to provide a low-cost and reliable archiving service for the University’s research community in compliance with research data management and funding policies. Researchers are only able to deposit datasets in Data Vault if the corresponding metadata is held in the University’s PURE research management information service.
Researchers can create a ‘vault’ to deposit datasets. Normally a single vault is used to hold the dataset(s) from a single research project. Initial Data Vault requirements also included the need for secure, data sharing within a research group or department, by incorporating organisational structures and roles, so that the University retains control of data access and retention if the original researcher is no longer a member of staff.
Unlike the interim service, the Data Vault has been designed to archive personal and sensitive data, within the scope of GDPR regulations, by (a) advising users to anonymise or pseudo-anonymise personal data and (b) encrypting all data prior to transmission and deposit, using SHA-256 protocols, to the archive subsystems.
Out of Scope
The following deliverables were not included in the scope of this project –
- archiving of University business data[1]
- storage of staff file systems
- storage of staff personal data (e.g. photo or music collections)
- archiving of student file systems and data
- archiving of PURE ‘restricted’ data
[1] This service may be included in the scope of a future Data Vault project
Outcomes
Objectives
The key objectives of the Data Vault project were to provide –
- a legacy research archive and retrieval service at low-cost (£50/TB per annum) that met the relevant funding body’s storage and retention policies
- controlled archiving of research data by only allowing users to deposit datasets with (unrestricted) metadata already defined in the University’s PURE research information system
- research organisational structures and roles supported to enable data sharing and control, particularly when the principal investigator is no longer employed at the University
- a web-based user experience (UX) that provides simple and user-friendly archive, retrieval and review functionality
- encrypted transmission and storage of data that ensures datasets can only be accessed by authorised users
- resilience in archiving and retrieval by storing three (3) copies of every deposit in two on-site facilities, using IBM’s Tivoli Storage Manager (TSM) and one off-site copy on the Oracle Cloud Archive service
- automated billing and usage reports
- support for dataset deposits up to 10TB
Most of the objectives outlined above have been achieved and confirmed. The principal goal of the project to provide a low-cost, secure, resilient and reliable service has been met.
Requirements
The status of the key requirements are summarised in the table below.
| Requirement | MoSCoW | Status | 
|---|---|---|
| Low-cost archiving meeting funding retention policies | Must | Delivered | 
| Controlled archiving using metadata in PURE | Must | Delivered | 
| Embedded organisational structures and roles to support data sharing | Must | Not delivered | 
| A user-friendly web user interface | Must | Delivered | 
| Encrypted transmission and storage of data | Must | Delivered | 
| Resilient back-end archive services | Must | Delivered | 
| Automated billing and usage reports | Should | Not delivered | 
| Support for dataset deposits up to 10TB | Should | Not delivered | 
Deliverables
More detail on the status of deliverables is given in the table below.
| Objective | Deliverable(s) | Achieved | 
|---|---|---|
| A legacy research archive and retrieval service at low-cost (£50/TB per annum) that met the relevant funding body’s storage and retention policies | A Data Vault service that meets cost constraints and conforms to funding regulations | Yes. Service launched in January 2019 with costs pegged at the interim system level and meeting funding regulations | 
| Controlled archiving of research data by only allowing users to deposit datasets with (unrestricted) metadata already defined in the University’s PURE research management information system | A service that controls the depositing of data by only using metadata held in the PURE system | Yes. Only data defined by (unrestricted) metadata held in PURE can be processed | 
| Research organisational structures and roles incorporated to enable data sharing and control, particularly when the principal investigator is no longer employed at the University | A Data Vault subsystem that provides the ability to create an organisational structure and roles to enable data control and access | No. Not achieved, included in the objectives of the next Data Vault project initiated in February 2019 | 
| A web-based user experience (UX) that provides simple and user-friendly archive and retrieval functionality | An easy-to-use and informative user interface (UI) presenting Data Vault functions in a consistent form | Yes. Extensive work was done on the design and development of the user interface with clear instructions, fields and supporting text | 
| Encrypted transmission and storage of data that ensures datasets can only be accessed by authorised users | Secure transmission and storage of data to internal and external archive services | Yes. Client-side encryption of all deposits completed before transmission | 
| Resilience in archiving and retrieval by storing three (3) copies of every deposit in two on-site facilities, using Tivoli Storage Manager (TSM) and one off-site to the Oracle Cloud Archive service | Data stored, encrypted, in two onsite (TSM) archiving services and one offsite (Oracle Cloud) services | Yes. Data archived to TSM and Oracle Cloud services | 
| Automated billing and usage reports | Billing and usage reports sent to user groups automatically on a regular basis from the Data Vault service | Partial. Online reports provided for staff from the Library Research Support team to prepare group-based billing and usage reports manually | 
| Support for dataset deposits up to 10TB | A Data Vault service handling data sets up to 10Tb in size | Partial. Data Vault can currently handle datasets up to 2TB in size | 
Product Quality
The quality of the Data Vault as a product is measured below against the University’s and ISG’s strategic visions as published in the project brief.
University Strategic Vision
| Vision | Commentary | 
|---|---|
| A unique Edinburgh offer for all of our students | |
| 
 | The Data Vault service provides the research community with a low-cost, secure and regulatory data archiving service for research projects | 
| 
 | N/A | 
| 
 | N/A | 
| Strong and vibrant communities within and beyond the University – making the most of our unique offer of world-leading thinking and learning within one of the world’s most attractive cities | N/A | 
| A larger, more international staff who feel valued and supported in a University that is a great and collegial place to work, develop and progress | Data Vault supplements existing services to researchers | 
| More postgraduate students – underpinned by the best support in the sector to ensure we attract the brightest and best regardless of ability to pay | Data Vault does not currently support postgraduate based research work | 
| A strong culture of philanthropic support focussed especially on our students and on outstanding research capabilities. | N/A | 
| Many more students benefiting from the Edinburgh experience (largely or entirely) in their own country – supported by deep international partnerships and world leading online distance learning | N/A | 
| Sustained world leading reputation for the breadth, depth and inter-disciplinary of our research supported by strong growth in research funding and strong international partnerships – drawing from well-established and less well developed sources | Data Vault provides a unique service, not currently available to Universities in the UK | 
| An estate that matches expectations, responds flexibly to changing student and staff needs, and showcases the University | The Data Vault service meet the needs for data retention required by funding bodies as well as meeting GDPR regulations | 
| A deeper and earlier collaboration with industry, the public sector and the third sector – in terms of research; knowledge exchange; and in giving our students the best possible set of skills for their future | Data Vault has the potential to be extended to provide low-cost, secure storage for legacy data that require retention by other organisations in the University e.g. Finance and to external organisations seeking similar services | 
ISG Strategic Vision
| Vision | Commentary | 
|---|---|
| Student Experience | |
| 
 | N/A | 
| 
 | N/A | 
| 
 | Data Vault development has been driven by the Library and University Collections team who recognised a gap in service provisioning in an area that could be exploited in the UK and overseas | 
| Research and Innovation | |
| 
 | The service provides the University’s research community with ability to store completed research data cheaply, meet regulatory requirements but be accessible as and when required | 
| 
 | A market review has shown that Data Vault provides a type of service that is currently not available but would benefit University researchers across the UK | 
| 
 | The University, along with the University of Manchester, are taking the lead in delivering a JISC funded version of Data Vault for use in UK Universities | 
| Service Excellence | |
| 
 | Data Vault provides researchers with a secure central facility to improve their research process management for storing, retrieving and sharing past research data | 
| 
 | Data Vault is one of the strands of development to meet the goals of the Research Data Management Roadmap, which also includes complementary initiatives such as DataShare and Data Safe Heaven | 
| 
 | Data Vault incorporates secure user account management by integrating with the University’s EASE and Shibboleth authentication services. The service also incorporates client-side encryption and decryption to ensure the secure transmission and storage of data | 
Project Quality
Project Plan
The Data Vault project has been subject to considerable delay for various reasons. The project was initiated in February 2017, the initial project brief and plan were prepared in May 2017 with a delivery date of October 2017. However, after a change of project manager, the project was re-planned in October 2017, with a delivery date of February 2018 and an Agile (Scrum) approach adopted for project planning and tracking. In December 2018, with another change of project manager, the project was again re-planned with the Jira backlog reverse engineered to produce a complementary project plan. At that point, the delivery of a minimal viable product (MVP) was estimated at the end of May 2018. Further iterations generated delivery dates in September 2018 and December 2018.
Finally, after moving working storage from DataStore, where peripheral disk performance was being adversely affected, to ITI Enterprise Services SAN and rerunning user and performance testing, originally run in June and July 2018, in November and December 2018, Data Vault was launched in January 2019.
Another reason for delays, was the frequent addition of new requirements to the Jira backlog which went from around 50 tasks in October 2017 to over 300 tasks at the end of the project. The core reason for this behaviour was the lack of effective planning and design in the early stages of the project and the leeway to continually expand on requirements during the project under Agile (Scrum) governance, as illustrated in the cumulative flow diagram In Appendix 1
Project Resourcing
Estimated resourcing on the project has been significantly exceeded by actuals due to the delays on the project. When the plan was revised in January 2018, 468 days of effort was estimated (based on an August 2017 start date), including EDINA resources. This was an estimate of 252 days excluding EDINA resources.
During the project, resource actuals exceeded estimates in July 2018. Additional resourcing in FY2018-19 was estimated as 180 days, including EDINA development. If EDINA resources are removed this gives 95 days net. This means that if EDINA resources (covered by revenue, see below) are excluded, project resourcing was 37% above plan.
Project Budget
In FY2017-18, the Data Vault project was allocated £111,000 capital and £100,000 revenue budget. £12,766 was spent from the capital budget, on TSM licences. Costs from the revenue budget were £97,915 with £84,350 for EDINA development support and £13,565 for Oracle Cloud Archive services. An additional £20,000 was used from the FY2018-19 budget to cover the EDINA development costs needed to launch Data Vault.
Lessons Learned
The key observations from the project are summarised in the table below –
| Observation | Description | Recommendations | Impact | 
|---|---|---|---|
| Limited knowledge transfer from interim system | Development team for DV II not the same as used on interim system but there was limited knowledge transfer from interim system which put the new development team at a disadvantage. Minimal documentation produced from interim system. | Ensure DV III architecture and operations documentation is of a detail and quality that benefits DV III developers. | High | 
| New development team engaged distinct for interim system team | Development for DV II was subcontracted to EDINA. The lack of knowledge transfer and documentation provide a steep learning curve for the developers | Ensure DV III development team includes EDINA developers, if possible. If not ensure L&UC development staff are involved in the final stages of DV II and the system is fully documented at the architecture and operations level | Medium | 
| No detailed system design | There was no appointed system architect so no detailed system design was undertaken. A 3-4 page outline system design document was produced. Extensive design decisions were done 'on the fly' during the build phase of the project. System evolved rather than being designed | Ensure system architect in place and detailed design - functional and infrastructure - is completed in a distinct design phase prior to system build | High | 
| Workflow has high working storage overheads and long processing times | Workflow is inefficient with high working storage overheads and long processing times in what is essentially a system integration exercise e.g. encryption, TSM, Oracle. DV is fundamentally a secure archiving service. | A detailed redesign exercise is recommended to ensure the service provides acceptable archive and retrieval times. The system design evolved and includes various steps that could be replaced by more efficient methods. | High | 
| Planning moved to Jira when project was in flight | A decision was made to use Jira (Scrum version) for planning by adding functional requirements directly into the Jira backlog in November 2017 (project started February 2017). Limited formal planning seems to have been in place. A formal plan was reverse-engineering from Jira in January 2018. Jira is a short term planning and tracking tool. It does not provide facilities for overall project planning such as scheduling, resourcing and dependencies. | Use a formal planning process and tools for DV III. Jira can still be used for task level initiation and tracking. | Medium | 
| Adoption of Jira forced a Scrum based approach | Data Vault is not best suited as a pure Agile project for the following reasons - The system is quasi-regulatory (e.g. GDPR required). The end product was well defined. The project team were not 100% assigned to the project and had limited Agile/Scrum experience. | DV III project should be managed using formal requirements-planning-design-build-test-launch approach. Once complete, the plan can be added to Jira and a sprint or Kanban based framework for tracking tasks. | Medium | 
| Long lead time identifying cloud supplier | The selection of Oracle cloud as the external archive supplier took over 6 months using the G-Cloud framework. With the value of the contract (£13,500 over 3 years) a single supplier could have been engaged directly. | Ensure procurement approach is appropriate for the costs involved. | High | 
| TSM licences not available as agreed | TSM licences from the RSS pool were to be used but when they were required they were not available. Unplanned expenditure (£12,766) was required to purchase the requisite licences. | Ensure commitment to supply licences is formally confirmed. | Medium | 
| Processing on DataStore impacted other users | Processing on DataStore, a networked shared storage system, impacted other users including staff network drives. Testing had to be curtailed and a request to move to ENT SAN raised where performance and reliability improved. | Ensure appropriate storage technology is used for DV working storage during the design phase. | High | 
| Design of the system infrastructure was ad hoc | Design of the system infrastructure was done just prior to the provisioning of final test and production environments. The team had limited infrastructure design knowledge. | Ensure infrastructure design is included at the start of the project as part of the system design process | Medium | 
| Significant budget overspend on development resources | Spending on development resources ate into the FY2018-19 project (DV III) budget potentially reducing the work that could be done on the next phase of DV. | Effective planning at the start of the project identifies resource and scheduling requirements that can be used to control project spend more efficiently. | High | 
Outstanding Issues
The following, within the project scope, were not completed –
- Deposits up to 10TB not available – current limit is 2TB
- Organisation structure and roles not implemented for data sharing
- Automated billing and usage reports – basic reporting provided
The following key features are planned for the next Data Vault project, DV III, with starts in February 2019.
- Increase maximum deposit sizes to 10TB (currently 2TB)
- Improve release management
- Improve processing recovery procedures
- Implement organisational structures and roles for data sharing
- Improve the website user experience (UX)
- Provide storage auditing features
- Implement automated billing
- Provide retention date review features
- Migrate deposits from the interim Data Vault service
- Provide usage reports for Schools administrators
- Simplify retention policy updates
- Improve encryption key management
- Optimise deposit and retrieval processing times
Appendix 1 – Jira Cumulative Flow Diagram

| Attachment | Size | 
|---|---|
|  jira_cumulative_chart.jpg | 46.54 KB | 
