Future Storage Models
Researchers have a strong requirement for being able to run analysis across any data they hold (no matter the storage system/location) and to be able to operate on the data as a single large data pool. (This is often referred to as a Data Lake)
A distributed storage architecture presents data which may be held at any arbitrary location within the system to the user or compute resources as an equivalent of local storage
The original EDDIE 3/DataStore upgrade had this as an ambition for the new systems but at that point the technology was not yet mature and reliable. In the subsequent 3 – 4 years the technology has developed rapidly to the point where there are now a number of major providers with such technologies available and suitable for large scale operations In 17/18 IT-I RSS ran a feasibility study looking at the available extensions to the IBM technology currently used for DataStore providing this capability and a limited scan of competing options. The result of this in summary, is that the technology is potentially suitable but the IBM offering may be currently too costly Proof of Concept systems for the IBM Spectrum Cloud technology and the competing CEPH OpenSource systems have been examined. We would like to extend this to a small number of other potential systems (eg. WesternDigital ActiveScale and Cloudian) and move to providing a limited scale Development system which can be made available to researchers interested in testing the capabilities of this technology, with the aim of identifying the correct direction for the eventual roll out of a comprehensive system In addition many modern distributed storage solutions have at their core some form of Object Storage. This type of storage differs from traditional file based storage in that the data is stored as effectively a Binary Blob with associated metadata providing the data location information
Such an approach means that data is locatable using metadata search rather than traversing file trees. This has a number of benefits for the user and the system; rich search data discovery, improved system speed as operations do not need to span file trees
This Project is to complement and extend the POC for Future Storage Models already looked at under RSS039.
This Project will investigate and compare the new options with the same POC peramiter as previously conducted.
The options being considered are:
1. Arcastream
2. Cloudian
3. Western Digial
3rd Party discussions have taken place and all companies have been sent the POC details and are eager to progress, but due to RSS resources, this Project has been placed on hold until Q1/2 when resources should be available to concentrate on the POC.
Current project status
Report Date | RAG | Budget | Effort Completed | Effort to complete |
---|---|---|---|---|
July 2021 | BLUE | 50.0 days | 4.0 days | 46.0 |