The traditional file system and database web backend is no longer adequate and must make way for storage systems that handle unstructured data. In this article, we’ll learn the differences between structured and unstructured data, and why web storage backends need to scale to handle unstructured data.
Traditionally, web applications use file systems and databases to store user data. This is simple to manage, because web applications generate structured data by accepting text input into forms and saving the input to a database. However, times are changing; with the advent of social media, cloud storage and data analytics platforms, increasing amounts of unstructured data are being pushed onto the internet.
IDC conducted a study in 2014 that predicted that unstructured data created and copied worldwide would reach 44 zettabytes, or 44 trillion gigabytes, per year by 2020. This is a 10-fold increase from compared to the 2013 figure of 4.4 zettabytes. If you think that’s a bit much, think about it: unstructured data already represents 90% of all digital data in 2015!
So, like other computing paradigms, storage systems must evolve to handle this new wave of unstructured data that has hit the Internet. But before we go any further, let me define unstructured data for you. Data that cannot be organized for storage in a relational database is generally referred to as unstructured data. You can have textual or non-textual unstructured data. Text documents, emails, and presentations are examples of unstructured text data. Examples of non-text unstructured data include videos, images, and audio files. You can also take a look at this Quora Feed to get an idea of the difference between structured and unstructured data.
Why object storage?
We now know that there is a lot of unstructured data being generated that needs to be processed in a way that is easy to access, yet secure and reliable. We already have a storage mechanism that people have used since the beginning of modern computing, the file system.
So why do we need a whole new storage paradigm? The answer lies in the details. Let’s close a bit and understand the requirements.
When we talk about unstructured data and its scale, it is important to understand that the underlying system used to store the data must scale very well. But scaling filesystems is difficult. Not only do you have to deal with the (sometimes) unnecessary metadata and hierarchy that file systems impose on you, but there are also maintenance considerations such as managing backups.
It is not enough to collect unstructured data. You also need to apply some level of organization to make sense of the data. Techniques such as text analytics, auto-categorization, and auto-tagging are key to making business sense of all the unstructured data you collect. This is difficult to achieve with filesystems because they have fixed layouts.
Filesystems are not designed for HTTP(S), but rather for humans. Sharing and managing files in a file system is difficult to manage programmatically. Managing file streams and possible edge cases is error prone and time and effort consuming.
To circumvent all of this, something new is needed, something imagined from the ground up that keeps the new requirements in focus. This brings us to object storage.
What is Object Storage
Unlike files in file systems, objects are stored in a flat structure. There is only one object pool: no folders, directories or hierarchies. You simply request a given object by presenting its object ID. Objects can be local or on a cloud server thousands of miles away, but because they’re in a flat address space, they’re retrieved in exactly the same way.
An important aspect is metadata management. Object storage provides great flexibility because object metadata is arbitrary. Metadata is not limited to what the storage system deems important (think fixed metadata in file systems). You can manually add any type or amount of metadata. For example, you can assign the type of application the object is associated with; the importance of an application; the level of data protection you want to assign to an object; if you want this object to be replicated on one or more other sites; when to move this object to another storage tier or to another geographical area; when to delete this object. And so on, the possibilities are limitless.
It is very important that the files are accessible via HTTP(S), to ensure that the file is easily accessible. Then it can be subjected to analyzes or other techniques. Object storage handles this well. Almost every platform that offers object storage has REST APIs to help you access files over HTTP(S). Not only are APIs useful for accessing data, but they also help you authenticate, get file properties, and manage permissions, all of which you would have to do manually in a file system.
Now that the majority of data on the Internet is unstructured, and experts are predicting double-digit growth in this trend, it’s important to meet this challenge head-on. Unstructured data needs to be stored in an easy-to-access way, and we need to have the tools to make business sense of all that vast amounts of unstructured data we collect.
Let’s take a look at some of the most popular open source object storage solutions available:
Ceph is a distributed object, block, and file storage platform. Ceph’s software libraries provide client applications with direct access to the RADOS object-based storage system and also provide a foundation for some of Ceph’s advanced features, including RADOS Block Device (RBD), RADOS Gateway, and the Ceph File System. . (To see An Introduction to Ceph Storage for OpenStack.)
minio is a minimalist API-compliant object storage server with Amazon S3. Written in Go, Minio is lightweight and highly concurrent. (To see Minimal object storage with Minio.)
OpenStack Swift is a highly available, distributed, and eventually consistent object/blobs store. Written in Python, Swift supports REST APIs and other clients to access data. (Read more Opensource.com articles on Swift.)
A version of this article was previously published on Minio Blog. Republished with permission and under Creative Commons.