Big data has emerged as an option for organizations looking to provide data warehousing for transactional systems such as ERP, CRM, SCM, and HRMS. This data can replace traditional MOLAP, ROLAP, and HOLAP structures to meet business intelligence requirements. You can redirect the data analytics and visualization layer consisting of dashboards and reports to Big Data (data lakes) without impacting the visualization layer.
Storage, therefore, becomes a critical component of Big Data solutions. The option is Hadoop Distributed File System (HDFS), a workhorse Apache ecosystem or Object storage that has gained wide popularity. Let us examine the value prepositions that they have to offer -
|HDFS can operate on commodity hardware. It is highly flexible for horizontal scaling, with clusters running from a few nodes to thousands of nodes (Facebook boasts of one with 2000 nodes and 21PB storage)
|Object storage, on the other hand, is more vertically scalable. We can add storage to the existing servers, but it is not easy to add a Server, limiting horizontal scalability. Hence it needs the high-end storage server machines only, preferably NVMe.
|High performance, at a relatively cheaper price
|HDFS has both compute and storage components. HDFS clusters will consist of a minimum of one Name node (More than one in high availability environments) with storage distributed across multiple data nodes
|Object storage, on the other hand, is pure storage with practically no compute
|Decoupled Storage and Compute architecture provide flexibility. AWS's serverless solutions are a prime example
|HDFS handles structured data such as sales figures for a region linked to Product and Salesperson. Folder hierarchies are similar to file systems, making them very transparent. It also handles 'Append,' i.e., adding records to the end of the table.
|Object storage works very well with unstructured and semi-structured data such as images, music, videos, and web content. The entire 'document' is created, consumed, and deleted through a focus on 'Create' and 'Delete.'
|The use-case here is Data Warehousing which is purely structured data
|HDFS relies on replicating data to avoid data loss, which results in it using 2x or more storage space due to duplication of data
|Erasure coding, which prevents the need to replicate data and yet has a fallback in case of failures
|Although HDFS has now introduced Erasure coding, since it is relatively new, there is no reliable history of its effectiveness. Moreover, HDFS runs with commodity hardware, while Object Storage requires high-end storage.
|HDFS has more moving parts, i.e., processors, memory, and storage, to provide horizontal scalability. On-Prem also needs cooling and Electricity.
|Object Storage has a lesser number of high-end machines; they are less
|HDFS incurs relatively higher utility costs than object storage
|Though coming from multiple vendors like Cloudera and MapR, the underlying file system remains HDFS.
|Object storage comes in several flavors, each flavor coming from a vendor. Some examples are S3, ADLS, and MinIo.
|HDFS provides transparency and interoperability between offerings
In conclusion, Object storage has presented an excellent option for storage, and my vote for it is under two conditions. First, the data is unstructured, semi-structured, and archiving. Second, the expected storage size runs upwards of 5 Petabytes. For sub-5PB storage and where structured data is the prime need, HDFS is the way to go. However, if you're concerned about the long-term viability of HDFS or your vision is to migrate to S3, ALDS, or Snowflake, object storage is the better option.