Object Storage vs. HDFS - Which is Better?

 

Data Warehousing - The Context

Big data has emerged as an option for organizations looking to provide data warehousing for transactional systems such as ERP, CRM, SCM, and HRMS. This data can replace traditional MOLAP, ROLAP, and HOLAP structures to meet business intelligence requirements. You can redirect the data analytics and visualization layer consisting of dashboards and reports to Big Data (data lakes) without impacting the visualization layer.

Storage, therefore, becomes a critical component of Big Data solutions. The option is Hadoop Distributed File System (HDFS), a workhorse Apache ecosystem or Object storage that has gained wide popularity. Let us examine the value prepositions that they have to offer -
 

Criterion HDFS Object Storage Advantage Rationale
Infrastructure HDFS can operate on commodity hardware. It is highly flexible for horizontal scaling, with clusters running from few nodes to thousands of nodes (Facebook boasts of a one with 2000 nodes and 21PB storage) Object storage, on the other hand, is more vertically scalable. We can add storage to the existing servers, but it is not easy to add a Server, limiting horizontal scalability. Hence it needs the high-end storage server machines only, preferably NVMe HDFS High performance, at a relatively cheaper price
Architecture HDFS has both compute and storage components. HDFS clusters will consist of a minimum of one Name node (More than one in high availability environments) with storage distributed across multiple data nodes Object storage, on the other hand, is pure storage with practically no compute Object Storage Decoupled Storage and Compute architecture provides flexibility.  AWS’s serverless solutions are a prime example
Data Types HDFS handles structured data such as sales figures for a region, linked to Product, Salesperson. Folder hierarchies are similar to file systems, making them very transparent. It also handles 'Append,' i.e., adding records to the end of the table Object storage works very well with unstructured and semi-structured data such as images, music, videos, web content where the entire 'document' is created, consumed, and deleted through a focus on ‘Create’ and ‘Delete’ HDFS The use-case here is Data Warehousing which is purely structured data
Storage Utilization HDFS relies on replicating data to avoid data loss, which results in it using 2x or more storage space due to duplication of data Erasure coding, which obviates the need to replicate data and yet have a fallback in case of failures None Although HDFS has now introduced Erasure coding, since it is relatively new, there is no reliable history of its effectiveness. Moreover, HDFS runs with commodity hardware, while Object Storage requires high-end storage
Utility Costs HDFS has more moving parts, i.e., processors, memory, and storage, to provide horizontal scalability. For On-Prem also needs cooling and Electricity Object Storage having a lesser number of high-end machines; they are less Object Storage HDFS incurs relatively higher utility costs than object storage
Flavours/ Variations Though coming from multiple vendors like Cloudera and MapR, the underlying file system remains HDFS Object storage comes in several flavors, each flavor coming from a vendor. Some examples are S3, ADLS, MinIo HDFS HDFS provides transparency and interoperability between offerings
  

In conclusion, Object storage has presented an excellent option for storage, and my vote for it is under two conditions. First, the data is unstructured, semi-structured, and archiving. Second, the expected storage size runs upwards of 5 Penta bytes. For sub 5PB storage and where structured data is the prime need, HDFS is the way to go.  However, if you're concerned about the long-term viability of HDFS or your vision is to migrate to S3, ALDS, or Snowflake, object storage is the better option.


Data Lakes

Call 866-531-9587 / Fill out the contact form.