Considerations for Solving the Big Data "Small File" Problems

project-management-tool-triniti

Big data has over the years gained acceptance as storage for unstructured data such as emails, text files, web pages, digital images, multimedia content, navigation details, and social media posts. In the last few years, the focus has moved to capture structured data such as Customer information, Sales transactions, and Purchase orders in data lakes, replacing conventional Data Warehouses and eliminating 'Transformation' from ETL. The biggest challenge in big data is the problem commonly referred to as "the small file problem".

What is the "small file problem?"

When you ingest data into the data lakes, the driver (usually Hive) stores them as files in HDFS or object storage. Unless you insert large amounts of data, you will end up creating many small files. Small files are a significant bottleneck for distributed query engines as the engines are optimized to consume large volumes of data in single tasks.

Let us explore the considerations that become critical in the success of a big data solution based on the type of data.

Architecture

Unstructured data is ingested (Sqoop, Kafka) and stored (HDFS, Object Storage), and later retrieved using a distributed query engine (Presto, Impala, HiveQL) with relative ease. For structured data, we not only have to store the data itself, but we also need to store the schema information. We need an additional component in Big Data architecture, a translation layer between the data source and storage, e.g., Hive. Hive has a metadata store using relational databases (Derby, mySQL, Postgres) that captures schema information while storing the data. The same metadata is then also used while querying the data by the query engine.

Partitions

Relational databases feature indexes to optimize query performance. Big data does not have indices and rely on partitions to optimize data retrieval. The unstructured data, since it does not have data elements, be it a picture or a multimedia file, the partition will be a data attribute such as time and or geolocation. For structured data, since it consists of multiple data elements, the partition and data attributes can be on any data element such as Sales Order Ship-from location, Customer credit Category, and Ship-to-region. A suitable partition basis from the different set of partition basises available for structured and unstructured data forms part of the optimization strategy.

Data Consolidation

Having picked an excellent partition key based on the right size of data volume, we must ensure that every insert consolidates into as few files as possible. Hence an 'Append' functionality is critical for structured data. We have a certain number of records in a table, and with new transactions flowing in, we need to add the recent transactions to the table or 'Append' them to the existing file. For structured data using Hive, the ability to consolidate the data exists out of the box for HDFS storage. For object storage, the consolidation needs to happen during ingest using in-memory processing with Spark or use a Delta lake strategy after ingest.

Number of Partitions/ Partition Key

The partition key is the basis for the creation of partitions used for storing the data. Structured data can have data elements as partition keys and usually results in a large number of partitions. On the other hand, unstructured data have partitions based on the data attributes and have fewer partitions. The large number of partitions impact performance adversely. For structured data, to ensure optimal performance, an optimization strategy is required to minimize the number of partitions associated with structured data.

Query and File Formats

In the Big data world, we have multiple options such as Parquet for columnar storage, AVRO for Row-based, and ORC for row and column storage. Query needs for Unstructured and structured data vary. You query unstructured data as a single data - a picture, a music track, or an aggregation query, i.e., all social media posts from geolocation. Structured data will have aggregation queries, e.g., the sum of amount for all Sales orders, or some / all data elements consisting of a record, e.g., Sales order number, customer, Item, Price, discount. File format such as parquet would be suitable for aggregation queries and unstructured data, while ORC is better suited for structured data to get the best of both worlds.