Home > Design Patterns > Data Size Reduction

Data Size Reduction (Buhler, Erl, Khattak)

How can the size of the data be reduced to enable more cost effective storage and increased data movement mobility when faced with very large amounts of data?

Data Size Reduction

Problem

Storing increasingly large amounts of data inside a Big Data solution environment can quickly exhaust existing storage capacity, requiring frequent storage capacity expansion that leads to increased costs. On the other hand, transferring very large files inside a cluster can affect the overall data processing time.

Solution

Incoming raw data’s storage footprint is reduced before data is stored inside the Big Data platform.

Application

Acquired data is compressed either inflight in case of streaming data or after acquiring the dataset in case of batch data by applying compression techniques.

A compression engine mechanism is introduced within the Big Data platform that works closely with the data transfer engine to compress data as it is acquired. In other circumstances, already acquired data can be processed to create a reduced-size dataset, or the output from the processing engine can be configured to be compressed automatically.

The application of this pattern requires some attention as incorrect application may increase overall data processing time and be a waste of processing resources. This requires the use of an efficient compression engine that requires fewer processing cycles to compress and decompress data but at the same time provides an optimum reduction in the dataset size. A compression engine that provides more compression requires more computing power and time and vice-versa.

Data Size Reduction: A component is added to the Big Data platform that reduces the size of the data before it is saved to the storage device. This not only keeps the storage cost low but further facilitates faster data movement within the cluster, which helps achieve quicker processing of data.

A component is added to the Big Data platform that reduces the size of the data before it is saved to the storage device. This not only keeps the storage cost low but further facilitates faster data movement within the cluster, which helps achieve quicker processing of data.

In the preceding diagram, with a reasonable amount of data acquisition, the IT spending only increases slightly with the passage of time. As the amount of acquired data increases exponentially, there is a tendency for the IT spending to increase exponentially as well. However, the storage capacity does not need to be increased proportionally if a data compression engine is introduced. As a result, the IT spending only increases slightly.