Home > Design Patterns > Dataset Decomposition

Dataset Decomposition (Buhler, Erl, Khattak)

How can a large dataset be made amenable to distributed data processing in a Big Data solution environment?

Dataset Decomposition

Problem

Processing a very large dataset stored as a single file cannot take advantage of the distributed processing capabilities of a Big Data solution environment and leads to severe processing latency if processed using centralized data processing techniques.

Solution

The dataset is stored in a distributed manner by breaking down the original dataset into multiple parts.

Application

A Big Data storage technology that implements automatic decomposition and storage of a dataset across multiple nodes in a cluster is used to store the dataset.

A distributed file system storage device is employed that automatically divides a large file into multiple smaller sub-files and stores them across the cluster. When a processing engine, such as MapReduce, needs to process data, each sub-file is read independently to implement distributed data processing. All sub-files are automatically stitched together when it needs to be read in a streaming manner or when it needs to be copied to a different storage technology.

Dataset Decomposition: The large dataset is automatically split into multiple datasets and stored across multiple nodes in the cluster. Each sub-dataset can then be separately accessed by the processing engine. If the file needs to be exported, all parts are automatically joined together in the correct order to get the original file.

The large dataset is automatically split into multiple datasets and stored across multiple nodes in the cluster. Each sub-dataset can then be separately accessed by the processing engine. If the file needs to be exported, all parts are automatically joined together in the correct order to get the original file.

  1. A large dataset is saved as a single file at a central location.
  2. (a,b,c,d) The dataset needs to be processed using a processing engine deployed over a cluster.
  3. (a,b,c,d) To enable distributed data processing, the dataset is imported to a distributed file system that automatically breaks the dataset into smaller datasets spread across the cluster.
  4. (a,b,c,d) The dataset is now successfully processed by the processing engine.