Home > Design Patterns > Streaming Storage

Streaming Storage (Buhler, Erl, Khattak)

How can large datasets be accessed in a way that lends itself to efficient processing of data in batch mode?

Streaming Storage

Problem

Batch data processing techniques require contiguous blocks of input data to achieve high throughput. However, storing data using databases does not provide such a capability.

Solution

A Big Data storage device with streaming data access capability is used.

Application

Streaming data access technology is implemented to store datasets for non-random, simple sequential access, which achieves higher data transfer throughput.

A distributed file system storage device is used to enable streaming data access. When data is required for batch processing, only the start position of the file needs to be found, and then the rest of the file is output as a continuous stream till the end of the file. Although enabling batch data processing, a distributed file system does not support any file search capability. A file can only be accessed based on a known location, and data can only be searched based on a sequential scan of the whole file.

This pattern is generally applied together with the Large-Scale Batch Processing pattern to provide a complete solution.

Streaming Storage: A storage device that is capable of providing non-random data access is used for storing large amounts in support of batch data processing. Restricting data access to non-random mode enables provisioning of data as contiguous blocks of data without requiring multiple data seek operations.

A storage device that is capable of providing non-random data access is used for storing large amounts in support of batch data processing. Restricting data access to non-random mode enables provisioning of data as contiguous blocks of data without requiring multiple data seek operations.

  1. A distributed file system database is used to store large amounts of unstructured data.
  2. When the data is required for batch processing, the distributed file system only needs to perform a single seek to find the start position of the file. Then, the distributed file system starts streaming the file without any further seeks.
  3. This results in a very high throughput and decreases the time of the overall data processing.