Home > Design Patterns > Large-Scale Graph Processing

Large-Scale Graph Processing (Buhler, Erl, Khattak)

How can algorithms that require iterative processing of a large dataset that may or may not contain connected entities be processed in an efficient and timely manner?

Large-Scale Graph Processing

Problem

Implementation of a recursive algorithm for processing a large dataset using a general, distributed processing framework results in an inefficient solution because the algorithm needs to be executed over multiple and separate processing runs.

Solution

A specialized, iteration-friendly processing technique is used that executes the entire algorithm as a single processing run.

Application

A graph processing engine is used that enables passing messages between entities in the dataset and further maintains state between successive iterations.

A graph processing engine, such as bulk synchronous parallel (BSP), is used, which internally implements a message-passing functionality to provide a method of communication between entities that are spread across multiple logical processors within the cluster. Such a processing engine further persists the entity state in memory. Keeping state data in memory provides a means for restarting the next iteration of the job immediately without having to initialize a new processing job.

When faced with extremely large datasets, the performance gains achieved from the application of the Large-Scale Graph Processing pattern may deteriorate, for the intermediate entity state may need to be stored to disk due to memory constraints. Also, due to the specialized nature of the processing engine, graph processing engine-specific implementations for all algorithms may not be available.

Large-Scale Graph Processing: Data is processed using a distributed processing engine that makes use of stateful data processing techniques in support of executing iteration-based algorithms. This helps with massively speeding up the execution time of an iterative algorithm.

Data is processed using a distributed processing engine that makes use of stateful data processing techniques in support of executing iteration-based algorithms. This helps with massively speeding up the execution time of an iterative algorithm.

  1. A shortest path algorithm needs to be applied to a dataset, comprising a set of connected entities, to find the shortest path between Entities A and F.
  2. A graph processing engine is used to implement the shortest path calculation algorithm that is able to execute the entire algorithm within a single processing job in order to find the shortest path.
  3. The entire execution of the shortest path algorithm takes a short time to complete.