Home > Design Patterns > Automated Dataset Execution

Automated Dataset Execution (Buhler, Erl, Khattak)

How can the execution of a number of data processing activities starting from data ingress to egress be automated?

Automated Dataset Execution

Problem

Successful processing of a variety of voluminous data, whether arriving at a high or low velocity, requires recurring execution of a series of tasks which, if performed manually by a human, can significantly slow down the speed of data analysis.

Solution

The execution of various data processing tasks as well as the ingress and egress of data is automated.

Application

A component is introduced within the Big Data platform that creates a workflow of activities which can be configured to run automatically.

A workflow engine mechanism is used for creating and executing a workflow. Based on the interface provided by the workflow engine, either a markup language or a graphical user interface (GUI), the user specifies each operation that needs to be performed for achieving the required end result. Once the workflow is created, it is automatically executed by the workflow engine by calling the respective Big Data mechanism in turn that is responsible for executing a particular workflow step.

The productivity achieved through the application of the Automated Dataset Execution pattern depends upon how many different types of data processing operations can be automated by the workflow engine, which translates into how many different types of Big Data mechanisms can be invoked by the workflow engine. An extensible workflow engine needs to be chosen that provides extension points for future integration.

Automated Dataset Execution: The set of operations that need to be executed are specified in the form of a flowchart. The entire flowchart is then automatically executed without requiring human intervention. This results in a configure-once, execute-often solution.

The set of operations that need to be executed are specified in the form of a flowchart. The entire flowchart is then automatically executed without requiring human intervention. This results in a configure-once, execute-often solution.

  1. A user needs to acquire two datasets, cleanse them, join them together, apply a machine learning algorithm to the joined data and then export the results to a dashboard.
  2. The user uses a workflow engine to create a workflow of all required activities via the workflow engine.
  3. The workflow engine generates a flowchart of activities that need executing.
  4. The entire set of activities from data ingest to egress is then scheduled and automatically executed by the workflow engine by calling the required Big Data mechanisms in turn to perform the configured activities.