Home > Design Patterns > Canonical Data Format

Canonical Data Format (Buhler, Erl, Khattak)

How can the same dataset be consumed by disparate client programs?

Canonical Data Format

Problem

Client programs developed using different languages may not be able to read data serialized according to a specific encoding that either requires amendment to the client program or performing format conversion each time the dataset is accessed.

Solution

Store a single copy of the dataset in a common format that is interoperable between disparate clients.

Application

Standardize on an interoperable encoding scheme and serialize the dataset using the standardized encoding.

An interoperable data encoding scheme is selected and set as the de facto serialization scheme for serializing data within the Big Data platform. Furthermore, a serialization engine that is capable of encoding and decoding data using the standardized encoding scheme is used. To save time and avoid unnecessary use of processing resources, the data transfer engine can be configured to output the data using the standardized encoding scheme. In other circumstances, especially in case of relational data or datasets acquired from third-party data sources, such as data markets, the datasets need to be processed to shape the data into the required format based on the standardized encoding scheme.

Canonical Data Format: A canonical and extensible serialization format is chosen to save data such that disparate clients are able to read and write data. This saves from having to perform any data format conversion or keeping multiple copies of a dataset in different formats. The canonical serialization format is generally based on a schema-driven format that provides information about the structure of the data.

A canonical and extensible serialization format is chosen to save data such that disparate clients are able to read and write data. This saves from having to perform any data format conversion or keeping multiple copies of a dataset in different formats. The canonical serialization format is generally based on a schema-driven format that provides information about the structure of the data.

A dataset is serialized into a common format that is then consumed by three disparate clients without the need to perform any data format conversion.