Like many other messaging products and services, the services I build with my team at Microsoft mostly take a neutral stance towards payload data. We move byte arrays and streams. In fact, in my team we made it a hard principle to never touch the message payload inside our services. The upside of that stance is that we can easily support end-to-end payload encryption, because we don’t attempt to make any decisions based on the content.

But: Applications need to make hard choices about how they encode their data, and therefore I’m sharing some depth guidance that I initially wrote for an early draft of the Azure IoT Reference Architecture, but didn’t make it into the final doc in its entirety due to concerns about the overall size and depth of the document. The guidance applies quite broadly to messaging and not only to IoT. This is the “Director’s Cut”:

Introduction

There is a large and also growing number of formats available for the encoding of structured data for communication pruposes, and the optimal data encoding choice will differ from use-case to use-case and is sometimes even constrained by factors like the available code library footprint on a device.

  • JSON and XML (yes, still) are ubiquitous on the server and many clients and enjoy very broad library or platform-inherent support, but cause very significant wire footprint.
  • CSV is simple, interoperable, and compact (for text), but it’s structurally constrained to rows of simple value columns – which is very often enough for time-series data.
  • BSON, CBOR, and MessagePack are efficient binary encodings that lean on the JSON model and have great encoding size advantages, but require their own libraries and bring about some idiosyncratic choices like no first-class array support in the case of BSON.
  • Protobuf and Apache Thrift yield very small encoding sizes, but require distribution of an external schema (or even code) to all potential consumers, which is a prohibitive requirement in systems of nontrivial composition complexity.
  • Apache Avro is generally as or more efficient as these prior options and also natively supports layered-on compression and can carry the required schema as a preamble – whereby using the preamble puts Avro at a disadvantage compared to MessagePack, CBOR, or BSON for small or highly structured payloads with minimal structural repetition.

This list is not complete, but reflects the most popular options from what I can see.

Just as important as the encoding is the data layout, which can also have major impact on encoding size. A naïve JSON encoding approach where telemetry data is sent in the form of an array of objects where each object carries explicit properties for all values has enormously greater metadata overhead than a data layout mimicking CSV with a shared list of headers followed by an array of array carrying the row data.

Data Structure Considerations

The most common data encodings cover three great groups in terms of the approach to data structuring:

  • Comma-Separated-Values (CSV, including all other kinds of separators), and practically all other tabular data representations lay out data records as rows with the record data being split into columns. The column definitions uniformly apply to all rows, but rows may be “sparse”, meaning that a particular row may only have carry values for a subset of the columns. The rows/columns structure is very suitable for time-series information.

  • XML and HTML use a structural model based on a nominally unbounded tree structure made up of nodes, whereby “elements” can be annotated with qualifying attributes and may contain other nodes, including plaintext content. This structural model is most suitable for carrying and describing distinct sets of complex content that is to be processed by generic infrastructure components, such as a browser rendering a web page.

  • JSON and many other encodings, of which several will be explicitly discussed further on, use a structural model that is very closely aligned with the structural models used in the most popular programming languages and frameworks. Values are either held in (one-dimensional) arrays or in maps, whereby maps are dictionaries with uniquely keyed entries holding values. Values are of primitive types or are arrays or maps. Multidimensional arrays are modeled as arrays of arrays. The map/array/value structural model is the most universal, and I generally recommended it since the rows/columns and elements/attributes models can be expressed on top of it, while the reverse is not true.

For most application-to-application scenarios, the map/array/value structural model is generally preferable, and that ultimately also explains the success JSON had against XML, in spite of XML initially being the staunchly defended darling of the standards establishment.

Structural Metadata

There are several models for how data is described with metadata, in terms of data types and item identifiers, providing the system information about the particular layout and allowing data items to be identified and appropriately encoded and decoded.

The most common models are:

  • External schema – With external schema models, the description of data types and structure is shared or distributed separately from the communication path over which the data is exchanged or from where the data is stored. Data encodings like Apache Thrift or Google Protocol Buffers (“protobuf”) use this approach with the goal of reducing the data volume transferred over the communication path or stored on media.

  • Schema preamble – Schema preamble models separate the schema information from the data, but carry the schema with the data at all times. CSV commonly uses a descriptive schema preamble in form of header line that provides identifiers for the columns, and the column data types can be commonly be inferred from the data itself. Apache Avro has a formal schema language, allowing for description of complex structures, and a copy of the schema is, as a matter of principle, always carried as a preamble with any Avro data container.

  • Tagged data – In JSON and many other encodings, data is tagged, with each data item individually carrying the identifier (where needed) and the data type.

While the encodings using external schema, like Protobuf and Thrift, do achieve the goal of reduced footprint, the imposed cost on a complex system is enormous, as the external schema must be distributed and synchronized throughout all system components that need to process the information. An approach to this is to hold the schema information in shared registries.

Information that is durably stored and somehow gets separated from the external metadata is effectively rendered unusable through that separation. It’ll be a moment of intense grief when you’re in a highly regulated engineering field like in automotive or aerospace, open a raw certification telemetry data archive from cold storage in 10 years for an accident investigation, and someone forgot to keep the associated schema service going.

I therefore strongly discourage using any data encoding requiring external schema for durable storage. Furthermore I do not recommended to use any data encoding requiring external schema for any scenario where the two communicating endpoints are not under common control or where it is not practical to near simultaneously upgrade/change the schema at both ends of the communication path, even if the data encoding supports additive changes.

The marginal efficiency advantages Thrift or Protobuf may have over encodings like MessagePack or Avro in certain scenarios will always be severely overshadowed by the burden of external schema management, which becomes excessively more complex as systems evolve.

Encoding Formats

In the following, I discuss several data encodings with usage scenarios, whereby all either use the schema preamble or tagging models. That means I am excluding Protobuf and Thrift apriori because of the external schema concerns laid out above.

The descriptions are brief and it is encouraged to study the linked specifications or overviews.

JSON – JavaScript Object Notation

JSON (IETF RFC8259) is a lightweight, text-based data interchange format providing map/array/value structural model that is derived from a subset of JavaScript (ECMAScript). JSON is quite easy to parse and trivial to generate and therefore ubiquitously available or easily implementable anywhere.

JSON is a good default choice for all structured data, at rest and in motion, as it is the most interoperable option with the broadest reach. Practically speaking, a solution might never use JSON considering the format options listed below, but JSON ought always be a supported option on all processing and communication paths.

As JSON is text-based, it has efficiency disadvantages compared to binary formats, and those will often be preferable in scenarios where storage or communication path footprint or encoding effort are of concern. JSON is, however, always the most interoperable option. Because it is text, it’s also a more robust long-term archival choice. You’ll surely be able to read JSON in 30 years; that’s somewhat less assured with binary formats that are largely defined by a particular implementation.

Unless otherwise specified in the communication transport frame, all JSON text is assumed to use the UTF-8 (IETF RFC3629) text encoding.

CSV – Comma Separated Values

CSV (IETF RFC4180) is a very broadly used and very simple convention for encoding tabular data made up of rows and columns. RFC4180 is an attempt at standardizing the convention, but “comma-separated” data factually occurs in a broad variety of forms with semicolons, the vertical bar (pipe) symbol, tabs, and other characters used as separators and with data occurring quoted or unquoted.

The advantage of CSV is that it allows for a quite compact text encoding of tabular data with a schema preamble in the header followed by data with little overhead except for separators, and thus a much more compact rendering that a (naïve) JSON encoding that uses an array or records with repetitive metadata per record. The downside of CSV is, as mentioned, the lack of a reliable standard or convention and thus the absence of a type model. In lieu of that, I am suggesting the following constraints:

  • In extension of RFC4180, UTF8 character data is used for text encoding (see here)

  • In extension of RFC4180, additionally allowed separator characters are the semicolon (;), space (0x20), tab (0x09), and the vertical bar (|), but comma (,) is strongly preferred.

  • As a constraint to RFC4180, all CSV files and streams MUST have a header line with the column names. Column names may occur quoted or unquoted. The column name must comply with the JSON rule for constructing strings (RFC7159, Section 7).

  • As an extension to RFC4180, JSON type inference is used for column data during decoding.

    • All data occurring in surrounding and single quotes (‘) or double quotes (“) is treated as string data whereby the quotes are removed and do not count towards the string data.

    • All column data that is a valid numeric JSON expression (RFC7159, Section 6) is treated as a number of the furthermore inferred subtype.

    • All column data that is a valid JSON null, true, or false value (RFC7159, Section 3) is interpreted either as Boolean value or Null as applicable.

    • An empty column (no data or only unquoted whitespace data) is Null.

    • All other column data is treated as string.

With the above rules applied, CSV is preferred over JSON for encoding of tabular data where all columns carry data of primitive types, when a minimum of two rows is commonly expected.

Apache Avro

Apache Avro (Specification) is a data serialization system developed in the Apache Foundation, which features a very compact (and fairly straightforward) binary data encoding format, a formal schema language, and implementations across a number of languages and platforms, including Java, C#, C/C++, and Node.js, which are most relevant in server-side processing.

While Avro requires a schema for the encoder and decoder logic to function, it defines a container model where the JSON-encoded formal schema can be carried as a preamble for the encoded data. A suitable schema for encoding into Avro can be dynamically inferred from a given object graph, which means that a schema is always available for decoding and a schema can always be synthesized for encoding from any given concrete graph. That being so, Avro is very a suitable alternative to JSON.

Avro yields an extremely compact data encoding that can be further improved by data compression, which is also directly supported by the specification and the library implementations.

Because of the schema preamble being carried as plaintext JSON, the Avro encoding can only start playing out its strength once the data encoding savings eclipse the size of the schema preamble when compared to a JSON encoding or an encoding in one of the alternate binary formats explained below.

I recommend Apache Avro as the preferred binary encoding for transferred time-series data and all other structured data with significant structural repetition, obviously only if an Avro implementation is available for the devices in question. I also recommend Apache Avro as the preferred service-side media storage format for structured data due to its compactness and native support for data compression.

Apache Avro use should, however, be carefully considered in all cases where data must be preserved in archives and outside the system context for extended periods of time. Plain text formats take up more space, but the lack of dependency on a particular binary format specification and implementation of such a specification reduces the risk of the data not being decipherable decades into the future.

AMQP Encoding

The AMQP 1.0 Protocol (ISO/IEC 19464:2014, OASIS) includes a compact binary type encoding providing a map/array/value structural model. The AMQP encoding is a tagged format that is significantly more efficient than JSON, and has a higher fidelity type system for numeric types and date-time expressions.

The advantage of AMQP 1.0 encoding is that the encoder is readily available as part of any AMQP 1.0 client stack and therefore doesn’t require adding another library to the overall client library footprint, which is often a concern in embedded systems scenarios.

Generally, when AMQP 1.0 is used as a transport, AMQP type encoding is technically superior to JSON. The downside of choosing AMQP encoding is that the encoder/decoder is typically tied to the transport stack, meaning that it’s a problematic choice when the encoding/decoding from/into object graphs doesn’t happen right at the messaging API boundary or when the messaging API is abstracted (like in JMS).

AMQP is a compact choice for single records and highly structured information with minimal structural repetition, but is less efficient than Apache Avro for data with highly repetitive structural elements like time-series data.

Except in pure AMQP scenarios that aim for maximum efficiency while using a single encoding stack, it’s not preferable choice for payload encoding due to standalone AMQP encoders not being in widespread use.

MessagePack Encoding

MessagePack (Specification) is a very compact binary encoding providing a map/array/value structural model. MessagePack is a schemaless, tagged format. It is more efficient than JSON and the AMQP encoding.

Apache Avro’s schema-preamble strategy and native compression support still yields significant advantages over MessagePack with highly repetitive structural elements, but MessagePack is a great encoding choice for single records and highly structured information with minimal structural repetition.

When Avro is not an option for structural reasons, whether a solution opts for AMQP or MessagePack encoding depends on protocol use, library availability, and library footprint considerations. AMQP encoding is essentially a very reasonable and only slightly less efficient fallback option whenever MessagePack can’t be used, or when AMQP’s ISO/IEC standardization matters for policy reasons.

Like AMQP, MessagePack encoding is not recommended for bulk data storage.

CBOR Encoding

The Concise Binary Object Representation (CBOR) (Specification) is another compact binary encoding providing a map/array/value structural model, blessed by IETF in RFC7049. Like MessagePack, it’s a schemaless, tagged format and also has quite a few implementations, even though not quite as many as MessagePack.

If MessagePack is a good choice for a scenario, CBOR will likely be a similarly good choice and my guidance is equivalent.

Data Layout Convention for Map/Array/Value Encodings

While data encoding refers to how data is transformed to and from bits for transfer and storage, a data layout convention describes how structure of the data is constrained so that data can be universally handled across a system.

The data layout convention presented here should be understood as a guidance model informing schema definitions at the solution layer, with message and data schemas following the layout conventions below, and adding concrete semantics and data type choices for particular data items on top of the generic foundation.

The most important principle around data layout is that the data unit handled and processed in the context of model is a record, and not a message or a storage block, or document. Each of these storage or messaging transfer units may contain one or multiple data records (or events). A sequence of records may span multiple messages or storage units. In the following, “message” will refer both to messages and storage blocks or documents.

The row/column model of CSV provides a natural set of constraints for the layout and only allows for a not explicitly bounded list of rows, each equating to a record, with a not explicitly bounded set of columns, whereby each column value is of primitive type.

For the map/array/value structural model supported by the JSON, Avro, AMQP, and MessagePack data encodings, the following layout models are proposed:

Singular Record

A singular record occurs as a distinct object inside of a message. It may reside at the root of the message, or it may be uniquely and unambiguously identifiable through a reference expression. A single record is laid out as a map (dictionary). The values of a single record may be of primitive types, arrays, or objects.

One message may contain multiple, independent singular records.

In the simplest case, the record is equivalent to the payload:

{
     "fan" : {
        "id": "1221DEF",
        "ts": 1417128869,
        "temp": 20.2,
        "rpm": 3202,
        "hum": 52
    }
}

Records may also be found nested in the message payload graph:

{
    "id" : "1221DEF",
    "ts": 1417128869,
    "fan" : {
        "temp": 20.2,
        "rpm": 3202,
        "hum": 52
    },
    "lamp" : {
        "state" : "on",
        "bulbA" : {
            "bulbhours" : 1219
        },
        "bulbB" : {
            "bulbhours" : 12,
            "state" : "broken"
        }
    }
}

In queries

  • “fan” selects the “fan” record
  • “lamp.bulbA” selects the “bulbB” record inside the lamp object

For the selected record, all properties of its container objects might become properties of the record itself, with values of inner objects overriding values of outer objects.

If “bulbA” were chosen as a record, it will inherit “id”, “ts”, and “state” from the outer objects and its effective shape will be:

{
  "bulbA" :
  {
      "id" : "1221DEF",
      "ts": 1417128869,
      "state" : "on",
      "bulbhours" : 1219
  }
}

If “bulbB” were chosen as a record, it will inherit “id” and “ts” and override “state”, and its effective shape will be:

{
  "bulbB" :
  {
     "id" : "1221DEF",
     "ts": 1417128869,
     "state" : "broken",
     "bulbhours" : 12
  }
}

Record Sequence

A record sequence occurs an array object inside of a message. It may reside at the root of the message, or it may be uniquely and unambiguously identifiable through a reference expression. The records inside the array follow the above rules for singular records. Records inside the array don’t need to be entirely homogeneous, meaning they don’t need to have identical sets of properties, but all properties that do overlap by identifier must be of the same type.

{
  "fan" :
  [
    {
      "ts": 1417128859,
      "temp": 20.2,
      "rpm": 3202,
      "hum": 52 },
    {
      "ts": 1417128863,
      "temp": 20.2,
      "rpm": 3202,
      "hum": 52
    }
  ]
}

In queries,

  • “fan” selects the “fan” sequence
  • “fan[0]” selects a singular record

The same inheritance rules apply as for single records:

{
  "ts": 1417128859,
  "fan" :
  [
    {
      "temp": 20.2,
      "rpm": 3202,
      "hum": 52 },
    {
      "temp": 20.2,
      "rpm": 3202,
      "hum": 52
    }
  ]
}

For the selected record sequence, all properties of its container objects may become properties of all records in the sequence, with values of inner objects overriding values of outer objects.

Record Sequence with Metadata Preamble

The metadata preamble layout model adopts the strategy of CSV for more efficient encoding for flat records with tagged data formats.

The record sequence is encoded as an array of arrays inside of an object that has two properties:

{
  "_h" :["id","ts","temp","rpm","hum"],
  "_d" :[
     ["1221DEF",1417128869,20.2,3202,52],
     ["1221DEF",1417128871,20.2,3202,52]
  ]
}
  • “_h” (for header) holds an array of strings containing the column names
  • “_d” (for data) holds an array (rows) of arrays (columns).

The columns may diverge in type, as permitted by JSON and MessagePack. The data is modeled as a list in AMQP encoding and as an array of union types in Avro.

Encoding Decision Matrix

For deciding which of the presented encodings to choose, this matrix may help. In the use-case column, “flat” refers to records that solely consist of primitive data types. ‘Complex” refers to data where records contain nested object structures. The Avro column assumes use of containers with inline schema.

Use-Case JSON CSV Avro AMQP MsgPack/CBOR
Single-Record, Flat Data + o - ++ ++
Single-Record, Complex Data + - - ++ ++
Record Sequence, Flat Data o + ++ o o
Record Sequence, Complex Data o - ++ + +
Record Sequence w/ Preamble + ++ ++ ++ ++

Symbol key:

  • - excessive overhead or data representation issues
  • o significant overhead
  • + good fit
  • ++ best fit, least overhead

Summary

As a summary, I’d like to suggest to embrace the “Content-Type” declaration available in protocols like HTTP, MQTT (5.0+), and AMQP, and choose the appropriate data layout and encoding per use-case. Make data encoding and layout choices an explicit engineering decision, and don’t blindly pick a format to rule them all. Also be deliberate about long term storage choices and really think hard about taking dependencies on formats requiring external schema references outside of RPC scenarios.

PS: There has been some feedback that I should have included one or more of the ASN.1 encodings, in particular DER. Since ASN.1 is a schema format, my concerns are substantially the same as for Protobuf and Thrift: Use external schema with caution.

Updated: