MQTT. An Implementer’s Perspective

A few weeks ago, I sat down in front of an empty C# project and with a printout of the latest OASIS MQTT 3.1.1 specification review draft and started to implement the protocol from scratch.

There were several, including a few non-technical reasons not to pick up an existing implementation like, for instance, Paolo Patierno's M2Mqtt library (which I'm using a test client), which included requiring a server implementation with a certain shape of hooks, but a key reason was also that I wanted to understand the MQTT protocol at an implementer's level.

As I started, I had good but still cursory understanding of MQTT, which was probably about at the same level as anyone reading the "the de-facto standard protocol of Internet of Things" claims in semi-technical articles that cover its existence, but not its function. Published consensus is that it's very compact, it's easy to implement, and it's originating from and backed by IBM, and therefore must be a default good choice for device scenarios.

After implementing most of it, and I will explain which parts I left out and why, I am very disappointed.

Two exemplary scenarios I have in mind as I write this are bi-directional, long-haul communication with moving vehicles on GSM/LTE with national cross-carrier roaming, and bi-directionally connected Smart Meters on 802.15.4 based networks over unlicensed, and thus potentially very congested, public frequency spectrum. These are two key volume scenarios for the "Internet of Things" as I see it shaping up. Interestingly, you can read success stories for MQTT for these exact scenarios; and I do have some sense for how well things are really going in some of those.

The conclusion I will explain in this post is that MQTT is not a good protocol for long-haul communication (i.e. across the Internet), especially not when the going gets tough. It's also not a particularly well-designed protocol. That is also why this article is as long as it is.

Before I get into the details, there's a little bit of backstory that ought to be told and that backstory is about IBM and the context in which MQTT came into being. As you consider the following, mind that while I work at Microsoft, this is my personal perspective, I'm not having my posts read, reviewed, or approved by marketing. I care about stuff working right, and about making stuff work right, and I also care also about honesty and transparency in engineering.

IBM has a very successful enterprise messaging business and has had it for many years; related product names are MQSeries and WebSphereMQ. "Successful" is an understatement. They dominate the space. As they dominate, IBM has held the MQ wire protocol under tight wraps, until today. The Advanced Message Queuing (AMQP) protocol development effort started as a customer-driven initiative of Wall Street banks aiming to create an alternative messaging protocol with the goal of breaking out of that lock-in.

It is very interesting to observe how IBM are now playing open-protocol champions, having repurposed the "MQ Integrator SCADA Device Protocol" into MQTT, and drive community efforts on the connected devices front, while still keeping MQ closed, and are conveniently positioning a fairly expensive messaging appliance offering as a bridge.

That appliance speaks MQTT out to the device-side and MQ out to the backend-side. IBM has steadfastly refused to join the AMQP effort from the earliest days, so it doesn't seem like the motivation behind their strategy is ubiquitous messaging interoperability. I believe, personally, that IBM has published MQTT specifically to segregate messaging protocols in order to protect the MQ business. I believe IBM kept and keep MQTT intentionally limited. Yes, IBM indeed has an AMQP 1.0 protocol runtime in Beta called MQ Light; which seems like a nice way to funnel AMQP traffic into MQ without implementing AMQP. But this article is not about AMQP. It's about MQTT.

MQTT is not a messaging protocol; I would call it a funnel protocol serving to move binary frames of data, preferably from constrained clients into data collection systems. It's too limited for actual messaging work, which requires message metadata that a broker can work with. It is doing reasonably well at a very, very narrow set of use-cases and it is terrible at everything that goes beyond those use-cases. What it's reasonably good at is best-effort, raw-data ingestion from clients and best-effort raw-data delivery to clients using a solicit-push pattern (I'll have an explanation later). And as it turns out, the things MQTT is good at can be done in much simpler ways, while retaining more flexibility at the same time.

As we go through MQTT, the text will have many hyperlinks to various places in the MQTT specification, so there's no great danger for me to get off the rails with regards to the facts. Mind that the hyperlinks can't go to precise sections because the OASIS MQTT 3.1.1 specification doesn't have a lot of hyperlink anchors.

My goal is to simultaneously explain MQTT coarsely (go to the spec for details) and then comment on it.

Connection Model

MQTT is a session-oriented protocol overlaid over a stream transport protocol that has a clear notion of a client and a server [MQTT 4.2]. On TCP, MQTT clients connect to the server port 1883 for plaintext communication (the IANA port registration shows as ibm-mqisdp). Using TCP with overlaid TLS, MQTT clients connect to the server port 8883 for secured communication. [MQTT 5.1]

A MQTT connection over an existing (TLS-) socket is established though a handshake, with the client sending a connect packet, and the server replying with a connack packet. When the session is rejected, the server will terminate the connection after connack has been sent, which will then contain an error code.

The connect packet allows initializing quite a few capabilities. Because it is the first message flowing, it carries a protocol name and version indicator. It also carries a set of protocol flags that describe the connect message itself and how the broker shall treat it, and depending on those flags it optionally carries authentication information (username/password), and also a "Will" message.

"Will" [MQTT 3.1.2.5] is an interesting concept. It allows the client to park a message with the server for the duration of the session, and that message gets published on the server to a specified "Will Topic" once the session gets unexpectedly torn down for any reason. A clean "disconnect" from the client will cancel the "Will", i.e. it will not be sent.

Message Structure

One of MQTT's goals is for it to be super-compact on the wire. That's also arguably one of its greatest appeals. To that end, the preamble of each message can be as tiny as 2 bytes. The first byte splits into two nibbles [MQTT 2.2].

The first nibble (4 bits) indicates the packet type and the second nibble holds special flags related to that packet type. The packet type serves as the operation selection criterion for the MQTT stack. The protocol can therefore accommodate exactly 16 different packet types, of which 14 are currently used and 2 have a hard reservation, causing a mandatory protocol violation error when used [MQTT 2.2.1].

The second byte is the start of the packet length indicator, which is a sequence of 7-bit integers (value sits in bits 0-6). Whenever bit 7 is set, the next byte carries a further value complement and the current length value shifted up by 7 bits. Thus, a packet length of 127 or less can be expressed in one byte, and as there are four bytes allowed, the encoding allows for packets of up to 256Mbytes [MQTT 2.2.3].

This preamble sets the tone for most of MQTT, which is that the protocol is dictated by (unfortunately not consistent) wire-footprint greed and will trade away many of the key capabilities of modern protocols for reduction of wire footprint. Unfortunately, the spec doesn't tell the whole story about the actual wire footprint, and some of the decisions start looking questionable once you start looking at the true wire-footprint with IP and TCP headers and the requisite TLS framing added, as well as what you need to put into the payload to compensate for what MQTT does not provide.

The packet type indicator nibble with the accompanying flags nibble is nicely demonstrating that greed, but simultaneously also showing that MQTT doesn't have much future runway as a protocol without some drastic changes. If the protocol needs to add just one more packet type (and that need exists if proper error handling were added), the only way to rescue the current structure were to use the last reserved value as an escape hatch, and start putting packet types elsewhere, and the only good place seems to be that other nibble since everything else would mess up the protocol structure even more. So the extensibility runway even for new protocol revisions is very constrained.

The length indicator greed and related computation requirement is also somewhat surprising when we consider that the goal is to connect very constrained devices on metered networks and that you may not only want to be saving every byte, but also saving on overall protocol overhead and compute overhead.

The extra protocol overhead sizes to be aware of are at least 40 bytes TCP/IPv4 and 60 bytes TCP/IPv6 packet overhead for any first-try successful transmission, plus about 40 bytes TLS frame overhead. On IPv6 networks, which will be the norm for at least part of the communication paths for many devices in the future, the base transport packet overhead thus sits at some 80-100 bytes at a minimum. This is the base wire-footprint cost for any MQTT packet, when we neglect Nagling effects where a device would send many distinct messages within a short time that then would be able to hitch a ride inside the same TLS frame.

In this overall context, MQTT chooses 7-bit encoding requiring a calculation that the spec even includes pseudo-code for. It could also just say that the length indicator is either two or four bytes long and if bit 15 of the two-byte integer were set, it'd be a 31-bit integer and you need to factor in the next two bytes and just mask out the most significant bit. That's easier to code, and it also frees up 2 more bits.

You'd pay for a simpler model with potentially losing 1 byte compared to the 7-bit encoding on messages smaller than 128 bytes. It's worth debating whether sending messages of 128 bytes or less (including further MQTT overhead that I'll discuss in the following) is a good use of precious metered bandwidth. At that size, going to be sitting at or near 50% pure protocol overhead over your payload when using TCP/IPv6 and overlaid TLS. But you will have saved a byte.

Following the 1 to 5 bytes of preamble ("fixed header"), MQTT defines specific layouts for the various control packets. The first portion of that layout is called the "variable headers" [MQTT 2.3] and the second is called "payload" [MQTT 2.4], but the spec is fairly confused about where to draw the line between these and how they're interrelated.

On the connect packet, the "variable headers" carry the protocol version identifier, keep alive indicator, and session control flag, but also declare the layout of the following payload section. The payload section then begins with the mandatory client identifier, and it feels very arbitrary that this ought to be a payload component while the version indicator string is a header component.

It's appearing just as arbitrary that the "Will" mechanism is shred apart across the header section (QoS) and the payload section (topic and payload). Sadly, that is a theme of the spec in that it lacks principle on what is payload and what is header.

There are four kinds of header value encodings in MQTT

Two-byte length-prefixed strings
Single byte
8-bit bit-field
2-byte integer

The two-byte length-prefixed string is inconsistent with the 7-bit encoding effort on packet length; even the MQTT version-indicator string header in the "connect" packet is wasting a byte with an extra zero for the 4-character string "MQTT".

Protocol Versioning

The versioning story of MQTT is a sad chapter and nearly comical given all the greed that goes into the preamble. The MQTT 3.1.1 connect message burns up 7 bytes to tell you who it is, and the MQTT 3.1 connect message (IBM's last version before submitting to OASIS) burns up 9 bytes for announcing "MQIsdp" and the protocol sub-version, which is a single byte header following that string.

When a protocol implementation is able to read that version indicator, it's already at least 2, if not 5 bytes into reading the message as the protocol and protocol version indicator trails the fixed header preamble. The specification also treats any use of unspecified values as a protocol violation. It doesn't do so for the packet types explicitly, but it does so for the associated flags, so if you wanted to implement the current version correctly, you will be nuking the connection before you can even read the version string, if a new revision of the protocol would want to be introducing a new packet type or change permitted flags.

That means that the fixed section of the protocol is, if implemented correctly, effectively locked in forever in the current shape, and it will require hacks to change.

What I absolutely didn't expect was that MQTT 3.1 and MQTT 3.1.1 (the OASIS edit) would use a wholly different protocol identifier string. A MQTT 3.1.1 client will cause any correct MQTT 3.1 broker to reject connections, even though that's the only difference between the versions. What's worse is that the spec further limits its own runway by locking that in, making the length indicator bits explicit and normative and stating: "The string, its offset and length will not be changed by future versions of the MQTT specification." [MQTT 3.1.2.1].

Then why have all these 6 bytes and not just use, maybe, the UTF-8 characters for 'T' and 'T' and a trailing version counter of two bytes as you'll want to give yourself some room for spec revisions, instead of presenting this were an extensible thing? And then put that 4-byte preamble at the beginning of the stream, so a server can pick the implementation stack as the connection opens?

Extensibility

If your impression so far is that MQTT is an extensible protocol, as I have been writing about fixed and variable headers, control types, and payloads, then I must apologize for misleading you in the same way that the MQTT spec is misleading. MQTT is not extensible, at all.

MQTT is a protocol design that sets the clock back at least 20 years, while the rest of the distributed systems community has made extensibility and Postel's robustness principle driving considerations for practically all modern protocol designs.

If you read the spec and ignore the differentiation between "header" and "payload" and just think of the various information bits as "fields", you get closer to what the spec actually describes. The "connect" message is composed of a sequence of 8 required fields and 4 optional fields. The optional field's presence is indicated in content of the prior 8 fields.

"Header" telegraphs the presence of a concept that MQTT doesn't provide and that you may think of reading the spec: custom message metadata extensibility.

The MQTT has no facility for a client to communicate application-specific information to the server or the recipient. There really are no application-usable headers. There is no specified way for the client to indicate anything about a published message to the infrastructure and there is no specified way for the client to attach metadata to a message that provides functional information outside the message payload.

For someone accustomed to protocols like HTTP, XMPP, AMQP, or even approaches where people throw JSON on a WebSocket, the notion of lack of extensibility may be pretty hard to grasp in a "what do you mean?" kind of way. There is no custom metadata. Whatever you want to tell the other side outside of what the spec says, must go into the payload and it can only go into the payload of the one packet type that allows for a free form payload: publish.

The rest of MQTT, all other 13 defined gestures, is a non-extensible locked box and one with a weak versioning story at that. You either take the protocol as-is, in the current shape, or you don't. Some people might see that as a virtue; I find the notion of lack of extensibility for the key gestures of a protocol horrifyingly backwards.

There are obviously potential hacks to get around that limitation for that one precious publish packet type. It's conceivable to throw custom metadata into the publish message's topic name field in a sort of a query string – but what would be the format of that? Hacks are hacks.

Payload Encoding

One mainstream use-case for MQTT is telemetry ingestion (as it is the "MQ telemetry transport", after all). That means the client just needs to know two gestures: connect/connack as discussed, and publish.

The publish packet carries a number of flags (that I'll get to later), the topic name, and a packet-identifier whose presence depends on the flags. That information is followed by a free-form binary payload, whose length is determined by subtracting the length of the topic name, including its own leading length indicator, and the length of the packet-identifier from the remaining length included in the fixed header.

As a result, you find two subsequent length indicator fields (remaining length and topic name length) on the wire that are interrelated as you decode them, and that's where MQTT gets caught up in its faked up "header" model, and its protocol-manifested desire to be wanting to know the length of the message and read the entirety of the message into memory before beginning to process it.

That's the opposite of HTTP, which – with some success – uses a model allowing incremental discovery of the message that also permits payload chunking. It's also different from AMQP or WebSockets, which both have a cleanly layered notion of framing, and where each frame has a frame-length preamble, before there's any consideration about what's in it, so the transport stack can pull a frame off the network without having to communicate any information discovered in the process to an upper layer. Even XMPP, whose XML-fragment based framing model I personally don't like much, allows the network reader to pull a frame off the network without considering the contents other can counting the balance of elements and attributes. The MQTT design throws a bit of significant protocol right in front of the framing and therefore introduces unnecessary coupling.

The great sin of MQTT with regards to payload is that it is completely oblivious to payload. Not only does MQTT not define a payload format as AMQP or XMPP or JSON-over-WebSocket do, it doesn't even provide negotiable or at least discoverable content-types or encodings as HTTP and AMQP do. On the subscriber side of an MQTT communication path you just have to know what's in that byte array. There is no facility to tell you what the content-type or content-encoding might be. This is a protocol in active standardization in the year 2014. It's unbelievable.

As a consequence, stating that a system supports MQTT can't be a complete statement. You really have to say "mqtt+bson", "mqtt+json", "mqtt+protobuf", or whatever you use as payload encoding, because the protocol gives no indication and, yet, payload and framing are necessarily coupled when you're in the publisher or consumer role. A broker can be oblivious to the payload, its clients can't be. The MQTT specification isn't showing much consideration for the needs of those.

Errors

Amazingly, MQTT does manage to take it up a notch from unbelievable to inexcusable when you look at the error handling strategy.

MQTT's error handling model for everything from client-side protocol violations to intermittent error conditions, where a server can't flush a message to disk for whatever is reason is for the server to drop the network connection. The rule is that whenever something happens that's unexpected and not written down in the spec as the blessed one way, the server will cut the cord, with no explanation.

The sole exception from that rule is the connect/connack handshake where connack does indeed define a set of error codes that explain why a connection cannot be established. However, should you be making the mistake of sending "connect" and have one of the header flags wrong, the server will (must) cut the socket even before replying with "connack".

This toddler-like attitude of MQTT towards error conditions instantly disqualifies it as a serious protocol choice for practically all use-cases where predictability and reliability matters, it makes the protocol hard to debug, it's also actively sabotaging MQTT's goal of providing minimal wire-footprint.

The strategy really something to behold: For an application field where wireless communication with significant traffic contention and packet loss and network-roaming clients with frequent disconnects due to cell-tower hopping will be the norm, MQTT indicates errors, including those caused by transient conditions in the broker, by making them indistinguishable from network-infrastructure imposed error conditions and absolutely impossible to diagnose.

The strategy also demonstrates MQTT's inherent obliviousness of the wire-footprint and latency effects of adding security to the transport layer. The protocol is greedy to the single byte, but in the case where the server's storage layer were having the hiccups for a minute (and that happens in a distributed system), the protocol is perfectly happy to toss out a negotiated TLS channel with the socket at the bottom of it, and force the client to restart into a multi-hop handshake that may include, depending on TLS authentication mode, the mutual exchange of certificates, potentially weighing in at 5-10 kBytes per reconnection attempt over networks with a few hundred milliseconds of base roundtrip latency. Because the server has the hiccups. The client and the client's metered M2M data plan link are paying for the server having the hiccups.

Anyone with any experience of operating scale-out systems ought to be offended by a design that assumes backend reliability perfection and lays the blame on the client if that assumption cannot be satisfied, with the backend assuming zero responsibility and not even telling the client what's wrong.

In MQTT's error handling mode, the client has no way of distinguishing between a weak network link with excessive packet loss, the timeout of a NAT session, roaming-imposed change of network address causing a socket collapse, or the server having the hiccups. As the client can't tell, it's only chance is to aggressively reconnect if it needs to get rid of data and thus it will run up the phone bill specifically when communication infrastructure is not at fault.

I'll be revisiting the error handling aspect a little later when we get to reliability.

Subscriptions

The way you receive messages from an MQTT broker is by creating a subscription. MQTT subscriptions are different from those in many other brokers in that they set up both a filter-based delivery route and a message solicitation gesture at the same time, and that the message solicitation gesture is active for as long as the subscription is active.

In other words, you tell the broker what you're interested in, and then you tell it that you want to get any message fitting that criterion as it becomes available without doing any further action. MQTT implements a "solicit push" pattern; the client connects and establishes a delivery route for messages and either creates a new subscription or reuses an established subscription set up during a previous session with the same client-id. If any messages are available at the time of connection they'll be instantly delivered and further messages will delivered into the existing connection to clients with a matching subscription as they become available.

The push model is great since it relieves you of having to pull individual messages or batches of messages; there is no need for a "receive" gesture. There is no "polling" as some people might put it, and the absence of that is something many people seem to find an attractive feature.

The flipside is that while there is no explicit need to pull, there's now also no ability to control the flow of messages coming up the pipe to the client other than refusing to read from the socket. That's not a problem for scenarios where the client is performing a singular task or several tasks that do not take any significant work like displaying a line in a chat window.

Not having any flow control can get fairly dicey when there is significant work to do for processing a message and there might be several different subsystems on the receiving end of the MQTT connection where the work differs. As MQTT offers no flow control capability whatsoever, you can thus get into the situation that in, say, a commercial vehicle telematics box is subscribed to traffic information that can just be quickly appended to a list and, simultaneously, to messages that must be shown to the driver and explicitly acknowledged. With that, you're multiplexing two streams of messages of which one you can potentially process at several thousand messages per second, while for the other one you're really depending on how the driver is willing to pay attention. Mind that there's also no message-level time-to-live information at the MQTT level that you could lean on for how long the sender is willing to wait for a reply.

In a situation like this, the required strategy with MQTT is to rip everything down from the wire as you can and process it locally or store it locally, because of the absence of flow control. You can also, to some degree, leave in-flight messages unacknowledged, but whether that's possible depends on the ordering requirements.

If there's any data stream over the connection requiring in-order delivery, then that automatically extends to all streams. Generally, the flow control issue will force messages to be flagged as consumed towards the broker, even with the at-least-once and exactly-once delivery assurance models (next section) when you have done nothing to process the message, because you need to keep getting at the messages coming up behind the one you're pulling off. MQTT lacks both flow-control and multiplexing.

The way AMQP deals with this is that it allows multiplexing of links through a single connection and has a credit-based message solicitation model, meaning that you can ask for a single message on one link and a quasi-unbounded sequence of messages on another link and thus can get "pull" and "solicit push" at the same time.

The maximum number of messages that can be kept in flight are limited by the 2-byte packet-identifier. If you wanted to maintain a high-speed transfer link with individual messages across a high-bandwidth, but high-latency network, either with very largely scaled-up TCP receive buffers or using alternate transports like UDT, MQTT will start tripping on itself at 64K in-flight messages pending acknowledgment on QoS 1 or better. With a 4 byte payload and Nagling, MQTT would you to hit that point just past 1MByte of wire footprint; 500ms roundtrip at 20 Mbit/sec.

State Management

Subscriptions must be held by the broker as long as the client maintains a session or the client tells to broker to clear out the session [MQTT 3.1.2.4] and nuke away all context established for it. What maintaining a session means is fairly ambiguous in the specification, stating "some Sessions last only as long as the Network Connection, others can span multiple consecutive Network Connections between a Client and a Server"

The ambiguity this creates is fairly interesting when taken together with the error handling strategy.

Is the server entitled to nuke all subscriptions when there is – let's say – a storage failover, while the client is connected and happens to send a message and that message can't be stored at that second? Since the server's only way to communicate errors is to kill the connection and the connection may represent the session boundary, it is legitimate for the server to throw all subscriptions out at that point, as per the specification. That does, actually, mean that the server is completely entitled to forget about all subscriptions with any sort of excuse. "It don't feel like it right now" is fine a transient condition to nuke the connection and drop everything.

Thus, following the words of the spec, the only way for a client to reliably reestablish its subscription context is to set the "clear session" flag at every reconnection attempt, and reissue the "subscribe" packets. And since it can't tell whether the server dropped the connection on purpose or the connection dropped due to some networking condition there's really no good way to optimize the behavior. Better be safe than sorry. Turns out, reissuing a receive gesture is exactly what you'd do on AMQP as well and also what you'd do with HTTP long-polling.

The downside of resetting the subscription is, of course, that the subscription's state with regard to the message stream gets lost, and you end up with an online-only subscription model with no offline capability as the subscription doesn't exist while offline. So the exact opposite strategy is to never issue a "subscribe" gesture, never set the clear session flag, and just go with the assumption that the "subscribe" gesture has been established at some point in the past. You don't subscribe; the subscription is just there. That gets you online/offline capability. Using "subscribe" on the wire does not. I'll revisit this at the end of this article.

The concept of "Will" is oddly not suffering from the same ambiguity issues. The intent here is straightforward in allowing the client can park a message that says "I'm no longer here, because I'm network-unreachable", which is a way to implement presence functionality, i.e. turn the "is present" flag off. For "Will", the rules are very clear in that it's tied to the network connection.

Resource Governance

Compared to MQTT 3.1, the OASIS MQTT 3.1.1 spec clarifies [MQTT 4.7.3] that topics are not generally dynamic and may be predefined. It also clarifies that there "may be a security component that authorizes access to certain topics". That's a more than necessary addition to the specification, especially when considering the existence of the "retain" flag. What the spec also ought to mention is that the way to express an authorization failure is to disconnect the client and that the client won't ever know why that happened.

The retain flag [MQTT 3.3.1.3] requires the server to retain the given message, of whatever size, on the topic, so that it will be delivered to "any future subscribers when subscriptions match the topic name". Any subsequent message with that flag will replace the existing message with "retain".

Dynamic topics in conjunction with this retention model are nice puzzle to solve, because a single publisher client can stuff a server with messages on randomly named topics that nobody will ever pick up, and the server can't effectively defend against that other than not offering dynamic topics or tracking dynamic topic use with a per-user (not per client-identifier) quota.

If authorization were defined, the retention model would likely require a special level of authorization beyond "send" permission, since the "retain" message is a shared resource that ought to have special protection; a client's set "retain" message ought to have a guarding mechanism against undesired overrides by some other, lesser privileged client. This is similar to the permission to allow creating sticky posts in web forums, which usually requires administrative permission.

I've chosen to not implement support for "retain" not only because of resource governance and authorization concerns that will have to be solved, but also because it requires immediate broker support, and will require special behavior for all other protocols on a multi-protocol broker. "Retain" is conceptually interesting, but I think I would like an explicit broker feature that works cleanly across protocols even more; some (any) explicit customer use-case demand for the capability would also help bumping the priority up.

Delivery Assurances

Next we'll take a look at delivery assurances. MQTT defines 3 levels of "Quality of Service". Level 0 is providing best effort, "at most once" message delivery assurance. Level 1 aims to provide an "at least once" message delivery assurance, and Level 2 even an "exactly once" delivery assurance.

Unfortunately, MQTT defines this assurance model only for the publish gesture and not for its own, inherent operations, which is a good source of confusion. For instance, if you're creating a subscription, and creating the subscription succeeds while the route to the client collapses (let's say due to mobile national roaming, switching towers) just as the server is sending the "suback" response, then it's not unlikely that the client will come back to reconnect to some other server node before the current server node even knows what's going on, and as the server is still waiting for the socket to collapse, hanging on the timeout for the pending response. At that point, does a subscription exist for the client in spite of the client not having received "suback"?

It's not clear what the correct behavior ought to be. The client could interpret the disconnection as a network failure occurring before or after "subscribe" could be delivered, or a failure to establish the subscription due to any transient server condition, such as an ongoing server node failover or any other reason. It can't know. The server node will look to send the "suback" command back to the client, but there is no rule about what happens to the subscription when that fails. "Suback" is (implicitly) a QoS 0 message, and it is therefore inherently acceptable to lose it, and thus the subscription probably ought to stand. The client can't distinguish these cases due to the broken error handling model and will be in doubt about whether the subscription exists.

That means the client is forced to retry under any such circumstances. The specification already anticipates this with making subscriptions effectively idempotent and requiring that a subscribe command matching an existing subscription must be replaced by a new subscription and that the message flow must not be interrupted.

This case shows how MQTT's lack of separation between the message transfer model on one side and overlaid semantics on the other is quite problematic. "subscribe" and "suback" ought to be messages that are both delivered with "at least once" assurance if the subscription were indeed held durably. Thus, if a subscription has been established as a connection collapses, "suback" would instead be delivered on the reestablished connection. MQTT's reliable payload message transfer model realized with "publish" and related acknowledgements isn't available for the MQTT management operations and that hurts their reliability semantics.

The "fire and forget" QoS level 0 is helped by the client. The client will try once and if it fails to get the message to the server, for whatever reason, it will instantly give up. That ensures "at most once" by ways of client cooperation with the rule, but there's no governance opportunity for the server.

The QoS 1 "at least once" model with publish and puback provides a reasonable level of assurance to the client about whether a message has been accepted by the broker when puback gets back through. Until the client has received puback, it must hold on to any sent messages and if there is any failure condition, the client will set the "dup" flag [MQTT 3.3.1.1] in a retry copy of the message and send again. The presence of the "dup" flag allows the server to determine that this is a retry. If the server has already sent puback for a given packet identifier, it must treat the message as a new publication [MQTT 4.3.2-2].

The "dup" flag is a bit of a mystery. Personally, I don't know what to do with it. The spec is clear that I can't rely on having seen a previous package with "dup" set to 0 – which is logical as the client can have run into a local network condition as it tried to put the first packet on the wire. It escapes me what I do with the knowledge that the client has retried sending this packet at least once (I may be looking at the umpteenth resend) and the specification is no help. It states that a set "dup" flag indicates a retransmission, but there's no rule that depends on it. This smells like protocol baggage.

The QoS 2 "exactly once" assurance is the assurance level that I, so far, chose to not implement, largely because I have serious doubts about it being possible to provide "exactly once" as an end-to-end assurance in a scale-out messaging system, and if the assurance can't be given end-to-end it makes little sense to provide it on any of the legs.

Without going into too much detail, there are a range of edge-case error conditions that can occur in large high-throughput, multi-node broker systems where you'll favor duplicating a message over potentially losing it completely. That's especially true in cases where the gateway and the broker run on different nodes, and the gateway hands off a message straight into a broker failover situation. In that case, the broker might just get the message off to disk but doesn't get a chance to report that fact back to the gateway. In traditional transactional systems, you would span a transaction from the client over to the message store to ensure consensus on the outcome of the operation so that the broker won't make the stored message available for consumption until the client permits it, but many contemporary scale-out broker systems can't and won't subject their stores to transaction-scope control by untrusted clients for availability, reliability, and security reasons.

MQTT tries to mimic that traditional transaction model similar to how Azure Service Bus's proprietary SBMP protocol (which is phased out in favor of AMQP) mimics it. The message gets published with publish and the server stores and holds it. The server then confirms the receipt with pubrec, which establishes consensus between client and server that the server has the message safely in hands. The client then issues pubrel to permit the server to publish the message, which is confirmed by the server with pubcomp. The pubrel/pubcomp exchange is a QoS 1 exchange, meaning the client will repeatedly reissue the pubrel message until it receives a pubcomp confirmation. Oddly, the client isn't allowed to set the "dup" flag on these retries [MQTT 3.6.1], which underlines my suspicion that the "dup" flag is largely protocol fluff or is someone's implementation detail seeping out into the spec.

MQTT's QoS 2 prescribed exchange will, if successful, achieve transferring exactly one message copy to the server. The pattern is a path well-traveled. It aims not to ensure that exactly-once delivery is achieved end-to-end with the publisher knowing that delivery has been successful.

The reason I didn't implement QoS 2 is that I would have to make a transactional scale-out store to hold on to these messages that would have to live outside of the actual broker to keep the promise I make in pubrec. Without deep integration with the broker message store, I would actually just move the problem by one layer and might still only get "at least once" assurance. I explain this more in the next section.

To make the model solid, the broker backend behind an MQTT gateway must immediately support the transaction gestures on its store, meaning the broker would have to store and lock messages handed to it, and then promise not to forward them until a second gesture clears them for forwarding. There's an interesting abuse vector here in that you could potentially stuff a server with messages and never release them. The specification's section on ordering [MQTT 4.6] cites an undefined "in-flight window" (which appears to be an implementation detail of IBM's MicroBroker that has no place in an OASIS spec) in a non-normative comment and speaks about how restricting in-flight messages will address this.

Data Retention and Failover

Since I'm looking at MQTT from the perspective of building a scaled-out broker infrastructure, the reliability semantics of the protocol are inseparable from the failover behavior, as failover – meaning that a server node shuts down for any reason and another node kicks in to replace it – is how any large scale system stays available.

On failover, the first interesting aspect is the maintenance of the session-related state across all frontends. MQTT's state-management semantics work out to demand either a "CP" state management backplane ("CP" means consistency-biased per the CAP theorem) or no cross-node state management, at all.

Directly copying from the specification [MQTT 3.1.2.4], the session state on the server consists of the following:

The existence of a Session, even if the rest of the Session state is empty.
The Client's subscriptions.
QoS 1 and QoS 2 messages which have been sent to the Client, but have not been completely acknowledged.
QoS 1 and QoS 2 messages pending transmission to the Client.
QoS 2 messages which have been received from the Client, but have not been completely acknowledged.
Optionally, QoS 0 messages pending transmission to the Client.

The rules on state retention [MQTT 4.1] are disappointingly noncommittal for a specification that imposes so many state retention obligations on a server. Session state (and therefore the session) must be maintained for as long as the network connection exists, but beyond that it can be liberally discarded based on time, administrator action, because arbitrary stuff goes wrong (state corruption), because of resource constraints, or a full moon. It's compliant to shout "error!" and throw all state away and the client will have to cope with it.

Sadly, this noncommittal attitude of the specification also throws all QoS 1 and QoS 2 assurances straight out of the window. A client that has established a subscription on which it expects QoS 2 message delivery of presumably important data on a topic, and that gets disconnected for any reason (including the server having the hiccups) gets absolutely no assurance at the state retention layer that either the subscription or the in-flight QoS 2 messages will be retained and held available for a reconnect.

Mind that I can't let the excuse "it depends on what the implementation does" count. Either the specification provides me with watertight assurances or it does not. MQTT does not. It doesn't even try.

It's wishy-washy with "some" and "others" (MQTT 1.2, "Some Sessions last only as long as the Network Connection, others can span multiple consecutive Network Connections between a Client and a Server.") or "can" (MQTT 3.1.2.4 "The Client and Server can store Session state to enable reliable messaging to continue across a sequence of Network Connections"). There's no MUST or even just SHOULD with regards to retention rules.

But let's assume the spec were more assertive and let's go through the session state items that the protocol asks to retain for the duration of a session. Let that be until the client disconnects or a timeout that is known to both parties a priori (that's my alternate definition, not the spec's). I've taken the liberty to reorder the item list from the spec for the purpose of a better flow of explanation.

For the following discussion I will assume that the node running the MQTT broker will be one of at least two in a farm and one of them fails (assume an instant death due to a power-supply failure) and the other needs to kick in as the failover secondary, with the client instantly reconnecting to the other node.

The existence of a Session, even if the rest of the Session state is empty – A session exists when there's an ongoing relationship with a particular client-id. The fact that there is a session must be retained and all subsequent items are presumably to be anchored on that session. The session is [MQTT 3.1.2.4] "identified by the Client identifier" so there must only be one. In fact, the client-identifier really ought to be called session-identifier, because using a true client identifier has fairly negative security implications, as I'll discuss in the next section. If client state has to be retained across connections and server nodes in a failover situations, the immediate consequence from this most basic rule is that if you are indeed retaining session state, you cannot return connack (which confirms establishing or recovering a session) until all server nodes have access to a replica of this fact. The spec doesn't say that.
The Client's subscriptions – Client subscriptions are subject to the same considerations and I already touched on the in-doubt issues with suback in the previous section. If subscriptions ought to survive network connections and they have QoS 1 or QoS 2 assurances attached, the record of their existence must be known by all server nodes before suback is returned. The spec doesn't say that. I'm cutting the spec some slack for QoS 0, because those subscriptions could indeed be replicated in an eventually consistent manner as fumbling some messages is inherently acceptable while the replica propagates.
QoS 1 and QoS 2 messages pending transmission to the Client – Since we presumably have a broker with peek-lock and server-side cursor support for subscriptions backing the MQTT implementation, this is a straightforward requirement to fulfill as it means that messages available on the subscription but not yet delivered will be retained. Brokers do that.
Optionally, QoS 0 messages pending transmission to the Client – see above.
QoS 1 and QoS 2 messages which have been sent to the Client, but have not been completely acknowledged – Here it gets very interesting, because we're required to log the in-flight client interactions on a per-session basis in a way that any server node in the farm can instantly take over redelivery. For QoS 1 and with the protocol implementation backend by a broker, this is not all that hard if the broker counts delivery attempts so that you can set the "dup" flag correctly (which is required for protocol compliance in spite serving no purpose I can see). For QoS 2, being failover-safe practically means that you will either have to distribute the fact of pending pubrel throughout the farm on a per session-basis before you send it, and also garbage collect that data after you receive pubcomp, or – easier – have to run pubrel through the backend broker, since you need to remember pending deliveries of that message just as you have to for the "publish" message per-se. The tradeoff for "easier" is that you're running edge-protocol specific control messages through the backend broker.
QoS 2 messages which have been received from the Client, but have not been completely acknowledged – This requirement is quite tough in a scale-out failover model unless you immediately own the broker store or the broker allows for a model of queuing messages under a lock. You will have to retain all these messages received via "publish" for access by all (secondary) nodes across the farm before you return "pubrec", but without having them committed into (or released from) the broker for delivery until the matching "pubrel" is received.

I didn't implement QoS 2 for the time being, since I can't fulfill the last QoS 2 retention requirement with the broker I'm using. Azure Service Bus does indeed support queuing messages under a lock when using transactions, but losing the client and client connection triggers the transaction being abandoned. I'm in the lucky position to be able to ask our broker development team directly for an extension of that capability to allow for a lock that can be explicitly managed, and I might actually do that; this will not, however, solve the replication problem of all potential secondary nodes having to know about that lock at the protocol gateway edge and its association with the client-id and the sequence-id, meaning that in addition to the lock, there's information about the lock that the gateway needs to retain server-side.

MQTT is far from easy to implement if you want to do it correctly, and across more than one server node.

I believe that MQTT specifically suffers from the madness of the attempt of providing reliable messaging using a "solicit push" pattern, where the solicitation of an unbounded sequence of messages occurs when the subscription is established, and the delivery of those messages is potentially subjected to the Qos 1 or 2 delivery assurances defined in MQTT. With a "pull" based model that separates establishing subscriptions and message solicitation, you can leave delivery resumption control to the client, with MQTT those two aspects are coupled.

AMQP also supports sophisticated patterns for resumption of links with all in-flight deliveries being retained intact and those are just as hard to do at scale, but it's a perfectly valid option there to have all deliveries fail out and make the clients ask for messages again once they reconnect. "Pull" provides a way for push the in-flight problem out to the clients, and make scale-out scenarios more reliable. HTTP follows the same principle (not having server interactions interdepend is an aspect of REST).

Because of these state management considerations, my particular implementation choice for MQTT is to not implement state retention, at all. Instead, I turn the actual establishment of a per-client subscription into an out-of-band gesture and reinterpret the MQTT "subscribe" gesture to mean receive (or push-me-stuff-while-this-connection-lasts) on that pre-existing backend broker (Topic-) subscription.

That means I'm intentionally coupling all MQTT semantics to particular connections; which also means I can't provide QoS 2, but that's fairly easy to replace with message deduplication on the client, anyways.

That separation also enables an interesting trick that I already alluded to earlier:

If I wanted to save the "subscribe" gesture upon connection for footprint reasons, the pre-existing and decoupled backend subscription will allow me to pretend that "subscribe" has been issued on a previous connection in ancient history. With that model, and if the client never uses the "clear session" flag, I can provide instant "solicit push" on the topic associated with the client with QoS 1 assurances over the existing backend topic; extra "subscribe" gestures are basically ignored.

Security

MQTT 3.1.1 Section 5 states "As a transport protocol, MQTT is concerned only with message transmission and it is the implementer's responsibility to provide appropriate security features. This is commonly achieved by using TLS", i.e. security is your own problem.

Punting on security doesn't stop the spec authors from including a few pages of mentions of security and even regulation considerations, including references to Sarbanes-Oxley (!), the NIST Cyber Security Framework, and PCI-DSS, all of which MQTT has absolutely nothing to do with or enables in any particular fashion. I find the name-dropping disturbing and I feel like there's an attempt to trick me into believing there relationships where there are none.

It continues when after mentioning TLS as an option, the security section also mentions that "Advanced Encryption Standard [AES] and Data Encryption Standard [DES] are widely adopted" (btw, DES is also very much broken, thank you) and that "Where TLS is used, SSL Certificates sent from the Client can be used by the Server to authenticate the Client" and goes on name dropping some details of X.509 and TLS until the rest of the section.

The only enlightening part of the MQTT security section is [MQTT 5.4.8] on Detecting Abnormal Behaviors, which enumerates a few actual threats that MQTT implementations ought to be able to monitor and defend themselves against. Unfortunately, this "for example" list is far from complete and doesn't represent any thorough analysis.

The first suggested measure is that "Server implementations might disconnect Clients that breach its security rules" (which is fairly handy as that's how MQTT deals with every error), and the second measure is to implement a dynamic block list based on identifiers such as IP address or Client Identifier or to punt the problem up to the firewall in a similar fashion. That's all reasonable advice for any network protocol.

Remember: "It is the implementer's responsibility to provide appropriate security features". The problem is that if there is no security, there is no solution; in no commercial environment. And without having a well-defined security model, there is no interoperability.

There are some pretty evil threat vectors looming around MQTT that the specification doesn't mention.

The gravest mistake in the specification is that it fails to mandate that the Client Identifier, and therefore the associated session state, MUST be tied to the authenticated client initiating a session, meaning that a Client Identifier MUST only be used by the authenticated client while such a session exists.

Without this rule, which I'm providing here, any client with access to the server that has knowledge of an existing Client-Identifier can walk up and steal the session when the owning client happens to be disconnected for any reason, which obviously includes, as we know, transient server error conditions for which MQTT's error model is to disconnect the client.

Naming the Client-Identifier what it is makes this threat fairly real as it suggests a fixed association of the client instance and the server. If MQTT were implemented in a device that holds an extractable credential (username/password or certificate) and the Client Identifier were chosen to be some obvious identifier such as the device's serial number, taking ownership of one device would potentially enable an attacker to hijack all sessions on that server. Hijacking a session does include taking over all previously established subscriptions, which means that even if there were an authorization model for Topics that were used during "subscribe", this approach would allow the attacker bypass the authorization boundary.

If the identifier were named Session-Identifier, implementers would more likely lean to make it an ephemeral and quasi random value (like a GUID) and that's much harder to guess.

Conclusion

For the last 7 years I've been involved in shipping one of the biggest, if not the biggest multi-tenant, multi-datacenter, transactional, cloud-based message-broker in the world, with several 10,000 concurrent tenants across nearly 20 global datacenter locations: Microsoft Azure Service Bus.

Do I have a conflict of interest debating a pet protocol of one of our competitors? Maybe; you'll be the judge of whether this analysis is biased. If you ask people who know me personally they'll tell you that I will call a spade a spade.

I very strongly believe that MQTT 3.1.1 cannot be implemented correctly providing anything but QoS 0 assurance at the scale we provide, and I'm not feeling comfortable of providing anything but a QoS 0 assurance for MQTT by the words of the spec, because MQTT 3.1.1 is a fundamentally broken protocol at the present time. I can still provide "at least once", but only with the mentioned workaround of assuming that subscriptions for a given client are established out of band.

I have implemented it, however, because customer are asking for it. Some customers who are asking are already using it and for those I see the implementation as a way to move them forward from where they are. Some customers are looking at fresh implementations of MQTT and for those (you) I wrote this analysis so you can read the specification informed by an implementer's perspective. If MQTT remains your choice, I will try to make you as successful with it as I can, but there will be limits to the lengths I can go due to the inherent deficiencies. There were times when technical pride would get in the way of folks working at Microsoft supporting what the customers demand; that's not my notion of running "services".

MQTT needs significant changes and I think MQTT can opt for one of two potential rescue paths. The pity is that both ways will and ought to lead to its destruction as it gets too close to viable and modern alternatives.

Either MQTT brutally simplifies and gets rid of all the cruft, while taking on its debt, and there most predominantly extensibility. On that route, it'll become quickly indistinguishable from JSON-over-WebSockets or particular incarnations of that model like Node's socket.io or ASP.NET's SignalR, and this includes wire footprint.

The alternative is that MQTT fixes all of its reliability deficiencies including ditching the "solicit push" model spanning connections, the awful error handling model, and its lack of multiplexing support, but then we're getting mighty close to AMQP 1.0. Which IBM doesn't seem to want to support in any serious fashion. For a reason. See up above.

MQTT is an old, recycled, and often weirdly inconsistent mess. It's not a good protocol, and certainly not a good protocol for the Internet of Things where we will look connect devices with long-haul links with unpredictable network conditions, and I believe it's unfixable without becoming something different entirely. We ought to know better, and OASIS also ought to know better.

[Update: Some reactions covered in this post]