As our team was starting to transform our parts of the Azure Services Platform from a CTP ‘labs’ service exploring features into a full-on commercial service, it started to dawn on us that we had set ourselves up for writing a bunch of ‘enterprise apps’. The shiny parts of Service Bus and Access Control that we parade around are all about user-facing features, but if I look back at the work we had to go from a toy service to a commercial offering, I’d guess that 80%-90% of the effort went into aspects like infrastructure, deployment, upgradeability, billing, provisioning, throttling, quotas, security hardening, and service optimization. The lesson there was: when you’re boarding the train to shipping a V1, you don’t load new features on that train – you rather throw some off.
The most interesting challenge for these infrastructure apps sitting on the backend was that we didn’t have much solid ground to stand on. Remember – these were very early days, so we couldn’t use SQL Azure since the folks over in SQL were on a pretty heroic schedule themselves and didn’t want to take on any external dependencies even from close friends. We also couldn’t use any of the capabilities of our own bits because building infrastructure for your features on your features would just be plain dumb. And while we could use capabilities of the Windows Azure platform we were building on, a lot of those parts still had rough edges as those folks were going through a lot of the same that we went through. In those days, the table store would be very moody, the queue store would sometimes swallow or duplicate messages, the Azure fabric controller would occasionally go around and kill things. All normal – bugs.
So under those circumstances we had to figure out the architecture for some subsystems where we need to do a set of coordinated action across a distributed set of resources – a distributed transaction or saga of sorts. The architecture had a few simple goals: when we get an activation request, we must not fumble that request under any circumstance, we must run the job to completion for all resources and, at the same time, we need to minimize any potential for required operator intervention, i.e. if something goes wrong, the system better knows how to deal with it – at best it should self-heal.
My solution to that puzzle is a pattern I call “Scheduler-Agent-Supervisor Pattern” or, short, “Supervisor Pattern”. We keep finding applications for this pattern in different places, so I think it’s worth writing about it in generic terms – even without going into the details of our system.
The pattern foots on two seemingly odd and very related assumptions: ‘the system is perfect’ and ‘all error conditions are transient’. As a consequence, the architecture has some character traits of a toddler. It’s generally happily optimistic and gets very grumpy, very quickly when things go wrong – to the point that it will simply drop everything and run away screaming. It’s very precisely like that, in fact.
The first picture here shows all key pieces except the Supervisor that I’ll introduce later. At the core we have a Scheduler that manages a simple state machine made up of Jobs and those jobs have Steps. The steps may have a notion of interdependency or may be completely parallelizable. There is a Job Store that holds jobs and steps and there are Agents that execute operations on some resource. Each Agent is (usually) fronted by a queue and the Scheduler has a queue (or service endpoint) through which it receives reply messages from the Agents.
Steps are recorded in a durable storage table of some sort that has at least the following fields: Current State (say: Disabled, Active), Desired State (say: Disabled, Active), LockedUntil (Date/Time value), and Actor plus any step specific information you want to store and eventually submit with the job to the step agent.
When Things Go Right
The initial flow is as follows:
(1)a – Submit a new job into the Scheduler (and wait)
(2)a – The Scheduler creates a new job and steps with an initial current state (‘Disabled’) in the job store
(2)b – The Scheduler sets ‘desired state’ of the job and of all schedulable steps (dependencies?) to the target state (‘Active’) and sets the ‘locked until’ timeout of the step to a value in the near future, e.g. ‘Now’ + 2 minutes.
(1)b – Job submission request unblocks and returns
If all went well, we now have a job record and, here in this example, two step records in our store. They have a current state of ‘Disabled’ and a desired state of ‘Active’. If things didn’t go well, we’d have incomplete or partially wedged records or nothing in the job store, at all. The client would also know about it since we’ve held on to the reply until we have everything done – so the client is encouraged to retry. If we have nothing in the store and the client doesn’t retry – well, then the job probably wasn’t all that important, after all. But if we have at least a job record, we can make it all right later. We’re optimists, though; let’s assume it all went well.
For the next steps we assume that there’s a notion of dependencies between the steps and the second steps depends on the first. If that were not the case, the two actions would just be happening in parallel.
(3) – Place a step message into the queue for the actor for the first step; Agent 1 in this case. The message contains all the information about the step, including the current and desired state and also the LockedUntil that puts an ultimatum on the activity. The message may further contain an action indicator or arguments that are taken from the step record.
(4) – After the agent has done the work, it places a completion record into the reply queue of the Scheduler.
(5) – The Scheduler records the step as complete by setting the current state from ‘Disabled’ to ‘Active’; as a result the desired and the current state are now equal.
(6) – The Scheduler sets the next step’s desired state to the target state (‘Active’) and sets the LockedUntil timeout of the step to a value in the near future, e.g. ‘Now’ + 1 minute. The lock timeout value is an ultimatum for when the operation is expected to be complete and reported back as being complete in a worst-case success case. The actual value therefore depends on the common latency of operations in the system. If operations usually complete in milliseconds and at worst within a second, the lock timeout can be short – but not too short. We’ll discuss this value in more detail a bit later.
(7), (8), (9) are equivalent to (3), (4), (5).
Once the last step’s current state is equal to the current state, the job’s current state gets set to the desired state and we’re done. So that was the “99% of the time” happy path.
When Things Go Wrong
So what happens when anything goes wrong? Remember the principle ‘all errors are transient’. What we do in the error case – anywhere – is to log the error condition and then promptly drop everything and simply hope that time, a change in system conditions, human or divine intervention, or – at worst – a patch will heal matters. That’s what the second principle ‘the system is perfect’ is about; the system obviously isn’t really perfect, but if we construct it in a way that we can either wait for it to return from a wedged state into a functional state or where we enable someone to go in and apply a fix for a blocking bug while preserving the system state, we can consider the system ‘perfect’ in the sense that pretty much any conceivable job that’s already in the system can be driven to completion.
In the second picture, we have Agent 2 blowing up as it is processing the step it got handed in (7). If the agent just can’t get its work done since some external dependency isn’t available – maybe a database can’t be reached or a server it’s talking to spews out ‘server too busy’ errors – it may be able to back off for a moment and retry. However, it must not retry past the LockedUntil ultimatum that’s in the step record. When things fail and the agent is still breathing, it may, as a matter of courtesy, notify the scheduler of the fact and report that the step was completed with no result, i.e. the desired state and the achieved state don’t match. That notification may also include diagnostic information. Once the LockedUntil ultimatum has passed, the Agent no longer owns the job and must drop it. It must even not report failure state back to the Scheduler past that point.
If the agent keels over and dies as it is processing the step (or right before or right after), it is obviously no longer in a position to let the scheduler know about its fate. Thus, there won’t be any message flowing back to the scheduler and the job is stalled. But we expect that. In fact, we’re ok with any failure anywhere in the system. We could lose or fumble a queue message, we could get a duplicate message, we could have the scheduler die a fiery death (or just being recycled for patching at some unfortunate moment) – all of those conditions are fine since we’ve brought the doctor on board with us: the Supervisor.
The Supervisor is a schedule driven process (or thread) of which one or a few instances may run occasionally. The frequency depends on much on the average duration of operations and the expected overall latency for completion of jobs.
The Supervisor’s job is to recover steps or jobs that have failed – and we’re assuming that failures are due to some transient condition. So if the system would expect a transient resource failure condition that prevented a job from completing just a second ago to be healed two seconds later, it’d depend on the kind of system and resource whether that’d be a good strategy. What’s described here is a pattern, not a solution, so it depends on the concrete scenario to get the timing right for when to try operations again once they fail.
This desired back-off time manifests in the LockedUntil value. When a step gets scheduled, the Scheduler needs to state how long it is willing to wait for that step to complete; this includes some back-off time padding. Once that ultimatum has passed and the step is still in an inconsistent state (desired state doesn’t equal the current state) the Supervisor can pick it up at any time and schedule it.
(1) – Supervisor queries the job store for any inconsistent steps whose LockedUntil value has expired.
(2) – The Supervisor schedules the step again by setting the LockedUntil value to a new timeout and submitting the step into the target actor’s queue
(3) – Once the step succeeds, the step is reported as complete on the regular path back to the Scheduler where it completes normally as in steps (8), (9) from the happy-path scenario above. If it fails, we simply drop it again. For failures that allow reporting an error back to the Scheduler it may make sense to introduce an error counter that round-trips with the step so that the system could detect poisonous steps that fail ‘forever’ and have the Supervisor ignore those after some threshold.
The Supervisor can pursue a range of strategies for recovery. It can just take a look at individual steps and recover them by rescheduling them – assuming the steps are implemented as idempotent operations. If it were a bit cleverer, it may consider error information that a cooperative (and breathing) agent has submitted back to the Scheduler and even go as far as to fire an alert to an operator if the error condition were to require intervention and then take the step out of the loop by marking it and setting the LockedUntil value to some longer timeout so it’s taken out of the loop and someone can take a look.
At the job-scope, the Supervisor may want to perform recovery such that it first schedules all previously executed steps to revert back to the initial state by performing compensation work (all resources that got set to active are getting disabled again here in our example) and then scheduling another attempt at getting to the desired state.
In step (2)b up above, we’ve been logging current and desired state at the job-scope and with that we can also always find inconsistent jobs where all steps are consistent and wouldn’t show up in the step-level recovery query. That situation can occur if the Scheduler were to crash between logging one step as complete and scheduling the next step. If we find inconsistent jobs with all-consistent steps, we just need to reschedule the next step in the dependency sequence whose desired state isn’t matching the desired state of the overall job.
To be thorough, we could now take a look at all the places where things can go wrong in the system. I expect that survey to yield that at as long we can successfully get past step (2)b from the first diagram, the Supervisor is always in a position to either detect that a job isn’t making progress and help with recovery or can at least call for help. The system always knows what its current intent is, i.e. which state transitions it wants to drive, and never forgets about that intent since that intent is logged in the job store at all times and all progress against that intent is logged as well. The submission request (1) depends on the outcome of (2)a/b to guard against failures while putting a job and its steps into the system so that a client can take corrective action. In fact, once the job record is marked as inconsistent in step (2)b, the scheduler could already report success back to the submitting party even before the first step is scheduled, because the Supervisor would pick up that inconsistency eventually.