September 6, 2012
@ 07:08 PM

as I thumb through some people's code on Github, I see a fairly large number of "catch all" exception handling cases. It's difficult to blame folks for that, since there's generally (and sadly) very little discipline about exception contracts and exception masking, i.e. wrapping exceptions to avoid bubbling through failure conditions of underlying implementation details.

If you're calling a function and that sits on a mountain of dependencies and folks don't care about masking exceptions, there are many dozens of candidate exceptions that can bubble back up to you and there's little chance to deal with them all or even knowing them. Java has been trying to enforce more discipline in that regards, but people cheat there with "catch all" as well.  There's also a question what the right way tot deal with most exceptions is. In many cases, folks implement "intercept, shrug and log" and mask the failure by telling users that something went wrong. In other common cases, folks implement retries. It's actually fairly rare to see deeply customized and careful reactions to particular exceptions. Again - things are complicated and exceptions are supposed to be exceptional (reminder: throwing exceptions as part of the regular happy path is horrifingly bad for performance and terrible from a style perspective), so these blanket strategies are typically an efficient way of dealing with things.

That all said ...

Never, never ever do this:


And  not even this:

catch(Exception e)

Those examples are universally bad. (Yes, you will probably find examples of that type even in the archive of this blog and some of my public code. Just goes to show that I've learned some better coding practices here at Microsoft in the past 6 1/2 years.)

The problem with them is that they catch not only the benign stuff, but they also catch and suppress the C# runtime equivalent of the Zombie Apocalypse. If you get thread-abort, out-of-memory, or stack-overflow exceptions thrown back at you, you don't want to suppress those. Once you run into these, your code has ignored all the red flags and exhausted its resources and whatever it was that you called didn't get its job done and likely sits there as a zombie in an undefined state. That class of exceptions is raining down your call stack like a shower of knife blades. They can't happen. Your code must be defensively enough written to never run into that situation and overtax resources in that way; if it does without you knowing what the root cause is, this is an automatic "Priority 0", "drop-everything-you're-working-on" class bug. It certainly is if you're writing services that need to stay up 99.95%+.

What do we do? if we see any of those exceptions, it's an automatic death penalty for the process. Once you see an unsafe out-of-memory exception or stack overflow, you can't trust the state of the respective part of the system and likely not the stability of the system. Mind that there's also a "it depends" here;  I would follow a different strategy if I was talking about software for an autonomous Mars Rover that can't crash even if its gravely ill.  There I would likely spend a few months on the exception design and "what could go wrong here" before even thinking about functionality, so that's a different ballgame.  In a cloud system, booting a cluster machine that has the memory flu is a good strategy.

Here's a variation of the helper we use:

public static bool IsFatal(this Exception exception)
    while (exception != null)
        if (exception as OutOfMemoryException != null && exception as InsufficientMemoryException == null || exception as ThreadAbortException != null || 
exception as AccessViolationException != null || exception as SEHException != null || exception as StackOverflowException != null) { return true; } else { if (exception as TypeInitializationException == null && exception as TargetInvocationException == null) { break; } exception = exception.InnerException; } } return false; }

If you put this into a static utility class, you can use this on any exception as an extension. And whenever you want to do a "catch all", you do this:

} catch (Exception e) { if (e.IsFatal()) { throw; } Trace.TraceError(..., e); }

If the exception is fatal, you simply throw it up as high as you can. Eventually it'll end up on the bottom of whatever thread they happen on (where you might log and rethrow) and will hopefully take the process with it. Threads marked as background threads don't do that, so it's actually not a good idea to use those. These exceptions are unhandled, process-terminating disasters with a resulting process crash-dump you want to force in a 24/7 system so that you can weed them out one by one.

(Update) As Richard Blewett pointed out after reading this post, the StackOverflowException can't be caught in .NET 2.0+, at all, and the ThreadAbortException automatically rethrows even if you try to suppress it. There are two reasons for them to be on the list: first, to shut up any code review debates about which of the .NET stock exceptions are fatal and ought to be there; second, because code might (ab-)use these exceptions as fail-fast exceptions and fake-throw them, or the exceptions might be blindly rethrown when marshaled from a terminated background thread where they were caught at the bottom of the thread. However they show up, it's always bad for them to show up.

If you catch a falling knife, rethrow.

Categories: Technology | CLR

September 2, 2012
@ 05:31 PM

After just over six years in the United States, our family is going to relocate back to Germany sometime in the second half of this month.

Thanks to a lot of effort by our management and HR teams at Microsoft, including our VP Scott Guthrie, I will be staying with the Windows Azure engineering group and with the Service Bus feature team and will come back to the mothership fairly frequently, likely 5-6 times a year; you can look at this as "working from home" with a 14h each-way commute to work.

Since it's soon going to be fairly obvious looking at my Twitter timeline that that move is happening, I thought it make sense to let you know here as well with a few more than 140 characters. I already spoke to a few folks who're good at reading tea leaves and writing about it (like Darryl Taft and Mary-Jo Foley) back at TechEd North America about this is coming up, so there wouldn't be speculation about me jumping ship. I'm not.

There are two sets of reasons for why we're moving back at this time. The primary set of reason is around family concerns. Our daughter is now 5 and the grandparents and the rest of the family deserve extended time with her and us. We also have a choice between having her set her cultural and educational roots in America or in Europe and on whether our daughter is going to communicate with us in English or German in the long run.

Everyone has their notions of patriotism and I'm a proud European. And irrespective of what the media panic says, I'm bullish on Europe - and I'm also bullish on the Middle East and Africa. I see an awesome number of really sophisticated customer cloud solutions and concepts in progress in manufacturing, commerce, and energy in the EMEA region, and setting up camp near Mönchengladbach/Düsseldorf will mostly put me within 1-2 flying hours of most of these customers and my colleagues working with them. I think that having more folks from the core Windows Azure engineering organization over in Europe - my Service Bus colleague David Ingham is in Newcastle/England already - will be a good thing.

And even though we're still working on calibrating the exact shape of my remote role and coordinate with the local colleagues, I'm fairly certain that conference and event attendees in EMEA will see a bit more of me again. The first conference European conference I'll be speaking at that I would otherwise not been able to go to will be the German ADC conference in early November. If you run/chair a conference in Europe and would be interested in having me speak, drop me an email to

Also - this isn't the last word in the U.S. for us. Our daughter is a dual citizen and we're keeping the door open to come back in a few years time, so this is technically a temporary relocation. We obviously have a lot of friends here, and the Puget Sound area is one of the most beautiful places in the world (even when it's gray) and there's no better place in the world for one of my newly acquired hobbies, which is, probably oddly, Civil and Military Aviation History.   I also acquired appreciation for Baseball and the up-and-coming (you just have to believe) Seattle Mariners - and, of course, Football and the Seahawks.

Bottom line: Same job, different continent, different time-zone.


September 1, 2012
@ 04:49 AM

Today has been a lively day in some parts of the Twitterverse debating the Saga pattern. As it stands, there are a few frameworks for .NET out there that use the term "Saga" for some framework implementation of a state machine or workflow. Trouble is, that's not what a Saga is. A Saga is a failure management pattern.

Sagas come out of the realization that particularly long-lived transactions (originally even just inside databases), but also far distributed transactions across location and/or trust boundaries can't eaily be handled using the classic ACID model with 2-Phase commit and holding locks for the duration of the work. Instead, a Saga splits work into individual transactions whose effects can be, somehow, reversed after work has been performed and commited.


The picture shows a simple Saga. If you book a travel itinerary, you want a car and a hotel and a flight. If you can't get all of them, it's probably not worth going. It's also very certain that you can't enlist all of these providers into a distributed ACID transaction. Instead, you'll have an activity for booking rental cars that knows both how to perform a reservation and also how to cancel it - and one for a hotel and one for flights.

The activities are grouped in a composite job (routing slip) that's handed along the activity chain. If you want, you can sign/encrypt the routing slip items so that they can only be understood and manipulated by the intended receiver. When an activity completes, it adds a record of the completion to the routing slip along with information on where its compensating operation can be reached (e.g. via a Queue). When an activity fails, it cleans up locally and then sends the routing slip backwards to the last completed activity's compensation address to unwind the transaction outcome.

If you're a bit familiar with travel, you'll also notice that I've organized the steps by risk. Reserving a rental car almost always succeeds if you book in advance, because the rental car company can move more cars on-site of there is high demand. Reserving a hotel is slightly more risky, but you can commonly back out of a reservation without penalty until 24h before the stay. Airfare often comes with a refund restriction, so you'll want to do that last.

I created a Gist on Github that you can run as a console application. It illustrates this model in code. Mind that it is a mockup and not a framework. I wrote this in less than 90 minutes, so don't expect to reuse this.

The main program sets up an examplary routing slip (all the classes are in the one file) and creates three completely independent "processes" (activity hosts) that are each responsible for handling a particular kind of work. The "processes" are linked by a "network" and each kind of activity has an address for forward progress work and one of compensation work. The network resolution is simulated by 'Send".

   1:  static ActivityHost[] processes;
   3:  static void Main(string[] args)
   4:  {
   5:      var routingSlip = new RoutingSlip(new WorkItem[]
   6:          {
   7:              new WorkItem<ReserveCarActivity>(new WorkItemArguments{{"vehicleType", "Compact"}}),
   8:              new WorkItem<ReserveHotelActivity>(new WorkItemArguments{{"roomType", "Suite"}}),
   9:              new WorkItem<ReserveFlightActivity>(new WorkItemArguments{{"destination", "DUS"}})
  10:          });
  13:      // imagine these being completely separate processes with queues between them
  14:      processes = new ActivityHost[]
  15:                          {
  16:                              new ActivityHost<ReserveCarActivity>(Send),
  17:                              new ActivityHost<ReserveHotelActivity>(Send),
  18:                              new ActivityHost<ReserveFlightActivity>(Send)
  19:                          };
  21:      // hand off to the first address
  22:      Send(routingSlip.ProgressUri, routingSlip);
  23:  }
  25:  static void Send(Uri uri, RoutingSlip routingSlip)
  26:  {
  27:      // this is effectively the network dispatch
  28:      foreach (var process in processes)
  29:      {
  30:          if (process.AcceptMessage(uri, routingSlip))
  31:          {
  32:              break;
  33:          }
  34:      }
  35:  }

The activities each implement a reservation step and an undo step. Here's the one for cars:

   1:  class ReserveCarActivity : Activity
   2:  {
   3:      static Random rnd = new Random(2);
   5:      public override WorkLog DoWork(WorkItem workItem)
   6:      {
   7:          Console.WriteLine("Reserving car");
   8:          var car = workItem.Arguments["vehicleType"];
   9:          var reservationId = rnd.Next(100000);
  10:          Console.WriteLine("Reserved car {0}", reservationId);
  11:          return new WorkLog(this, new WorkResult { { "reservationId", reservationId } });
  12:      }
  14:      public override bool Compensate(WorkLog item, RoutingSlip routingSlip)
  15:      {
  16:          var reservationId = item.Result["reservationId"];
  17:          Console.WriteLine("Cancelled car {0}", reservationId);
  18:          return true;
  19:      }
  21:      public override Uri WorkItemQueueAddress
  22:      {
  23:          get { return new Uri("sb://./carReservations"); }
  24:      }
  26:      public override Uri CompensationQueueAddress
  27:      {
  28:          get { return new Uri("sb://./carCancellactions"); }
  29:      }
  30:  }

The chaining happens solely through the routing slip. The routing slip is "serializable" (it's not, pretend that it is) and it's the only piece of information that flows between the collaborating activities. There is no central coordination. All work is local on the nodes and once a node is done, it either hands the routing slip forward (on success) or backward (on failure). For forward progress data, the routing slip has a queue and for backwards items it maintains a stack. The routing slip also handles resolving and invoking whatever the "next" thing to call is on the way forward and backward.

   1:  class RoutingSlip
   2:  {
   3:      readonly Stack<WorkLog> completedWorkLogs = new Stack<WorkLog>();
   4:      readonly Queue<WorkItem> nextWorkItem = new Queue<WorkItem>();
   6:      public RoutingSlip()
   7:      {
   8:      }
  10:      public RoutingSlip(IEnumerable<WorkItem> workItems)
  11:      {
  12:          foreach (var workItem in workItems)
  13:          {
  14:              this.nextWorkItem.Enqueue(workItem);
  15:          }
  16:      }
  18:      public bool IsCompleted
  19:      {
  20:          get { return this.nextWorkItem.Count == 0; }
  21:      }
  23:      public bool IsInProgress
  24:      {
  25:          get { return this.completedWorkLogs.Count > 0; }
  26:      }
  28:      public bool ProcessNext()
  29:      {
  30:          if (this.IsCompleted)
  31:          {
  32:              throw new InvalidOperationException();
  33:          }
  35:          var currentItem = this.nextWorkItem.Dequeue();
  36:          var activity = (Activity)Activator.CreateInstance(currentItem.ActivityType);
  37:          try
  38:          {
  39:              var result = activity.DoWork(currentItem);
  40:              if (result != null)
  41:              {
  42:                  this.completedWorkLogs.Push(result);
  43:                  return true;
  44:              }
  45:          }
  46:          catch (Exception e)
  47:          {
  48:              Console.WriteLine("Exception {0}", e.Message);
  49:          }
  50:          return false;
  51:      }
  53:      public Uri ProgressUri
  54:      {
  55:          get
  56:          {
  57:              if (IsCompleted)
  58:              {
  59:                  return null;
  60:              }
  61:              else
  62:              {
  63:                  return
  64:                      ((Activity)Activator.CreateInstance(this.nextWorkItem.Peek().ActivityType)).
  65:                          WorkItemQueueAddress;
  66:              }
  67:          }
  68:      }
  70:      public Uri CompensationUri
  71:      {
  72:          get
  73:          {
  74:              if (!IsInProgress)
  75:              {
  76:                  return null;
  77:              }
  78:              else
  79:              {
  80:                  return
  81:                      ((Activity)Activator.CreateInstance(this.completedWorkLogs.Peek().ActivityType)).
  82:                          CompensationQueueAddress;
  83:              }
  84:          }
  85:      }
  87:      public bool UndoLast()
  88:      {
  89:          if (!this.IsInProgress)
  90:          {
  91:              throw new InvalidOperationException();
  92:          }
  94:          var currentItem = this.completedWorkLogs.Pop();
  95:          var activity = (Activity)Activator.CreateInstance(currentItem.ActivityType);
  96:          try
  97:          {
  98:              return activity.Compensate(currentItem, this);
  99:          }
 100:          catch (Exception e)
 101:          {
 102:              Console.WriteLine("Exception {0}", e.Message);
 103:              throw;
 104:          }
 106:      }
 107:  }

The local work  and making the decisions is encapsulated in the ActivityHost, which calls ProcessNext() on the routing slip to resolve the next activity and call its DoWork() function on the way forward or will resolve the last executed activity on the way back and invoke its Compensate() function. Again, there's nothing centralized here; all that work hinges on the routing slip and the three activities and their execution is completely disjoint.

   1:  abstract class ActivityHost
   2:  {
   3:      Action<Uri, RoutingSlip> send;
   5:      public ActivityHost(Action<Uri, RoutingSlip> send)
   6:      {
   7:          this.send = send;
   8:      }
  10:      public void ProcessForwardMessage(RoutingSlip routingSlip)
  11:      {
  12:          if (!routingSlip.IsCompleted)
  13:          {
  14:              // if the current step is successful, proceed
  15:              // otherwise go to the Unwind path
  16:              if (routingSlip.ProcessNext())
  17:              {
  18:                  // recursion stands for passing context via message
  19:                  // the routing slip can be fully serialized and passed
  20:                  // between systems. 
  21:                  this.send(routingSlip.ProgressUri, routingSlip);
  22:              }
  23:              else
  24:              {
  25:                  // pass message to unwind message route
  26:                  this.send(routingSlip.CompensationUri, routingSlip);
  27:              }
  28:          }
  29:      }
  31:      public void ProcessBackwardMessage(RoutingSlip routingSlip)
  32:      {
  33:          if (routingSlip.IsInProgress)
  34:          {
  35:              // UndoLast can put new work on the routing slip
  36:              // and return false to go back on the forward 
  37:              // path
  38:              if (routingSlip.UndoLast())
  39:              {
  40:                  // recursion stands for passing context via message
  41:                  // the routing slip can be fully serialized and passed
  42:                  // between systems 
  43:                  this.send(routingSlip.CompensationUri, routingSlip);
  44:              }
  45:              else
  46:              {
  47:                  this.send(routingSlip.ProgressUri, routingSlip);
  48:              }
  49:          }
  50:      }
  52:      public abstract bool AcceptMessage(Uri uri, RoutingSlip routingSlip);
  53:  }


That's a Saga.

Categories: Architecture | SOA