Poison Control.

A bad sign for how much I’m coding these days is that I had a HDD crash three weeks ago and only restored Visual Studio into fully working condition with all my tools and stuff today. I’ve decided that that has to change otherwise I’ll get really rusty.

Picking up the thread from “Professor Indigo” Nicholas Allen, I’ve built a little program that illustrates an alternate handling strategy for poisonous messages that WCF throws into the poison queue on Vista and Longhorn Server if you ask it to (ReceiveErrorHandling.Move). The one we’re showing in the docs is implementing a local resolution strategy that’s being fired within the service when the service ends up faulting; that’s the strategy for ReceiveErrorHandling.Fault and works for MSMQ 3.0. The strategy I’m showing here requires our latest OS wave.

When a message arrives at a WCF endpoint through a queue, WCF will – if the queue is transactional – open a transaction and de-queue the message. It will then try to dispatch it to the target service and operation. Assuming the dispatch works, the operation gets invoked and – might – tank. If it does, an exception is raised, thrown back into the WCF stack and the transaction aborts. Happily, WCF grabs the next message from the queue – which happens to be the one that just caused the failure due to the rollback – and the operation – might – tank again.

Now, the reasons why the operation might fail are as numerous as the combinations of program statement combinations that you could put there. Anything could happen. The program is completely broken, the input data causes the app to go to that branch that nobody ever cared to test – or apparently not enough, the backend database is permanently offline, the machine is having an extremely bad hardware day, power fails, you name it.

So what if the application just keeps choking and throwing on that particular message? With either of the aforementioned error handling modes, WCF is going to take the message out of the loop when its patience with the patient is exhausted. With the ReceiveErrorHandling.Fault option, WCF will raise an error event that can be caught and processed with a handler. When you use ReceiveErrorHandling.Move things are a bit more flexible, because the message causing all that trouble now sits in a queue again.

The headache-causing problem with poison messages is that you really, really need to do something about them. From the sender’s perspective, the message has been delivered and it puts its trust into the receiver to do the right thing. “Here’s that $1,000,000 purchase order! I’m done, go party!”. If the receiving service goes into the bug-induced loop of recurring death, you’ve got two problems: You have a nasty bug that’s probably difficult to repro since it happens under stress, and you’ve got a $1,000,000 purchase order unhappily sitting in a dark hole. Guess what your great-grand-boss’ boss cares more about.

The second, technically slightly more headache-causing problem with poison messages (if that’s possible to imagine) is that they just sit there with all the gold and diamonds that they might represent, but they are effectively just a bunch of (if you’re lucky) XML goo. Telling a system operator to go and check the poison message queues or to surface their contents to him/her and look what’s going on there is probably not a winning strategy.

So what to do? Your high-throughput automated-processing solution that does the regular business behind the queue has left the building for lunch. That much is clear. How do you hook in some alternate processing path that does at least surface the problem to an operator or “information worker”– or even a call center agent pool – in a legible and intelligible fashion so that a human can look at the problem and try finding a fix? In the end, we’ve got the best processing unit for non-deterministic and unexpected events sitting between our shoulders, one would hope. How about writing a slightly less automated service alternative that’s easy to adjust and try to get the issue surfaced to someone or just try multiple things [Did someone just say “Workflow”?] – and hook that straight up to where all the bad stuff lands: the poison queue.

Here’s the code. I just coded that up for illustrative purposes and hence there’s absolutely room for improvement. I’m going to put the project files up on wcf.netfx3.com and will update this post with the link. We’ll start with the boilerplate stuff and the “regular” service:

using System;
using System.Collections.Generic;
using System.Text;
using System.ServiceModel.Channels;
using System.ServiceModel;
using System.Runtime.Serialization;
using System.ServiceModel.Description;
using System.Workflow.Runtime;
using ServerErrorHandlingWorkflow;
using ServerData;

namespace Server
{
    [ServiceContract(Namespace=Program.ServiceNamespaceURI)]
    interface IApplicationContract
    {
        [OperationContract(IsOneWay=true)]
        void SubmitData(ApplicationData data);
    }

    [ServiceBehavior(TransactionAutoCompleteOnSessionClose=true,
                    ReleaseServiceInstanceOnTransactionComplete=true)]
    class ApplicationService : IApplicationContract
    {
        [OperationBehavior(TransactionAutoComplete=true,TransactionScopeRequired=true),
        System.Diagnostics.DebuggerStepThrough]
        public void SubmitData(ApplicationData data)
        {
            throw new Exception("The method or operation is not implemented.");
        }
    }

Not much excitement here except that the highlighted line will always cause the service to tank. In real life, the path to that particular place where the service consistently finds its way into a trouble-spot is more convoluted and may involve a few thousand lines, but this is a good approximation for what happens when you hit a poison message. Stuff keeps failing.

The next snippet is our alternate service. Instead of boldly trying to do complex processing, it simply punts the message data to a Workflow. That’s assuming that the message isn’t completely messed up to begin with and can indeed be de-serialized. To mitigate that scenario we could also use a one-way universal contract and be even more careful. The key difference between this and the “regular” service is that the alternate service turns off the WCF address filter check. We’ll get back to that.

    [ServiceBehavior(AddressFilterMode = AddressFilterMode.Any)]
    class ApplicationErrorService : IApplicationContract
    {
        public void SubmitData(ApplicationData data)
        {
            Dictionary<string,object> workflowArgs = new Dictionary<string,object>();
            workflowArgs.Add("ApplicationData",data);
            WorkflowInstance workflowInstance =
                Program.WorkflowRuntime.CreateWorkflow(
                          typeof(ErrorHandlingWorkflow),
                          workflowArgs);
            workflowInstance.Start();
        }
    }

So now we’ve got the fully automated middle-of-the-road default service and our “what do we do next” alternate service. Let’s hook them up.

    class Program
    {
        public const string ServiceNamespaceURI =
                "http://samples.microsoft.com/2007/03/WCF/PoisonHandling/Service";
        public static WorkflowRuntime WorkflowRuntime = new WorkflowRuntime();

        static void Main(string[] args)
        {
            string msmqQueueName = Properties.Settings.Default.QueueName;
            string msmqPoisonQueueName = msmqQueueName+";poison";
            string netMsmqQueueName =
                 "net.msmq://" + msmqQueueName.Replace('\\', '/').Replace("$","");
            string netMsmqPoisonQueueName = netMsmqQueueName+";poison";

            if (!System.Messaging.MessageQueue.Exists(msmqQueueName))
            {
                System.Messaging.MessageQueue.Create(msmqQueueName, true);
            }

First – and for this little demo only – we’re setting up a local queue and do a little stringsmithing to get the app.config stored MSMQ format queue name into the net.msmq URI format. Next …

            ServiceHost applicationServiceHost = new ServiceHost(typeof(ApplicationService));
            NetMsmqBinding queueBinding = new NetMsmqBinding(NetMsmqSecurityMode.None);
            queueBinding.ReceiveErrorHandling = ReceiveErrorHandling.Move;
            queueBinding.ReceiveRetryCount = 1;
            queueBinding.RetryCycleDelay = TimeSpan.FromSeconds(1);
            applicationServiceHost.AddServiceEndpoint(typeof(IApplicationContract),
                                                      queueBinding,
                                                      netMsmqQueueName);

Now we’ve bound the “regular” application service to the queue. I’m setting the binding parameters (look them up at your leisure) in a way that we’re failing very fast here. By default, the RetryCycleDelay is set to 30 minutes, which means that WCF is giving you a reasonable chance to fix temporary issues while stuff hangs out in the retry queue. Now for the poison handler service:

            ServiceHost poisonHandlerServiceHost = new ServiceHost(typeof(ApplicationErrorService));
            NetMsmqBinding poisonBinding = new NetMsmqBinding(NetMsmqSecurityMode.None);
            poisonBinding.ReceiveErrorHandling = ReceiveErrorHandling.Drop;
            poisonHandlerServiceHost.AddServiceEndpoint(typeof(IApplicationContract),
                                                        poisonBinding,
                                                        netMsmqPoisonQueueName);

Looks almost the same, hmm? The trick here is that we’re pointing this one to the poison queue into which the regular service drops all the stuff that it can’t deal with. Otherwise it’s (almost) just a normal service. The key difference between the ApplicationErrorService service and its sibling is that the poison-message handler service implementation is decorated with [ServiceBehavior(AddressFilterMode = AddressFilterMode.Any)].Since the original message was sent to the a different (the original) queue and we’re now looking at a sub-queue that has a different name and therefore a different WS-Addressing:To identity, WCF would normally reject processing that message. With this behavior setting we can tell WCF to ignore that and have the service treat the message as if it landed at the right place – which is what we want.

And now for the unspectacular run-it and drop-a-message-into-queue finale:

            applicationServiceHost.Open();
            poisonHandlerServiceHost.Open();

            Console.WriteLine("Application running");

            ChannelFactory<IApplicationContract> client =
               new ChannelFactory<IApplicationContract>(queueBinding,
                                                        netMsmqQueueName);
            IApplicationContract channel = client.CreateChannel();
          ApplicationData data = new ApplicationData();
            data.FirstName = "Clemens";
            data.LastName = "Vasters";
            channel.SubmitData(data);
            ((IClientChannel)channel).Close();

            Console.WriteLine("Press ENTER to exit");

            Console.ReadLine();
        }
    }
}

The Workflow that’s hooked up to the poison handler in my particular sample project does nothing big. It’s got a property that is initialized with the data item and just has a code activity that spits out the message to the console. It could send an email, page an operator through messenger, etcetc. Whatever works.

Clemens Vasters