Dealing with Deadlocks and Other Unfortunate Events on the Application Level: Part 3

See Part 2

Now that we have a way to detect deadlocks and similar error conditions we might encounter during a transaction, we need to find a mechanism that will allow us to back out of the doomed transaction and retry the failed operation. The good thing about transaction failures of that sort is that you can essentially retry as often as you want without doing anything wrong – as long as all the resources you are reading from and writing to are guarded by transactional resource managers that are enlisted in the transaction. And that’s the hint at the trick we need to use.

If we were executing the code below through a remote call, it might be rather difficult to recover from a failure. The local transaction inside the component will roll back – ok – but that punts the problem back out to the caller, who will have to implement the appropriate logic to single out transaction failure causes that can be fixed by a retry and to actually resubmit the request. That can be rather tricky.

[Transaction(TransactionOption.Required)]
public class DoesStuff : ServicedComponent
{
  /* things omitted */

    [AutoComplete]
    public void DoStuff( string argument )
    {
        try
        {
           dbConnection.Open();
           sprocUpdateAndQueryStuff.Parameters["@StuffArgument"].Value = argument;
           result = this.GetResultFromReader( sprocUpdateAndQueryStuff.ExecuteReader() );
        }
        catch( Exception exception )
        {
          throw RepeatableOperationExceptionMapper.MapException( exception );
      }
      finally
      {
          dbConnection.Close();
        }
    }
}

Even worse, we might not even learn about a failure depending on when and why the transaction fails. In distributed transaction scenarios, it is absolutely possible (even though rare) that a transaction fails long after the application code is done with all of its processing. Let’s assume everything in the call to DoStuff() above works just fine. The code calls a stored procedure on a remote SQL Server, SQL Server is completely happy and so is the component. Hence, neither has any good reason to complain. Let’s further assume there is a component DoesOtherStuff just like the one above and that component is subsequently called by the remote client and updates some other SQL Server database from within the same distributed transaction. That SQL Server and that component are both happy as well. Everybody is happy. The client commits the transaction, maybe by calling ServiceDomain.Leave() or leaving from a method of a ServicedComponent that serves as the transaction root. Now DTC (the transaction coordinator) wants to go around and ask everyone for their opinion about the transaction outcome. At this point, all user code you’ve contributed to this story is done executing. And now (we are talking about transactions, so this is where the fun is) disaster strikes and one of the SQL Servers you were talking to suddenly gets cut off from the network in some way. Hardware issue. Crash. Something. If you are lucky, the problem occurs while DTC is still walking around asking for votes (Phase 1). In this case, the transaction will just abort. If you are horribly unlucky, disaster strikes while DTC is already issuing commit commands (Phase 2). In that case, the transaction may (and will) just hang forever until either the network connection is restored or some operator throws in the hammer and resolves the transaction manually (and may deliberately cause inconsistency). Since all this might happen asynchronously, none of your code might ever end up knowing that the transaction failed and the (transient) transaction results just end up being quietly rolled back in the background.

Before you get all nervous about the asynchronous case, let’s look at a strategy to deal with the scenario from the client side, assuming that the transaction is resolved synchronously. We will take into account that there are transaction failures that can be fixed by rerunning the transaction.

public void RunTx(string argument)
{
    TransactionStatus status;
    bool done = false;

    while( !done )
    {
        ServiceConfig cfg = new ServiceConfig();
        cfg.Transaction = TransactionOption.Required;
        ServiceDomain.Enter(cfg);

        try
        {
            DoesStuff doesStuff = new DoesStuff();
            doesStuff.DoStuff(argument);

            ContextUtil.SetComplete();
            done = true;
        }
        catch (RepeatableOperationException re)
        {
            ContextUtil.SetAbort();
        }
        catch
        {
            ContextUtil.SetAbort();
            throw;
        }
        finally
        {
            status = ServiceDomain.Leave();
        }
    }
    return status;
}

If a component throws an exception indicating that running the transaction again might yield a success, we’ll catch that condition, abort the local transaction and since we don’t set done to true, we will get back to the top of the loop and rerun the transaction. If we find another exception, we will abort the transaction and re-throw the exception we caught. The loop might look a bit awkward, but it does the job of re-running the transaction whenever that’s desired. So all you need to do is to use that sort of a loop everywhere in your code and you are all set. Well … almost.

Assume that the argument we pass into RunTx() is something as complex and valuable (!) as a purchase order that you just got into your system. A purchase order is as good as cash. The longer you keep that purchase order around in memory, the greater is the risk that your local process (or machine) will fail catastrophically for some arbitrary reason. So if you need to re-run the transaction one or two times because the backend system is struggling (again, this all technology for the worst imaginable case) you are making yourself increasingly vulnerable to a local failure until the transaction finally succeeds. To reduce that risk, it therefore makes sense to add another step here.

Before we even start the transaction that does the processing, we grab the input data and guard it using a transactional resource manager – such as MSMQ. If we throw the input data into a private, transactional MSMQ queue as the first step and before we do anything else, the data is safely guarded on disk. If the machine fails once the data is in the queue, the data – encapsulated in a message – will still be available once the machine comes back up. Using a message queue listener, the messages in the queue are then each read from within individual transactions and the data is submitted into the processing component. Because the messages are simply put back into the queue if the transaction fails, re-running the transaction is practically automatic once the message queue listener restarts. Which leaves a problem: There are failing transactions that we want to repeat and there are failing transactions of which we know that they will not succeed no matter how often we try them. If a transactional component throws an exception because it has a bug, the transaction will abort and no number of re-runs will have that transaction succeed. Messages (data) that cause this sort of behavior are “poisonous” and we need to sort them out. Stay tuned; the message queue listener that takes care of all of that is up next.

Clemens Vasters