Deadlocks and other locking conflicts that cause transactional database operations to fail are things that puzzle many application developers. Sure, proper database design and careful implementation of database access (and appropriate support by the database engine) should take care of that problem, but it cannot do so in all cases. Sometimes, especially under stress and other situations with high lock contention, a database just has not much of a choice but picking at least one of the transactions competing for the same locks as the victim in resolving the deadlock situation and then aborts the chosen transaction. Generally speaking, transactions that abort and roll back are a good thing, because this behavior guarantees data integrity. In the end, we use transaction technology for those cases where data integrity is at risk. What’s interesting is that even though transactions are a technology that is explicitly about things going wrong, the strategy for dealing with failing transaction is often not much more than to bubble the problem up to the user and say “We apologize for the inconvenience. Please press OK”.

The appropriate strategy for handling a deadlock or some other recoverable reason for a transaction abort on the application level is to back out of the entire operation and to retry the transaction. Retrying is a gamble that the next time the transaction runs, it won’t run into the same deadlock situation again or that it will at least come out victorious when the database picks its victims. Eventually, it’ll work. Even if it takes a few attempts. That’s the idea. It’s quite simple.

What is not really all that simple is the implementation. Whenever you are using transactions, you must make your code aware that such “good errors” may occur at any time. Wrapping your transactional ODBC/OLEDB/ADO/ADO.NET code or calls to transactional Enterprise Services or COM+ components with a try/catch block, writing errors to log-files and showing message boxes to users just isn’t the right thing to do. The right thing is to simply do the same batch of work again and until it succeeds.

The problem that some developers seem to have with “just retry” is that it’s not so clear what should be retried. It’s a problem of finding and defining the proper transaction scope. Especially when user interaction is in the picture, things easily get very confusing. If a user has filled in a form on a web page or some dialog window and all of his/her input is complete and correct, should the user be bothered with a message that the update transaction failed due to a locking issue? Certainly not. Should the user know when the transaction fails because the database is currently unavailable? Maybe, but not necessarily. Should the user be made aware that the application he/she is using is for some sudden reason incompatible with the database schema of the backend database? Maybe, but what does Joe in the sales department do with that valuable piece of information?

If stuff fails, should we just forget about Joe’s input and tell him to come back when the system is happier to serve him? So, in other words, do we have Joe retry the job? That’s easy to program, but that sort of strategy doesn’t really make Joe happy, does it?

So what’s the right thing to do? One part of the solution is a proper separation between the things the user (or a program) does and the things that the transaction does. This will give us two layers and “a job” that can be handed down from the presentation layer down to the “transaction layer”. Once this separation is in place, we can come up with a mechanism that will run those jobs in transactions and will automate how and when transactions are to be retried. Transactional MSMQ queues turn out to be a brilliant tool to make this very easy to implement. More tomorrow. Stay tuned.

Updated: