The little series I am currently writing here on my blog has inspired me to write way too more code than actually necessary to get my point across So by now I've got my own MSMQ transport for WSE 2.0 (yes, I know that others have written that already, but I am shooting for a "enterprise strength" implementation), a WebRequest/WebResponse pair to smuggle under arbitrary ASMX proxies and I am more than halfway done with a server-side host for MSMQ-to-ASMX (spelled out: ASP.NET Web Services).
What bugs me is that WSE 2.0's messaging model is "asynchronous only" and that it always performs a push/pull translation and that there is no way to push a message through to a service on the receiving thread. Whenever I grab a message from the queue and put it into my SoapTransport's "Dispatch()" method, the message gets queued up in an in-memory queue and that is then, on a concurrent thread, pulled (OnReceiveComplete) by the SoapReceivers collection and submitted into ProcessMessage() of the SoapReceiver (like any SoapService derived implementation) matching the target endpoint. So while I can dequeue from MSMQ within a transaction scope (ServiceDomain), that transaction scope doesn't make it across onto the thread that will actually execute the action inside the SoapReceiver/SoapService.
So now I am sitting here, contemplating and trying to figure out a workaround that doesn't require me to rewrite a big chunk of WSE 2.0 (which I am totally not shy of if that is what it takes). Transaction marshaling, thread synchronization, ah, I love puzzles. Once I am know how to solve this and have made the adjustments, I'll post the queue listener I promised to wrap up the series. The other code I've written in the process will likely surface in some other way.
"My Lists", "My Photos", "My Profile" .... sounds all very familiar over there in MSN Spaces. So ... roll in the Web service interfaces, please.
That blue thing that I am running PowerPoint on during my talks is an Alienware Area51-m laptop. Heavy like a block of lead, battery life of about 2 hrs (which makes me carry two), but very fast and "feel good". Except for some minor annoyances, I am happy with it for about a year now. And I don't think I have every had a machine that was still faster than most of the other machines out there 12 months after I got it. One big problem I had with it until today, though, was that the machine got increasingly louder - more precisely, the fan noise was just unbearable. Since the machine is essentially a high-end desktop in a notebook shell, the box needs a lot of cooling. It turns out that the elaborate cooling mechanism that Alienware puts in those machines clogs up easily with dust, but hides the dust quite well. So while the machine looked "clean", it really, really, really wasn't. Wow. Yucky! Now it's all clean again and the fan is back to a bearable noise level. It's happy to be breathing again.
Wow. It's been a long time. 2 years again, already? Mr. Forte lets me stay at his house and invited me to his Christmas party in New York and since I am not going to Denver for the rest of the year as I was planning and I still need a few miles to retain my current status tier with Lufthansa for '05/'06, there wasn't much to think about. I am thrilled to go back. In the two years I lived in Manhattan ('95 and '96), New York has become my "second home" and it is always fantastic to go back. I love The City. Dec 17-21.
See Part 2
Now that we have a way to detect deadlocks and similar error conditions we might encounter during a transaction, we need to find a mechanism that will allow us to back out of the doomed transaction and retry the failed operation. The good thing about transaction failures of that sort is that you can essentially retry as often as you want without doing anything wrong – as long as all the resources you are reading from and writing to are guarded by transactional resource managers that are enlisted in the transaction. And that’s the hint at the trick we need to use.
If we were executing the code below through a remote call, it might be rather difficult to recover from a failure. The local transaction inside the component will roll back – ok – but that punts the problem back out to the caller, who will have to implement the appropriate logic to single out transaction failure causes that can be fixed by a retry and to actually resubmit the request. That can be rather tricky.
|
[Transaction(TransactionOption.Required)] public class DoesStuff : ServicedComponent { /* things omitted */ [AutoComplete] public void DoStuff( string argument ) { try { dbConnection.Open(); sprocUpdateAndQueryStuff.Parameters["@StuffArgument"].Value = argument; result = this.GetResultFromReader( sprocUpdateAndQueryStuff.ExecuteReader() ); } catch( Exception exception ) { throw RepeatableOperationExceptionMapper.MapException( exception ); } finally { dbConnection.Close(); } } } |
Even worse, we might not even learn about a failure depending on when and why the transaction fails. In distributed transaction scenarios, it is absolutely possible (even though rare) that a transaction fails long after the application code is done with all of its processing. Let’s assume everything in the call to DoStuff() above works just fine. The code calls a stored procedure on a remote SQL Server, SQL Server is completely happy and so is the component. Hence, neither has any good reason to complain. Let’s further assume there is a component DoesOtherStuff just like the one above and that component is subsequently called by the remote client and updates some other SQL Server database from within the same distributed transaction. That SQL Server and that component are both happy as well. Everybody is happy. The client commits the transaction, maybe by calling ServiceDomain.Leave() or leaving from a method of a ServicedComponent that serves as the transaction root. Now DTC (the transaction coordinator) wants to go around and ask everyone for their opinion about the transaction outcome. At this point, all user code you’ve contributed to this story is done executing. And now (we are talking about transactions, so this is where the fun is) disaster strikes and one of the SQL Servers you were talking to suddenly gets cut off from the network in some way. Hardware issue. Crash. Something. If you are lucky, the problem occurs while DTC is still walking around asking for votes (Phase 1). In this case, the transaction will just abort. If you are horribly unlucky, disaster strikes while DTC is already issuing commit commands (Phase 2). In that case, the transaction may (and will) just hang forever until either the network connection is restored or some operator throws in the hammer and resolves the transaction manually (and may deliberately cause inconsistency). Since all this might happen asynchronously, none of your code might ever end up knowing that the transaction failed and the (transient) transaction results just end up being quietly rolled back in the background.
Before you get all nervous about the asynchronous case, let’s look at a strategy to deal with the scenario from the client side, assuming that the transaction is resolved synchronously. We will take into account that there are transaction failures that can be fixed by rerunning the transaction.
|
public void RunTx(string argument) { TransactionStatus status; bool done = false;
while( !done ) { ServiceConfig cfg = new ServiceConfig(); cfg.Transaction = TransactionOption.Required; ServiceDomain.Enter(cfg);
try { DoesStuff doesStuff = new DoesStuff(); doesStuff.DoStuff(argument);
ContextUtil.SetComplete(); done = true; } catch (RepeatableOperationException re) { ContextUtil.SetAbort(); } catch { ContextUtil.SetAbort(); throw; } finally { status = ServiceDomain.Leave(); } } return status; } |
If a component throws an exception indicating that running the transaction again might yield a success, we’ll catch that condition, abort the local transaction and since we don’t set done to true, we will get back to the top of the loop and rerun the transaction. If we find another exception, we will abort the transaction and re-throw the exception we caught. The loop might look a bit awkward, but it does the job of re-running the transaction whenever that’s desired. So all you need to do is to use that sort of a loop everywhere in your code and you are all set. Well … almost.
Assume that the argument we pass into RunTx() is something as complex and valuable (!) as a purchase order that you just got into your system. A purchase order is as good as cash. The longer you keep that purchase order around in memory, the greater is the risk that your local process (or machine) will fail catastrophically for some arbitrary reason. So if you need to re-run the transaction one or two times because the backend system is struggling (again, this all technology for the worst imaginable case) you are making yourself increasingly vulnerable to a local failure until the transaction finally succeeds. To reduce that risk, it therefore makes sense to add another step here.
Before we even start the transaction that does the processing, we grab the input data and guard it using a transactional resource manager – such as MSMQ. If we throw the input data into a private, transactional MSMQ queue as the first step and before we do anything else, the data is safely guarded on disk. If the machine fails once the data is in the queue, the data – encapsulated in a message – will still be available once the machine comes back up. Using a message queue listener, the messages in the queue are then each read from within individual transactions and the data is submitted into the processing component. Because the messages are simply put back into the queue if the transaction fails, re-running the transaction is practically automatic once the message queue listener restarts. Which leaves a problem: There are failing transactions that we want to repeat and there are failing transactions of which we know that they will not succeed no matter how often we try them. If a transactional component throws an exception because it has a bug, the transaction will abort and no number of re-runs will have that transaction succeed. Messages (data) that cause this sort of behavior are “poisonous” and we need to sort them out. Stay tuned; the message queue listener that takes care of all of that is up next.
See Part 1
Before we can do anything about deadlocks or deal with similar troubles, we first need to be able to tell that we indeed have a deadlock situation. Finding this out is a matter of knowing the respective error codes that your database gives you and a mechanism to bubble that information up to some code that will handle the situation. So before we can think about and write the handling logic for failed/failing but safely repeatable transactions, we need to build a few little things. The first thing we’ll need is an exception class that will wrap the original exception indicating the reason for the transaction failure. The new exception class’s identity will later serve to filter out exceptions in a “catch” statement and take the appropriate actions.
|
using System; using System.Runtime.Serialization;
namespace newtelligence.EnterpriseTools.Data { [Serializable] public class RepeatableOperationException : Exception { public RepeatableOperationException():base() { }
public RepeatableOperationException(Exception innerException) :base(null,innerException) { }
public RepeatableOperationException(string message, Exception innerException) :base(message,innerException) { }
public RepeatableOperationException(string message):base(message) { }
public RepeatableOperationException( SerializationInfo serializationInfo, StreamingContext streamingContext) :base(serializationInfo,streamingContext) { }
public override void GetObjectData( System.Runtime.Serialization.SerializationInfo info, System.Runtime.Serialization.StreamingContext context) { base.GetObjectData (info, context); } } } |
Having an exception wrapper with the desired semantics, we know need to be able to figure out when to replace the original exception with this wrapper and re-throw it up on the call stack. The idea is that whenever you execute a database operation – or, more generally, any operation that might be repeatable on failure – you will catch the resulting exception and run it through a factory, which will analyze the exception and wrap it with the RepeatableOperationException if the issue at hand can be resolved by re-running the transaction. The (still a little naïve) code below illustrates how to such a factory in the application code. Later we will flesh out the catch block a little more, since we will lose the original call stack if we end up re-throwing the original exception like shown here:
|
Try { dbConnection.Open(); sprocUpdateAndQueryStuff.Parameters["@StuffArgument"].Value = argument; result = this.GetResultFromReader( sprocUpdateAndQueryStuff.ExecuteReader() ); } catch( Exception exception ) { throw RepeatableOperationExceptionMapper.MapException( exception ); } finally { dbConnection.Close(); } |
The factory class itself is rather simple in structure, but a bit tricky to put together, because you have to know the right error codes for all resource managers you will ever run into. In the example below I put in what I believe to be the appropriate codes for SQL Server and Oracle (corrections are welcome) and left the ODBC and OLE DB factories (for which would have to inspect the driver type and the respective driver-specific error codes) blank. The factory will check out the exception data type and delegate mapping to a private method that is specialized for a specific managed provider.
|
using System; using System.Data.SqlClient; using System.Data.OleDb; using System.Data.Odbc; using System.Data.OracleClient;
namespace newtelligence.EnterpriseTools.Data { public class RepeatableOperationExceptionMapper { /// <summary> /// Maps the exception to a Repeatable exception, if the error code /// indicates that the transaction is repeatable. /// </summary> /// <param name="sqlException"></param> /// <returns></returns> private static Exception MapSqlException( SqlException sqlException ) { switch ( sqlException.Number ) { case -2: /* Client Timeout */ case 701: /* Out of Memory */ case 1204: /* Lock Issue */ case 1205: /* Deadlock Victim */ case 1222: /* Lock Request Timeout */ case 8645: /* Timeout waiting for memory resource */ case 8651: /* Low memory condition */ return new RepeatableOperationException(sqlException); default: return sqlException; } }
private static Exception MapOleDbException( OleDbException oledbException ) { switch ( oledbException.ErrorCode ) { default: return oledbException; } }
private static Exception MapOdbcException( OdbcException odbcException ) { return odbcException; }
private static Exception MapOracleException( OracleException oracleException ) { switch ( oracleException.Code ) { case 104: /* ORA-00104: Deadlock detected; all public servers blocked waiting for resources */ case 1013: /* ORA-01013: User requested cancel of current operation */ case 2087: /* ORA-02087: Object locked by another process in same transaction */ case 60: /* ORA-00060: Deadlock detected while waiting for resource */ return new RepeatableOperationException( oracleException ); default: return oracleException; } }
public static Exception MapException( Exception exception ) { if ( exception is SqlException ) { return MapSqlException( exception as SqlException ); } else if ( exception is OleDbException ) { return MapOleDbException( exception as OleDbException ); } else if (exception is OdbcException ) { return MapOdbcException( exception as OdbcException ); } else if (exception is OracleException ) { return MapOracleException( exception as OracleException ); } else { return exception; } } } }
|
With that little framework of two classes, we can now selectively throw exceptions that convey whether a failed/failing transaction is worth repeating. Next step: How do we do actually run such repeats and make sure we neither lose data nor make the user unhappy in the process? Stay tuned.
Deadlocks and other locking conflicts that cause transactional database operations to fail are things that puzzle many application developers. Sure, proper database design and careful implementation of database access (and appropriate support by the database engine) should take care of that problem, but it cannot do so in all cases. Sometimes, especially under stress and other situations with high lock contention, a database just has not much of a choice but picking at least one of the transactions competing for the same locks as the victim in resolving the deadlock situation and then aborts the chosen transaction. Generally speaking, transactions that abort and roll back are a good thing, because this behavior guarantees data integrity. In the end, we use transaction technology for those cases where data integrity is at risk. What’s interesting is that even though transactions are a technology that is explicitly about things going wrong, the strategy for dealing with failing transaction is often not much more than to bubble the problem up to the user and say “We apologize for the inconvenience. Please press OK”.
The appropriate strategy for handling a deadlock or some other recoverable reason for a transaction abort on the application level is to back out of the entire operation and to retry the transaction. Retrying is a gamble that the next time the transaction runs, it won’t run into the same deadlock situation again or that it will at least come out victorious when the database picks its victims. Eventually, it’ll work. Even if it takes a few attempts. That’s the idea. It’s quite simple.
What is not really all that simple is the implementation. Whenever you are using transactions, you must make your code aware that such “good errors” may occur at any time. Wrapping your transactional ODBC/OLEDB/ADO/ADO.NET code or calls to transactional Enterprise Services or COM+ components with a try/catch block, writing errors to log-files and showing message boxes to users just isn’t the right thing to do. The right thing is to simply do the same batch of work again and until it succeeds.
The problem that some developers seem to have with “just retry” is that it’s not so clear what should be retried. It’s a problem of finding and defining the proper transaction scope. Especially when user interaction is in the picture, things easily get very confusing. If a user has filled in a form on a web page or some dialog window and all of his/her input is complete and correct, should the user be bothered with a message that the update transaction failed due to a locking issue? Certainly not. Should the user know when the transaction fails because the database is currently unavailable? Maybe, but not necessarily. Should the user be made aware that the application he/she is using is for some sudden reason incompatible with the database schema of the backend database? Maybe, but what does Joe in the sales department do with that valuable piece of information?
If stuff fails, should we just forget about Joe’s input and tell him to come back when the system is happier to serve him? So, in other words, do we have Joe retry the job? That’s easy to program, but that sort of strategy doesn’t really make Joe happy, does it?
So what’s the right thing to do? One part of the solution is a proper separation between the things the user (or a program) does and the things that the transaction does. This will give us two layers and “a job” that can be handed down from the presentation layer down to the “transaction layer”. Once this separation is in place, we can come up with a mechanism that will run those jobs in transactions and will automate how and when transactions are to be retried. Transactional MSMQ queues turn out to be a brilliant tool to make this very easy to implement. More tomorrow. Stay tuned.
If I were really good at writing about life, love, happiness and tragedy, weird relationships, drama, grand obstacles, successes and defeats, and all those sudden unexpected turns and twists that a story could possibly have, and I had been willing to share what went on with and around me in real life in the last six months -- my blog would now have an entirely different audience and I could easily sell the movie rights by now. So the actual reason why you haven't seen much happening here is simply that a dramatic surge in "personal life activity" (no, not starting at "no life") took over the "blogging timeslice" and had, frankly, some adverse effects on my work morale at times. The good news is that there is definitely light at the end of that tunnel and the better news (for you as a reader) is that this place here won't be as quiet as it has been in the recent months. I've got some interesting stuff cooking.
I am presently doing some intense research on services, service patterns, message exchange patterns and many other issues related to services (No surprise there). However, I can't do that without external help and since many people are reading my blog, I can just as well start asking around right here:
I would like to get in touch with companies (preferrably insurances and banks) who afford a corporate history department. The ambitious goal I have is to reconstruct a few banking or insurance or purchasing business processes of ca. 1955-1965. I have come to believe that there is a lot, a lot to be learned there that will be very useful to what we're all doing. The deal is that if you share, I share whatever I have as soon as I have it. My contact address is clemensv@newtelligence.com
The good news is that the V|@gr@ spam is getting less, but what scares me is that I start getting lots and lots of religious spam from Jesusland.
In two hours I'll be back on the road (well, airport, to be precise). Today I will fly out to Reykjavik in Iceland where Achim and I will do the first of a series of SOA workshops with Microsoft EMEA from Monday to Wednesday, explaining principles of Service Oriented Architectures and the application of those principles in real applications with today's technologies. Other stops on the tour will be in Denmark in early December and, early next year, in Poland, Belgium and the Netherlands (AFAIK, all of these events are invite-only Microsoft customer events). A German-language, newtelligence-branded edition of that workshop will take place December 1-3 in Düsseldorf and we plan a newtelligence event in South Africa in early March 2005.
When I come back from Iceland Wednesday night, I'll stay home for less than 12 hours and will then fly out to Denver for a long weekend and when I come back from there the following Wednesday I'll go straight at our own TornadoCamp event held in Bad Ems (half way between Frankfurt and Düsseldorf). Coming back it'll be another short turnaround of likely less than a day before I will leave for the Microsoft EastMed Developer Conference in Amman. (Very much looking forward to that)
So with that schedule and a few customer engagements in between, I have plenty of days on the road and only two days left at the newtelligence office, before I'll move my office desk to Denver on December 11 for the rest of the year and into the new year so that I can spend Christmas with Jen, get some better traction with Visual Studio 2005 and do some writing. And for when I come back on January 10, the schedule looks just as busy for the following weeks and months.
I've got nothing against advertising on websites. However, there are two things that are completely annoying. The first are popup windows and my popup blocker is taking care of those. As an alternative, the advertising people have invented the Macromedia Flash-popup that pops up on the page and obscures the content for a little while. That's annoying but something I can absolutely deal with. What I cannot deal with, and that's the second annoying thing, is that some advertising twits try to entertain me with music and or other sorts of 30-second audio/video shows. You people might find that cool, but I don't. Sound effects and music are strictly an "opt-in" feature on my computer and at my work desk. I just uninstalled the Flash player. Silence has returned and websites became instantly more useful. Try it.
[Usually it's as easy as this: In Internet Explorer select "Tools/Internet Options" on the menu, then click the "Settings..." button in the "Temporary Internet Files" box on the "General" tab of the dialog that opens up. Click the "View Objects" button in the "Settings" dialog that opens up. There will be another window opening. Find "Shockwave Flash" and delete the object. Close IE. Done.]
|