January 15, 2013
@ 06:56 PM

File:ESB.pngThe basic idea of the Enterprise Service Bus paints a wonderful picture of a harmonious coexistence, integration, and collaboration of software services. Services for a particular general cause are built or procured once and reused across the Enterprise by ways of publishing them and their capabilities in a corporate services repository from where they can be discovered. The repository holds contracts and policy that allows dynamically generating functional adapters to integrate with services. Collaboration and communication is virtualized through an intermediary layer that knows how to translate messages from and to any other service hooked into the ESB like a babel fish in the Hitchhiker’s Guide to the Galaxy. The ESB is a bus, meaning it aspires to be a smart, virtualizing, mediating, orchestrating messaging substrate permeating the Enterprise, providing uniform and mediated access anytime and anywhere throughout today’s global Enterprise. That idea is so beautiful, it rivals My Little Pony. Sadly, it’s also about as realistic. We tried regardless.

As with many utopian ideas, before we can get to the pure ideal of an ESB, there’s some less ideal and usually fairly ugly phase involved where non-conformant services are made conformant. Until they are turned into WS-* services, any CICS transaction and SAP BAPI is fronted with a translator and as that skinning renovation takes place, there’s also some optimization around message flow, meaning messages get batched or de-batched, enriched or reduced. In that phase, there was also learning of the value and lure of the benefits of central control. SOA Governance is an interesting idea to get customers drunk on. That ultimately led to cheating on the ‘B’. When you look around and look at products proudly carrying the moniker ‘Enterprise Service Bus’ you will see hubs. In practice, the B in ESB is mostly just a lie. Some vendors sell ESB servers, some even sell ESB appliances. If you need to walk to a central place to talk to anyone, it’s a hub. Not a bus.

Yet, the bus does exist. The IP network is the bus. It turns out to suit us well on the Internet. Mind that I’m explicitly talking about “IP network” and not “Web” as I do believe that there are very many useful protocols beyond HTTP. The Web is obviously the banner example for a successful implementation of services on the IP network that does just fine without any form of centralized services other than the highly redundant domain name system.

Centralized control over services does not scale in any dimension. Intentionally creating a bottleneck through a centrally controlling committee of ESB machines, however far scaled out, is not a winning proposition in a time where every potential or actual customer carries a powerful computer in their pockets allowing to initiate ad-hoc transactions at any time and from anywhere and where we see vehicles, machines and devices increasingly spew out telemetry and accept remote control commands. Central control and policy driven governance over all services in an Enterprise also kills all agility and reduces the ability to adapt services to changing needs because governance invariably implies process and certification. Five-year plan, anyone?

If the ESB architecture ideal weren’t a failure already, the competitive pressure to adopt direct digital interaction with customers via Web and Apps, and therefore scale up not to the scale of the enterprise, but to scale up to the scale of the enterprise’s customer base will seal its collapse.

Service Orientation

While the ESB as a concept permeating the entire Enterprise is dead, the related notion of Service Orientation is thriving even though the four tenets of SOA are rarely mentioned anymore. HTTP-based services on the Web embrace explicit message passing. They mostly do so over the baseline application contract and negotiated payloads that the HTTP specification provides for. In the case of SOAP or XML-RPC, they are using abstractions on top that have their own application protocol semantics. Services are clearly understood as units of management, deployment, and versioning and that understanding is codified in most platform-as-a-service offerings.

That said, while explicit boundaries, autonomy, and contract sharing have been clearly established, the notion of policy-driven compatibility – arguably a political addition to the list to motivate WS-Policy as the time – has generally been replaced by something even more powerful: Code. JavaScript code to be more precise. Instead of trying to tell a generic client how to adapt to service settings by ways of giving it a complex document explaining what switches to turn, clients now get code that turns the switches outright. The successful alternative is to simply provide no choice. There’s one way to gain access authorization for a service, period. The “policy” is in the docs.

The REST architecture model is service oriented – and I am not meaning to imply that it is so because of any particular influence. The foundational principles were becoming common sense around the time when these terms were coined and as the notion of broadly interoperable programmable services started to gain traction in the late 1990s – the subsequent grand dissent that arose was around whether pure HTTP was sufficient to build these services, or whether the ambitious multi-protocol abstraction for WS-* would be needed. I think it’s fairly easy to declare the winner there.

Federated Autonomous Services

imageWindows Azure, to name a system that would surely be one to fit the kind of solution complexity that ESBs were aimed at, is a very large distributed system with a significant number of independent multi-tenant services and deployments that are spread across many data centers. In addition to the publicly exposed capabilities, there are quite a number of “invisible” services for provisioning, usage tracking and analysis, billing, diagnostics, deployment, and other purposes.  Some components of these internal services integrate with external providers. Windows Azure doesn’t use an ESB. Windows Azure is a federation of autonomous services.

The basic shape of each of these services is effectively identical and that’s not owing, at least not to my knowledge, to any central architectural directive even though the services that shipped after the initial wave certainly took a good look at the patterns that emerged. Practically all services have a gateway whose purpose it is to handle and dispatch and sometimes preprocess incoming network requests or sessions and a backend that ultimately fulfills the requests. The services interact through public IP space, meaning that if Service Bus wants to talk to its SQL Database backend it is using a public IP address and not some private IP. The Internet is the bus. The backend and its structure is entirely a private implementation matter.  It could be a single role or many roles.

Any gateway’s job is to provide network request management, which includes establishing and maintaining sessions, session security and authorization, API versioning where multiple variants of the same API are often provided in parallel, usage tracking, defense mechanisms, and diagnostics for its areas of responsibility. This functionality is specific and inherent to the service. And it’s not all HTTP. SQL database has a gateway that speaks the Tabular Data Stream protocol (TDS) over TCP, for instance, and Service Bus has a gateway that speaks AMQP and the binary proprietary Relay and Messaging protocols.

Governance and diagnostics doesn’t work by putting a man in the middle and watching the traffic coming by, which is akin to trying the tell whether a business is healthy by counting the trucks going to their warehouse. Instead we are integrating the data feeds that come out of the respective services and are generated fully knowing the internal state, and concentrate these data streams, like the billing stream, in yet other services that are also autonomous and have their own gateways. All these services interact and integrate even though they’re built by a composite team far exceeding the scale of most Enterprise’s largest projects, and while teams run on separate schedules where deployments into the overall system happen multiple times daily. It works because each service owns its gateway, is explicit about its versioning strategy, and has a very clear mandate to honor published contracts, which includes explicit regression testing. It would be unfathomable to maintain a system of this scale through a centrally governed switchboard service like an ESB.

Well, where does that leave “ESB technologies” like BizTalk Server? The answer is simply that they’re being used for what they’re commonly used for in practice. As a gateway technology. Once a service in such a federation would have to adhere to a particular industry standard for commerce, for instance if it would have to understand EDIFACT or X.12 messages sent to it, the Gateway would employ an appropriate and proven implementation and thus likely rely on BizTalk if implemented on the Microsoft stack. If a service would have to speak to an external service for which it would have to build EDI exchanges, it would likely be very cost effective to also use BizTalk as the appropriate tool for that outbound integration. Likewise, if data would have to be extracted from backend-internal message traffic for tracking purposes and BizTalk’s BAM capabilities would be a fit, it might be a reasonable component to use for that. If there’s a long running process around exchanging electronic documents, BizTalk Orchestration might be appropriate, if there’s a document exchange involving humans then SharePoint and/or Workflow would be a good candidate from the toolset.

For most services, the key gateway technology of choice is HTTP using frameworks like ASP.NET, Web API, probably paired with IIS features like application request routing and the gateway is largely stateless.

In this context, Windows Azure Service Bus is, in fact, a technology choice to implement application gateways. A Service Bus namespace thus forms a message bus for “a service” and not for “all services”. It’s as scoped to a service or a set of related services as an IIS site is usually scoped to one or a few related services. The Relay is a way to place a gateway into the cloud for services where the backend resides outside of the cloud environment and it also allows for multiple systems, e.g. branch systems, to be federated into a single gateway to be addressed from other systems and thus form a gateway of gateways. The messaging capabilities with Queues and Pub/Sub Topics provide a way for inbound traffic to be authorized and queued up on behalf of the service, with Service Bus acting as the mediator and first line of defense and where a service will never get a message from the outside world unless it explicitly fetches it from Service Bus. The service can’t be overstressed and it can’t be accessed except through sending it a message.

The next logical step on that journey is to provide federation capabilities with reliable handoff of message between services, meaning that you can safely enqueue a message within a service and then have Service Bus replicate that message (or one copy in the case of pub/sub) over to another service’s Gateway – across namespaces and across datacenters or your own sites, and using the open AMQP protocol. You can do that today with a few lines of code, but this will become inherent to the system later this year.

Categories: Architecture | SOA | Azure

September 1, 2012
@ 04:49 AM

Today has been a lively day in some parts of the Twitterverse debating the Saga pattern. As it stands, there are a few frameworks for .NET out there that use the term "Saga" for some framework implementation of a state machine or workflow. Trouble is, that's not what a Saga is. A Saga is a failure management pattern.

Sagas come out of the realization that particularly long-lived transactions (originally even just inside databases), but also far distributed transactions across location and/or trust boundaries can't eaily be handled using the classic ACID model with 2-Phase commit and holding locks for the duration of the work. Instead, a Saga splits work into individual transactions whose effects can be, somehow, reversed after work has been performed and commited.


The picture shows a simple Saga. If you book a travel itinerary, you want a car and a hotel and a flight. If you can't get all of them, it's probably not worth going. It's also very certain that you can't enlist all of these providers into a distributed ACID transaction. Instead, you'll have an activity for booking rental cars that knows both how to perform a reservation and also how to cancel it - and one for a hotel and one for flights.

The activities are grouped in a composite job (routing slip) that's handed along the activity chain. If you want, you can sign/encrypt the routing slip items so that they can only be understood and manipulated by the intended receiver. When an activity completes, it adds a record of the completion to the routing slip along with information on where its compensating operation can be reached (e.g. via a Queue). When an activity fails, it cleans up locally and then sends the routing slip backwards to the last completed activity's compensation address to unwind the transaction outcome.

If you're a bit familiar with travel, you'll also notice that I've organized the steps by risk. Reserving a rental car almost always succeeds if you book in advance, because the rental car company can move more cars on-site of there is high demand. Reserving a hotel is slightly more risky, but you can commonly back out of a reservation without penalty until 24h before the stay. Airfare often comes with a refund restriction, so you'll want to do that last.

I created a Gist on Github that you can run as a console application. It illustrates this model in code. Mind that it is a mockup and not a framework. I wrote this in less than 90 minutes, so don't expect to reuse this.

The main program sets up an examplary routing slip (all the classes are in the one file) and creates three completely independent "processes" (activity hosts) that are each responsible for handling a particular kind of work. The "processes" are linked by a "network" and each kind of activity has an address for forward progress work and one of compensation work. The network resolution is simulated by 'Send".

   1:  static ActivityHost[] processes;
   3:  static void Main(string[] args)
   4:  {
   5:      var routingSlip = new RoutingSlip(new WorkItem[]
   6:          {
   7:              new WorkItem<ReserveCarActivity>(new WorkItemArguments{{"vehicleType", "Compact"}}),
   8:              new WorkItem<ReserveHotelActivity>(new WorkItemArguments{{"roomType", "Suite"}}),
   9:              new WorkItem<ReserveFlightActivity>(new WorkItemArguments{{"destination", "DUS"}})
  10:          });
  13:      // imagine these being completely separate processes with queues between them
  14:      processes = new ActivityHost[]
  15:                          {
  16:                              new ActivityHost<ReserveCarActivity>(Send),
  17:                              new ActivityHost<ReserveHotelActivity>(Send),
  18:                              new ActivityHost<ReserveFlightActivity>(Send)
  19:                          };
  21:      // hand off to the first address
  22:      Send(routingSlip.ProgressUri, routingSlip);
  23:  }
  25:  static void Send(Uri uri, RoutingSlip routingSlip)
  26:  {
  27:      // this is effectively the network dispatch
  28:      foreach (var process in processes)
  29:      {
  30:          if (process.AcceptMessage(uri, routingSlip))
  31:          {
  32:              break;
  33:          }
  34:      }
  35:  }

The activities each implement a reservation step and an undo step. Here's the one for cars:

   1:  class ReserveCarActivity : Activity
   2:  {
   3:      static Random rnd = new Random(2);
   5:      public override WorkLog DoWork(WorkItem workItem)
   6:      {
   7:          Console.WriteLine("Reserving car");
   8:          var car = workItem.Arguments["vehicleType"];
   9:          var reservationId = rnd.Next(100000);
  10:          Console.WriteLine("Reserved car {0}", reservationId);
  11:          return new WorkLog(this, new WorkResult { { "reservationId", reservationId } });
  12:      }
  14:      public override bool Compensate(WorkLog item, RoutingSlip routingSlip)
  15:      {
  16:          var reservationId = item.Result["reservationId"];
  17:          Console.WriteLine("Cancelled car {0}", reservationId);
  18:          return true;
  19:      }
  21:      public override Uri WorkItemQueueAddress
  22:      {
  23:          get { return new Uri("sb://./carReservations"); }
  24:      }
  26:      public override Uri CompensationQueueAddress
  27:      {
  28:          get { return new Uri("sb://./carCancellactions"); }
  29:      }
  30:  }

The chaining happens solely through the routing slip. The routing slip is "serializable" (it's not, pretend that it is) and it's the only piece of information that flows between the collaborating activities. There is no central coordination. All work is local on the nodes and once a node is done, it either hands the routing slip forward (on success) or backward (on failure). For forward progress data, the routing slip has a queue and for backwards items it maintains a stack. The routing slip also handles resolving and invoking whatever the "next" thing to call is on the way forward and backward.

   1:  class RoutingSlip
   2:  {
   3:      readonly Stack<WorkLog> completedWorkLogs = new Stack<WorkLog>();
   4:      readonly Queue<WorkItem> nextWorkItem = new Queue<WorkItem>();
   6:      public RoutingSlip()
   7:      {
   8:      }
  10:      public RoutingSlip(IEnumerable<WorkItem> workItems)
  11:      {
  12:          foreach (var workItem in workItems)
  13:          {
  14:              this.nextWorkItem.Enqueue(workItem);
  15:          }
  16:      }
  18:      public bool IsCompleted
  19:      {
  20:          get { return this.nextWorkItem.Count == 0; }
  21:      }
  23:      public bool IsInProgress
  24:      {
  25:          get { return this.completedWorkLogs.Count > 0; }
  26:      }
  28:      public bool ProcessNext()
  29:      {
  30:          if (this.IsCompleted)
  31:          {
  32:              throw new InvalidOperationException();
  33:          }
  35:          var currentItem = this.nextWorkItem.Dequeue();
  36:          var activity = (Activity)Activator.CreateInstance(currentItem.ActivityType);
  37:          try
  38:          {
  39:              var result = activity.DoWork(currentItem);
  40:              if (result != null)
  41:              {
  42:                  this.completedWorkLogs.Push(result);
  43:                  return true;
  44:              }
  45:          }
  46:          catch (Exception e)
  47:          {
  48:              Console.WriteLine("Exception {0}", e.Message);
  49:          }
  50:          return false;
  51:      }
  53:      public Uri ProgressUri
  54:      {
  55:          get
  56:          {
  57:              if (IsCompleted)
  58:              {
  59:                  return null;
  60:              }
  61:              else
  62:              {
  63:                  return
  64:                      ((Activity)Activator.CreateInstance(this.nextWorkItem.Peek().ActivityType)).
  65:                          WorkItemQueueAddress;
  66:              }
  67:          }
  68:      }
  70:      public Uri CompensationUri
  71:      {
  72:          get
  73:          {
  74:              if (!IsInProgress)
  75:              {
  76:                  return null;
  77:              }
  78:              else
  79:              {
  80:                  return
  81:                      ((Activity)Activator.CreateInstance(this.completedWorkLogs.Peek().ActivityType)).
  82:                          CompensationQueueAddress;
  83:              }
  84:          }
  85:      }
  87:      public bool UndoLast()
  88:      {
  89:          if (!this.IsInProgress)
  90:          {
  91:              throw new InvalidOperationException();
  92:          }
  94:          var currentItem = this.completedWorkLogs.Pop();
  95:          var activity = (Activity)Activator.CreateInstance(currentItem.ActivityType);
  96:          try
  97:          {
  98:              return activity.Compensate(currentItem, this);
  99:          }
 100:          catch (Exception e)
 101:          {
 102:              Console.WriteLine("Exception {0}", e.Message);
 103:              throw;
 104:          }
 106:      }
 107:  }

The local work  and making the decisions is encapsulated in the ActivityHost, which calls ProcessNext() on the routing slip to resolve the next activity and call its DoWork() function on the way forward or will resolve the last executed activity on the way back and invoke its Compensate() function. Again, there's nothing centralized here; all that work hinges on the routing slip and the three activities and their execution is completely disjoint.

   1:  abstract class ActivityHost
   2:  {
   3:      Action<Uri, RoutingSlip> send;
   5:      public ActivityHost(Action<Uri, RoutingSlip> send)
   6:      {
   7:          this.send = send;
   8:      }
  10:      public void ProcessForwardMessage(RoutingSlip routingSlip)
  11:      {
  12:          if (!routingSlip.IsCompleted)
  13:          {
  14:              // if the current step is successful, proceed
  15:              // otherwise go to the Unwind path
  16:              if (routingSlip.ProcessNext())
  17:              {
  18:                  // recursion stands for passing context via message
  19:                  // the routing slip can be fully serialized and passed
  20:                  // between systems. 
  21:                  this.send(routingSlip.ProgressUri, routingSlip);
  22:              }
  23:              else
  24:              {
  25:                  // pass message to unwind message route
  26:                  this.send(routingSlip.CompensationUri, routingSlip);
  27:              }
  28:          }
  29:      }
  31:      public void ProcessBackwardMessage(RoutingSlip routingSlip)
  32:      {
  33:          if (routingSlip.IsInProgress)
  34:          {
  35:              // UndoLast can put new work on the routing slip
  36:              // and return false to go back on the forward 
  37:              // path
  38:              if (routingSlip.UndoLast())
  39:              {
  40:                  // recursion stands for passing context via message
  41:                  // the routing slip can be fully serialized and passed
  42:                  // between systems 
  43:                  this.send(routingSlip.CompensationUri, routingSlip);
  44:              }
  45:              else
  46:              {
  47:                  this.send(routingSlip.ProgressUri, routingSlip);
  48:              }
  49:          }
  50:      }
  52:      public abstract bool AcceptMessage(Uri uri, RoutingSlip routingSlip);
  53:  }


That's a Saga.

Categories: Architecture | SOA

March 5, 2012
@ 11:00 PM

Greg says what it’s not, and since he didn’t use the opportunity to also succinctly express what it is, I helped him out in the comments:

CQRS ("Command-Query Responsibility Segregation") is a simple pattern that strictly segregates the responsibility of handling command input into an autonomous system from the responsibility of handling side-effect-free query/read access on the same system. Consequently, the decoupling allows for any number of homogeneous or heterogeneous query/read modules to be paired with a command processor and this principle presents a very suitable foundation for event sourcing, eventual-consistency state replication/fan-out and, thus, high-scale read access. In simple terms: You don’t service queries via the same module of a service that you process commands through. For REST heads: GET wires to a different thing from what PUT/POST/DELETE wire up to.

Martin Fowler has a nice discussion here, with pictures. Udi Dahan has another nice description, also with pictures. To say it in yet another way, the key point of the pattern is that the read and write paths in a system are entirely separate.

Categories: Architecture | SOA

December 1, 2011
@ 05:38 PM

I answered 4 questions in Richard Seroter’s series of interviews with folks working on connect systems. See the Q&A here.

Categories: Architecture | SOA

Elastic and dynamic multitenant cloud environments have characteristics that make traditional failure management mechanisms using coordinated 2-phase transactions a suboptimal choice. The common 2-phase commit protocols depend on a number of parties enlisted into a transaction making hard promises on the expected outcome of their slice a transaction. Those promises are difficult to keep in an environment where systems may go down at any time with their local state vanishing, where not all party trust each other, where significant latency may be involved, and network connectivity cannot be assumed to be reliable. 2-phase-commit is also not a good choice for operations that take significant amounts of time and span a significant amount of resources, because such a coordinated transaction may adversely affect the availability of said resources, especially in cases where the solution is a high-density multitenant solution where virtualized, and conceptually isolated resources are collocated on the same base resources. In such a case, database locks and locks on other resources to satisfy coordinated transaction promises may easily break the isolation model of a multitenant system and have one tenant affect the other.

Therefore, failure management – and this is ultimately what transactions are about – requires a somewhat different approach in cloud environments and other scalable distributed systems with similar characteristics.

To find a suitable set of alternative approaches, let’s quickly dissect what goes on in a distributed transaction:

To start, two or more parties ‘enlist’ into a shared transaction scope performing some coordinated work that’s commonly motivated by a shared notion of a ‘job’ that  needs to be executed. The goal of having a shared transaction scope is that the overall system will remain correct and consistent in both the success and the failure cases. Consistency in the success case is trivial. All participating parties could complete their slice of the job that had to be done. Consistency in the failure case is more interesting. If any party fails in doing their part of the job, the system will end up in a state that is not consistent. If you were trying to book a travel package and ticketing with the airline failed, you may end up with a hotel and a car, but no flight. In order to prevent that, a ‘classic’ distributed transaction asks the participants to make promises on the outcome of the transaction as the transaction is going on.

As all participating parties have tentatively completed but not finalized their work, the distributed transaction goes into a voting phase where every participant is asked whether it could tentatively complete its portion of the job and whether it can furthermore guarantee with a very high degree of certainty that it can finalize the job outcome and make it effective when asked to do so. Imagine a store clerk who puts an item on the counter that you’d like to purchase – you’ll show him your $10 and ask for a promise that he will hand you the item if you give him the money – and vice versa.

Finally, once all parties have made their promises and agreed that the job can be finalized, they are told to do so.

There are two big interesting things to observe about the 2-phase-commit (2PC) distributed transaction model that I just described: First, It’s incredibly simple from a developer’s perspective because the transaction outcome negotiation is externalized and happens as ‘magic’. Second, it’s not resembling anything that happens in real life and that should be somewhat suspicious. You may have noticed that there was no neutral escrow agent present when you bought the case of beverages at the store for $10 two paragraphs earlier.

The grand canonical example for 2PC transactions is a bank account transfer. You debit one account and credit another. These two operations need to succeed or fail together because otherwise you are either creating or destroying money (which is illegal, by the way). So that’s the example that’s very commonly used to illustrate 2PC transactions. The catch is – that’s not how it really works, at all. Getting money from one bank account to another bank account is a fairly complicated affair that touches a ton of other accounts. More importantly, it’s not a synchronous fail-together/success-together scenario. Instead, principles of accounting apply (surprise!). When a transfer is initiated, let’s say in online banking, the transfer is recorded in form of a message for submission into the accounting system and the debit is recorded in the account as a ‘pending’ transaction that affects the displayed balance. From the user’s perspective, the transaction is ’done’, but factually nothing has happened, yet. Eventually, the accounting system will get the message and start performing the transfer, which often causes a cascade of operations, many of them yielding further messages, including booking into clearing accounts and notifying the other bank of the transfer. The principle here is that all progress is forward. If an operation doesn’t work for some technical reason it can be retried once the technical reason is resolved. If operation fails for a business reason, the operation can be aborted – but not by annihilating previous work, but by doing the inverse of previous work. If an account was credited, that credit is annulled with a debit of the same amount. For some types of failed transactions,  the ‘inverse’ operation may not be fully symmetric but may result in extra actions like imposing penalty fees. In fact, in accounting, annihilating any work is illegal – ‘delete’ and ‘update’ are a great way to end up in prison.

As all the operations occur that eventually lead to the completion or failure of the grand complex operation that is a bank transfer, the one thing we’ll be looking to avoid is to be in any kind of ‘doubt’ of the state of the system. All participants must be able to have a great degree of confidence in their knowledge about the success or failure of their respective action. No shots into the dark. There’s no maybe. Succeed or fail.

That said, “fail” is a funny thing is distributed systems because it happens quite a bit. In many cases “fail” isn’t something that a bit of patience can’t fix. Which means that teaching the system some patience and tenacity is probably a good idea instead of giving up too easily. So if an operation fails because it runs into a database deadlock or the database is offline or the network is down or the local machine’s network adapter just got electrocuted that’s all not necessarily a reason to fail the operation. That’s a reason to write an alert into a log and call for help for someone to fix the environment condition.

If we zoom into an ‘operation’ here, we might see a message that we retrieve from some sort of reliable queue or some other kind of message store and subsequently an update of system state based on message. Once the state has been successfully updated, which may mean that we’ve inserted a new database record, we can tell the message system that the message has been processed and that it can be discarded. That’s the happy case.

Let’s say we take the message and as the process wants to walk up the database the power shuts off. Click. Darkness. Not a problem. Assuming the messaging system supports a ‘peek/lock’ model that allows the process to first take the message and only remove it from the queue once processing has been completed, the message will reappear on the queue after the lock has expired and the operation can be retried, possibly on a different node. That model holds true for all failures of the operation through to and in the database. If the operation fails due to some transient condition (including the network card smoking out, see above), the message is either explicitly abandoned by the process or returns into the queue by ways of a lock timeout.  If the operation fails because something is really logically wrong, like trying to ship a product out of the inventory that’s factually out of stock, we’ll have to take some forward action to deal with that. We’ll get to that in a bit.

Assuming the operation succeeded, the next tricky waypoint is failure after success, meaning that the database operation succeeded, but the message subsequently can’t be flagged as completed and thus can’t be removed from the queue. That situation would potentially lead to another delivery of the message even though the job has already been completed and therefore would cause the job to be executed again – which is only a problem if the system isn’t expecting that, or, in fancier terms, if it’s not ‘idempotent’. If the job is updating a record to absolute values and the particular process/module/procedure is the only avenue to perform that update (meaning there are no competing writers elsewhere), doing that update again and again and again is just fine. That’s natural idempotency. If the job is inserting a record, the job should contain enough information, such as a causality or case or logical transaction identifier that allows the process to figure out whether the desired record has already been inserted and if that’s the case it should do nothing, consider its own action a duplicate and just act as if it succeeded.

Checkpoint: With what I said in the last two paragraphs, you can establish pretty good confidence about failure or success of individual operations that are driven by messages. You fail and retry, you fail and take forward action, or you succeed and take steps to avoid retrying even if the system presents the same job again. There’s very little room for doubt. So that’s good.

The ‘forward action’ that results from failure is often referred to as ‘compensation’, but that’s a bit simplistic. The forward action resulting from running into the warehouse with the belief that there’s still product present while the shelf is factually empty isn’t to back out and cancel the order (unless you’re doing a firesale of a touch tablet your management just killed). Instead, you notify the customer of the shipping delay, flag a correction of the inventory levels, and put the item on backorder. For the most part, pure ‘compensation’ doesn’t really exist. With every action, the system ends up in a consistent state. It’s just that some states are more convenient than others and there are some state for which the system has a good answer and some states for which it doesn’t. If the system ends up in a dead end street and just wants to sit down and cry because nobody told it what to do now, it should phone home and ask for human intervention. That’s fine and likely a wise strategy in weird edge cases.

Initiating the ‘forward action’ and, really, any action in a system that’s using messaging as its lifeline and as a backplane for failure resilience as I’m describing it here is not entirely without failure risk in itself. It’s possible that you want to initiate an action and can’t reach the messaging system or sending the message fails for some other reason. Here again, patience and tenacity are a good idea. If we can’t send, our overall operation is considered failed and we won’t flag the initiating message as completed. That will cause the job to show up again, but since we’ve got idempotency in the database that operation will again succeed (even if by playing dead) or fail and we will have the same outcome allowing us to retry the send. If it looks like we can send but sending fails sometime during the operation, there might be doubt about whether we sent the message. Since doubt is a problem and we shouldn’t send the same message twice, duplicate detection in the messaging system can help suppressing a duplicate so that it never shows up at the receiver. That allows the sender to confidently resend if it’s in doubt about success in a prior incarnation of processing the same message.

Checkpoint: We now also can establish pretty good confidence about initiating forward action or any other action in the system given if the ‘current’ action is following the principles described above.

So far I’ve talked about individual actions and also about chains of actions, albeit just in the failure case. Obviously the same applies to success cases where you want to do something ‘next’ once you’re done with ‘this’.

Now let’s assume you want to do multiple things in parallel, like updating multiple stores as part of executing a single job – which gets us back to the distributed transaction scenario discussed earlier. What helps in these cases is if the messaging system supports ‘topics’ that allow dropping a message (the job) into the messaging system once and serve the message to each participant in the composite activity via their own subscription on the topic. Since the messaging system is internally transactional it will guarantee that each message that is successfully submitted will indeed appear on each subscription so it ensures the distribution. With that, the failure handling story for each slice of the composite job turns into the same model that I’ve been explaining above. Each participant can be patient and tenacious when it comes to transient error conditions. In hard failure cases, the forward action can be a notification to the initiator that will then have to decide how to progress forward, including annulling or otherwise invoking forward actions activities that have been executed in parallel. In the aforementioned case of a ticketing failure that means that the ticketing module throws its hands up and the module responsible for booking the travel package either decides to bubble the case to the customer or an operator leaving the remaining reservations intact or to cancel the reservations for the car and the hotel that have been made in parallel. Should two out of three or more participants’ operations fail and each report up to the initiator, the initiator can either keep track of whether it already took corrective forward action on the third participant or, in doubt, the idempotency rule should avoid doing the same thing twice.

The model described here is loosely based on the notion of ‘Sagas’, which were first described in a 1987 ACM paper by Hector Garcia-Molina and Kenneth Salem, so this isn’t grand news. However, the notion of such Sagas is only now really gaining momentum with long-running and far distributed transactions becoming more commonplace, so it’s well worth to drag the model further out into the limelight and give it coverage. The original paper on Sagas is still assuming that the individual steps can be encapsulated in a regular transaction, which may not even be the case in the cloud and with infrastructures that don’t have inherent transaction support. The role of the messaging system with the capabilities mentioned above is to help compensate for the absence of that support.

… to be continued …

Categories: Architecture | SOA

From //build in Anaheim

Categories: AppFabric | Architecture | SOA | ISB | Web Services

As our team was starting to transform our parts of the Azure Services Platform from a CTP ‘labs’ service exploring features into a full-on commercial service, it started to dawn on us that we had set ourselves up for writing a bunch of ‘enterprise apps’. The shiny parts of Service Bus and Access Control that we parade around are all about user-facing features, but if I look back at the work we had to go from a toy service to a commercial offering, I’d guess that 80%-90% of the effort went into aspects like infrastructure, deployment, upgradeability, billing, provisioning, throttling, quotas, security hardening, and service optimization. The lesson there was: when you’re boarding the train to shipping a V1, you don’t load new features on that train –  you rather throw some off.

The most interesting challenge for these infrastructure apps sitting on the backend was that we didn’t have much solid ground to stand on. Remember – these were very early days, so we couldn’t use SQL Azure since the folks over in SQL were on a pretty heroic schedule themselves and didn’t want to take on any external dependencies even from close friends. We also couldn’t use any of the capabilities of our own bits because building infrastructure for your features on your features would just be plain dumb. And while we could use capabilities of the Windows Azure platform we were building on, a lot of those parts still had rough edges as those folks were going through a lot of the same that we went through. In those days, the table store would be very moody, the queue store would sometimes swallow or duplicate messages, the Azure fabric controller would occasionally go around and kill things. All normal –  bugs.

So under those circumstances we had to figure out the architecture for some subsystems where we need to do a set of coordinated action across a distributed set of resources – a distributed transaction or saga of sorts. The architecture had a few simple goals: when we get an activation request, we must not fumble that request under any circumstance, we must run the job to completion for all resources and, at the same time, we need to minimize any potential for required operator intervention, i.e. if something goes wrong, the system better knows how to deal with it – at best it should self-heal.

My solution to that puzzle is a pattern I call “Scheduler-Agent-Supervisor Pattern” or, short, “Supervisor Pattern”. We keep finding applications for this pattern in different places, so I think it’s worth writing about it in generic terms – even without going into the details of our system.

The pattern foots on two seemingly odd and very related assumptions: the system is perfect’ and ‘all error conditions are transient’. As a consequence, the architecture has some character traits of a toddler. It’s generally happily optimistic and gets very grumpy, very quickly when things go wrong – to the point that it will simply drop everything and run away screaming. It’s very precisely like that, in fact.


The first picture here shows all key pieces except the Supervisor that I’ll introduce later. At the core we have a Scheduler that manages a simple state machine made up of Jobs and those jobs have Steps. The steps may have a notion of interdependency or may be completely parallelizable. There is a Job Store that holds jobs and steps and there are Agents that execute operations on some resource.  Each Agent is (usually) fronted by a queue and the Scheduler has a queue (or service endpoint) through which it receives reply messages from the Agents.

Steps are recorded in a durable storage table of some sort that has at least the following fields: Current State (say: Disabled, Active), Desired State (say: Disabled, Active), LockedUntil (Date/Time value), and Actor plus any step specific information you want to store and eventually submit with the job to the step agent.

When Things Go Right

The initial flow is as follows:

(1)a – Submit a new job into the Scheduler (and wait)
(2)a – The Scheduler creates a new job and steps with an initial current state (‘Disabled’) in the job store 
(2)b – The Scheduler sets ‘desired state’ of the job and of all schedulable steps (dependencies?) to the target state (‘Active’) and sets the ‘locked until’ timeout of the step to a value in the near future, e.g. ‘Now’ + 2 minutes.
(1)b – Job submission request unblocks and returns

If all went well, we now have a job record and, here in this example, two step records in our store. They have a current state of ‘Disabled’ and a desired state of ‘Active’. If things didn’t go well, we’d have incomplete or partially wedged records or nothing in the job store, at all. The client would also know about it since we’ve held on to the reply until we have everything done – so the client is encouraged to retry. If we have nothing in the store and the client doesn’t retry – well, then the job probably wasn’t all that important, after all. But if we have at least a job record, we can make it all right later. We’re optimists, though; let’s assume it all went well.

For the next steps we assume that there’s a notion of dependencies between the steps and the second steps depends on the first. If that were not the case, the two actions would just be happening in parallel.

(3) – Place a step message into the queue for the actor for the first step; Agent 1 in this case. The message contains all the information about the step, including the current and desired state and also the LockedUntil that puts an ultimatum on the activity. The message may further contain an action indicator or arguments that are taken from the step record.
(4) – After the agent has done the work, it places a completion record into the reply queue of the Scheduler.
(5) – The Scheduler records the step as complete by setting the current state from ‘Disabled’ to ‘Active’; as a result the desired and the current state are now equal.
(6) – The Scheduler sets the next step’s desired state to the target state (‘Active’) and sets the LockedUntil timeout of the step to a value in the near future, e.g. ‘Now’ + 1 minute. The lock timeout value is an ultimatum for when the operation is expected to be complete and reported back as being complete in a worst-case success case. The actual value therefore depends on the common latency of operations in the system. If operations usually complete in milliseconds and at worst within a second, the lock timeout can be short – but not too short. We’ll discuss this  value in more detail a bit later.
(7), (8), (9) are equivalent to (3), (4), (5).

Once the last step’s current state is equal to the current state, the job’s current state gets set to the desired state and we’re done. So that was the “99% of the time” happy path.


When Things Go Wrong

So what happens when anything goes wrong? Remember the principle ‘all errors are transient’. What we do in the error case – anywhere – is to log the error condition and then promptly drop everything and simply hope that time, a change in system conditions, human or divine intervention, or – at worst – a patch will heal matters. That’s what the second principle ‘the system is perfect’ is about; the system obviously isn’t really perfect, but if we construct it in a way that we can either wait for it to return from a wedged state into a functional state or where we enable someone to go in and apply a fix for a blocking bug while preserving the system state, we can consider the system ‘perfect’ in the sense that pretty much any conceivable job that’s already in the system can be driven to completion.

In the second picture, we have Agent 2 blowing up as it is processing the step it got handed in (7). If the agent just can’t get its work done since some external dependency isn’t available – maybe a database can’t be reached or a server it’s talking to spews out ‘server too busy’ errors – it may be able to back off for a moment and retry. However, it must not retry past the LockedUntil ultimatum that’s in the step record. When things fail and the agent is still breathing, it may, as a matter of courtesy, notify the scheduler of the fact and report that the step was completed with no result, i.e. the desired state and the achieved state don’t match. That notification may also include diagnostic information. Once the LockedUntil ultimatum has passed, the Agent no longer owns the job and must drop it. It must even not report failure state back to the Scheduler past that point.

If the agent keels over and dies as it is processing the step (or right before or right after), it is obviously no longer in a position to let the scheduler know about its fate. Thus, there won’t be any message flowing back to the scheduler and the job is stalled. But we expect that. In fact, we’re ok with any failure anywhere in the system. We could lose or fumble a queue message, we could get a duplicate message, we could have the scheduler die a fiery death (or just being recycled for patching at some unfortunate moment) – all of those conditions are fine since we’ve brought the doctor on board with us: the Supervisor. 


The Supervisor

The Supervisor is a schedule driven process (or thread) of which one or a few instances may run occasionally. The frequency depends on much on the average duration of operations and the expected overall latency for completion of jobs.

The Supervisor’s job is to recover steps or jobs that have failed – and we’re assuming that failures are due to some transient condition. So if the system would expect a transient resource failure condition that prevented a job from completing just a second ago to be healed two seconds later, it’d depend on the kind of system and resource whether that’d be a good strategy.  What’s described here is a pattern, not a solution, so it depends on the concrete scenario to get the  timing right for when to try operations again once they fail.

This desired back-off time manifests in the LockedUntil value.  When a step gets scheduled, the Scheduler needs to state how long it is willing to wait for that step to complete; this includes some back-off time padding. Once that ultimatum has passed and the step is still in an inconsistent state (desired state doesn’t equal the current state)  the Supervisor can pick it up at any time and schedule it.

(1) – Supervisor queries the job store for any inconsistent steps whose LockedUntil value has expired.
(2) – The Supervisor schedules the step again by setting the LockedUntil value to a new timeout and submitting the step into the target actor’s queue
(3) – Once the step succeeds, the step is reported as complete on the regular path back to the Scheduler  where it completes normally as in steps (8), (9) from the happy-path scenario above. If it fails, we simply drop it again. For failures that allow reporting an error back to the Scheduler it may make sense to introduce an error counter that round-trips with the step so that the system could detect poisonous steps that fail ‘forever’ and have the Supervisor ignore those after some threshold.

The Supervisor can pursue a range of strategies for recovery. It can just take a look at individual steps and recover them by rescheduling them – assuming the steps are implemented as idempotent operations. If it were a bit cleverer, it may consider error information that a cooperative (and breathing) agent has submitted back to the Scheduler and even go as far as to fire an alert to an operator if the error condition were to require intervention and then take the step out of the loop by marking it and setting the LockedUntil value to some longer timeout so it’s taken out of the loop and someone can take a look.

At the job-scope, the Supervisor may want to perform recovery such that it first schedules all previously executed steps to revert back to the initial state by performing compensation work (all resources that got set to active are getting disabled again here in our example) and then scheduling another attempt at getting to the desired state.

In step (2)b up above, we’ve been logging current and desired state at the job-scope and with that we can also always find inconsistent jobs where all steps are consistent and wouldn’t show up in the step-level recovery query. That situation can occur if the Scheduler were to crash between logging one step as complete and scheduling the next step. If we find inconsistent jobs with all-consistent steps, we just need to reschedule the next step in the dependency sequence whose desired state isn’t matching the desired state of the overall job.

To be thorough, we could now take a look at all the places where things can go wrong in the system. I expect that survey to yield that at as long we can successfully get past step (2)b from the first diagram, the Supervisor is always in a position to either detect that a job isn’t making progress and help with recovery or can at least call for help. The system always knows what its current intent is, i.e. which state transitions it wants to drive, and never forgets about that intent since that intent is logged in the job store at all times and all progress against that intent is logged as well.  The submission request (1) depends on the outcome of (2)a/b to guard against failures while putting a job and its steps into the system so that a client can take corrective action. In fact, once the job record is marked as inconsistent in step (2)b, the scheduler could already report success back to the submitting party even before the first step is scheduled, because the Supervisor would pick up that inconsistency eventually.


Categories: Architecture | SOA | Azure | Technology

In case you need a refresher or update about the things me and our team work on at Microsoft, go here for a very recent and very good presentation by my PM colleague Maggie Myslinska from TechEd Australia 2010 about Windows Azure AppFabric with Service Bus demos and a demo of the new Access Control V2 CTP

Categories: AppFabric | SOA | Azure | Technology | ISB | WCF | Web Services

April 3, 2008
@ 06:10 AM

Earlier today I hopefully gave a somewhat reasonable, simple answer to the question "What is a Claim?" Let's try the same with "Token":

In the WS-* security world, "Token" is really just a another name the security geniuses decided to use for "Handy package for all sorts of security stuff". The most popular type of token is the SAML (just say "samel") token. If the ladies and gentlemen designing and writing security platform infrastructure and frameworks are doing a good job you might want to know about the existence of such a thing, but otherwise be blissfully ignorant of all the gory details.

Tokens are meant to be a thing that you need to know about in much the same way you need to know about ... ummm... rebate coupons you can cut out of your local newspaper or all those funny books that you get in the mail. I have really no idea how the accounting works behind the scenes between the manufacturers and the stores, but it really doesn't interest me much, either. What matters to me is that we get $4 off that jumbo pack of diapers and we go through a lot of those these days with a 9 month old baby here at home. We cut out the coupon, present it at the store, four bucks saved. Works for me.

A token is the same kind of deal. You go to some (security) service, get a token, and present that token to some other service. The other service takes a good look at the token and figures whether it 'trusts' the token issuer and might then do some further inspection; if all is well you get four bucks off. Or you get to do the thing you want to do at the service. The latter is more likely, but I liked the idea for a moment.

Remember when I mentioned the surprising fact that people lie from time to time when I wrote about claims? Well, that's where tokens come in. The security stuff in a token is there to keep people honest and to make 'assertions' about claims. The security dudes and dudettes will say "Err, that's not the whole story", but for me it's good enough. It's actually pretty common (that'll be their objection) that there are tokens that don't carry any claims and where the security service effectively says "whoever brings this token is a fine person; they are ok to get in". It's like having a really close buddy relationship with the boss of the nightclub when you are having troubles with the monsters guarding the door. I'm getting a bit ahead of myself here, though.

In the post about claims I claimed that "I am authorized to approve corporate acquisitions with a transaction volume of up to $5Bln". That's a pretty obvious lie. If there was such a thing as a one-click shopping button for companies on some Microsoft Intranet site (there isn't, don't get any ideas) and I were to push it, I surely should not be authorized to execute the transaction. The imaginary "just one click and you own Xigg" button would surely have some sort of authorization mechanism on it.

I don't know what Xigg is assumed to be worth these days, but there is actually be a second authorization gate to check. I might indeed be authorized to do one-click shopping for corporate acquisitions, but even with my made-up $5Bln limit claim, Xigg may just be worth more that I'm claiming I'm authorized to approve. I digress.

How would the one-click-merger-approval service be secured? It would expect some sort of token that absolutely, positively asserts that my claim "I am authorized to approve corporate acquisitions with a transaction volume of up to $5Bln" is truthful and the one-click-merger-approval service would have to absolutely trust the security service that is making that assertion. The resulting token that I'm getting from the security service would contain the claim as an attribute of the assertion and that assertion would be signed and encrypted in mysterious (for me) yet very secure and interoperable ways, so that I can't tamper with it as much as I look at the token while having it in hands.

The service receiving the token is the only one able to crack the token (I'll get to that point in a later post) and look at its internals and the asserted attributes. So what if I were indeed authorized to spend a bit of Microsoft's reserves and I were trying to acquire Xigg at the touch of a button and, for some reason I wouldn't understand, the valuation were outside my acquisition limit? That's the service's job. It'd look at my claim, understand that I can't spend more than $5Bln and say "nope!" - and it would likely send email to SteveB under the covers. Trouble.

Bottom line: For a client application, a token is a collection of opaque (and mysterious) security stuff. The token may contain an assertion (saying "yep, that's actually true") about a claim or a set of claims that I am making. I shouldn't have to care about the further details unless I'm writing a service and I'm interested in some deeper inspection of the claims that have been asserted. I will get to that.

Before that, I notice that I talked quite a bit about some sort of "security service" here. Next post...

Categories: Architecture | SOA | CardSpace | WCF | Web Services

April 2, 2008
@ 08:20 PM

If you ask any search engine "What is a Claim?" and you mean the sort of claim used in the WS-* security space, you'll likely find an answer somewhere, but that answer is just as likely buried in a sea of complex terminology that is only really comprehensible if you have already wrapped your head around the details of the WS-* security model. I would have thought that by now there would be a simple and not too technical explanation of the concept that's easy to find on the Web, but I haven't really had success finding one. 

So "What is a Claim?" It's really simple.

A claim is just a simple statement like "I am Clemens Vasters", or "I am over 21 years of age", or "I am a Microsoft employee", or "I work in the Connected Systems Division", or "I am authorized to approve corporate acquisitions with a transaction volume of up to $5Bln". A claim set is just a bundle of such claims.

When I walk up to a service with some client program and want to do something on the service that requires authorization, the client program sends a claim set along with the request. For the client to know what claims to send along, the service lets it know about its requirements in its policy.

When a request comes in, this imaginary (U.S.) service looks at the request knowing "I'm a service for an online game  promoting alcoholic beverages!". It then it looks at the claim set, finds the "I am over 21 years of age" claim and thinks "Alright, I think we got that covered".

The service didn't really care who was trying to get at the service. And it shouldn't. To cover the liquor company's legal behind, they only need to know that you are over 21. They don't really need to know (and you probably don't want them to know) who is talking to them. From the client's perspective that's a good thing, because the client is now in a position to refuse giving out (m)any clues about the user's identity and only provide the exact data needed to pass the authorization gate. Mind that the claim isn't the date of birth for that exact reason. The claim just says "over 21".

Providing control over what claims are being sent to a service (I'm lumping websites, SOAP, and REST services all in the same bucket here) is one of the key reasons why Windows CardSpace exists, by the way. The service asks for a set of claims, you get to see what is being asked for, and it's ultimately your personal, interactive decision to provide or refuse to provide that information.

The only problem with relying on simple statements (claims) of that sort is that people lie. When you go to the Jack Daniel's website, you are asked to enter your date of birth before you can proceed. In reality, it's any date you like and an 10-year old kid is easily smart enough to figure that out.

All that complex security stuff is mostly there to keep people honest. Next time ...

Categories: Architecture | SOA | CardSpace | WCF | Web Services

Christian Weyer shows off the few lines of pretty straightforward WCF code & config he needed to figure out in order to set up a duplex conversation through BizTalk Services.

Categories: Architecture | SOA | BizTalk | WCF | Web Services | XML

Steve has a great analysis of what BizTalk Services means for Corzen and how he views it in the broader industry context.

Categories: Architecture | SOA | IT Strategy | Technology | BizTalk | WCF | Web Services

April 25, 2007
@ 03:28 AM

"ESB" (for "Enterprise Service Bus") is an acronym floating around in the SOA/BPM space for quite a while now. The notion is that you have a set of shared services in an enterprise that act as a shared foundation for discovering, connecting and federating services. That's a good thing and there's not much of a debate about the usefulness, except whether ESB is the actual term is being used to describe this service fabric or whether there's a concrete product with that name. Microsoft has, for instance, directory services, the UDDI registry, and our P2P resolution services that contribute to the discovery portion, we've got BizTalk Server as a scalable business process, integration and federation hub, we've got the Windows Communication Foundation for building service oriented applications and endpoints, we've got the Windows Workflow Foundation for building workflow-driven endpoint applications, and we have the Identity Platform with ILM/MIIS, ADFS, and CardSpace that provides the federated identity backplane.

Today, the division I work in (Connected Systems Division) has announced BizTalk Services, which John Shewchuk explains here and Dennis Pilarinos drills into here.

Two aspects that make the idea of a "service bus" generally very attractive are that the service bus enables identity federation and connectivity federation. This idea gets far more interesting and more broadly applicable when we remove the "Enterprise" constraint from ESB it and put "Internet" into its place, thus elevating it to an "Internet Services Bus", or ISB. If we look at the most popular Internet-dependent applications outside of the browser these days, like the many Instant Messaging apps, BitTorrent, Limewire, VoIP, Orb/Slingbox, Skype, Halo, Project Gotham Racing, and others, many of them depend on one or two key services must be provided for each of them: Identity Federation (or, in absence of that, a central identity service) and some sort of message relay in order to connect up two or more application instances that each sit behind firewalls - and at the very least some stable, shared rendezvous point or directory to seed P2P connections. The question "how does Messenger work?" has, from an high-level architecture perspective a simple answer: The Messenger "switchboard" acts as a message relay.

The problem gets really juicy when we look at the reality of what connecting such applications means and what an ISV (or you!) were to come up with the next cool thing on the Internet:

You'll soon find out that you will have to run a whole lot of server infrastructure and the routing of all of that traffic goes through your pipes. If your cool thing involves moving lots of large files around (let's say you'd want to build a photo sharing app like the very unfortunately deceased Microsoft Max) you'd suddenly find yourself running some significant sets of pipes (tubes?) into your basement even though your users are just passing data from one place to the next. That's a killer for lots of good ideas as this represents a significant entry barrier. Interesting stuff can get popular very, very fast these days and sometimes faster than you can say "Venture Capital".

Messenger runs such infrastructure. And the need for such infrastructure was indeed an (not entirely unexpected) important takeaway from the cited Max project. What looked just to be a very polished and cool client app to showcase all the Vista and NETFX 3.0 goodness was just the tip of a significant iceberg of (just as cool) server functionality that was running in a Microsoft data center to make the sharing experience as seamless and easy as it was. Once you want to do cool stuff that goes beyond the request/response browser thing, you easily end up running a data center. And people will quickly think that your application sucks if that data center doesn't "just work". And that translates into several "nines" in terms of availability in my book. And that'll cost you.

As cool as Flickr and YouTube are, I don't think of none of them or their brethren to be nearly as disruptive in terms of architectural paradigm shift and long-term technology impact as Napster, ICQ and Skype were as they appeared on the scene. YouTube is just a place with interesting content. ICQ changed the world of collaboration. Napster's and Skype's impact changed and is changing entire industries. The Internet is far more and has more potential than just having some shared, mashed-up places where lots of people go to consume, search and upload stuff. "Personal computing" where I'm in control of MY stuff and share between MY places from wherever I happen to be and NOT giving that data to someone else so that they can decorate my stuff with ads has a future. The pendulum will swing back. I want to be able to take a family picture with my digital camera and snap that into a digital picture frame at my dad's house at the push of a button without some "place" being in the middle of that. The picture frame just has to be able to stick its head out to a place where my camera can talk to it so that it can accept that picture and know that it's me who is sending it.

Another personal, and very concrete and real point in case: I am running, and I've written about that before, a custom-built (software/hardware) combo of two machines (one in Germany, one here in the US) that provide me and my family with full Windows Media Center embedded access to live and recorded TV along with electronic program guide data for 45+ German TV channels, Sports Pay-TV included. The work of getting the connectivity right (dynamic DNS, port mappings, firewall holes), dealing with the bandwidth constraints and shielding this against unwanted access were ridiculously complicated. This solution and IP telephony and video conferencing (over Messenger, Skype) are shrinking the distance to home to what's effectively just the inconvenience of the time difference of 9 hours and that we don't see family and friends in person all that often. Otherwise we're completely "plugged in" on what's going on at home and in Germany in general. That's an immediate and huge improvement of the quality of living for us, is enabled by the Internet, and has very little to do with "the Web", let alone "Web 2.0" - except that my Program Guide app for Media Center happens to be an AJAX app today. Using BizTalk Services would throw out a whole lot of complexity that I had to deal with myself, especially on the access control/identity and connectivity and discoverability fronts. Of course, as I've done it the hard way and it's working to a degree that my wife is very happy with it as it stands (which is the customer satisfaction metric that matters here), I'm not making changes for technology's sake until I'm attacking the next revision of this or I'll wait for one of the alternative and improving solutions (Orb is on a good path) to catch up with what I have.

But I digress. Just as much as the services that were just announced (and the ones that are lined up to follow) are a potential enabler for new Napster/ICQ/Skype type consumer space applications from innovative companies who don't have the capacity or expertise to run their own data center, they are also and just as importantly the "Small and Medium Enterprise Service Bus".

If you are an ISV catering shrink-wrapped business solutions to SMEs whose network infrastructure may be as simple as a DSL line (with dynamic IP) that goes into a (wireless) hub and is as locked down as it possibly can be by the local networking company that services them, we can do as much as we want as an industry in trying to make inter-company B2B work and expand it to SMEs; your customers just aren't playing in that game if they can't get over these basic connectivity hurdles.

Your app, that lives behind the firewall shield and NAT and a dynamic IP, doesn't have a stable, public place where it can publish its endpoints and you have no way to federate identity (and access control) unless you are doing some pretty invasive surgery on their network setup or you end up building and running run a bunch of infrastructure on-site or for them. And that's the same problem as the mentioned consumer apps have. Even more so, if you look at the list of "coming soon" services, you'll find that problems like relaying events or coordinating work with workflows are very suitable for many common use-cases in SME business applications once you imagine expanding their scope to inter-company collaboration.

So where's "Megacorp Enterprises" in that play? First of all, Megacorp isn't an island. Every Megacorp depends on lots of SME suppliers and retailers (or their equivalents in the respective lingo of the verticals). Plugging all of them directly into Megacorp's "ESB" often isn't feasible for lots of reasons and increasingly less so if the SME had a second or third (imagine that!) customer and/or supplier. 

Second, Megacorp isn't a uniform big entity. The count of "enterprise applications" running inside of Megacorp is measured in thousands rather than dozens. We're often inclined to think of SAP or Siebel when we think of enterprise applications, but the vast majority are much simpler and more scoped than that. It's not entirely ridiculous to think that some of those applications runs (gasp!) under someone's desk or in a cabinet in an extra room of a department. And it's also not entirely ridiculous to think that these applications are so vertical and special that their integration into the "ESB" gets continuously overridden by someone else's higher priorities and yet, the respective business department needs a very practical way to connect with partners now and be "connectable" even though it sits deeply inside the network thicket of Megacorp. While it is likely on every CIO's goal sheet to contain that sort of IT anarchy, it's a reality that needs answers in order to keep the business bring in the money.

Third, Megacorp needs to work with Gigacorp. To make it interesting, let's assume that Megacorp and Gigacorp don't like each other much and trust each other even less. They even compete. Yet, they've got to work on a standard and hence they need to collaborate. It turns out that this scenario is almost entirely the same as the "Panic! Our departments take IT in their own hands!" scenario described above. At most, Megacorp wants to give Gigacorp a rendezvous and identity federation point on neutral ground. So instead of letting Gigacorp on their ESB, they both hook their apps and their identity infrastructures into the ISB and let the ISB be the mediator in that play.

Bottom line: There are very many solution scenarios, of which I mentioned just a few, where "I" is a much more suitable scope than "E". Sometimes the appropriate scope is just "I", sometimes the appropriate scope is just "E". They key to achieve the agility that SOA strategies commonly promise is the ability to do the "E to I" scale-up whenever you need it in order to enable broader communication. If you need to elevate one or a set services from your ESB to Internet scope, you have the option to go and do so as appropriate and integrated with your identity infrastructure. And since this all strictly WS-* standards based, your "E" might actually be "whatever you happen to run today". BizTalk Services is the "I".

Or, in other words, this is a pretty big deal.

Categories: Architecture | SOA | IT Strategy | Microsoft | MSDN | BizTalk | WCF | Web Services

Recently, a gentleman from Switzerland wrote me an email after attending the “WinFX Tour” presentations in Zurich. He is a business consultant advising corporations on the IT strategy and an IT industry veteran with his first programming work dating as long back as 1962. He was quite interested in the Workflow part of my presentation, but wrote me that he thinks that those abstraction efforts go the wrong way. He sees the fundamental gap between business and IT widening and sees very little hope for the two sides to ever find a way to communicate effectively with each other. In his view, IT isn’t truly interested in the reality of business. He wrote me a very long email with several statements and questions, which I won’t quote – the (very long) reply below should give you enough context: 

Your main concern is, in my words, about the disconnect between the reality of the business vs. the snapshot of a perceived business reality that is translated into a software system. I say “perceived” because the capturing of the actual business reality is done by analysts who are on the fence between being business experts and IT experts and even though they would ideally be geniuses in both worlds to do that translation, they often are coming down on one side of that fence in terms of their core competencies.  

The only way to close that gap is to pull people off that fence onto the business side and enable them to capture the reality of the business and the way the business processes flow with tools that fit their needs and don’t demand that they are programmers or even have the sense of abstraction that a software designer or process analyst possesses. Our industry is only starting to understand what is required to achieve this and we are certainly thinking hard about these problems. 

You state that the Business/IT gap cannot be bridged. I do not fully agree with that assessment. I think what you are observing is a particular effect of software architecture and implementation as it exists today. You are truly an “industry veteran” you can certainly see much clearer how software has evolved since you got into the trade in 1962 than I can as a relative youngling. However, my (humble) observation is that the fundamental concepts of business software design haven’t changes all that much since then. A business application is a scoped set of siloed functionality built for a set of predefined purposes and whether the user interacts with the system through batch jobs, green screens, web sites or whether the system is made up of 5000 identical fat client applications with identical logic that talk to a central database is merely an implementation detail. The tradition of (interactive) business software is very much that we’ve got a system with some sort of menu screen or other form of selecting the task you want to perform with the system and any number of forms/screens/dialogs with which you can interact with the system. The reason for your observation of IT conveniently neglecting at least 20% of what they are told to do is not only caused by them not understanding the business, but also caused by them being not in control of the monsters they create because the scope grows too big, everything is tightly coupled to everything else, and a lot of functionalities are crammed together in ”multi-purpose” user interfaces that often make changes or adjustments mutually exclusive and hence “impossible”. 

The actual business process is often external to the software or if the software has workflow guidance, the workflows are not taking all the “offline” activities into account and the process becomes lossy. The way that most applications are built today is that they present the user with a grab-bag of tools and procedures and leaves it up to them to navigate through it. Even worse, business processes are often changed to fit the constraints of software – and not the other way around. Testimony for this is that the organizational structure of many companies is hardly recognizable once the SAP/PeopleSoft/Oracle/etc. ERP consultants have left the building. 

So what’s changing with the SOA/BPM “hype” as you call it? Or I shall better say: What’s the opportunity?  

First, we’re working hard to get to a point where we can build interoperable, autonomous pieces of software with well-defined interfaces where we don’t have to spend 80% of our time and money trying to make those pieces communicate with each other. Before the industry consensus around XML, SOAP and the WS-* specifications this most fundamental requirement for breaking up the solution silos simple didn’t exist (despite previous efforts like DCE or CORBA). I am not saying that we have completely arrived at that point, but we’ve got a better foundation than ever to build composable, loosely coupled, distributed systems that can interact irrespective of their implementation specifics. 

Second, we are starting to see growing insight on the IT side for the need of a common understanding of the idea of “services”. Any employee, department, division or vendor assumes various roles within an organization and renders a certain portfolio of services towards the organization. The notion of service-oriented architecture speaks about writing software that fits into the organization instead of fitting the organization to the software. Just like an employee assumes roles, software assumes roles as well. From an architecture perspective, people and software are peers collaborating on the same business task. The former are just a lot more flexible than the latter. 

Business processes (or bureaucracy), whether formalized or ad-hoc, are driven by the flow of information. If the information flow is not central to the execution of the process you have friction and efficiency suffers. With a paper-bound “offline” process where all information flows in a (paper-) file folder and on forms carried by courier from department to department there’s arguably a better and more complete information flow than in an environment where information is scattered around dozens of different computer systems where the flow of information and the flow of tasks are disconnected and only knitted together by people shuffling mice around on their desks. 

The opportunity is right at that point. With the technology we have available and a common notion of “services”, the line between the implementation of deterministic work (done by programs) and non-deterministic work (done by humans) begins to blur. It is fairly easy today to write chat-bots, mail-bots, or speech-enabled services that allow rich computer/human interaction within their respective scope. It is likewise fairly easy to expose application to application services that allow exchanging rich data across system and platform boundaries using Web Services. The concrete form of how a service interacts with a peer depends on who the peer is. If you have an address book lookup service, it may have all of these capabilities at once. With that, services can be integrated into real-life ad-hoc scenarios singly or in combination because they are built to satisfy specific roles and their capabilities are exposed for (re)use in arbitrary contexts.  

When you have a service that’s specialized on creating complex sales offers and you are a salesman on the road an sit in a taxi, it’s absolutely fathomable to build a solution that allows you to call in, identify yourself towards a voice service and ask for an baseline offer for 500 units of A and 300 units of B for a specific customer and have the result, with all applicable rebates and possible delivery time frames, including shipping cost and considering the schedule of the container freighter ship from China, pushed onto your hotel fax or by email or SMS/MMS. However, the question of how realistic it is to build that service purely depends on how easy it is to make all the necessary backend systems talk to each other and wire up all the roles into a process that can jointly accomplish the task. And it also depends on how well any necessary human intervention into that process can be integrated into the respective flow. Assuming that the resulting offer would cross a certain threshold in terms of the total order amount, the offer might require approval by a manager – the manager renders a “decision service” towards the process and might so by responding to an email that is sent to him/her by a program and will be evaluated by a program.  

Ideally, you could teach a service (through mail, speech or chat) the workflow ad-hoc just as you would tell an administrative assistant a sequence of activities. “Get A and B, do C, let Mike take a look at it and send it to me”. The information can be parsed, mapped to activities, the activities can be wired up to a one-off workflow and the workflow can be launched. The crux is that you need to have those individual capabilities and activities catalogued and available in a fashion that allows composition. And the above example of the approval manager goes to show that people’s roles and capabilities must be part of that same catalogue. 

We (Microsoft) are already shipping and will ship even more building blocks not only for creating such services and workflows, but we also have an increasingly complete federated identity and access control infrastructure that allows to realize all that decentralized interaction in a secure fashion. From a purely technical perspective, the above scenario is not utopia. We have every single component in place to let customers build this, voice recognition included. However, mind that I am not saying “very easy”. 

“BPM” tools such as the Windows Workflow Foundation or BizTalk Orchestration are all about putting the process into the center. Services are all about creating easy-to-integrate, loosely coupled, business-aligned pieces of functionality that can be used and reused in as many contexts as a role in a business can be used in different contexts. A workflow may just be one step that you just have in your head as you are interacting with the address lookup service or it may be a more formalized workflow with several steps and intertwined people-based and program-based activities. The realization here is that while individual roles in an organization are relatively sticky, the way that the organization acts across those roles and tunes the rules for the roles is very dynamic. BPM tools are precisely about shaping and reshaping the flow and rules quickly and deploying those instantly into the business environment. 

The design of the Windows Workflow Foundation also recognizes that business processes, especially those that run for days and weeks rather than seconds or minutes, are never set in stone. If you’ve got a (very) long running master workflow that were, for instance, tracking an insurance policy and applies rebates and handles claims as time progresses and suddenly the client comes around and sues the insurance company, that policy certainly no longer belongs into the same bucket as the other 100000s of policies that are being tracked. So for such unforeseeable circumstances, the foundation makes it possible to jump right in, assess the status and redirect or reshape the respective flow on a case-by-case basis and if the new action that you add into the flow in order to terminate it is merely that you hand off the entire case with all of its status to your legal department’s “dropoff service”. 

You write about SO/A and BPM as an “effort to extend the borders of IT”. From my perspective it’s rather the attempt by IT to humbly fit itself into the ever changing dynamic nature of business. On an industry level we have started to realize that we need to get away from the thinking that applications are silos and that nobody should factually care about whether he/she interacts with a “CRM” or “ERP” system or needs to navigate across a dozen of intranet websites to get a job done. The whole notion of “we build a loan application handling program” is indeed misguided. That’s the disconnect. 

But how alien is the notion of building a library of services that are not wired up into a thing that you can install and instantly run as a program? We hire people into organizations who have a broad education if which we only tap a fraction at any given point in time, but we can trust that we can instantly tap some other capability as the process changes. Which CEO/CFO/CIO will commit to a project that “educates” a software system to have a broad spectrum of capabilities that may or may not be used in the circumstances of “now” but which may become a pressing necessity as you need to quickly adapt to a change in the business? How strange of an idea is it that you might produce a software package that consists of hundreds of roles and thousands of activities but none of them are connected in any way, because that’s the job of the space that’s intentionally left blank and undefined to host the customized business process? How does the buyer justify the expense? What can the vendor charge? Would you continually service and update a “dormant” software-based capability to the latest policies, laws and regulations so that it can be used whenever the need arises? 

All that comes back to a completely different communication breakdown: How does IT explain that sort of perspective to the business stakeholders? The great challenge is not in the bits, it’s in the heads.

Categories: Architecture | SOA

Inside the big house....

Back in December of last year and about two weeks before I publicly announced that I will be working from Microsoft, I started a nine-part series on REST/POX* programming with Indigo WCF. (1, 2, 3, 4, 5, 6, 7, 8, 9). Since then, the WCF object model has seen quite a few feature and usability improvements across the board and those are significant enough to justify that I rewrite the entire series to get it up to the February CTP level and I will keep updating it through Vista/WinFX Beta2 and as we are marching towards our RTM. We've got a few changes/extensions in our production pipeline to make the REST/POX story for WCF v1 stronger and I will track those changes with yet another re-release of this series.

Except in one or two occasions, I haven't re-posted a reworked story on my blog. This here is quite a bit different, because of it sheer size and the things I learned in the process of writing it and developing the code along the way. So even though it is relatively new, it's already due for an end-to-end overhaul to represent my current thinking. It's also different, because I am starting to cross-post content to http://blogs.msdn.com/clemensv with this post; however http://friends.newtelligence.net/clemensv remains my primary blog since that runs my engine ;-)


The "current thinking" is of course very much influenced by now working for the team that builds WCF instead of being a customer looking at things from the outside. That changes the perspective quite a bit. One great insight I gained is how non-dogmatic and customer-oriented our team is. When I started the concrete REST/POX work with WCF back in last September (on the customer side still working with newtelligence), the extensions to the HTTP transport that enabled this work were just showing up in the public builds and they were sometimes referred to as the "Tim/Aaaron feature". Tim Ewald and Aaron Skonnard had beat the drums for having simple XML (non-SOAP) support in WCF so loudly that the team investigated the options and figured that some minimal changes to the HTTP transport would enable most of these scenarios**. Based on that feature, I wrote the set of dispatcher extensions that I've been presenting in the V1 of this series and newtellivision as the applied example did not only turn out to be a big hit as a demo, it also was one of many motivations to give the REST/POX scenario even deeper consideration within the team.

REST/POX is a scenario we think about as a first-class scenario alongside SOAP-based messaging - we are working with the ASP.NET Atlas team to integrate WCF with their AJAX story and we continue to tweak the core WCF product to enable those scenarios in a more straightforward fashion. Proof for that is that my talk (PPT here) at the MIX06 conference in Las Vegas two weeks ago was entirely dedicated to the non-SOAP scenarios.

What does that say about SOAP? Nothing. There are two parallel worlds of application-level network communication that live in peaceful co-existence:

  • Simple point-to-point, request/response scenarios with limited security requirements and no need for "enterprise features" along the lines of reliable messaging and transaction integration.
  • Rich messaging scenarios with support for message routing, reliable delivery, discoverable metadata, out-of-band data, transactions, one-way and duplex, etcetc.

The Faceless Web

The first scenario is the web as we know it. Almost. HTTP is an incredibly rich application protocol once you dig into RFC2616 and look at the methods in detail and consider response codes beyond 200 and 404. HTTP is strong because it is well-defined, widely supported and designed to scale, HTTP is weak because it is effectively constrained to request/response, there is no story for server-to-client notifications and it abstracts away the inherent reliability of the transmission-control protocol (TCP). These pros and cons lists are not exhaustive.

What REST/POX does is to elevate the web model above the "you give me text/html or */* and I give you application/x-www-form-urlencoded" interaction model. Whether the server punts up markup in the form of text/html or text/xml or some other angle-bracket dialect or some raw binary isn't too interesting. What's changing the way applications are built and what is really creating the foundation for, say, AJAX is that the path back to the server is increasingly XML'ised. PUT and POST with a content-type of text/xml is significantly different from application/x-www-form-urlencoded. What we are observing is the emancipation of HTTP from HTML to a degree that the "HT" in HTTP is becoming a misnomer. Something like IXTP ("Interlinked XML Transport Protocol" - I just made that up) would be a better fit by now.

The astonishing bit in this is that there has been been no fundamental technology change that has been driving this. The only thing I can identify is that browsers other than IE are now supporting XMLHTTP and therefore created the critical mass for broad adoption. REST/POX rips the face off the web and enables a separation of data and presentation in a way that mashups become easily possible and we're driving towards a point where the browser cache becomes more of an application repository than merely a place that holds cacheable collateral. When developing the newtellivision application I have spent quite a bit of time on tuning the caching behavior in a way that HTML and script are pulled from the server only when necessary and as static resources and all actual interaction with the backend services happens through XMLHTTP and in REST/POX style. newtellivision is not really a hypertext website, it's more like a smart client application that is delivered through the web technology stack.

Distributed Enterprise Computing

All that said, the significant investments in SOAP and WS-* that were made my Microsoft and industry partners such as Sun, IBM, Tibco and BEA have their primary justification in the parallel universe of highly interoperable, feature-rich intra and inter-application communication as well as in enterprise messaging. Even though there was a two-way split right through through the industry in the 1990s with one side adopting the Distributed Computing Environment (DCE) and the other side driving the Common Object Request Broker Architecture (CORBA), both of these camps made great advances towards rich, interoperable (within their boundaries) enterprise communication infrastructures. All of that got effectively killed by the web gold-rush starting in 1994/1995 as the focus (and investment) in the industry turned to HTML/HTTP and to building infrastructures that supported the web in the first place and everything else as a secondary consideration. The direct consequence of the resulting (even if big) technology islands hat sit underneath the web and the neglect of inter-application communication needs was that inter-application communication has slowly grown to become one of the greatest industry problems and cost factors. Contributing to that is that the average yearly number of corporate mergers and acquisitions has tripled compared to 10-15 years ago (even though the trend has slowed in recent years) and the information technology dependency of today's corporations has grown to become one of the deciding if not the deciding competitive factor for an ever increasing number of industries.

What we (the industry as a whole) are doing now and for the last few years is that we're working towards getting to a point where we're both writing the next chapter of the story of the web and we're fixing the distributed computing story at the same time by bringing them both onto a commonly agreed platform. The underpinning of that is XML; REST/POX is the simplest implementation. SOAP and the WS-* standards elevate that model up to the distributed enterprise computing realm.

If you compare the core properties of SOAP+WS-Adressing and the Internet Protocol (IP) in an interpretative fashion side-by-side and then also compare the Transmission Control Protocol (TCP) to WS-ReliableMessaging it may become quite clear to you what a fundamental abstraction above the networking stacks and concrete technology coupling the WS-* specification family has become. Every specification in the long list of WS-* specs is about converging and unifying formerly proprietary approaches to messaging, security, transactions, metadata, management, business process management and other aspects of distributed computing into this common platform.


The beauty of that model is that it is an implementation superset of the web. SOAP is the out-of-band metadata container for these abstractions. The key feature of SOAP is SOAP:Header, which provides a standardized facility to relay the required metadata alongside payloads. If you are willing to constrain out-of-band metadata to one transport or application protocol, you don't need SOAP.

There is really very little difference between SOAP and REST/POX in terms of the information model. SOAP carries headers and HTTP carries headers. In HTTP they are bolted to the protocol layer and in SOAP they are tunneled through whatever carries the envelope. [In that sense, SOAP is calculated abuse of HTTP as a transport protocol for the purpose of abstraction.] You can map WS-Addressing headers from and to HTTP headers.

The SOAP/WS-* model is richer, more flexible and more complex. The SOAP/WS-* set of specifications is about infrastructure protocols. HTTP is an application protocol and therefore it is naturally more constrained - but has inherently defined qualities and features that require an explicit protocol implementation in the SOAP/WS-* world; one example is the inherent CRUD (create, read, update, delete) support in HTTP that is matched by the explicitly composed-on-top WS-Transfer protocol in SOAP/WS-*

The common platform is XML. You can scale down from SOAP/WS-* to REST/POX by putting the naked payload on the wire and rely on HTTP for your metadata, error and status information if that suits your needs. You can scale up from REST/POX to SOAP/WS-* by encapsulating payloads and leverage the WS-* infrastructure for all the flexibility and features it brings to the table. [It is fairly straightforward to go from HTTP to SOAP/WS-*, and it is harder to go the other way. That's why I say "superset".]

Doing the right thing for a given scenario is precisely what are enabling in WCF. There is a place for REST/POX for building the surface of the mashed and faceless web and there is a place for SOAP for building the backbone of it - and some may choose to mix and match these worlds. There are many scenarios and architectural models that suit them. What we want is

One Way To Program

* REST=REpresentational State Transfer; POX="Plain-Old XML" or "simple XML"

Categories: Architecture | SOA | MIX06 | Technology | Web Services

To (O/R) map or not to map.

The monthly discussion about the benefits and dangers of O/R mapping is making rounds on one of the mailing lists that I am signed up to. One big problem in this space - from my experience of discussing this through with a lot of people over and over – is that O/R mapping is one of those things where the sheer wish for an elegant solution to the data/object schism obscures most of the rational argumentation. If an O/R mapper provides a nice programming or tooling experience, developers (and architects) are often willing to accept performance hits and a less-than-optimal tight coupling to the data model, because they are lured by the aesthetics of the abstraction.

Another argument I keep hearing is that O/R mapping yields a significant productivity boost. However, if that were the case and if using O/R mapping would shorten the average development cost in a departmental development project by – say – a quarter or more, O/R mapping would likely have taken over the world by now. It hasn't. And it's not that the idea is new. It’s been around for well more than a decade.

To me, O/R mapping is one of the unfortunate consequences of trying to apply OOP principles to anything and everything. For "distributed objects", we’re fixing that with the service orientation idea and the consequential constraints when we talk about the network edge of applications. It turns out that the many of the same principles apply to the database edge as well. The list below is just for giving you the idea. I could write a whole article about this and I wish I had the time:

  • Boundaries are explicit => Database access is explicit
  • Services avoid coupling (autonomy) => Database schema and in-process data representation are disjoint and mapped explicitly
  • Share schema not code => Query/Sproc result sets and Sproc inputs form data access schema (aliased result sets provide a degree of separation from phys. schema)

In short, I think the dream of transparent O/R mapping is the same dream that fueled the development of fully transparent distributed objects in the early days of DSOM, CORBA and (D)COM when we all thought that'd just work and were neglecting the related issues of coupling, security, bandwidth, etc.

Meanwhile, we’ve learned the hard way that even though the idea was fantastic, it was rather naïve to apply local development principles to distributed systems. The same goes for database programming. Data is the most important thing in the vast majority of applications. Every class of data items (table) surround special considerations: read-only, read/write, insert-only; update frequency, currency and replicability; access authorization; business relevance; caching strategies; etcetc. 

Proper data management is the key to great architecture. Ignoring this and abstracting data access and data management away just to have a convenient programming model is … problematic.

And in closing: Many of the proponents of O/R mapping that I run into (and that is a generalization and I am not trying to offend anyone – just an observation) are folks who don't know SQL and RDBMS technology in any reasonable depth and/or often have no interest in doing so. It may be worth exploring how tooling can better help the SQL-challenged instead of obscuring all data access deep down in some framework and make all data look like a bunch of local objects. If you have ideas, shoot. Comment section is open for business.

Categories: Architecture | SOA

See Part 1

Before we can do anything about deadlocks or deal with similar troubles, we first need to be able to tell that we indeed have a deadlock situation. Finding this out is a matter of knowing the respective error codes that your database gives you and a mechanism to bubble that information up to some code that will handle the situation. So before we can think about and write the handling logic for failed/failing but safely repeatable transactions, we need to build a few little things. The first thing we’ll need is an exception class that will wrap the original exception indicating the reason for the transaction failure. The new exception class’s identity will later serve to filter out exceptions in a “catch” statement and take the appropriate actions.

using System;
using System.Runtime.Serialization;

namespace newtelligence.EnterpriseTools.Data
   public class RepeatableOperationException : Exception
       public RepeatableOperationException():base()

       public RepeatableOperationException(Exception innerException)

       public RepeatableOperationException(string message, Exception innerException)

       public RepeatableOperationException(string message):base(message)

        public RepeatableOperationException(
          SerializationInfo serializationInfo,
          StreamingContext streamingContext)

        public override void GetObjectData(
           System.Runtime.Serialization.SerializationInfo info,
           System.Runtime.Serialization.StreamingContext context)
            base.GetObjectData (info, context);

Having an exception wrapper with the desired semantics, we know need to be able to figure out when to replace the original exception with this wrapper and re-throw it up on the call stack. The idea is that whenever you execute a database operation – or, more generally, any operation that might be repeatable on failure – you will catch the resulting exception and run it through a factory, which will analyze the exception and wrap it with the RepeatableOperationException if the issue at hand can be resolved by re-running the transaction. The (still a little naïve) code below illustrates how to such a factory in the application code. Later we will flesh out the catch block a little more, since we will lose the original call stack if we end up re-throwing the original exception like shown here:

   sprocUpdateAndQueryStuff.Parameters["@StuffArgument"].Value = argument;
   result = this.GetResultFromReader( sprocUpdateAndQueryStuff.ExecuteReader() );
catch( Exception exception )
   throw RepeatableOperationExceptionMapper.MapException( exception );                           

The factory class itself is rather simple in structure, but a bit tricky to put together, because you have to know the right error codes for all resource managers you will ever run into. In the example below I put in what I believe to be the appropriate codes for SQL Server and Oracle (corrections are welcome) and left the ODBC and OLE DB factories (for which would have to inspect the driver type and the respective driver-specific error codes) blank. The factory will check out the exception data type and delegate mapping to a private method that is specialized for a specific managed provider.

using System;
using System.Data.SqlClient;
using System.Data.OleDb;
using System.Data.Odbc;
using System.Data.OracleClient;

namespace newtelligence.EnterpriseTools.Data
   public class RepeatableOperationExceptionMapper
        /// <summary>
        /// Maps the exception to a Repeatable exception, if the error code
        /// indicates that the transaction is repeatable.
        /// </summary>
        /// <param name="sqlException"></param>
        /// <returns></returns>
        private static Exception MapSqlException( SqlException sqlException )
            switch ( sqlException.Number )
                case -2: /* Client Timeout */
                case 701: /* Out of Memory */
                case 1204: /* Lock Issue */
                case 1205: /* Deadlock Victim */
                case 1222: /* Lock Request Timeout */
                case 8645: /* Timeout waiting for memory resource */
                case 8651: /* Low memory condition */
                    return new RepeatableOperationException(sqlException);
                    return sqlException;

        private static Exception MapOleDbException( OleDbException oledbException )
            switch ( oledbException.ErrorCode )
                    return oledbException;

        private static Exception MapOdbcException( OdbcException odbcException )
            return odbcException;           

        private static Exception MapOracleException( OracleException oracleException )
            switch ( oracleException.Code )
                case 104:  /* ORA-00104: Deadlock detected; all public servers blocked waiting for resources */
                case 1013: /* ORA-01013: User requested cancel of current operation */
                case 2087: /* ORA-02087: Object locked by another process in same transaction */
                case 60:   /* ORA-00060: Deadlock detected while waiting for resource */
                    return new RepeatableOperationException( oracleException );
                    return oracleException;

        public static Exception MapException( Exception exception )
            if ( exception is SqlException )
                return MapSqlException( exception as SqlException );
            else if ( exception is OleDbException )
                return MapOleDbException( exception as OleDbException );
            else if (exception is OdbcException )
                return MapOdbcException( exception as OdbcException );
            else if (exception is OracleException )
                return MapOracleException( exception as OracleException );
                return exception;

With that little framework of two classes, we can now selectively throw exceptions that convey whether a failed/failing transaction is worth repeating. Next step: How do we do actually run such repeats and make sure we neither lose data nor make the user unhappy in the process? Stay tuned.

Categories: Architecture | SOA | Enterprise Services | MSMQ

Deadlocks and other locking conflicts that cause transactional database operations to fail are things that puzzle many application developers. Sure, proper database design and careful implementation of database access (and appropriate support by the database engine) should take care of that problem, but it cannot do so in all cases. Sometimes, especially under stress and other situations with high lock contention, a database just has not much of a choice but picking at least one of the transactions competing for the same locks as the victim in resolving the deadlock situation and then aborts the chosen transaction. Generally speaking, transactions that abort and roll back are a good thing, because this behavior guarantees data integrity. In the end, we use transaction technology for those cases where data integrity is at risk. What’s interesting is that even though transactions are a technology that is explicitly about things going wrong, the strategy for dealing with failing transaction is often not much more than to bubble the problem up to the user and say “We apologize for the inconvenience. Please press OK”.

The appropriate strategy for handling a deadlock or some other recoverable reason for a transaction abort on the application level is to back out of the entire operation and to retry the transaction. Retrying is a gamble that the next time the transaction runs, it won’t run into the same deadlock situation again or that it will at least come out victorious when the database picks its victims. Eventually, it’ll work. Even if it takes a few attempts. That’s the idea. It’s quite simple.

What is not really all that simple is the implementation. Whenever you are using transactions, you must make your code aware that such “good errors” may occur at any time. Wrapping your transactional ODBC/OLEDB/ADO/ADO.NET code or calls to transactional Enterprise Services or COM+ components with a try/catch block, writing errors to log-files and showing message boxes to users just isn’t the right thing to do. The right thing is to simply do the same batch of work again and until it succeeds.

The problem that some developers seem to have with “just retry” is that it’s not so clear what should be retried. It’s a problem of finding and defining the proper transaction scope. Especially when user interaction is in the picture, things easily get very confusing. If a user has filled in a form on a web page or some dialog window and all of his/her input is complete and correct, should the user be bothered with a message that the update transaction failed due to a locking issue? Certainly not. Should the user know when the transaction fails because the database is currently unavailable? Maybe, but not necessarily. Should the user be made aware that the application he/she is using is for some sudden reason incompatible with the database schema of the backend database? Maybe, but what does Joe in the sales department do with that valuable piece of information?

If stuff fails, should we just forget about Joe’s input and tell him to come back when the system is happier to serve him? So, in other words, do we have Joe retry the job? That’s easy to program, but that sort of strategy doesn’t really make Joe happy, does it?

So what’s the right thing to do? One part of the solution is a proper separation between the things the user (or a program) does and the things that the transaction does. This will give us two layers and “a job” that can be handed down from the presentation layer down to the “transaction layer”. Once this separation is in place, we can come up with a mechanism that will run those jobs in transactions and will automate how and when transactions are to be retried. Transactional MSMQ queues turn out to be a brilliant tool to make this very easy to implement. More tomorrow. Stay tuned.

Categories: Architecture | SOA | Enterprise Services | MSMQ

I am presently doing some intense research on services, service patterns, message exchange patterns and many other issues related to services (No surprise there). However, I can't do that without external help and since many people are reading my blog, I can just as well start asking around right here:

I would like to get in touch with companies (preferrably insurances and banks) who afford a corporate history department. The ambitious goal I have is to reconstruct a few banking or insurance or purchasing business processes of ca. 1955-1965. I have come to believe that there is a lot, a lot to be learned there that will be very useful to what we're all doing. The deal is that if you share, I share whatever I have as soon as I have it. My contact address is clemensv@newtelligence.com

Categories: SOA

I get emails like that very frequently. I have some news.

Short story: Microsoft is still willing and working to publish the application that I presented at TechEd Europe (see Benjamin's report) and they keep telling me that it will come out. Apparently there is a lot of consensus building to be done to get a big sample application out of the door. So there's nothing to be found on msdn, yet.

Little known secret: There are 15 lucky indivduals who have already received (hand-delivered) the Proseware code as a technical preview under a non-disclosure agreement. Because we (newtelligence) designed and wrote the sample application, we have permission to distribute the complete sample to participants of our SOA workshops and seminars.

So if you want to get yours hands on it, all you need to do is to send mail to training@newtelligence.com to sign up for one of the public events [Next published date is Dec 1-3, and the event is held in German, unless we get swamped with international inquiries] or you send email to the same address asking for an on-site workshop delivery. At this time, we (and MS) bind the code sample to workshop attendance so that you really understand why the application was built like it's built and that you fully understand the guidance that the application implicitly and explicitly carries (and doesn't carry).

Categories: SOA | Web Services

I feel like I have been "out of business" for a really long time and like I really got nothing done in the past 3 months, even though that's objectively not true. I guess that's "conference & travel withdrawal", because I had tone and tons of bigger events in the first half of the year and 3 smaller events since TechEd Amsterdam in July. On the upside, I am pretty relaxed and have certainly reduced my stress-related health risks ;-)

So with winter and its short days coming up, the other half of my life living a 1/3 around the planet until next spring, I can and am going to spend some serious time on a bunch of things:

On the new programming stuff front:
     Catch up on what has been going on in Indigo in recent months, dig deeper into "everything Whidbey", figure out the CLR aspects of SQL 2005 and familiarize myself with VS Team System.

On the existing programming stuff front:
      Consolidate my "e:\development\*"  directory on my harddrive and pull together all my samples and utilities for Enterprise Services, ASP.NET Web Services and other enterprise-development technologies and create a production-quality library from of them for us and our customers to use. Also, because the Indigo team is doing quite a bit of COM/COM+ replumbing recently in order to have that prohgraming model ride on Indigo, I have some hope that I can now file bugs/wishes against COM+ that might have a chance of being addressed. If that happens and a particular showstopper is getting out of the way, I will reopen this project here and will, at the very least, release it as a toy.

On the architectural stuff front:
         Refine our SOA Workshop material, do quite a bit of additional work on the FABRIQ, evolve the Proseware architecture model, and get some pending projects done. In addition to our own SOA workshops (the next English-language workshop is held December 1-3, 2004 in Düsseldorf), there will be a series of invite-only Microsoft events on Service Orientation throughout Europe this fall/winter, and I am very happy that I will be speaking -- mostly on architecture topics -- at the Microsoft Eastern Mediterranean Developer Conference in Amman/Jordan in November and several other locations in the Middle East early next year. 

And even though I hate the effort around writing books, I am seriously considering to write a book about "Services" in the next months. There's a lot of stuff here on the blog that should really be consolidated into a coherent story and there are lots and lots of considerations and motiviations for decisons I made for FABRIQ and Proseware and other services-related work that I should probably write down in one place. One goal of the book would be to write a pragmatic guide on how to design and build services using currently shipping (!) technologies that does focus on how to get stuff done and not on how to craft new, exotic SOAP headers, how to do WSDL trickery, or do other "cool" but not necessarily practical things. So don't expect a 1200 page monster. 

In addition to the "how to" part, I would also like to incorporate and consolidate other architect's good (and bad) practical design and implementation experiences, and write about adoption accelerators and barriers, and some other aspects that are important to get the service idea past the CFO. That's a great pain point for many people thinking about services today. If you would be interested in contributing experiences (named or unnamed), I certainly would like to know about it.

And I also think about a German-to-English translation and a significant (English) update to my German-language Enterprise Services book.....

[And to preempt the question: No, I don't have a publisher for either project, yet.]

Categories: Architecture | SOA | Blog | IT Strategy | newtelligence | Other Stuff | Talks

September 7, 2004
@ 11:09 AM

A little while ago a new build of Indigo found its way onto my desk. On thing that’s interesting about this particular interim milestone (don’t hope for juicy details here on the blog) is that it doesn’t support WSDL. No. Calm down. It’s not what you think. It just happens that the WSDL support wasn’t included in this particular version; it’ll be there, no worries.

Such interim binary drops that the Microsoft product groups give out to very early adopters are really only meant for the bravest of the brave with too much time on their hands. Therefore, the documentation is not entirely in synch with the feature set – and that’s a stretch. (Hey, I am not complaining.) So until someone told me “yeah, there’s no WSDL or Metadata exchange worth talking about in this build” I tried to find how contract works in this build and of course I couldn’t find it. Well, not really true. It turns out that contract works just like with ASP.NET 1.0 or ASP.NET 1.1 Web Services. At runtime, WSDL usually doesn’t play any role, at all. Unless you use some funky dynamic binding logic, WSDL is just a design-time metadata container and that’s the basis for generating CLR metadata and code. Although there is an implementation aspect to it when you generate proxies or server skeletons, the most important job of WSDL.EXE is to perform a conversion of the WSDL rendering of the message contract into a format that the ASMX infrastructure can readily understand. That format happens to be classes with methods and attribute annotations. Here is a client and a server I just typed up with Notepad:



<% @WebService class="MyService" language="C#"%>

using System.Web.Services;
using System.Web.Services.Protocols;

public class MyService
    public string Hello( string Test )
       return "Hello "+Test;

using System;
using System.Web.Services;
using System.Web.Services.Protocols;

public class MyService : SoapHttpClientProtocol
    public string Hello( string Test )
       return (string)Invoke("Hello", new object[]{Test})[0];

public class MainApp
   static void Main()
       MyService m = new MyService();
       m.Url = "http://localhost/contract.asmx";
       Console.WriteLine("Result: "+ m.Hello("Test"));

Drop the contract.asmx into x:\inetpub\wwwroot, compile the client with “csc client.cs” and run it. No WSDL ever changed hands, no “Add Web Reference”, it just works. Ok, ok, it’s the year ‘04 now and there’s really no magic there anymore. Now here’s a tiny bit of refactoring:



<% @WebService class="MyService" language="C#"%>

using System.Web.Services;
using System.Web.Services.Protocols;

public interface IMyService
    string Hello( string Test );

public class MyService : IMyService
    public string Hello( string Test )
       return "Hello "+Test;

using System;
using System.Web.Services;
using System.Web.Services.Protocols;

public interface IMyService
    string Hello( string Test );

public class MyService : SoapHttpClientProtocol, IMyService
    public string Hello( string Test )
       return (string)Invoke("Hello", new object[]{Test})[0];

public class MainApp
   static void Main()
       MyService m = new MyService();
       m.Url = "http://localhost:8080/contract.asmx";
       Console.WriteLine("Result: "+ m.Hello("Test"));

Extracting the interface from server and proxy makes it very, very clear that we’re dealing with the very same message contract. Mind that we only have source-code level contract equivalence between client and server here. The compiled code on either side yields distinct IMyService types and that’s supposed to be that way. In the case I am illustrating here, the language C# serves as the metadata language (and the clipboard is the mechanism) for sharing contract between client and server (or endpoints).

There are two things I find (mildly) annoying about ASP.Net Web Services 1.x and they also become quite apparent in this example: #1 the server side and the client side have a different minimum set of required attributes and #2 ASP.NET’s ASMX support doesn’t look at inherited method-level attributes. The following does not even compile:



<% @WebService class="MyService" language="C#"%>

using System.Web.Services;
using System.Web.Services.Protocols;

public interface IMyService
    string Hello( string Test );

public class MyService : IMyService
    public string Hello( string Test )
       return "Hello "+Test;

using System;
using System.Web.Services;
using System.Web.Services.Protocols;

public interface IMyService
    string Hello( string Test );

public class MyService : SoapHttpClientProtocol, IMyService
    public string Hello( string Test )
       return (string)Invoke("Hello", new object[]{Test})[0];

public class MainApp
   static void Main()
       MyService m = new MyService();
       m.Url = "http://localhost:8080/contract.asmx";
       Console.WriteLine("Result: "+ m.Hello("Test"));

And, frankly, it’s probably not bad that it doesn’t compile, because the half of the attributes on the resulting interface declaration are useless on either side.

Indigo pulls both sides together into the notion of a service contract that’s good for either side (I am using a simple variant of the “publicly known” notation here)

public interface IMyService
    string Hello( string Test );

If you want to see it that way, this is Indigo’s IDL (Interface Definition Language). There’s an equivalence transformation from and to WSDL, if you want to share the contract with folks who program on other platforms or who use another programming language. There are just no angle brackets here and it’s much easier to read, too.

So if you see code that I wrote and you find (even today) seemingly unnecessary interface declarations that are implemented on a single web service class within a project and nowhere else and are also not referenced from anywhere – they might not be unnecessary as they seem. It’s simply IDL!

More confusingly, you might find those declarations replicated! in source code! in another service! Yes, because services are designed to be evolved independent of each other. So while the target service is already in V4, the consumer may still be on the level of V2. Moving up to the V4 interface version is a conscious choice by the developer of the service consumer – and at that point in time he/she imports the most recent contract. Whether that’s copy/paste of a C#/VB/C++ declaration or clicking “Update” on some menu in Visual Studio that turns WSDL into proxy code is not very important; it’s just a matter of tool preference.

Using explicit interface declarations with Web Services is strictly a “contract first” model. It’s just not using WSDL, that’s all.

Categories: SOA

(1) Policy-negotiated behavior, (2) Explicitness of Boundaries, (3) Autonomy and (4) Contract/Schema Exchange are the proclaimed tenets of service orientation. As I am getting along with the design for the services infrastructure we're working on, I find that one of the four towers the others in importance and really doesn't really fit well with them: Autonomy.

P, E and CE say things about the edge of a service. P speaks about how a service negotiates its edge behavior, guarantees and expectations with others. E speaks about how talking to another service is and must be different from talking to anything within your own service. CE speaks about how it is possible to establish a common understanding about messages and data while allowing independent evolution of internals by not sharing programming-level type constructs.

P is about creating and using techniques for metadata exchange and metadata controlled dynamic configuration of service-edges, E is about creating and using techniques for shaping service-edge communication paths and tools in a way that services don't get into each others underwear, and CE is about techniques for defining, exchanging, and implementing the flow of data across service-edges so that services can deal with each other's "stuff".

P, E and CE are guiding tenets for "Web Services" and the stack that evolves around SOAP. These are about "edge stuff". Proper tooling around SOAP, WSDL, XML Schema, WS-Policy (think ASMX, WSE, Axis, or Indigo) makes or will make it relatively easy for any programmer to do "the right thing" about these tenets.

Autonomy, on the other hand, is rather independent from the edge of a service. It describes a fundamental approach to architecture. An "autonomous digital entity" is "alive", makes independent decisions, hides its internals, and is in full control of the data it owns. An autonomous digital entity is not fully dependent on stuff happening on its edge or on inbound commands or requests. Instead, an autonomous service may routinely wake up from a self-chosen cryostasis and check whether certain data items that it is taking care of are becoming due for some actions, or it may decide that it is time to switch on the lights and lower the windows blinds to fends off burglars while its owner is on vacation. 

Autonomy is actually quite difficult to (teach and) achieve and much more a matter of discipline than a matter of tooling. If you have two "Web services" that sit on top of the very same data store and frequently touch what's supposed to be each others private (data) parts, each of them may very well fulfill the P, E, CE tenets, but they are not autonomous. If you try to scale out and host such a pair or group of Web services in separate locations and on top of separate stores, you end up with a very complicated Siamese-twins-separation surgery.  

That gets me to very clearly separate the two stories: Web Services <> Services.

A service is an autonomous digital entity that may or may not accept commands, inquiries (and data) from the outside world. If it chooses to accept such input, it can choose to do so through one or more web service edges that fulfill the P,E,CE tenets and/or may choose to accept totally different forms of input. If it communicates with the outside world from within, it may or may choose to do so via web services. A service is a program.

A web service is an implementation of a set of edge definitions (policies and contracts and channels) that fronts a service and allows services to communicate with each other. Two services may choose to communicate using multiple web services, if they wish to do so. A web service is a communication tool.

With that, I'll cite myself from three paragraphs earlier: If you have two "Web services" that sit on top of the very same data store [...] they are not autonomous. If you try to scale out [...] you end up with a very complicated Siamese-twins-separation surgery."  ... and correct myself: If you have two "Web services", they don't sit on the data store, they front a service. The service is the smallest deployable unit. The service provides the A, the web services bring P, E and CE to the table. A web service may, indeed, just be a message router, security gateway, translator or other intermediary that handles messages strictly on the wire-level and dispatches them to yet another web service fronting an actual service that does the work prescribed by the message. 

All of this, of course, causes substantial confusion about the duplicate use of the word service. The above it terribly difficult to read. I would wish it was still possible to kill the (inapproprate) term "web service" and just call it "edge", or "XML Edge", or (for the marketing people) "SOAP Edge Technology", or maybe "Service Edge XML", although I don't think the resulting acronym would go over well in the U.S.

Categories: SOA

August 6, 2004
@ 02:40 PM

I don't blog much in summer. That's mostly because I am either enjoying some time off or I am busy figuring out "new stuff".

So here's a bit of a hint what currently keeps me busy. If you read this in an RSS aggregator, you better come to the website for this explanation to make sense.

This page here is composed from several elements. There are navigation elements on the left, including a calendar, a categories tree and an archive link list that are built based on index information of the content store. The rest of the page, header and footer elements aside, contains the articles, which are composed onto the page based on a set of rules and there's some content aggregation going on to produce, for instance, the comment counter. Each of these jobs takes a little time and they are worked on sequentially, while the data is acquired from the backend, the web-site (rendering) thread sits idle.

Likewise, imagine you have an intranet web portal that's customizable and gives the user all sorts of individual information like the items on his/her task list, the unread mails, today's weather at his/her location, a list of current projects and their status, etc.  All of these are little visual components on the page that are individually requested and each data item takes some time (even if not much) to acquire. Likely more than here on this dasBlog site. And all the information comes from several, distributed services with the portal page providing the visual aggregation. Again, usually all these things are worked on sequentially. If you have a dozen of those elements on a page and it takes unusually long to get one of them, you'll still sit there and wait for the whole dozen. If the browser times out on you during the wait, you won't get anything, even if 11 out of 12 information items could be acquired.

One aspect of what I am working having all those 12 things done at the same time and allow the rendering thread to do a little bit of work whenever one of the items is ready and to allow the page to proceed whenever it loses its patience with one or more of those jobs. So all of the data acquisition work happens in parallel rather than in sequence and the results can be collected and processed in random order and as they are ready. What's really exciting about this from an SOA perspective is that I am killing request/response in the process. The model sits entirely on one-way messaging. No return values, not output parameters anywhere in sight.

In case you wondered why it is so silent around here ... that's why.

Categories: SOA

I was a little off when I compared my problem here to a tail call. Gordon Weakliem corrected me with the term "continuation".

The fact that the post got 28 comments shows that this seems to be an interesting problem and, naming aside, it is indeed a tricky thing to implement in a framework when the programming language you use (C# in my case) doesn't support the construct. What's specifically tricky about the concrete case that I have is that I don't know where I am yielding control to at the time when I make the respective call.

I'll recap. Assume there is the following call

CustomerService cs = new CustomerService();

FindCustomer is a call that will not return any result as a return value. Instead, the invoked service comes back into the caller's program at some completely different place such this:

public void
FindCustomerReply(Customer[] result)

So what we have here is a "duplex" conversation. The result of an operation initiated by an outbound message (call) is received, some time later, through an inbound message (call), but not on the same thread and not on the same "object". You could say that this is a callback, but that's not precisely what it is, because a "callback" usually happens while the initiating call (as above FindCustomer) has not yet returned back to its scope or at least while the initiating object (or an object passed by some sort of reference) is still alive. Here, instead, processing of the FindCustomer call may take a while and the initiating thread and the initiating object may be long gone when the answer is ready.

Now, the additional issue I have is that at the time when the FindCustomer call is made, it is not known what "FindCustomerReply" message handler it going to be processing the result and it is really not know what's happening next. The decision about what happens next and which handler is chosen is dependent on several factors, including the time that it takes to receive the result. If the FindCustomer is called from a web-page and the service providing FindCustomer drops a result at the caller's doorstep within 2-3 seconds [1], the FindCustomerReply handler can go and hijack the initial call's thread (and HTTP context) and render a page showing the result. If the reply takes longer, the web-page (the caller) may lose its patience [2] and choose to continue by rendering a page that says "We are sending the result to your email account." and the message handler with not throw HTML into an HTTP response on an open socket, but rather render it to an email and send it via SMTP and maybe even alert the user through his/her Instant Messenger when/if the result arrives.

[1] HTTP Request => FindCustomer() =?> "FindCustomerReply" => yield to CustomerList.aspx => HTTP Response
[2] HTTP Request => FindCustomer() =?> Timeout!            => yield to YouWillGetMail.aspx => HTTP Response
                               T+n =?> "FindCustomerReply" => SMTP Mail
                                                           => IM Notification

So, in case [1] I need to correlate the reply with the request and continue processing on the original thread. In case [2], the original thread continues on a "default path" without an available reply and the reply is processed on (possibly two) independent threads and using two different notification channels.

A slightly different angle. Consider a workflow application environment in a bank, where users are assigned tasks and simply fetch the next thing from the to-do list (by clicking a link in an HTML-rendered list). The reply that results from "LookupAndDoNextTask" is a message that contains the job that the user is supposed to do.  

[1] HTTP Request => LookupAndDoNextTask() =?> Job: "Call Customer" => yield to CallCustomer.aspx => HTTP Response
[2] HTTP Request => LookupAndDoNextTask() =?> Job: "Review Credit Offer" => yield to ReviewCredit.aspx => HTTP Response
[3] HTTP Request => LookupAndDoNextTask() =?> Job: "Approve Mortgage" => yield to ApproveMortgage.aspx => HTTP Response
[4] HTTP Request => LookupAndDoNextTask() =?> No Job / Timeout => yield to Solitaire.aspx => HTTP Response

In all of these cases, calls to "FindCustomer()" and "LookupAndDoTask()" that are made from the code that deals with the incoming request will (at least in the theoretical model) never return to their caller and the thread will continue to execute in a different context that is "TBD" at the time of the call. By the time the call stack is unwound and the initiating call (like FindCustomer) indeed returns, the request is therefore fully processed and the caller may not perform any further actions. 

So the issue at hand is to make that fact clear in the programming model. In ASP.NET, there is a single construct called "Server.Transfer()" for that sort of continuation, but it's very specific to ASP.NET and requires that the caller knows where you want to yield control to. In the case I have here, the caller knows that it is surrendering the thread to some other handler, but it doesn't know to to whom, because this is dynamically determined by the underlying frameworks. All that's visible and should be visible in the code is a "normal" method call.

cs.FindCustomer(customerId) might therefore not be a good name, because it looks "too normal". And of course I don't have the powers to invent a new statement for the C# language like continue(cs.FindCustomer(customerId)) that would result in a continuation that simply doesn't return to the call location. Since I can't do that, there has to be a different way to flag it. Sure, I could put an attribute on the method, but Intellisense wouldn't show that, would it? So it seems the best way is to have a convention of prefixing the method name.

There were a bunch of ideas in the comments for method-name prefixes. Here is a selection:

  • cs.InitiateFindCustomer(customerId)
  • cs.YieldFindCustomer(customerId)
  • cs.YieldToFindCustomer(customerId)
  • cs.InjectFindCustomer(customerId)
  • cs.PlaceRequestFindCustomer(customerId)
  • cs.PostRequestFindCustomer(customerId)

I've got most of the underlying correlation and dispatch infrastructure sitting here, but finding a good programming model for that sort of behavior is quite difficult.

[Of course, this post won't make it on Microsoft Watch, eWeek or The Register]

Categories: Architecture | SOA | Technology | ASP.NET | CLR

July 19, 2004
@ 07:07 AM

The recording of last week's .NET Rocks show on which I explained my view on the "services mindset" (at 4AM in the morning) is now available for download from franklins.net

Categories: SOA

newtelligence AG will be hosting an open workshop on service-oriented development, covering principles, architecture ideas and implementation guidance on October 13-15 in Düsseldorf, Germany.

The workshop will be held in English, will be hosted by my partner and “Mr. Methodologies” Achim Oellers and myself, and is limited to just 15 (!) attendees to assure an interactive environment that maximizes everyone’s benefit. The cap on the number of attendees also allows us to adjust the content to individual needs to some extent.

We will cover the “services philosophy” and theoretical foundations of service-compatible transaction techniques, scalability and federation patterns, autonomy and other important aspects. And once we’ve shared our “services mind-set”, we will take the participants on a very intense “guided tour” through (a lot of) very real and production-level quality code (including the Proseware example application that newtelligence built for Microsoft Corporation) that turns the theory to practice on the Windows platform and shows that there’s no need to wait for some shiny future technology to come out in 2 year’s time to benefit from services today.

Regular pricing for the event is €2500.00 (plus applicable taxes) and includes:

  • 3-day workshop in English from 9:00 – 18:00 (or later depending on topic/evening) 
  •  2 nights hotel stay (Oct 13th and 14th)
  • Group dinner with the experts on the first night.  The 2nd night is at your disposal to enjoy Düsseldorf’s fabulous Altstadt at your own leisure
  • Lunch (and snacks/drinks throughout the day)
  • Printed materials (in English), as appropriate
  • Post-Workshop CD containing all presentations and materials used/shown

For registration inquiries, information about the prerequisites, as well as for group and early-bird discount options, please contact Mr. Fons Habes via training@newtelligence.com. If the event is sold out at the time of your inquiry or if you are busy on this date, we will be happy to pre-register you for one of the upcoming event dates or arrange for an event at your site.

Categories: Architecture | SOA | newtelligence

July 15, 2004
@ 11:17 AM

We have a bit of a wording problem. With what I am current building we have a bit (not precisely) a notion of "tail calls". Here's an example:

public void LookupMessage(int messageId)
MessageStoreService messageStore = new MessageStoreService();

The call to LookupMessage() doesn't return anything as a return value or through output parameters. Instead, the resulting reply message surfaces moments later at a totally different place within the same application. At the same time, the object with the method you see here, surrenders all control to the (anonymous) receiver of the reply. It's a tiny bit like Server.Transfer() in ASP.NET.

So the naming problem is that neither of "GetMessage()", "LookupMessage()", "RequestMessage()" sounds right and they all look odd if there's no request/response pattern. The current favorite is to prefix all such methods with "Yield" so that we'd have "YieldLookupMessage()". Or "LookupMessageAndYield()"? Or something else?

Update: Also consider this

public void LookupCustomer(int customerId)
   CustomerService cs = new

Categories: SOA

Carl invited me for .NET Rocks on Thursday night. That is July 15th, 10 PM-Midnight Eastern Standard Time (U.S.) which is FOUR A.M. UNTIL SIX A.M. Central European Time (CET) on Friday morning. I am not sure whether my brain can properly operate at that time. The most fun thing would be to go out drinking Thursday night ;-)   I want to talk about (guess what) Services. Not Indigo, not WSE, not Enterprise Services, not SOAP, not XML. Services. Mindset first, tools later.

Categories: Architecture | SOA

July 12, 2004
@ 01:07 PM

I've had several epiphanies in the 12 months or so. I don't know how it is for other people, but the way my thinking evolves is that I've got some inexpressible "thought clouds" going around in my head for months that I can't really get on paper or talk about in any coherent way. And then, at some point, there's some catalyst and "bang", it all comes together and suddenly those clouds start raining ideas and my thinking very rapidly goes through an actual paradigm shift.

The first important epiphany occurred when Arvindra gave me a compact explanation of his very pragmatic view on Agent Technology and Queueing Networks, which booted the FABRIQ effort. Once I saw what Arvindra had done in his previous projects and I put that together with my thinking about services, a lot of things clicked. The insight that formed from there was that RPC'ish request/response interactions are very restrictive exceptions in a much larger picture where one-way messages and much more complex message flow-patterns possibly involving an arbitrary number of parties are the norm.

The second struck me while on stage in Amsterdam and during the "The Nerd, The Suit, and the Fortune Teller" play as Pat and myself were discussing Service Oriented User Interaction. (You need to understand that we had very limited time for preparation and hence we had a good outline, but the rest of the script essentially said "go with the flow" and so most of it was pure improvisation theater). The insight that formed can (with all due respect) be shortened "the user is just another service". Not only users shall drive the interaction by issuing messages (commands) to a systems for which they expect one or more out of a set of possible replies, but there should also be a way how systems can be drive an interaction by issuing messages to users expecting one or more out of a set of possible replies. There is no good reason why any of these two directions of driving the interaction should receive preferred treatment. There is no client and there is no server. There are just roles in interactions. That moment, the 3-layer/3-tier model of building applications died a quick and painless death in my head. I think I have a new one, but the clouds are still raining ideas. Too early for details. Come back and ask in a few months.

Categories: Architecture | SOA

In my comment view for the last post (comment #1), Piyush Pant writes about the confusion around different pipeline models and frameworks that are popping up all over the place and mentions Proseware, so I need to clarify some things:

I'll address the "too many frameworks" concern first: Proseware's explicit design goal and my job was to use the technologies ASP.NET Web Services, WSE 2.0, IIS, MSMQ, and Enterprise Services as pure as possible and I did intentionally not introduce yet another framework for the runtime bits beyond a few utility classes used by the services as a common infrastructure (like a config-driven web service proxy factory, the queue listener, or the just-in-time activation proxy pooling). What my job was and what I reasonably succeeded at was to show that:

Writing Service Oriented Applications on today's Windows Server 2003 platform does not require yet another framework.

The framework'ish pieces that I had to add are simply addressing some deployment issues like creating accounts, setting ACLs or setting up databases, that need to be done in a "real" app hat isn't a toy. Such things are sometimes difficult to abstract on the level of what the .NET Framework can offer as a general-purpose platform or are simply not there yet. All of these extra classes reside in an isolated assembly that's only used by the installers.

The total number of utility classes that play a role of any importance at runtime is 5 (in words five) and none of them has more than three screen pages worth of actual code. Let me repeat:

Writing Service Oriented Applications on today's Windows Server 2003 platform does not require yet another framework.

I do have a dormant (newtelligence-owned) code branch sitting here that'd make a lot of things in Proseware easier and more elegant to develop and makes reconfiguring services more convenient, but it's a developer convenience and productivity framework. No pipelines, no other architecture, just a prettier shell around the exact Proseware architecture and technologies I chose.

To illustrate my point about the fact that we don't need another entirely new framework, I have here (MessageQueueWebRequest.cs.txt, MessageQueueWebResponse.cs.txt) an early 0.1 prototype copy of our MessageQueueWebRequest/-WebResponse class pair that supports sending WS messages through MSMQ. (That prototype only does very simple one-way messages; you can do a lot more with MSMQ).  

Take the code, put it in yours, create a private queue, take an arbitrary ASMX WebService proxy, call MessageQueueWebRequest.RegisterMSMQProtocol() when your app starts, instantiate the proxy, set the Url property of the proxy to msmq://mymachine/private$/myqueue, invoke the proxy and watch how a SOAP message materializes in the queue.

Next step: use a WSE proxy. Works too. I'll leave the receiver logic to your imagination, but that's not really much more than listening to the queue and throwing the message into a WSE 2.0 SoapMethod or throwing it as a raw HTTP request at an ASMX WebMethod or by using a SimpleWorkerRequest on a self-hosted ASP.NET AppDomain (just like WebMatrix's Cassini hosts that stuff).


On to "pipelines" in the same context: Pipelines are a very common design pattern and you can find hundreds of variations of them in many projects (likely dozens from MS) which all have some sort of a notion of a pipeline. It's just "pipeline", not Pipeline(tm) 2003 SP1.

User-extensible pipeline models are a nice idea, but I don't think they are very useful to have or consider for most services of the type that Proseware has (and that covers a lot of types).

Frankly, most things that are done with pipelines in generalized architectures that wrap around endpoints (in/out crosscutting pipelines) and that are not about "logging" (which is, IMHO, more useful if done explicitly and in-context) are already in the existing technology stack (Enterprise Services, WSE) or are really jobs for other services.

There is no need to invent another pipeline to process custom headers in ASMX, if you have SoapExtensions. There is no need to invent a new pipeline model to do WS-Security, if you can plug the WSE 2.0 pipeline into the ASMX SoapExtension pipeline already. There is no need to invent a new pipeline model to push a new transaction context on the stack, if you can hook the COM+ context pipeline into your call chain by using ES. There is no need to invent another pipeline for authorization, if you can hook arbitrary custom stuff into the ASP.NET Http Pipeline or the WSE 2.0 pipeline already has or simply use what the ES context pipeline gives you.

I just enumerated four (!) different pipeline models and all of them are in the bits you already have on a shipping platform today and as it happens, all of them compose really well with each other. The fact that I am writing this might show that most of us just use and configure their services without even thinking of them as a composite pipeline model.

"We don't need another Pipeline" (I want Tina Turner to sing that for me).

Of course there's other pipeline jobs, right? Mapping!

Well, mapping between schemas is something that goes against the notion of a well-defined contract of a service. Either you have a well-defined contract or two or three or you don't. If you have a well-defined contract and there's a sender that doesn't adhere to it, it's the job of another service to provide that sort of data negotiation, because that's a business-logic task in and by itself.

Umm ... ah! Validation!

That might be true if schema validation is enough, but validation of data is a business logic level task if things get more complex (like if you need to check a PO against your catalog and need to check whether that customer is actually entitled to get a certain discount bracket). That's not a cross-cutting concern. That's a core job of the app.

Pipelines are for plumbers


Now, before I confuse everyone (and because Piyush mentioned it explicitly):

FABRIQ is a wholly different ballgame, because it is precisely a specialized architecture for dynamically distributable, queued (pull-model), one-way pipeline message processing and that does require a bit of a framework, because the platform doesn't readily support it.

We don't really have a notion of an endpoint in FABRIQ that is the default terminal for any message arriving at a node. We just let stuff asynchronously flow in one direction and across machines and handlers can choose to look at, modify, absorb or yield resultant messages into the pipeline as a result of what they do. In that model, the pipeline is the application. Very different story, very different sets of requirements, very different optimization potential and not really about services in the first place (although we stick to the tenets), but rather about distributing work dynamically and about doing so as fast as we can make it go.

Sorry, Piyush! All of that totally wasn't going against your valued comments, but you threw a lit match into a very dry haystack.


Categories: Architecture | SOA

Benjamin Mitchell wrote a better summary of my "Building Proseware Inc." session at TechEd Amsterdam than I ever could.

Because ... whenever the lights go on and the mike is open, I somehow automatically switch into an adrenalin-powered auto-pilot mode that luckily works really well and since my sessions take up so much energy and "focus on the moment", I often just don't remember all the things I said once the session is over and I am cooled down. That also explains why I almost never rehearse sessions (meaning: I never ever speak to the slides until I face an audience) except when I have to coordinate with other speakers. Yet, even though most of my sessions are really ad-hoc performances, whenever I repeat a session I usually remember whatever I said last time just at the very moment when the respective topic comes up, so there's an element of routine. It is really strange how that works. That's also why I am really a bad advisor on how to do sessions the right way, because that is a very risky approach. I just write slides that provide me with a list of topics and "illustration helpers" and whatever I say just "happens". 

About Proseware: All the written comments that people submitted after the session have been collected and are being read and it's very well understood that you want to get your hands on the bits as soon as possible. One of my big takeaways from the project is that if you're Microsoft, releasing stuff that is about giving "how-to" guidance is (for more reasons you can imagine) quite a bit more complicated than just putting bits up on a download site. It's being worked on. In the meantime, I'll blog a bit about the patterns I used whenever I can allocate a timeslice.

Categories: Architecture | SOA | TechEd Europe

Simple question: Please show me a case where inheritance and/or full data encapsulation makes sense for business/domain objects on the implementation level. 

I'll steal the low-hanging fruit: Address. Address is a great candidate when you look at an OOA model as you could model yourself to death having BaseAddress(BA) and BA<-StreetAddress(SA) and BA<-PostalAddress(PA) and SA<-US_StreetAddress and SA<-DE_StreetAddress and SA<-UK_StreetAddress and so forth.

When it comes to implementation, you'll end up refactoring the class into on thing: Address. There's probably an AddressType attribute and there's a Country field that indicates the formatting and since implementing a full address validation component is way too much work that feature gets cut anyway and hence we end up with a multiline text field with the properly formatted address and stuff like Street and PostOfficeBox (eventually normalized to AddressField), City, PostalCode, Country and Region is kept separate really just to make searching easier and faster. The stuff that goes onto the letter envelope is really only the preformatted address text.

Maybe I am too much of a data (read: XML, Messages, SQL) guy by now, but I just lost faith that objects are any good on the "business logic" abstraction level. The whole inheritance story is usually refactored away for very pragmatic reasons and the encapsulation story isn't all that useful either. You simply can't pragmatically regard data validation of data on a property get/set level as a useful general design pattern, because a type like Address is one type with interdependency between its elements and not simply a container for types. The rules for Region depend on Country and the rules for AddressField (or Street/PostOfficeBox) depend on AddressType. Since the object can't know your intent of what data you want to supply to it on a property get/set level, it can't do meaningful validation on that level. Hence, you end up calling something like address.Validate() and from there it's really a small step to separate out code and data into a message and a service that deals with it and call Validate(address). And that sort of service is the best way to support polymorphism over a scoped set of "classes" because it can potentially support "any" address schema and can yet concentrate and share all the validation logic (which is largely the same across whatever format you might choose) in a single place and not spread it across an array of specialized classes that's much, much harder to maintain.

What you end up with are elements and attributes (infoset) for the data that flows across, services that deal with the data that flows, and rows and columns that efficiently store data and let you retrieve it flexibly and quickly. Objects lost (except on the abstract and conceptional analysis level where they are useful to understand a problem space) their place in that picture for me.

While objects are fantastic for frameworks, I've absolutely unlearned why I would ever want them on the business logic level in practice. Reeducate me.

Categories: Architecture | SOA

We've built FABRIQ, we've built Proseware. We have written seminar series about Web Services Best Practices and Service Orientation for Microsoft Europe. I speak about services and aspects of services at conferences around the world. And at all events where I talk about Services, I keep hearing the same question: "Enough of the theory, how do I do it?"

Therefore we have announced a seminar/workshop around designing and building service oriented systems that puts together all the things we've found out in the past years about how services can be built today and on today's Microsoft technology stack and how your systems can be designed for with migration to the next generation Microsoft technlogy stack in mind. Together with our newtelligence Associates, we are offering this workshop for in-house delivery at client sites world-wide and are planning to announce dates and locations for central, "open for all" events soon.

If you are interested in inviting us for an event at your site, contact Bart DePetrillo, or write to sales@newtelligence.com. If you are interested in participating at a central seminar, Bart would like to hear about it (no obligations) so that we can select reasonable location(s) and date(s) that fit your needs.

Categories: Architecture | SOA | FABRIQ | Indigo | Web Services

June 4, 2004
@ 09:15 AM

Autonomy means that a service is alive.

Here are my sub-tenets:

  • It has its very own, independent view on data. That may or may not result in fully owning its own data store (I think it should, but that's all a matter of scale and use case), but it certainly shall never share its own view on a shared store with others. The service's public interface(s) provide(s) the only way to manipulate its view on data.
  • It controls its own lifetime. It can do periodical tasks, spin its own threads and should not be forced to shut down because its hosting process model thinks it's idle for the sole reason that it hasn't seen inbound traffic for a while.
  • It has its own identity and carries a security responsibility. It identifies itself with a service-unique principal against other services and through of its own authorization rules it takes the responsibility upon itself that no user gains illegitimate access to backend data or services. It identifies and takes responsibility for those that invoke it, but never assumes their identity.

The PEACE tenets for SO are a composite set. Autonomy is architecturally the most far reaching of the SO tenets and it is much more about the inside and fundamental behavior of a service than about its edge.

Categories: SOA

I am back home from San Diego now. About 3 more hours of jet-lag to work on. This will be a very busy two weeks until I make a little excursion to the Pakistan Developer Conference in Karachi and then have another week to do the final preparations for TechEd Europe.

One of the three realy cool talks I'll do at TechEd Europe is called "Building Proseware" and explains the the scenario, architecture, and core implementation techniques of Proseware, an industrial-strength, robust, service-oriented example application that newtelligence has designed and implemented for Microsoft over the past 2 months.

The second talk is one that I have been looking forward to for a long time: Rafal Lukawiecki and myself are going to co-present a session. And if that weren't enough: The moderator of our little on-stage banter about services is nobody else than Pat Helland.

And lastly, I'll likely sign-off on the first public version of the FABRIQ later this week (we had been waiting for WSE 2.0 to come out), which means that Arvindra Sehmi and myself can not only repeat our FABRIQ talk in Amsterdam but have shipping bits to show this time. There will even be a hands-on lab on FABRIQ led by newtelligence instructors Achim Oellers and Jörg Freiberger. The plan is to launch the bits before the show, so watch this space for "when and where".

Overall, and as much as I like meeting all my friends in the U.S. and appreciate the efforts of the TechEd team over there, I think that for the last 4 years TechEd Europe consistently has been and will be again the better of the two TechEd events from a developer perspective. In Europe, we have TechEd and IT Forum, whereby TechEd is more developer focused and IT Forum is for the operations side of the house. Hence, TechEd Europe can go and does go a lot deeper into developer topics than TechEd US.

There's a lot of work ahead so don't be surprised if the blog falls silent again until I unleash the information avalanche on Proseware and FABRIQ.

Categories: Architecture | SOA | Talks | TechEd Europe | TechEd US | FABRIQ

All the wonderful loose coupling on the service boundary doesn't help you the least bit, if you tightly couple a set of services on a common store. The temptation is just too big that some developer will go and make a database join across the "data domains" of services and cause a co-location dependency of data and schema dependencies between services. If you share data stores, you break the autonomy rule and you simply don't have a service.

Separating out data stores means at least that every service has it's own "tablespace" or "database" and that in-store joins between those stores are absolutely forbidden. If you have a service managing customers and a service managing invoices, the invoice service must go through the service front for anything that has to do with customer data.

If you want to do reporting across data owned by several services, you must have a reporting service that pulls the data through service interfaces, consolidates it and creates the reports from there.

Will this all be a bit slower than coupling in the store? Sure. It will make your architecture infinitely more agile, though and allows you to implement a lot of clustering scalability patterns. In that way, autonomy is not about making everything a Porsche 911; it's about making the roads wider so that nobody (including the Porsche) ends up in a traffic jam all the time. It's also about paving roads that not only let you from A to B in one stretch, but also have something useful called "exits" that let you get off or on that road at any other place between those two points.

If you decide to throw out you own customer service and replace it with a wrapper around Siebel, your invoice service will never learn about that change. If the invoice service were reaching over into co-located tables owned by the (former) customer service, you'd have a lot of work to do to untangle things. You don't need to do that untangling and all that complication. As an architect you should keep things separate from the start and make it insanely difficult for developers to break those rules. Having different databases and, better yet, to scatter them over several machines at least at development time makes it hard enough to keep the discipline.

Categories: SOA

May 19, 2004
@ 07:56 AM

The four fundamental transaction principles are nicely grouped into the acronym "ACID" that's simple to remember, and so I was looking for something that's doing the same for the SOA tenets and that sort of represents what the service idea has done to the distributed platform wars:

  • Policy-Based Behavior Negotiation
  • Explicitness of Boundaries
  • Autonomy
  • Contract
Categories: Architecture | SOA

May 16, 2004
@ 11:58 AM

Ralf Westphal responded to this and there are really just two sentences that I’d like to pick out from Ralf’s response because that allows me to go a quite a bit deeper into the data services idea and might help to further clarify what I understand as a service oriented approach to data and resource management. Ralf says: There is no necessity to put data access into a service and deploy it pretty far away from its clients. Sometimes is might make sense, sometimes it doesn’t.

I like patterns that eliminate that sort of doubt and which allow one to say “data services always make sense”.

Co-locating data acquisition and storage with business rules inside a service makes absolute sense if all accessed data can be assumed to be co-located on the same store and has similar characteristics with regards to the timely accuracy the data must have. In all other cases, it’s very beneficial to move data access into a separate, autonomous data service and as I’ll explain here, the design can be made so flexible that the data service consumer won’t even notice radical architectural changes to how data is stored. I will show three quite large scenarios to help illustrating what I mean: A federated warehouse system, a partitioned customer data storage system and a master/replica catalog system.

The central question that I want to answer is: Why would you want delegate data acquisition and storage to dedicated services? The short answer is: Because data doesn’t always live in a single place and not all data is alike.

Here the long answer:

The Warehouse

The Warehouse Inventory Service (WIS) holds data about all the goods/items that are stored in warehouse. It’s a data service in the sense that it manages the records (quantity in stock, reorder levels, items on back order) for the individual goods, performs some simplistic accounting-like work to allocate pools of items to orders, but it doesn’t really contain any sophisticated business rules. The services implementing the supply order process and the order fulfillment process for customer orders implement such business rules – the warehouse service just keeps data records.

The public interface [“Explicit Boundary” SOA tenet] for this service is governed by one (or a set of) WSDL portType(s), which define(s) a set of actions and message exchanges that the service implements and understands [“Shared Contract” SOA tenet]. Complementary is a deployment-dependent policy definition for the service, which describes several assertions about the Security and QoS requirements the service makes [“Policy” SOA tenet].

The WIS controls its own, isolated store over which it has exclusive control and the only way that others can get at the content of that data store is through actions available on the public interface of the service [“Autonomy” SOA tenet].

Now let’s say the company running the system is a bit bigger and has a central website (of which replicas might be hosted in several locations) and has multiple warehouses from where items can be delivered. So now, we are putting a total of four instances of WIS into our data centers at the warehouses in New Jersey, Houston, Atlanta and Seattle. The services need to live there, because only the people on site can effectively manage the “shelf/database relationship”. So how does that impact the order fulfillment system that used to talk to the “simple” WIS? It doesn’t, because we can build a dispatcher service implementing the very same portType that accepts order information, looks at the order’s shipping address and routes the allocation requests to the warehouse closest to the shipping destination. In fact now, the formerly “dumb” WIS can be outfitted with some more sophisticated rules that allow to split or to shift the allocation of items to orders across or between warehouses to limit freight cost or ensure the earliest possible delivery in case the preferred warehouse is out of stock for a certain item. Still, from the perspective of the service consumer, the WIS implementation is still just a data service. All that additional complexity is hidden in the underlying “service tree”.

While all the services implement the very same portType, their service policies may differ significantly. Authentication may require certificates for one warehouse and some other token for another warehouse. The connection to some warehouses might be done through a typically rock-solid reliable direct leased line, while another is reached through a less-than-optimal Internet tunnel, which impacts the application-level demand for the reliable messaging assurances. All these aspects are deployment specific and hence made an external deployment-time choice. That’s why WS-Policy exists.

The Customer Data Storage

This scenario for the Customer Data Storage Service (CDS) starts as simple as the Warehouse Inventory scenario and with a single service. The design principles are the same.

Now let’s assume we’re running a quite sophisticated e-commerce site where customers can customize quite a few aspects of the site, can store and reload shopping carts, make personal annotations on items, and can review their own order history. Let’s also assume that we’re pretty aggressively tracking what they look at, what their search keywords are and also what items they put into any shopping cart so that we can show them a very personalized selection of goods that precisely matches their interest profile. Let’s say that all-in-all, we need to have storage space of about 2Mbytes for the cumulative profile/tracking data of each customer. And we happen to have 2 million customers. Even in the Gigabyte age, ~4mln Mbytes (4TB) is quite a bit of data payload to manage in a read/write access database that should be reasonably responsive.

So, the solution is to partition the customer data across an array of smaller (cheaper!) machines that each holds a bucket of customer records. With that we’re also eliminating the co-location assumption.

As in the warehouse case, we are putting a dispatcher service implementing the very same CDS portType on top of the partitioned data service array and therefore hide the storage strategy re-architecture from the service consumers entirely. With this application-level partitioning strategy (and a set of auxiliary service to manage partitions that I am not discussing here), we could scale this up to 2 billion customers and still have an appropriate architecture. Mind that we can have any number of dispatcher instances as long as they implement the same rules for how to route requests to partitions. Strategies for this are a direct partition reference in the customer identifier or a locator service sitting on a customer/machine lookup dictionary.

Now you might say “my database engine does this for me”. Yes, so-called “shared-nothing” clustering techniques do exist on the database level for a while now, but the following addition to the scenario mandates putting more logic into the dispatching and allocation service than – for instance – SQL Server’s “distributed partitioned views” are ready to deal with.

What I am adding to the picture is the European Union’s Data Privacy Directive. Very simplified, the EU directives and regulations it is illegal to permanently store personal data of EU citizens outside EU territory, unless the storage operator and the legislation governing the operator complies with the respective “Safe Harbor” regulations spelled out in these EU rules.

So let’s say we’re a tiny little bit evil and want to treat EU data according to EU rules, but be more “relaxed” about data privacy for the rest of the world. Hence, we permanently store all EU customer data in a data center near Dublin, Ireland and the data for the rest of the world in a data center in Dallas, TX (not making any implications here).

In that case, we’re adding yet another service on top of the unaltered partitioning architecture that implements the same CDS contract and which internally implements the appropriate data routing and service access rules. Those rules which will most likely be based on some location code embedded in the customer identifier (“E1223344” vs. “U1223344”). Based on these rules, requests are dispatched to the right data center. To improve performance and avoid having to data travel along the complete path repeatedly or in small chunks during an interactive session with the customer (customer is logged into the web site), the dispatcher service might choose to have a temporary, non-permanent cache for customer data that is filled with a single request and allows quicker and repeat access to customer data. Changes to the customer’s data that result from the interactive session can later be replicated out to the remote permanent storage.

Again, the service consumer doesn’t really need to know about these massive architectural changes in the underlying data services tree. It only talks to a service that understands a well-known contract.

The Catalog System

Same picture to boot with and the same rules: Here we have a simple service fronting a catalog database. If you have millions of catalog items with customer reviews, pictures, audio and/or video clips, you might chose to partition this just like we did with the customer data.

If you have different catalogs depending on the markets you are selling into (for instance German-language books for Austria, Switzerland and Germany), you might want to partition by location just as in the warehouse scenario.

One thing that’s very special about catalog data is that very much of it rarely ever changes. Reviews are added, media might be added, but except for corrections, the title, author, ISBN number and content summary for a book really doesn’t ever change as long as the book is kept in the catalog. Such data is essentially “insert once, change never”. It’s read-only for all practical purposes.

What’s wonderful about read-only data is that you can replicate it, cache it, move it close to the consumer and pre-index it. You’re expecting that a lot of people will search for items with “Christmas” in the item description come November? Instead of running a full text search every time, run that query once, save the result set in an extra table and have the stored procedure running the “QueryByItemDescription” activity simply return the entire table if it sees that keyword. Read-only data is optimization heaven.

Also, for catalog data, timeliness is not a great concern. If a customer review or a correction isn’t immediately reflected on the presentation surface, but only 30 minutes or 3 hours after is has been added to the master catalog, it doesn’t do any harm as long as the individual adding such information is sufficiently informed of such a delay.

So what we can do to with the catalog is to periodically (every few hours or even just twice a week) consolidate, pre-index and then propagate the master catalog data to distributed read-only replicas. The data services fronting the replicas will satisfy all read operations from the local store and will delegate all write operations directly (passthrough) to the master catalog service. They might choose to update their local replica to reflect those changes immediately, but that would preclude editorial or validation rules that might be enforced by the master catalog service.


So there you have it. What I’ve described here is the net effect of sticking to SOA rules.

·         Shared Contract: Any number of services can implement the same contract (although the concrete implementation, purpose and hence their type differ). Layering contract-compatible services with gradually increasing levels of abstractions and refining rules over existing services creates very clear and simple designs that help you scale and distribute data very well

·         Explicit Boundaries: Forbidding foreign access or even knowledge about service internals allows radical changes inside and “underneath” services.

·         Autonomy allows for data partitioning and data access optimization and avoids “tight coupling in the backend”.

·         Policy: Separating out policy from the service/message contract allows flexible deployment of the compatible services across a variety of security and trust scenarios and also allows for dynamic adaptation to “far” or “near” communications paths by mandating certain QoS properties such as reliable messaging.


Service-Orientation is most useful if you don’t consider it as just another technique or tool, but embrace it as a paradigm. And very little of this thinking has to do with SOAP or XML. SOAP and XML are indeed just tools.

Categories: Architecture | SOA