A question that is raised quite often in the context of “SOA” is that of how to deal with data.  Specifically, people are increasingly interested in (and concerned about) appropriate caching strategies. What I see described in that context is often motivated by the fundamental misunderstanding that the SO tenet that speaks about ”automony” is perceived to mean “autonomous computing” while it really means “avoid coupling”. The former is an architecture prescription, the latter is just a statement about the quality of a network edge.

I will admit that it the use of “autonomy” confused me for a while as well. Specifically, in my 5/2004 “Data Services” post, I’ve shown principles of autonomous computing and how there is a benefit to loose coupling at the network edge when combined with autonomous computing principles, but at the time I did not yet fully understand how orthogonal those two things really are. I guess that one of the aspects of blogging is that you’ve got to be ready to learn and evolve your knowledge in front of all people. Mind that I stand by the architectural patterns and the notion of data services that I explained in that post, except for the notion that the “Autonomy” SO tenet speaks about autonomous computing.

The picture here illustrates the difference. By autonomous computing principles the left shape of the service is “correct”. The service is fully autonomous and protects its state. That’s a model that’s strictly following the Fiefdoms/Emissaries idea that Pat Helland formulated a few years back. Very many applications look like the shape on the right. There are a number of services sticking up that share a common backend store. That’s not following autonomous computing principles. However, if you look across the top, you’ll see that the endpoints (different colors, different contracts) look precisely alike from the outside for both pillars. That’s the split: Autonomous computing talks very much about how things are supposed to look behind your service boundary (which is not and should not be anyone’s business but yours) and service orientation really talks about you being able to hide any kind of such architectural decision between a loosely coupled network edge. The two ideas compose well, but they are not the same, at all.

Which leads me to the greater story: In terms of software architecture, “SOA” introduces very little new. All distributed systems patterns that have evolved since the 1960 stay true. I haven’t really seen any revolutionary new architecture pattern come out since we speak about Web Services. Brokers, Intermediaries, Federators, Pub/Sub, Queuing, STP, Conversations – all of that has been known for a long time. We’ve just commonly discovered that loose coupling is a quality that’s worth something.

In all reality, the “SOA” hype is about the notion of aligning business functions with software in order to streamline integration. SOA doesn’t talk about software architecture; in other words: SOA does not talk about how to shape the mechanics of a system. From a software architecture perspective, any notion of an “SOA revolution” is complete hogwash. From a Business/IT convergence perspective – to drive analysis and high-level design – there’s meat in the whole story, but I see the SOA term being used mostly for describing technology pieces. “We are building a SOA” really means “we are building a distributed system and we’re trying to make all parts loosely coupled to the best of our abilities”. Whether that distributed system is indeed aligned with the business functions is a wholly different story.

However, I digress. Coming back to the data management issue, it’s clear that a stringent autonomous computing design introduces quite a few challenges in terms of data management. Data consolidation across separate stores for the purposes of reporting requires quite a bit of special consideration and so does caching of data. When the data for a system is dispersed across a variety of stores and comes together only through service channels without the ability to freely query across the data stores and those services are potentially “far” away in terms of bandwidth and latency, data management becomes considerably more difficult than in a monolithic app with a single store. However, this added complexity is a function of choosing to make the service architecture follow autonomous computing principles, not one of how to shape the service edge and whether you use service orientation principles to implement it.

To be clear: I continue to believe that aligning data storage with services is a good thing. It is an additional strategy for looser coupling between services and allows the sort of data patterns and flexibility that I have explained in the post I linked to above. However, “your mileage may vary” is as true here as anywhere. For some scenarios, tightly coupling services in the backyard might be the right thing to do. That’s especially true for “service-enabling” existing applications. All these architectural considerations are, however, strictly orthogonal to the tenets of SO.

Generally, my advice with respect to data management in distributed systems is to handle all data explicitly as part of the application code and not hide data management in some obscure interception layer. There are a lot of approaches that attempt to hide complex caching scenarios away from application programmers by introducing caching magic on the call/message path. That is a reasonable thing to do, if the goal is to optimize message traffic and the granularity that that gives you is acceptable. I had a scenario where that was a just the right fit in one of my last newtelligence projects. Be that as it may, proper data management, caching included, is somewhat like the holy grail of distributed computing and unless people know what they’re doing, it’s dangerous to try to hide it away.

That said, I believe that it is worth a thought to make caching a first-class consideration in any distributed system where data flows across boundaries. If it’s known at the data source that a particular record or set of records won’t be updated until 1200h tomorrow (many banks, for instance, still do accounting batch runs just once or twice daily) then it is helpful to flow that information alongside the data to allow any receiver determine the caching strategy for the particular data item(s). Likewise, if it’s know that a record or record set is unlikely to change or even guaranteed to not change within an hour/day/week/month or if some staleness of that record is typically acceptable, the caching metadata can indicate an absolute or relative time instant at which the data has to be considered stale and possibly a time instant at which it absolutely expires and must be cleaned from any cache. Adding caching hints to each record or set of records allows clients to make a lot better informed decisions about how to deal with that data. This is ultimately about loose coupling and giving every participant of a distributed system enough information to make their own decisions about how to deal with things.

Which leaves the question about where to cache stuff. The instant “obvious best idea” is to hold stuff in memory. However, if the queries into the cached data become more complex than “select all” or reasonably simple hashtable lookups, it’s not too unlikely that, if you run on Windows, a local SQL Server (-Express) instance holding the cache data will do as good or better (increasingly with data volume) compared a custom query “engine” in terms of performance – even if it serves data out from memory. That’s especially true for caching frameworks that can be written within the time/budget of a typical enterprise project. Besides, long-lived cached data whose expiration window exceeds the lifetime of the application instance needs a home, too. One of the bad caching scenarios is that the network gets saturated at 8 in the morning when everybody starts up their smart client apps and tries to suck the central database dry at once – that’s what in-memory database approaches cause.

Updated: