The FABRIQ
Arvindra Sehmi, who is “Senior Architect” at Microsoft EMEA, is indeed one of the most brilliant architects I know and also happens to be the project manager and “owner” of the project I am working on as the lead architect at the moment (I’ve hinted at it here and here) has finally allowed me to say bit more about what we’re up to.
The goal of this project, code-named “FABRIQ”, is to create a special-purpose, high-performance, service-oriented, one-way messaging infrastructure for queuing networks, agents and agile computing. It’s not a Microsoft product. It’s an architecture blue-print backed by code that we write so that customers don’t need to – at least that’s the plan.
In case that doesn’t tell you anything, I’ll try to give you a little bit of an idea (It’s long, but it’s hopefully worth it):
In financial trading and many other scenarios in finance but also other business areas, activities performed by and information produced by systems isn’t all black and white and ones and zeroes all the time.
Let’s say you want to sell securities for $500,000.00. To do that, you send a job to your investment manager, who’ll forward it to a broker. The broker does what he needs to find a buyer and eventually finds one. Now, as it happens, the buyer can (by indirection) close a contract with you right away, but that doesn’t mean that you get the money right away. There are so called “settlement periods” which exist for various practical reasons; mostly related to people doing some work. Let’s assume the settlement period is three (3) days.
Within those three days, you are required to deliver the goods and the buyer is required to pass you the bucks. Until that happens, there are a lot of things happening behind the scenes to get the money from there to here and the goods from here to there. Many parties are involved. It’s a horrible process. All you have and everyone in between has is a more or less strong belief that the trade and all intermediate steps will turn out good, but all these the steps only happen one-by-one during the settlement period and it all gets finalized only at the very last possible second. While that happens, the markets keep buzzing on. Now, based on some risk assessment that you need to make for yourself (and everyone else makes), you (and them) can act on that belief and go to some other party and buy something for $300.000 and speculate on settling this new deal with the money you will eventually have in hands within the next 3 days. Everyone else does the same. So, in essence, the whole back office handling lags the market by some 3 days here, because the processes and procedures aren’t exactly trivial. And of course everyone wants to keep their money as long for them as anyhow possible. Of course, big and little things may break in the process. Things don’t go as they were believed to go. Commitments from others about future actions may change and then you need to act on those changes while the process is going on to meet or correct your commitments to others.
That’s why systems – like people – in this world act much more on trust and belief than on hard facts. If it were only black and white, you’d only proceed with doing some consecutive action if you’d be sure that the previous transaction was done with. But, hey, stock markets are the biggest casinos available to man and hence the systems must be built to have a bit of that gambler nature in them.
Honestly, I don’t claim to have more than just a tiny little bit of understanding how that all works, because my experience with financial systems is mostly in areas where things take days and weeks to finalize and years and years to settle (loads and collections). But when I refreshed my memory on all that before going into this project, it struck me that this is not dissimilar from other businesses. And that may be a bit easier to understand than all these very specialized and rather abstract finance terms:
Take my girlfriend Patricia’s business as an example: Patricia’s company, GVA Krefeld GmbH, is the world’s premier supplier of slagpots for the steel industry. When steel is produced from molten ore and/or scrap, all the non-metal “garbage” floats on top and is separated out. That stuff (the “slag”, which is as hot as the molten steel) needs to be removed and transported away from the oven/furnace. The containers (or “pots”), which weigh between 1 and 55 metric tons and that are used to haul the “hot as molten steel” slag around to the slag dumps are made from – bingo! – steel. The special design expertise and knowledge handling practices is what makes GVA so unique in their business. If you don’t have the right gear, it’d all instantly melt into a dirty chunk of glowing scrap.
Because every steel plant has their very own requirements, GVA produces only custom slagpots in the exact numbers the customers need them. It turns out that this process is not much less complicated than what I described above. If a customer makes an inquiry asking for a quotation, they need to look at their stock designs to see whether and how much customization is needed; while that happens, they need to get quotations from their foundries, from the casting pattern manufacturers, oversized-street-cargo companies, sea-cargo companies and others. Sometimes they get all answers in time, sometimes they need to make a bit of a gamble and quote a price although they don’t know for sure just to make the customer’s quotation deadline. Once they’ve got the contract and the slagpots are scheduled for production, everyone involved is acting on the belief that everyone will work out fine. Patterns will be built, the casting areas will be reserved, trucks will be allocated, space on ships will be reserved and the customer makes plans to put the slagpot into production by a specific day. Nothing has happened; it’s just that the entire supply chain is set up based on the belief that things will go as planned. But almost inevitably, something (even if small) won’t work. If that happens, you need to manage. At GVA, a lot of that management must be done by keeping track of all these activities using “manual” controlling using check-lists and per-activity databases today. That’s manageable just because the number of concurrent projects is in the hundreds, they’ve got a lot of experience dealing with such things and a project takes about 3-6 months to complete.
But what if you’ve got a couple of thousand of new “projects” queuing up each hour and the time to deal with all of this is getting closer and closer to zero?!? That’s about analogous to the financial trading scenarios where the long term goal is to bring down the settlement periods to zero, allowing real-time settlements of trades in order to limit and eventually eliminate the risk exposure of all involved parties.
Automating that sort of management is what agents and agile machines can help doing, because they can be built to know how to deal with beliefs and unknowns while making weak or strong commitments about foreseeable future actions to others. That theory is really interesting and what’s even better is that quite a few companies actually use it in production today.
Now – what is the most appropriate infrastructure solution to implement a system that has these capabilities?
Before I’ll go into that, let’s pick out and derive some observations and requirements from what I’ve been saying above:
· Activities are triggered by messages. Messages contain orders, acknowledgement, commitments, notifications and other things. Depending on time of day, mood of the market or the mood of Alan Greenspan you may be getting anything from close to nothing to bursts of thousands of messages a second, which might overtax your system, whose capacity is calculated based on an average “very good day for business” scenario to keep costs at bay. So queuing inbound messages at the gate and working on them on your own time seems mandatory. If messages are arriving on a queue, that’s good, if they arrive by “push” (HTTP Web Services) a push/pull translation through a queue is the right thing to do.
· Activities may take a while. Nobody in a business with any sense of organization goes somewhere, drops a paper on someone’s desk and waits patiently for them to complete the work on your stuff after they’ve completed whatever they’re doing. (Yes, branches of government tend to do that; see the first half sentence). So all of the communication is asynchronous by default. Sometimes you get a reply from the party you sent a message to, sometimes you get an answer to your message from someone much further down the processing chain, and sometimes the message is just absorbed by some black information hole somewhere and you’ll never hear about it again (but if the message doesn’t arrive, you’re in trouble). So the whole idea of request/response doesn’t apply here, and if it does, it’s a rather rare exception.
· Activities might only occur once two or messages arrived at one given node. There’s no trade without seller and buyer. There’s no marriage ceremony without both bride and groom. There’s no football (I mean that game played with the foot) game happening until both teams consisting of at least eleven players and the referee with his two assistants have arrived and that dreaded satellite link for the live TV coverage has been restored.
· Activities might not result in absolute things. An activity may rather be a process that goes through multiple states of weaker and firmer beliefs that yield commitments of varying level of guarantees. Once a system is “pretty sure” or “sure” about whether the commitments it made are final, the activity completes and the beliefs become facts and the commitments become results.
Since I know nothing but Microsoft (I’m kidding, I’m kidding), what are the possible infrastructure choices to realize such a system on the Windows platform and Microsoft tools?
Attempt 1: The Microsoft Message Queue! Excellent choice! That’s actually what Arvindra (and colleagues) picked for his last project in that sort of use-case scenario. The resulting system clocked a benchmarked capacity of a whopping 65 million trades per day on a cluster of eight 2-way Windows 2000 server nodes. The downside: once you commit your project to be on MSMQ and only on MSMQ, you are locked into MSMQ and it’s difficult or costly to loop in external systems. And you still need to write some substantial infrastructure around MSMQ to do anything.
Attempt 2: ASP.NET Web Services with WSE! Excellent choice! It’s based on open standards, interoperates well with the world, WSE v2.0 lets you do routing, and policy-driven security of all sorts and it is reasonably fast. The downside: ASP.NET has a very strong request/response bias (it is HTTP locked), with standalone WSE that lets you do asynchronous processing you are basically left out in the cold without a process model and just as with MSMQ, a big part of the infrastructure is still up to you.
Attempt 3: BizTalk Server 2002/2004! Now we’re talking business! It has various listener ports for all sort of protocols, has a pluggable pipeline, all sorts of integration adapters, it is XML based and does Web Services very well. The orchestration engine is a state machine with excellent designer support that would allow building agents of the described type with some solid thinking. The downside: Throwing BizTalk at this problem may be much too little for its messaging engine, because we don’t need all the sophisticated message mapping (this isn’t EAI or B2B at heart) and a little too much for its orchestration engine, because the message flow in our queuing networks is by definition not really as predictable as to make orchestration really shine. It could be done, but there would have to be a lot of custom coding and we’d still have to find a clever way of integrating the agents in a way that doesn’t cause the orchestration schedules to become entirely awkward. Also, beating Arvindra’s MSMQ/C++ app’s 65 million trades per day might be possible, but you’d need a reasonably large set of 19” racks for that. I am still a fan of BizTalk, but it’s not the perfect fit as the core infrastructure for the scenario – it is a great fit reach out into other systems, though.
Attempt 4: Indigo! I wish. Isn’t here, yet. This is for now. (Did anybody expect me to say “Enterprise Services” and “Queued Components”? Sorry for disappointing you; that’s lumped into the MSMQ portion.)
So, the bottom line is that I didn’t find the ONE (pun intended) infrastructure to readily provide what we need, but a lot of ready-made pieces to put together. Since we’re stuck with a quite a bit of coding no matter what we do, we’ll rather abstract away from transports and existing infrastructures and roll something that fits our needs exactly:
· ASP.NET provides us a great HTTP stack and has programmatic support for WSDL, providing us with message contract support. The XML support (doesn’t strictly belong to ASP.NET) doesn’t hurt, either.
· Enterprise Services on Windows Server 2003/Windows XP gives us a solid process and hosting model and a very fast synchronous transport.
· MSMQ provides us fast queuing, reliable delivery and transactional en-/dequeue support.
· The CLR itself has great support for configuration, provides us with isolated application domains and allows us to pull code dynamically from a centralized server to make things a bit “grid-like”. The class libraries give us enough pre-built pieces to build a custom process model with relative ease.
· The Microsoft Web Services Enhancements WSE v2.0 has support for the more solid and the less solid WS-* specifications, including WS-Addressing, WS-Policy, WS-Security and related specs, providing us with good WS service contract support.
· Indigo can’t help us with code, but it can help us with some good ideas. Indigo’s Message class is pretty nice.
· BizTalk lets us reach out into other systems that don’t talk our 2003-style Web Services lingo.
· Shadowfax gives us a best-practice infrastructure to build SOA endpoints to live inside our queuing network or even live inside a Shadowfax endpoint.
So, what again is “FABRIQ”?
It’s
· ASP.NET/WSE/MSMQ/ES Transport with a just so faint shade of Indigo blue.
· An abstract queuing network with a lightweight-transacted, composable pipeline of primitives, custom message handlers and agents.
· A service oriented, one-way, message-centric architecture
· High performance (despite the use of angle brackets) as a primary design and implementation driver.
If you’ve really made it up to here reading, you can probably tell that I am having quite a bit of fun and are quite busy at the moment. Watch this space in the next 6 months for details on how we’re doing and how we’re doing it. If you think that we’re doing something useful for a scenario you have, let me know.