Tuesday, December 8, 2009

Reliability and the Risks of Using Enterprise Middleware

If you are building systems with high requirements for performance and reliability, it is important that you are careful, selective of course, but even more important, sparing in your use of general-purpose middleware solutions to solve your technical problems.

There are strong, obvious arguments in favor of using proven middleware solutions, whether commercial off the shelf software (COTS) or open source solutions - arguments that are based on time-to-market, risk mitigation, and cost leveraging:

Time-to-market
In most cases, it will take much less time to evaluate, acquire, install, configure and understand a commercial product or open source solution than to build your own plumbing. This is especially important early in the project when your focus should be on understanding and solving important business problems, delivering value early, getting something working in the customer’s hands as soon as possible for feedback and validation.

Risk mitigation
Somebody has already gone down this path, taken the time to understand a complex technical problem space, made some mistakes and learned from them. The results are in front of you. You can take advantage of what they have already learned, and focus on solving your customer’s business problems, rather than risking falling into a technical black hole.

Of course you take on a different set of risks: that the solution is of high quality, that you will get adequate support (from the vendor or the community), that you not are buying into a dead end.

Cost leverage
For open source solutions, the cost argument is obvious: you can take advantage of the time and knowledge invested by the community for close to nothing.

In the case of enterprise middleware, companies like Oracle and IBM have spent an awful lot of money hiring smart people, or buying companies that were created by smart people, invested millions of dollars into R&D and millions more into their support infrastructures. You get to take advantage of all of this through comparatively modest license and support fees.

The do-it-yourself, not-invented-here arguments for building instead of buying are essentially that your company is so different, your needs are unique: that most of the money and time invested by Oracle and IBM, or the code built up by an open source community, does not apply to your situation, that you need something that nobody else has anticipated, nobody else has built.

I can safely say that this is almost always bullshit: naïve arguments put forward by people who might be smart, but are too intellectually lazy or inexperienced to properly understand and frame the problem, to bother to look at the choice of solutions available, to appreciate the risks and costs involved in taking a proprietary path. But, when you are pushing the limits in performance and reliability, it may actually be true.

A fascinating study on software complexity by NASA’s Office of the Chief Engineer Technical Excellence Program examines a number of factors that contribute to complexity and risk in high reliability / safety critical software systems (in this case flight systems), and success factors in delivery of these systems. One of the factors that NASA examined was the risks and benefits of using commercial off the shelf software (COTS) solutions:
Finding:
Commercial off-the-shelf (COTS) software can provide valuable and well-tested functionality, but sometimes comes bundled with additional features that are not needed and cannot easily be separated. Since the unneeded features might interact with the needed features, they must be tested too, creating extra work.

Also, COTS software sometimes embodies assumptions about the operating environment that don’t apply well to [specific] applications. If the assumptions are not apparent or well documented, they will take time to discover. This creates extra work in testing; in some cases, a lot of extra work.

Recommendation:

Make-versus-buy decisions about COTS software should include an analysis of the COTS software to: (a) determine how well the desired components or features can be separated from everything else, and (b) quantify the effect on testing complexity. In that way, projects will have a better basis for make/buy and fewer surprises.
The costs and risks involved with using off the shelf solutions can be much greater than this, especially when working with enterprise middleware solutions. Enterprise solutions offer considerable promise: power and scale, configuration to handle different environments, extensive management capabilities, interface plug-and-play… all backed up by deep support capabilities. But you must factor in the costs and complexities of properly setting up and working with these products, and the costs and complexities in understanding the software and its limits: how much time and money must you invest in a technology before you know if it is a good fit, if it fulfills its promise?

Let’s use the example of an enterprise middleware database management system, Oracle’s Real Application Cluster (RAC) maximum availability database cluster solution.

Disclaimer: I am not an Oracle DBA, I am not going to argue fine technical details here. I chose RAC because of recent and extensive experience working with this product, because it is representative of the problems that teams can have working with enterprise middleware. I could have chosen other technologies from other projects, say Weblogic Suite or Websphere Application Server and so on, but I didn’t.

The promise of RAC is to solve many of the problems of managing data and ensuring reliability in high-volume, high-availability distributed systems. RAC shares and manages data across multiple servers, masks failures and provides instant failover in an active-active cluster, and allows you to scale the system horizontally, adding more servers to the cluster as needed to handle increasing demands. RAC is a powerful data management solution, involving many software layers, including clustering and storage management and data management and operations management, designed to solve a set of complex problems.

In particular, one of these technical problems is maintaining cache fusion across the cluster: fusing the in-memory data on each server together into a global, cluster-wide cache so that each server node in the cluster can access information locally as it changes on any other node.

As you would expect, there are limits to the speed and scaling of cluster-wide cache fusion, especially at high transaction rates. And this power and complexity comes with costs. You need to invest both in infrastructure, in a highly reliable and performant network interconnect fabric and shared storage subsystem, and in making fundamental application changes, to carefully and consistently partition data within the database and carefully design your indexes in order to minimize the overhead costs of maintaining global cache state consistency. As the number of server nodes in the cluster increases (for scaling purposes or for higher availability), the overhead costs and the costs involved in managing this overhead increase.

RAC is difficult to setup, tune and manage in production conditions: this is to be expected – the software does a lot for you. But it is especially difficult to setup, tune and manage effectively in high-volume environments with low tolerance for variability and latency, where predictable performance under sustained load, and predictable behavior in failure situations, is required. It requires a significant investment in time to understand the trade-offs in setup and operations of RAC, to balance reliability and integrity factors against performance; choosing between automated and manual management options, testing and measuring system behavior, setting up and testing failover scenarios, carefully managing and monitoring system operations. To do all of this will require you to invest in setting up and maintaining test and certification labs, in training for your operations staff and DBAs, in expert consulting and additional support from Oracle.

To effectively work with enterprise technology like this, at or close to the limits of its design capabilities, you need to understand it in depth: this understanding comes from months of testing and tuning your system, working through support issues and fixing problems in the software, modifying your application and re-testing. The result is like a race car engine: highly optimized and efficient, running hot and fast, highly sensitive to change. Upgrades to your application or to the Oracle software must be reviewed carefully and extensively tested, including planning and testing rollback scenarios: you must be prepared to manage the very real risk that a software upgrade can affect the behavior of the database engine or cluster manager or operations manager or other layers, impacting the reliability or performance of the system.

Clearly one of the major risks of working with enterprise software is that it is difficult, if not impossible, to learn enough about the costs and limits of this technology early enough in the project – especially if you are pushing these limits. Hiring experienced specialists, bringing in expert consultants, investing in training, testing in the lab: all of this might not be enough. While you can get up and running much faster and cheaper than you would trying to solve so many technical problems yourself from the start, you face the risk that you may not understand the technology well enough, the design points and real limits, how to make the necessary balances and trade-offs – and whether these trade-offs will be acceptable to you or your customers. The danger is that you become over-invested in the solution, that you run out of time or resources to explore alternatives, that you give yourself no choice.

You are making a big bet when working with enterprise products. The alternative is to avoid making big bets, avoid having to find big solutions to big problems. Break your problems down, and find narrow, specific answers to these smaller, well-bounded problems. Look for lightweight, single-purpose solutions, and design the simplest possible solution to the problem if you have to build it yourself. Spread the risks out, attack your problems iteratively and incrementally.

In order to do this you need to understand the problem well – but whether you break the problem down or try to solve it with an enterprise product, you can’t avoid the need to understand the problem. Look (carefully) at the options available, at open source and commercial products, look for the smallest, simplest approach that fits. Don’t over-specify, or design, yourself into a corner. Don’t force yourself to over-commit. And think twice, or three or four times, before looking at an enterprise solution as the answer.

No comments:

Site Meter