Release it design and deploy production-ready software pdf download






















Skip to content 0. Release It! Author Michael T. Availability: In Stock. Categories: Computers , Programming. Tags: Computers , deploy , Design , ebookshop , ebooksore , production-ready , Programming , Release , Software. Description Additional information Reviews 0 A single dramatic software failure can cost a company millions of dollars — but can be avoided with simple changes to design and architecture.

There are no reviews yet. Related Products. Add to cart. Algorithms in Bioinformatics: Theory and Implementation pdf Computers. They had to rush from their original gate to the reallocated gate.

They also measure customer complaints sent to the FAA about an airline. Pacific time, eight hours after the outage started, Tom,2 our account representative, called me to come down for a post-mortem. Because the failure occurred so soon after the database failover and maintenance, suspicion naturally condensed around that action. In fact, when Tom called me, he asked me to fly there to find out why the database failover caused this outage. Once I was airborne, I started reviewing the problem ticket and preliminary incident report on my laptop.

Not his real name. If not, what did? Of course, my presence there also served to demonstrate to the client that we were serious about responding to this outage. Not to mention, my investigation should also allay any fears about the local team whitewashing the incident.

They would never do such a thing, of course, but managing perception after a major incident can be just as important as managing the incident itself.

A post-mortem is like a murder mystery. You have a set of clues. Some are reliable, such as server logs copied from the time of the outage.

Some are unreliable, such as statements from people about what they saw. As with real witnesses, people will mix observations with speculation. They will present hypotheses as facts. The post-mortem can actually be harder to solve than a murder, because the body goes away. There is no corpse to autopsy, because the servers are back up and running. Whatever state they were in that caused the failure no longer exists. The failure might have left traces in the log files or monitoring data collected from that time, or it might not.

The clues can be very hard to see. Manage perceptions after a major incident. As I read the files, I made some notes about data to collect. From the application servers, I would need log files, thread dumps, and configuration files. From the database servers, I would need configuration files for the databases and the cluster server. I also made a note to compare the current configuration files to those from the nightly backup.

The backup ran before the outage, so that would tell me whether any configurations were changed between the backup and my investigation. In other words, that would tell me whether someone was trying to cover up a mistake.

All I wanted was a shower and a bed. What I got instead was a meeting with our account executive to brief me on developments while I was incommunicado in the air. My day finally ended around 1 a. In the morning, fortified with quarts of coffee, I dug into the database cluster and RAID configurations. I was looking for common problems with clusters: not enough heartbeats, heartbeats going through switches that carry production traffic, servers set to use physical IP addresses instead of the virtual address, bad dependencies among managed packages, and so on.

I found nothing wrong. The engineering team had done a great job with the database cluster. Proven, textbook work. The local engineers had made copies of all the log files from the kiosk application servers during the outage. I was also able to get log files from the CF application servers. They still had log files from the time of the outage, since it was just the day before.

Better still, there were thread dumps in both sets of log files. As a longtime Java programmer, I love Java thread dumps for debugging application hangs. Armed with a thread dump, the application is an open book, if you know how to read it. You can tell what thirdparty libraries an application uses, what kind of thread pools it has, how many threads are in each one, and what background processing the application uses. It did not take long to decide that the problem had to be within CF.

Out of the forty threads allocated for handling requests from the individual kiosks, all forty were blocked inside SocketInputStream. They were trying vainly to read a response that would never come. To use this on Windows, you must be at the console, with a Command Prompt window running the Java application. Log files produced with Log4J or java. Here is a small portion of a thread dump from JBoss 3. PlainSocketImpl at java. This fragment shows two threads, each named like http Number 25 is in a runnable state, whereas thread 24 is blocked in Object.

This trace clearly indicates that these are members of a thread pool. As a result, the caller is vulnerable to problems in the remote server. In fact, every single thread on every application server was blocked at exactly the same line of code: attempting to check out a database connection from a resource pool. The next part would be dicey. I needed to look at that code, but the operations center had no access to the source control system.

Only binaries were deployed to the production environment. When I asked our account executive how we could get access to the source code, he was reluctant to take that step. Given the scale of the outage, you can imagine that there was plenty of blame floating in the air looking for someone to land on. Relations between the operations center and Development—never all that cozy—were more strained than usual. Everyone was on the defensive, wary of any attempt to point the finger of blame in their direction.

So, with no legitimate access to the source code, I did the only thing I could do. I took the binaries from production and decompiled them. This particular session bean turned out to be the only facility that CF implemented yet. The actual code is show on the facing page. Actually, at first glance, this method looks well constructed.

Use of the try.. In fact, this very cleanup block has appeared in some Java books on the market. Too bad it contains a fatal flaw. It turns out that java. It almost never does. Suppose the JDBC connection was created before the failover.

The IP address used to create the connection will have moved from one host to another, but the current state of TCP connections will not carry over to the second database host. Any socket writes will eventually throw an IOException after the operating system and network driver finally decide that the TCP connection is dead. That means every JDBC connection in the resource pool is an accident waiting to happen.

My favorite tool for decompiling Java code is still JAD. It is fast and accurate, though it is beginning to creak and groan when used on Java 5 code. But, closing the statement will also throw a SQLException, because the driver attempts to tell the database server to release resources associated with that statement. In short, the driver is willing to create a Statement Object that cannot be used. You might consider this a bug.

Many of the developers at the airline certainly made that accusation. The key lesson to be drawn here, though, is that the JDBC specification allows java. In the previous offending code, if closing the statement throws an exception, then the connection does not get closed, resulting in a 5. After forty of these calls, the resource pool is exhausted, and all future calls will block at connectionPool. That is exactly what I saw in the thread dumps from CF.

Would a code review have caught this bug? Would more testing have prevented this bug? Once the problem was identified, the team performed a test in the stress test environment that did demonstrate the same error. Ultimately, it is just fantasy to expect every single bug like this one to be driven out.

Bugs will happen. They cannot be eliminated, so they must be survived instead. The worst problem here is that the bug in one system could propagate to all the other affected systems.

They cannot—must not—allow bugs to cause a chain of failures. Things happen in the real world that just do not happen in the lab, usually bad things. In the lab, all the tests are contrived by people who know what answer they expect to get. Enterprise software must be cynical. Cynical software expects bad things to happen and is never surprised when they do. It refuses to get too intimate with other systems, because it could get hurt.

As so often happens, the team got caught up in the excitement of new technology and advanced architecture. It had lots of great things to say about leverage and synergy. Poor stability carries significant real costs. The obvious cost is lost revenue. Trading systems can lose that much in a single missed transaction!

Tarnish to the brand might be less immediately obvious than lost customers, but try having your holiday-season operational problems reported in BusinessWeek. Millions of dollars in image advertising—touting online customer service—can be undone in a few hours by a batch of bad hard drives. Good stability does not necessarily cost a lot.

Confronted with these leverage points, two paths might both satisfy the functional requirements aiming for QA. One will lead to hours of downtime every year while the other will not. The amazing thing is that the highly stable design usually costs the same to implement as the unstable one. A highly stable design usually costs the same to implement as an unstable one. A transaction is an abstract unit of work processed by the system. This is not the same as a database transaction. A single unit of work might encompass many database transactions.

Transactions are the reason that the system exists. A single system can process just one type of transaction, making it a dedicated system. A mixed workload is a combination of different transaction types processed by a system.

When I use the word system, I mean the complete, interdependent set of hardware, applications, and services required to process transactions for users. A system might be as small as a single application, or it might be a sprawling, multitier network of applications and servers.

I use system when I mean a collection of hosts, applications, network segments, power supplies, and so on, that process transactions from end to end. This is what most people mean when they just say stability. The terms impulse and stress come from mechanical engineering. An impulse is a rapid shock to the system. An impulse to the system is when something whacks it with a hammer. In contrast, stress to the system is a force applied to the system over an extended period.

A flash mob pounding the Xbox product detail page, thanks to a rumor about discounts, causes an impulse. Ten thousand new sessions, all arriving within one minute of each other, is very difficult to withstand.

Getting Slashdotted is an impulse. Dumping twelve million messages into a queue at midnight on November 21st is an impulse.

These are things that can fracture the system in the blink of an eye. In a mechanical system, a material changes shape when stress is applied. This change in shape is called the strain. Stress produces strain. The same thing happens with computer systems. The stress from the credit card processor will cause strain to propagate to other parts of the system, which can produce odd effects. A system with longevity keeps processing transactions for a long time. What is a long Run longevity tests.

It depends. A useful working definition the only way to catch of a long time is the time between code deploy- longevity bugs. Unless you want to live in western Montana, that is. In either case, some component of the system will start to fail before everything else does.

Chiles refers to these as cracks in the system. Both kinds of sludge will kill your system in production. Both are rarely caught during testing.

Testing makes problems visible so you can fix them which is I why I always thank my testers when they find bugs. If you do not test for memory leaks that show up only after seven days, you will have memory leaks after seven days.

The trouble is that applications never run long enough in the development environment to reveal their longevity bugs. How long do you usually keep an application server running in your development environment? These environments are not conducive to longrunning tests, such as leaving the server running for a month under daily traffic. A load test runs for a specified period of time and then quits. Load-testing vendors charge large dollars per hour, so nobody asks them to keep the load running for a week at a time.

Your development team probably shares the corporate network, so you cannot disrupt such vital corporate activities as email and web browsing for days at a time. So, how do you find these kinds of bugs? The only way you can catch them before they bite you in production is to run your own longevity tests.

If you can, set aside a developer machine. Have it run JMeter, Marathon, or some other load-testing tool. Also, be sure to have the scripts slack for a few hours a day to simulate the slow period during the middle of the night. That will catch connection pool and firewall timeouts. Once you skip commercials and the opening and closing credits: about 21 minutes.

If not, at least try to test important parts while stubbing out the rest. If all else fails, production becomes your longevity testing environment by default. Under stress, that crack can begin to propagate, faster and faster. Eventually, the crack will propagate faster than the speed of sound, and the metal breaks with an explosive sound. The original trigger and the way the crack spreads to the rest of the system, together with the result of the damage, are collectively called a failure mode.

No matter what, your system will have a variety of failure modes. Denying the inevitability of failures robs you of your power to control and contain them. Just as auto engineers create crumple zones—areas designed to protect passengers by failing first—you can create safe failure modes that contain the damage and protect the rest of the system.

Chiles calls these protections crackstoppers. Like building crumple zones into cars to absorb impacts and keep passengers safe, you can decide what features of the system are indispensable and build in failure modes that keep cracks away from those features. If you do not design your failure modes, then you will get whatever unpredictable— and usually dangerous—ones happen to emerge.

The crack started at the improper handling of the SQLException, but it could have been stopped at many other points. This happened independently in each application server instance.

The pool could have been configured to create more connections if it was exhausted. It could also have been configured to block callers for a limited time, instead of blocking forever when all connections were checked out. Either of these would have stopped the crack from propagating.

At the next level up, a problem with one call in CF caused the calling applications on other hosts to fail. By default, RMI calls will never time out.

After that, the calls started blocking. The client could have been written to set a timeout on the RMI sockets. Then, the client could set a timeout on its HTTP requests. None of these were done, so the crack propagated from CF to all systems that used CF. At a still larger scale, the CF servers themselves could have been partitioned into more than one service group. That would keep a problem within one of the service groups from taking down all users of CF. In this case, all service groups would have cracked in the same way, but that would not always be the case.

This is another way of stopping cracks from propagating into the rest of the enterprise. In that case, the caller would know that a reply might never arrive. It would have to deal with that case, as part of handling the protocol itself. Even more radically, the callers could be searching for flights by looking for entries For example, by installing a socket factory that calls Socket. Unless it used java. URL and java. URLConnection, though.

CF would keep the tuplespace populated with flight records. The more tightly coupled the architecture, the greater the chance that this coding error can propagate. Conversely, the less coupled architectures act as shock absorbers, diminishing the effects of this error instead of amplifying them. Any of these approaches could have stopped the SQLException problem from spreading to the rest of the airline. One small thing leads to another, which leads to another. Looking at the entire chain of failure after the fact, the failure seems inevitable.

If you tried to estimate the probability of that exact chain of events occurring, it would look incredibly improbable. But, it looks improbable only if you consider the probability of each event independently. A coin has no memory; each toss has the same probability, independent of previous tosses. The combination of events causing the failure is not independent. A failure in one point or layer actually increases the probability of other failures.

If the database gets slow, then the application servers are more likely to run out of memory. Because the layers are coupled, the events are not independent. At each step in the chain of failure, the crack can be accelerated, slowed, or stopped. High levels of complexity provide more directions for the cracks to propagate in. Tight coupling accelerates cracks. For instance, the tight coupling of EJB calls allowed a resource exhaustion problem in CF to create larger problems in its callers.

Coupling the request-handling threads to the external integration calls in those systems caused a remote problem to turn into downtime. So, the exhaustive brute-force approach is impractical for anything but life-critical systems or Mars rovers. What if you actually have to deliver in this decade? You need to look at some patterns that let you create shock absorbers to relieve those stresses.

Each one was unique. They were mostly unique, anyway, since I try not to have the same failure happen twice! Over time, however, patterns of failure do emerge. A certain brittleness along an axis, a tendency for this problem to amplify that way.

These are the stability antipatterns. Chapter 4, Stability Antipatterns, on page 44 deals with these patterns of failure. If there are systematic patterns of failure, you might imagine that some common solutions would apply. You would be correct. Chapter 5, Stability Patterns, on page deals with design and architecture patterns to defeat the antipatterns.

These patterns cannot prevent cracks in the system. Nothing can. There will always be some set of conditions that can trigger a crack. These patterns stop cracks from propagating. They help contain damage and preserve partial functionality instead of allowing total crashes.

It should come as no surprise that these patterns and antipatterns interact with each other. The antipatterns have a tendency to reinforce each other. Like matching garlic, silver, and fire to their respective movie monsters,4 each of the patterns alleviate specific problems.

I could make snide remarks about how little has changed, but that would be dishonest. Applications rarely crash these days, thanks in large part to the wide adoption of Java, PHP, Ruby, and other interpreted languages. Operating systems have generally gotten more stable and reliable due to the hard work of many thousands of programmers.

We used to think of a hundred concurrent users as representing a large system; now we think in the tens of thousands. Instead of application uptime in the hours, we now look for months of continuous uptime. Of course, this also means bigger challenges. As we integrate the world, tightly coupled systems are the rule rather than the exception.

Big systems serve more users by commanding more resources; but, in many failure modes, big systems fail faster than small systems. The size and the complexity of these systems push us to what Inviting Disaster [Chi01] calls the technology frontier, where the twin specters of highly interactive complexity and tight coupling conspire to turn rapidly moving cracks into full-blown failures.

Adjusting the dials under that mental model resulted in frozen milk and thawed meat, because the actual mechanism was controlling the proportion of chilled air sent to each section. With the best of intentions, the operator can take an action, based on his own mental model of how the system functions, that triggers a completely unexpected linkage.

Such linkages contribute to problem inflation, turning a minor issue into a major incident. Hidden linkages in cooling monitoring and control systems are partly to blame for the Three Mile Island reactor incident. Tight coupling allows cracks in one part of the system to propagate themselves—or multiply themselves—across layer or system boundaries. In the physical world, you can think of a catwalk held up by four bolts threaded through a metal plate.

The catwalk, the nuts and bolts, the plate, and the ceiling are obviously tightly coupled. The failure of a single bolt will radically increase the stress on the other bolts, the ceiling, and the catwalk.

This increased stress makes it extremely likely that another component in the system will fail—probably the catwalk itself. In your systems, tight coupling can appear within application code, in calls between systems, or anyplace a resource has multiple consumers. These bad behaviors are to be avoided. In all cases, however, the main point to remember is that things will break. Assume the worst, because cracks happen. Inviting Disaster [Chi01], pages 37— Antipatterns create, accelerate, or multiply cracks in the system.

If your projects are like mine, they have probably been enterprise integration projects that happen to have an HTML-based front end. Those projects were the impetus that finally forced many companies to integrate systems that have never played well together. Data extracts fly off toward CRM, fulfillment, booking, authorization, fraud checking, address normalization, scheduling, shipping, and so on.

Reports are generated one hopes showing business statistics to business people, technical statistics to technical people, and management statistics to management. Integration points are the number-one killer of systems. Every single one of those feeds presents a stability risk. Every socket, process, pipe, or remote procedure call can and will hang. Even database calls can hang, in ways obvious and subtle. Every feed into the system can hang it, crash it, or generate other impulses at the worst possible time.

Socket-Based Protocols Many higher-level integration protocols run over sockets. In fact, pretty much everything except named pipes and shared-memory IPC is socket based. The higher protocols introduce their own failure modes, but they are all susceptible to failures at the socket layer. The simplest failure mode occurs when the remote system refuses connections. The calling system must deal with connection failures. Usually, this is not much of a problem, since everything from C to Java to Ruby has clear ways to indicate a connection failure—either a -1 return value in C or an exception in Java, C , and Ruby.

It came time to identify all the production firewall rules so we could open holes in the firewall to allow authorized connections to the production system. When it came time to add rules for the feeds in and out of the production environment, we were pointed at the project manager for enterprise integration. That was our second clue that this was not going to be a simple task. The first clue was that nobody else could tell us what all the feeds were.

The PM understood exactly what we needed. He pulled up his database of integrations and ran a custom report to give us the connection specifics. On the other hand, however, I was dismayed that he needed a database to keep track of it! It probably comes as no surprise, then, that the site was plagued with stability problems when it launched. It was like having a newborn baby in the house; I was awakened up every night at 3 a.

We kept documenting the spots where the app crashed and feeding them back to the maintenance team for correction. Every architecture diagram ever drawn has boxes and arrows, like the ones in Figure 4. Like a lot of other things we work with, this arrow is an abstraction for a network connection.

All you will ever see on the network itself are packets. Transmission Control Protocol TCP is an agreement about how to make something that looks like a continuous connection out of discrete packets. Figure 4. The connection starts when the caller the client in this scenario, even though it is itself a server for other applications sends a SYN packet to a port on the remote server. The calling application then gets an exception or a bad return value.

All this happens very quickly, in less than ten milliseconds if both machines are plugged into the same switch. Between electrons and a TCP connection, there are many layers of abstraction. Fortunately, we get to choose whichever level of abstraction is useful at any given point in time. SYN Calling Application 2. Once that listen queue is full, further connection attempts are refused quickly.

The listen queue is the worst place to be. While the socket is in that partially formed state, whichever thread called open is blocked inside the OS kernel until the remote application finally gets around to accepting the connection or until the connection attempt times out. Connection timeouts vary from one operating system to another, but they are usually measured in minutes! Nearly the same thing happens when the caller can connect and send its request but the server takes a long time to read the request and send a response.

The read call will just block until the server gets around to responding. In Java, the default is to block forever.

You have to call Socket. In that case, be prepared for an IOException. Fast network failures cause immediate exceptions in the calling code. Slow failures, such as a dropped ACK, let threads block for minutes before throwing exceptions. If all threads end up getting blocked, then for all practical purposes, the server is down. Clearly, a slow response is a lot worse than no response. The 5 a. Problem One of the sites I launched developed this very nasty pattern of hanging completely at almost exactly 5 a.

This was running on around thirty different instances, so something was happening to make all thirty different application server instances hang within a fiveminute window the resolution of our URL pinger. Restarting the application servers always cleared it up, so there was some transient effect that tipped the site over at that time. Unfortunately, that was just when traffic started to ramp up for the day. From midnight to 5 a.

On the third day this occurred, I took thread dumps from one of the afflicted application servers. We were using the thick-client driver for its superior failover features. In fact, once I eliminated the threads that were just blocked trying to enter a synchronized method, it looked as if the active threads were all in low-level socket read or write calls.

The next step was tcpdump and ethereal. A handful of packets were being sent from the application servers to the database servers, but with no replies. Also nothing was coming from the database to the application servers. Yet, monitoring showed that the database was alive and healthy.

Ethereal has since been renamed Wireshark. We can go much faster when we talk about fetching a document from a URL than if we have to discuss the tedious details of connection setup, packet framing, acknowledgments, receive windows, and so on.

Whether for problem diagnosis or performance tuning, packet capture tools are the only way to understand what is really happening on the network. It can sniff packets on the wire, as tcpdump does. Wireshark goes farther, though, by unpacking the packets for us. Through its history, Wireshark has experienced numerous security flaws—some trivial, some serious. Beyond the security issues, Wireshark is a big, heavy GUI program.

That is a burden that should not be on the production servers. For these reasons, it is best to capture packets noninteractively using tcpdump and then move the capture file to a nonproduction environment for analysis. The screenshot below shows Ethereal analyzing a capture from my home network. The first packet shows an address routing protocol ARP request. This happens to be a question from my wireless bridge to my cable modem.

Packets five, six, and seven are the three-phase handshake for a TCP connection setup. We can trace the entire conversation between my web browser and server. The outermost frame is an Ethernet packet. The exact bytes of the entire packet appear in the third pane. Our first priority is restoring service. We do data collection when we can, but not at the risk of breaking an SLA.

Any deeper investigation would have to wait until it happened again. None of us doubted that it would happen again. Sure enough, the pattern repeated itself the next morning. Application servers locked up tight as a drum, with the threads inside the JDBC driver. Nothing at all. I had a hypothesis. I said before that socket connections are an abstraction. They exist only as objects in the memory of the computers at the endpoints.

Once established, a TCP connection can exist for days without a single packet being sent by either side. Routes can change, and physical links can be severed and reconnected. There was a time when that all worked beautifully well. These days, a bunch of paranoid little bastions have broken the philosophy and implementation of the whole Net. A firewall is nothing but a specialized router. It routes packets from one set of physical ports to another. Inside each firewall, a set of access control lists define the rules about which connections it will allow.

The packet might be allowed routed to the destination network , rejected TCP reset packet sent back to origin , or ignored dropped on the floor with no response at all. Assuming you set suitably perverse timeouts in the kernel. Financial penalties accompany the violation of an SLA.

SYN 1. SYN Remote Server time check ruleset 4. ACK 5. ACK 8. How is this related to my 5 a. The key is that table of established connections inside the firewall. Therefore, it does not allow infinite duration connections, even though TCP itself does allow them. If too much time elapses without a packet on a connection, the firewall assumes that the endpoints are dead or gone.

It just drops the connection from its table, as shown in Figure 4. But, TCP was never designed for that kind of intelligent device in the middle of a connection. The endpoints assume their connection is valid for an indefinite length of time, even if no packets are crossing the wire.

After that point, any attempt to read or write from the socket on either end does not result in a TCP reset or an error due to a half-open socket.

That could let bad guys probe for active connections by spoofing source addresses. My Linux system, running on a 2. The HP-UX servers we were using at the time had a thirty-minute timeout. The situation for reading from the socket is even worse. It could block forever. When I decompiled the resource pool class, I saw that it used a last-in, first-out strategy.

During the slow overnight times, traffic volume was light enough that one single database connection would get checked out of the pool, used, and checked back in. Then the next request would get the same connection, leaving the thirty-nine others to sit idle until traffic started to ramp up. They were idle well over the one-hour idle connection timeout configured into the firewall.

Once traffic started to ramp up, those thirty-nine connections per application server would get locked up immediately. Even if the one connection was still being used to serve pages, sooner or later it would be checked out by a thread that ended up blocked on a connection from one of the other pools.

Then the one good connection would be held by a blocked thread. Total site hang. Once we understood all the links in that chain of failure, we had to find a solution. The resource pool has the ability to test JDBC connections for validity before checking them out. Well, that would just make the request-handling thread hang anyway. We could also have the pool keep track of the idle time of the JDBC connection and discard any that were older than one hour.

Unfortunately, that involves sending a packet to the database server to tell it that the session is being torn down. Fortunately, a sharp DBA recalled just the thing. Oracle has a feature called dead connection detection that you can enable to discover when clients have crashed. When enabled, the database server sends a ping packet to the client at some periodic interval. If the client responds, then the database knows it is still alive.

If the client fails to respond after a few retries, the database server assumes the client has crashed and frees up all the resources held by that connection. Dead connection detection kept the connection alive, which let me sleep through the night. HTTP Protocols Service-oriented architectures are a hot topic these days, certainly if you listen to application server vendors.

Another commonly cited reason is more efficient use of data center resources by providing shared hardware for commonly used services. Other organizations desire the flexibility and nimbleness that SOA promises. Of course, all HTTP-based protocols use sockets so are vulnerable to all of the problems described previously. HTTP adds its own flavor of issue, mainly centered around the client library. URLConnection classes. In line 1 we construct a query URL to hit Google. We have to downcast 6.

More people should be using asynchronous message transport. If you wanted to set the socket timeout by calling Socket. The remote system could dribble back one byte per second for the next ten years, and your thread would still be stuck on that one call. A cynical system would never put up with such an unprotected call. Fortunately, other available HTTP clients allow much more control. Usually, software vendors provide client API libraries that have a lot of problems and often have hidden stability risks.

These libraries are just code, coming from regular developers. They have all the variability in quality, style, and safety that you see from any other random sampling of code. The worst part about these libraries is that you have so little control over them. About the best thing you can do is decompile the code, find issues, and report them as bugs. If you have enough clout to apply pressure to the vendor, then you might be able to get a bug fix to their client library, assuming, of course, that you are on the latest version of their software.

In the past, I have been known to fix their bugs and recompile my own version for temporary use while waiting for the patched version from the vendor. Whenever you have threads that need to synchronize on multiple resources, you have the potential for deadlock. Thread 1 holds lock A and needs lock B, while thread 2 has lock B and needs lock A.

The classic recipe for avoiding this deadlock is to make sure you always acquire the locks in the same order and release them in the reverse order. Of course, this helps only if you know that the thread will be acquiring both locks and you can control the order in which they are acquired. Is it safe? No idea. Without knowing what thread messageReceived gets called on, you cannot be sure what monitors the thread will be holding. It could have a dozen synchronized methods on the stack already.

Deadlock minefield. Depending on the threading model inside the client library and how long your callback method takes, synchronizing the callback method could block threads inside the client library. Like a plugged drain, those blocked threads can cause threads calling send to block.

Odds are that means request-handling threads will be tied up. As always, once all the request-handling threads are blocked, your application might as well be down. What can you do to make integration points safer? The most effective patterns to combat integration point failures are Circuit Breaker and Decoupling Middleware.

Testing helps, too. Cynical software should handle violations of form and function, such Combat integration as badly formed headers or abruptly closed point failures with the connections. To make sure your software is Circuit Breaker and cynical enough, you should make a test har- Decoupling Middleware ness—a simulator that provides controllable patterns.

Setting the test harness to spit back canned responses facilitates functional testing. It also provides isolation from the target system when you are testing. Finally, each such test harness should also allow you to simulate various kinds of system and network failure.

This test harness will immediately help with functional testing. To test for stability, you also need to flip all the switches on the harness while the system is under considerable load. This load can come from a bunch of workstations running JMeter or Marathon, but it definitely requires much more than a handful of testers clicking around on their desktops.

Remember This Beware this necessary evil Every integration point will eventually fail in some way, and you need to be prepared for that failure. Prepare for the many forms of failure Integration point failures take several forms, ranging from various network errors to semantic errors.

Know when to open up abstractions Debugging integration point failures usually requires peeling back a layer of abstraction. Failures are often difficult to debug at the application layer, because most of them violate the high-level protocols.

Packet sniffers and other network diagnostics can help. Apply patterns to avert Integration Points problems Defensive programming via Circuit Breaker, Timeouts, Decoupling Middleware, and Handshaking will all help you avoid the dangers of Integration Points. Horizontal scaling refers to adding capacity by adding servers. This is the Google and Amazon approach. A web farm is an example of horizontal scaling—each server adds nearly the same amount of capacity as the previous server.

The alternative, vertical scaling, means building bigger and bigger servers: replacing x86 pizza boxes with four-way, eight-way, and then thirty-two-way servers. This is the approach Oracle would love to see you use. Each type of scaling works best under different circumstances. If your system scales horizontally, then you will have load-balanced farms or clusters where each server runs the same applications.

The multiplicity of machines provides you with fault tolerance through redundancy. A single machine or process can completely bonk while the remainder continues serving transactions. Still, even though horizontal clusters are not susceptible to single points of failure except in the case of attacks of self-denial, see Antipattern 4.

When one node in a load-balanced group fails, the other nodes must pick up the slack. For example, in the eightserver farm shown in Figure 4. After one server pops off, you have the distribution shown in Figure 4. Each of the remaining seven servers must handle about Even though each server has to take only 1. If the first server failed because of some load-related condition, such as a memory leak or intermittent race condition, the surviving nodes become more likely to fail.

With each additional server that goes dark, the remaining stalwarts get more and more burdened and therefore are more and more likely to also go dark. A chain reaction occurs when there is some defect in an application— usually a resource leak or a load-related crash. That means the only way you can eliminate the chain reaction Single point of failure SPOF : Any device, node, or cable that, when removed, results in the complete failure of a larger system. For example, a server with only one power supply and a network switch with no redundancy are both SPOFs.

Splitting a layer into multiple pools— as in the Bulkhead pattern—can sometimes help by splitting a single chain reaction into two separate chain reactions that occur at different rates. What effect could a chain reaction have on the rest of the system? Well, for one thing, a chain reaction failure in one layer can easily lead to a cascading failure in a calling layer. Chain reactions are sometimes caused by blocked threads. This happens when all the request-handling threads in an application get blocked and that application stops responding.

Incoming requests will then get distributed out to the applications on other servers in the same layer, increasing their chance of failure.

Remember This One server down jeopardizes the rest A chain reaction happens because the death of one server makes the others pick up the slack. The increased load makes them more likely to fail. A chain reaction will quickly bring an entire layer down. Other layers that depend on it must protect themselves, or they will go down in a cascading failure.

Hunt for resource leaks Most of the time, a chain reaction happens when your application has a memory leak. The increased traffic means they leak memory faster. It has a huge catalog—half a million SKUs in different categories. To handle all the customers during the holidays, the retailer was running a dozen search engines sitting behind a hardware load balancer. It also performed health checks to discover which servers were alive and responsive so it could make sure to send queries only to search engines that were alive.

Those health checks turned out to be useful. The search engine had some bug that caused a memory leak. Under regular traffic not a holiday season , the search engines would start to go dark right around noon. Because each engine had been taking the same proportion of load throughout the morning, they would all crash at about the same time. As each search engine went dark, the load balancer would send their share of the queries to the remaining servers, causing them to run out of memory even faster.

The gap between the first crash and the second would be five or six minutes. Between the second and third would be just three or four minutes.

The last two would go down within seconds of each other. This particular system also suffered from cascading failures and blocked threads. Losing the last search server caused the entire front end to lock up completely.

Until we got an effective patch from the vendor which took months , we had to follow a daily regime of restarts that bracketed the peak hours: 11 a. See Section Again, if one server goes down to a deadlock, the increased load on the others makes them more likely to hit the deadlock too. Use Circuit Breaker on the calling side for that.

We usually refer to the individual farms as layers— for example, as in Figure 4. In a service-oriented architecture, these look even less like traditional layers and more like a directed, acyclic graph.

System failures start with a crack. That crack comes from some fundamental problem. Various mechanisms can retard or stop the crack, which are the topics of the next chapter. Absent those mechanisms, the crack can progress and even be amplified by some structural problems. A cascading failure occurs when a crack in one layer triggers a crack in a calling layer.

An obvious example is a database failure. If an entire database cluster goes dark, then any A cascading failure application that calls the database is going to occurs when problems experience problems of some kind. If it handles in one layer cause the problems badly, then the application layer problems in callers. Each page request would attempt to create a new connection, get a SQLException, try to tear down the connection, get another SQLException, and then vomit a stack trace all over the user.

Cascading failures require some mechanism to transmit the failure from one layer to another. Cascading failures often result from resource pools that get drained because of a failure in a lower layer.



0コメント

  • 1000 / 1000