Yesterday's bombing in Boston was horrible for those caught by the blast and terrible for their friends and families. The impact reached far beyond the bomb's blast radius.
As people across the country (and around the world) sought to find out more about the event, they went to media websites for reports on the known and unfolding, and sometimes found no response to their queries. Overnight analysis of test results has shed some light on exactly how significant the impact on the websites became -- and what companies faced with similar spikes in demand might do to keep data flowing.
Keynote Systems performs ongoing monitoring of a wide variety of websites. Among their regular reports is a weekly look at how the websites of 22 different media organizations perform. In response to yesterday's events, they released an overnight report on how those sites had handled the high demand. The company sent Enterprise Efficiency a copy of the report and I had a chance to talk with one of their performance experts. What I got was an interesting picture of how the news sites responded to the tragedy, and some solid information on how companies might minimize the impact on their own sites when demand suddenly shoots through the roof.
Keynote Web Statistics for April 15
Keynote's data shows a spike in response time and drop in reliability corresponding to the time of the explosions in Boston.
I spoke with Aaron Rudger, web performance expert (a title, not a description) with Keynote Systems. He told me that keynotes index is built by performing the same query against 22 different websites on a regular schedule throughout the day. "We take measurements using our desktop web performance measurement technology, which on a regular, repeated basis captures a measurement of the individual site's availability and performance," he said. Yesterday, they saw something unusual in their test results.
The test used to create the results in the image come from requesting the site's home page. The download time (in seconds) is shown in blue, while the success rate is shown in orange. Rudger said that there are a number of ways to begin peeling back the layers in order to understand the meaning of the data they receive:
One of the ways is looking at the dynamic, to see whether the response and uptime are consistent for every site we examine. In this case, we have [the test routine] running from 10 different places in the US: San Francisco, New York, Boston, and others are represented. In this case, we didn't see any appreciable difference in results across the agents, so it suggests that this wasn't primarily a network congestion issue within the ISPs in a certain region. We didn't for example, see a degradation in the New England area. We saw the performance degrade across the nation, which indicates that demand is an issue rather than the area.
I asked Rudger whether companies might predict the scale of a potential traffic spike, build to that spike and build an infrastructure to meet the need, or whether some spikes are simply too great for any server to bear. He told me that it was obvious that some sites were able to deal with the demand better than others, but that it was more than a simple matter of building an infrastructure that can scale to meet demand.
Rudger said that Google News, for example, is a very different site than CNN.com: There are dramatic differences in the content provided and the way in which it is presented to the user, and dramatic differences in performance between the sites as well. He said that the performance might well have to do with the infrastructure on which each site is built, but there can also be a huge performance impact that is a function of how the content is provided or rendered by the sites. The site's performance, Rudger said, could be associated with the still images or video that are provided on the page.
One key, he said, is understanding the relationship between your site's performance and the performance of content that might be provided by a third party. "With many sites, being able to provide a good experience to the user means that, if you're depending on content from another provider, you have to deal with that connection as a point of failure," Rudger said. "The third-party provider could be the point of weakness that slows things down."
So what lessons can the CIO of a non-media company take away from the performance hit suffered by CNN, NBC, and the other websites yesterday? Rudger has a few key suggestions:
We definitely preach that, if there's any way that you can reasonably create a fairly predictive high-demand scenario, imagine that a spike in demand is a fairly certain occurrence and test to make sure that the infrastructure is capable of meeting that demand. The other thing is the third-party dependencies: We emphasize that site owners need to understand what the dependencies are, employ mitigation strategies so you can decouple them when performance starts to degrade, and have a mechanism in place to monitor the impact of the third-party dependencies from a performance point of view.