Just in case you haven’t heard the news, Netflix has basically shut down due to some mysterious internal problem that has halted their ability to ship DVDs. This is fascinating news – it’s similar to having FedEx announce that they had halted delivery of packages because they couldn’t figure out where they were supposed to go. What’s going on here?
Netflix has 8.5 million subscribers who pay a monthly fee to rent DVD movies. The movies that they select are delivered by mail and when they are returned by mail, then the next movie in the subscriber’s queue is mailed to them. From an IT point of view, the heart of the company is their database(s). This truly is an information based company. I have been a happy subscriber for at least 5 years now and I’ve never had a problem with getting my movies. For Netflix to come out and admit that they are having a problem (I’ve even received an email from them) must mean that the problem has existed for several days and they now felt the need to tell the world before people stared wondering where their next movie was. Clearly this kind of outage is going to cost the company – Citi analyst Tony Wible is guessing that the tab will be $1.8 million to $3.6 million in revenue a day. Talk about a melt-down!
One possible source of Netflix’s problems might be the simple fact that they don’t appear to have a CIO! A quick search of both the company web site and Hoovers turned up no likely suspects. Hmm, perhaps this IT ship has nobody at the helm!
Back to the problem — I have no secret insight into how Netflix runs their business. However, Tom Dillon, who had been serving as the company’s COO as well as its CIO, gave some interviews and from these we can piece parts of the story together.
At the core of Netflix’s operations is the ability to automate as much of the process of sending and receiving DVDs as possible. Since the solution that they have in place to automate these tasks is proprietary , it is of course a trade secret. However, we do have some information. When a DVD comes in, the first things that is done is to check it to make sure the right disk is in the right sleeve. Next, the the serial number on the jacket is scanned.
Now that Netflix’s proprietary software knows what DVD it’s dealing with, it can consider the company’s total inventory of that title, the items on customers’ wish lists of movies they want to see, and a host of other factors. At this time, the DVD will either get sent out again, placed in inventory or simply retired. When things are working correctly, Netflix says that it is able to check in a returned DVD and send out a new one within one day more than 90% of the time. The two challenges that Netflix has always been open about are scaling issues and bottlenecks. As Netflix has rapidly grown in the number of subscribers that it has to serve and the number of movies that it has in its inventory, IT challenges will occur. Additionally, bottlenecks in the DVD processing and delivery process can occur at any time. Dillon admitted that bottlenecks can’t be predicted and basically just have to be dealt with as they show up. If Netflix stumbles, the problem will quickly go from bad to worse. The reason is that they receive over 100,000 new disks a day.
So what could have gone so terribly, terribly wrong here? I’m just taking a guess, but based on years of experience in IT I’m thinking that we’re looking at a cascading problem that was started by a software upgrade. A good guess as to how this all started is that some relatively minor piece of Netflix’s proprietary automation system got a routine update. Next, some sequence of events occurred that caused this updated software to fail or behave in some unexpected way. This problem then cascaded up and down the automated DVD processing line. Since Netflix is reporting that all of their distribution centers are impacted, this means that either a core system is down, or they performed the upgrade at each site at the same time.
As horrible as this must be for Netflix, it’s not the first time we’ve seen this type of problem: AT&T has had its frame-relay network shut down due to software issues, XM Radio took a hit for two days, JetBlue’s unromantic Valentine’s Day outage, and of course, any Blackberry outage ends up being front page news. Netflix is reporting that they have their entire technical staff trying to fix this problem. This means that the original issue spread to multiple applications and may have damaged their corporate data.
Now that we’re done pointing fingers, what could have Netflix done to prevent this from happening in the first place?
- Have An “Undo” Button: Most IT shops keep a old version of each application in storage even after it has been replaced. If the updated software starts to cause problems, then the old version can be rolled back out and reinstalled.
- Don’t Put All Eggs In Single Basket: If indeed Netflix updated all of their distribution operations at the same time, then they were foolish indeed. Instead, upgrades should be done to a single site first in order to determine if there are any unknown issues. Dealing with a single site that is down is much easier than having all of your sites down.
- In Case Of Emergency, Go Manual: Although I love automation as much as the next IT worker, it’s always a good idea to know how to perform the automated tasks by hand just in case there is a day that this might be required. Since Netflix has reported that they are down, not just limping along, clearly they don’t have a manual process to fall back on.
I’m confidant that Netflix will solve this problem (but nobody there gets to sleep until they do!); however, afterwords they are going to have to make some changes in their IT shop in order to ensure that this never happens again. Good luck!
Have you ever been caught in a major software outage? How big was your inconvenience – small, medium, or large? Will Netflix’s glitch cause you to run to Blockbuster or are you willing to ride this one out? Leave a comment and let me know!