In late July, Netflix released an open-source software known as Chaos Monkey. The tool was designed to purposely cause failure to virtual machines (VMs) hosted on Amazon Web Service (AWS) in order to increase the resiliency of Netflix’s website in the event of everything from a localized system failure up to a large-scale outage.
Unsurprisingly, when top companies like Netflix make a move, people pay attention. Chaos Monkey is no exception and has been a subject of debate since Netflix revealed its existence in 2010. The concept was controversial: Break your systems if you want to make them resilient. This, of course, fed the discussion.
What pushed Chaos Monkey into the recent spotlight was Netflix’s decision to release it into the wild, which occurred only a month after a major Amazon AWS outage. However, now that there are a lot more people talking about Chaos Monkey, there is a lot of misinformation floating around.
The biggest misconception about Chaos Monkey is that it’s a security tool. This is understandable, as the concept of “break your own systems first” overlaps with the “white hat” hacker mentality to “hack yourself first.” However, there’s a big difference between turning off a server and finding holes in one that’s running.
Labeling something as a security tool implies that it helps locate specific security holes, not overall system problems. A white hat hacking service would answer questions such as: Could your network handle a Denial of Service attack, and at what point would this jeopardize your system? Do you have vulnerabilities that could expose information, and what could an attacker extract? Are there flaws in your architecture or implementation that would allow someone else to penetrate your network, and what would the damage be to your business?
Note that none of these questions are answered by turning off a server.
An example of a true security tool would be something like Metasploit by Rapid7, which is constantly updated by a large open source community to include exploits for the most recently discovered vulnerabilities. The ultimate goal of a Metasploit user is to gain root-level user access to the target system, at which point they’d have more control than most admins. The potential damage to the company could then be determined by the criticality of the penetrated services. There are similar tools that try to test non-destructively for the vulnerability, without actually cracking the system. These tools include Nessus by Tenable, Core Impact by Core Security, NeXpose by Rapid7, the Qualys service, IP360 from nCircle, and other products from HP, IBM, and more.
In contrast, Chaos Monkey is a tool whose strength is in its relative simplicity: turn off a random server in your AWS instance at a random time. It keeps a log of what it’s done, so the root cause is known. What it does not do is fix any problems it may find. It also comes with no guarantee that any problem it causes will be easy to repair (in the short term) or fix (in the long term).
The end goal of Chaos Monkey is like lifting weights: get stronger by causing yourself pain in a controlled fashion. As you fix the problems uncovered by Chaos Monkey, it makes your overall system stronger – but it also forces your team to push themselves in uncomfortable ways, and many of them will be sore for days.
With this in mind, it is essential if you are deploying Chaos Monkey that your team is mentally prepared to face failure – this is true for security tools as well. Both practices can lead to more vulnerability and more work, however they can also create a much stronger ecosystem.
Remember, the ultimate goal of Chaos Monkey is to break your own system not to find problems like a security tool, but to build a team and set of responses to deal with the unexpected. It forces the team to expand itself to focus on not just known problems, but building a system that can recover quickly and prove its reliability through instability.