The Resilient Enterprise: Learning to Fail

Friday, June 22, 2012

Rafal Los


Let's talk a little bit more on The Resilient Enterprise, and why organizations need to learn to fail.  Failing isn't necessarily good, or bad... it just is.  

Failing is a fact of humanity and a fact of the universe.  

Failure is one of those immutable universal laws that is driven by chaos.  Furthermore, I think we can probably agree at some base level that failure is one of the only things we can count on as a universal constant.

Having said all that, you may think me all "doom and gloom" but don't misunderstand - I repeat that failure isn't bad or good - it just is.

In my conversations with Genefa Murphy and Matt Morgan, some of HP's top minds at HP Discover 2012 in Las Vegas when it comes to DevOps, I believe even more firmly that one of the things that will link the tribe (as Gene Kim calls it) of diverse people into a cohesive DevOps group is failure.

When things go badly we as IT teams have two options, and unfortunately for just about everywhere I've ever worked, we pick the same option consistently.  See if you agree with me here...

Option 1 is by far the one anyone who's ever worked an operations role in an enterprise is used to.  Option 1 involves a large mass of people getting together, most likely virtually, and doing a huddle and trying to pass the blame off on one another.  We always start with the firewall, don't we?  It's always the firewall's fault somehow, or at least that's how I remember it.  

There are two terrible habits at work here, the first is the finger-pointing.  The second, if we get over the "Who's fault is it anyway?" game is the fun situation where out of the 30 people on the conference call someone says "Now, (name) try (random option). Done? OK, does it work now?"... sound familiar to anyone?  

This situation is exactly what the DevOps movement is trying to eradicate as much as possible since it makes the whole process painful, creates animosity and makes us no better at figuring out failure the next time.  We simply repeat this train wreck again.

Option 2 is subtly different.  While we still have a group of people getting together, again most likely virtually, it's a different type of scenario.  I'm thinking of a situation where we have a representative from each team who has a stake in the application or system, and is intimately familiar with the deployment and architecture.  

It should be rare for people to meet the first time here, or be part of a "one ops team triages all" type of function.  If a project I'm responsible for falls over at 4am, I get woken up to fix it... odds are I can diagnose it and get it back serving business faster than some team whose job it is to simply do "operations".  

As a side effect since these are all stake-holders we have the tribe Gene talks about form, and the knowledge of failure and repair stays within the group and maybe, just maybe, gets documented for future use.  Hopefully the failure is documented and we can build resiliency against that type of failure into the application either now or in the future.

Adam Shostack told me we don't learn enough from our failures in IT.  I whole-heartedly agree, Adam.  Learning to fail and get back up is critical... and I think this makes the DevOps tribe idea that much more crucial and realistic.  I think this is where IT is evolving to, and watching as an outsider across these different functions I'm noticing these types of patterns appear.  

Enterprise resiliency is a brilliant concept that I'm sure has been talked about before but could not be any more crucial than it is today.  We're at an inflection point, and things must, absolutely must, evolve.

If the agile enterprise is to become a reality, not just something we talk about at conferences and write books about, then it needs to be a core ideal, served by every technical and non-technical function and products and services to enable that core ideal.

The road to the agile enterprise starts with an awakening to DevOps.  Step 1, learning to fail, recover and move on. Next time, I'll give you some of the how behind this idea.

Cross-posted from Following the White Rabbit

Possibly Related Articles:
Enterprise Security
Information Security
Enterprise Security Application Security Incident Response Network Security Information Security Resilience IT Security DevOps
Post Rating I Like this!
The views expressed in this post are the opinions of the Infosec Island member that posted this content. Infosec Island is not responsible for the content or messaging of this post.

Unauthorized reproduction of this article (in part or in whole) is prohibited without the express written permission of Infosec Island and the Infosec Island member that posted this content--this includes using our RSS feed for any purpose other than personal use.