The Resilient Enterprise: Taming Chaos with Automation

Wednesday, June 20, 2012

Rafal Los


In my last post I wrote that Stability is bad for your business... and apparently lots of you agreed.  

I saw some great tweets back and forth talking about how this is a concept many of you practice, and how it's both a terrifying idea and a necessary truth.

I was at day 1 of HP Discover in Las Vegas, and let me start off by telling you that the blog post was still in the back of my mind as I walked the show floor and stood in the back of the crowd and watch our engineers do demos and explain technology.  

The theme was "Make it Matter"... and on that theme I'm trying to make sure I make my blog posts matter to those of you that I met when walking around, and over Twitter. Those of you that are on board with the idea that stability is bad and a little chaos is good are looking for a way to make it real, tangible, and something you can implement.

While I don't yet have a product or service that will actively cause chaos in your organization to strengthen your resiliency - I did find some real ways you can manage the chaos right there on the showroom floor.  As I walked past the HP Software - Cloud pavilion area, I was drawn to the automation demos... chaos has an ugly way of showing up when you're not ready for it, and if you can't respond you're sunk.  

Automation is the key here, I'm convinced of that.  Making decisions quickly, at large scale, is critical... what would be even more helpful is if automation took care of the easy decisions for you, and let you make the tough calls that it can't figure out - then learn from the decisions you made to make better ones in the future.  That would be cool... and productive.

Luckily I ran into my pal Mark who manages the demo teams and he showed me something that was essentially this concept in a product.  I don't want to go deep into the guts of Server Automation (SA), Operations Orchestration (OO) and Cloud Service Automation (CSA) but if you're managing large-scale elastic environments you can't live without these tools.  Period.

Here's how this all works, at a high level.  The security team designs policies at some level that they're comfortable with that encompass patching policy, systems lock-down, software load/remove, and what-not and attach it to a group.  

Now when someone wants to spin up a virtual machine, or a thousand virtual machines, in that group they automatically get an application of all those policies and procedures without having to think about it.  Goodbye direct control, hello governance.  Remember when the security team would do a run at securing every server that was deployed?  

Try that when a department needs to spin up 1,000 virtual machines to prototype a new project... never going to happen without this automation capability.  The thing is, as cool as this is we're not the only ones that do this.  We do it better, but we're not the only ones doing it... which is why the story goes on.

Now, chaos strikes.

Someone breaks your environment, or a system has a catastrophic hard disk or other hardware failure... or something else happens - chaos is unpredictable by nature.  The system notices that there is a degradation of performance, or a system vital on the telemetry monitoring platform has gone wonky and pops an alert then goes into action.

If it detects a hardware fault, odds are it can't do anything about it except perhaps provision a new virtual machine, move over the workload temporarily, bring up the new VM and re-balance... crisis averted.  

Whatever the incident, what-ever the failure, the system can detect and respond in an automated fashion as long as its within the realm of known things... when things fail or break in a completely new way that has never been seen before, the system will take corrective action to restore service to the best of its ability, then page a human to figure out the rest.  

Now comes the cool part... you can teach the system to take some corrective action based on this new event next time it happens... it's really impressive to watch things fail, automation pick up the failure and correct and respond.  All this happens at the pace of business... and it all strives to not disrupt that business.

That's how we do things that matter... I'm impressed, and if you get a chance to see some of these SA and OO demos, do it.  I think you'll be impressed too.

Chaos doesn't have to be the end of your enterprise, embrace it, welcome it, and be ready when it strikes to detect and respond.

Cross-posted from Following the White Rabbit

Possibly Related Articles:
Enterprise Security
Information Security
Enterprise Security Security Strategies Incident Response Network Security Systems Resilience Automation IT Security
Post Rating I Like this!
The views expressed in this post are the opinions of the Infosec Island member that posted this content. Infosec Island is not responsible for the content or messaging of this post.

Unauthorized reproduction of this article (in part or in whole) is prohibited without the express written permission of Infosec Island and the Infosec Island member that posted this content--this includes using our RSS feed for any purpose other than personal use.