Securing Operations
A Functional Analysis of Systems Engineering Today
Author: James Downs
Start with Configuration Management

I was ready to write this post as "Start with Monitoring". That's the usual plan, and I've done that in the past, but as I've thought more about the problem, I think monitoring can actually be a secondary service. Monitoring is about making sure the site is up and running, serving customers. But if you think this, at a startup, everyone is obsessing about the product. If the site is critical, and it goes down, everyone is going to know it. At an established company, if the site goes down, and it's critical to the business, everyone is going to know it. On some of the high traffic sites I've worked on, even with the best monitoring, customers often saw site issues before monitoring did.

Which leads me to my current thinking: monitoring isn't critical, MTTR (Mean Time To Recovery) is critical.

I don't think that statement is revolutionary, but simply knowing something is broken isn't sufficient. You need to know what is broken, and why it's broken, and knowing what last changed, simply, easily, and hopefully immediately, is potentially the most important single piece of information.

Having watched a number of companies skip the work of configuration management because, you know, "agile", or [insert excuse], and the resulting scramble to figure out what got pushed, or what got changed, or what settings are different from the other members of the application fleet, shows that all the monitoring in the world doesn't tell you about those deltas.

It just gets worse as the company gets larger.

If you don't have configuration management across your infrastructure, Get Some.

If you love Ruby, checkout Chef or Puppet If you're into Python, checkout SaltStack or bcfg2 And for the old-school feel, checkout CFEngine

permalink | trackback | comments: | tags: bcfg2, cfengine, chef, configuration management, ops, puppet, saltstack
Using the HP Cloud Beta

With the Boto Toolkit

Works out of the box, just by overriding defaults in the connection setup. Not every feature is supported, but instance features seem to work, as do security group, and elastic IPs.

HP Cloud is running OpenStack Nova and API.

Compute Resources

Signup here:

Log into the console:

  • Activate Compute Regions

  • Go to Account Tab
    • Go to Your API Keys
    • You will need: AccessKeyID, SecretKey, TenantID, Endpoint
  • In your Python Console
    import boto
    from boto import ec2
    region = ec2.regioninfo.RegionInfo(name=hpregion,endpoint=hpendpoint)
  • At this point, you should have an active connection to the HP Cloud
    # List all available machine images
    # Create a keypair, save key.material into a local ssh keyfile, chmod 600
    print key.material
    # Launch an instance (64 bit CentOS)
  • After a short while, the instance will be available (you can also watch the status in the dashboard)

  • Gather the information for the instance.
    # Get the first reservation from all that are running
    # Get the first instance for this reservation
    # And finally, this instance's IP address

Block Storage

Signup at this url:

(waiting for approval)

permalink | trackback | comments: | tags: HP, boto, cloud,, python
Lean Operations

Operations is rarely in the same enviable position in which development organizations find themselves. Operations generally must run and support current infrastructure and systems, and doesn't have the luxury of starting over or rebuilding from scratch.

Lean, with its emphasis on continuous improvement, especially incremental improvement, leads to the consideration of Lean Operations.

A possible first step is the introduction of Kanban.

However it's configured, the traditional large physical board with cards indicating tasks achieves a number of immediate goals:

Reveals Tasks Organizations bogged down with interrupts and tasks need to expose the workload. It's likely that no one knows everything that is in the works. Putting it all onto a single board can help expose the to do list of everyone on the team.

Reveal Work in Progress Equally important is showing how many tasks are in various states of work. Limiting work in progress (WIP) is a common practice with Kanban. Too often to do lists become lists of things that never get done. Items that stay "in progress" for too long are called to attention, and can be addressed directly.

Communication Across the Group When everyone can see what's going on, it's possible to move tasks from overloaded people to less loaded people. You can reduce duplication of effort, and help out especially with items that are helping people up.

Group Visibility There's nothing like a large board, covered with slips of paper to call attention to the amount of work your group has to do. Visibility is also important for changing culture. It's something for people to gather around and talk about. In my experience, it has always resulted in another group getting their own board, and trying it out for themselves.

A physical board is also easy to reconfigure, while standing near it and talking about it. A feedback loop is critical to incremental improvement, and regular discussions of what's working and what needs fixing are important. Each change should have a goal, which should then be assessed. Some will work out, and some won't. Equally, what works for one group or at one company may work for you, or it might not.

The point is Incremental Improvement.

Tracking improvement means tracking some metrics. A couple of guys I used to work with loved the quote:

"That which is measured, improves."

It's a great quote, but it's not the whole story. Just as one can measure a bank account balance without necessarily seeing it "improve", it's possible to gather metrics on things that don't matter, or find causality that doesn't exist: Pirates vs Temperature. Measuring code check-ins results in trivial commits instead of a real measurement of code written.

In terms of Kanban, Lean, and Agile, you want to measure things like tasks, or stories, completed per week, and how long they take to go through the board. It might also be useful to track how much time tasks "sit around", not being worked on.

Of course, a human factor should still be considered:

"Deming is often incorrectly quoted as saying, "You can't manage what you can't measure." In fact, he stated that one of the seven deadly diseases of management is running a company on visible figures alone."

In order to be embraced, the system needs to be accepted by the entire team (or company!) and is best evolved with the involvement of everyone. If people don't believe in the change, or the need for change, the system won't be completely embraced, and it won't work.

For more background and a concise introduction, a colleague pointed me to this book:

It's a quick read, and it's a good view of how one company has used Kanban on their project. There are also sections highlighting some of the specific concepts.

To take some headings from the book:

Learn Constantly, Engage Everyone, Solve problems; Not Symptoms.

permalink | trackback | comments: | tags: kanban, lean, ops
AWS Outage and Operations

Anyone running in EC2 or with services in AWS should now be convinced that they can't do without "operations". Even with all the advantages of outsourcing data center space, hardware, and savings with related personnel, it should be obvious that "systems engineering" is absolutely critical, and possibly more so in "the cloud", than ever before.

If you're not thinking about (and doing) systems engineering.... you probably experienced an outage. And then you definitely had "NoOps".

UPDATE: Joe Stump's article, a little over a year later is even more relevant:

permalink | trackback | comments: | tags: aws, ec2, noops, ops, system engineering
Upgrading Operations

The time to launch for software development, especially web development has been progressively diminishing.

As I ended in my previous post:
What (other than leaving out certain kinds of features) lets web and application developers develop at a faster rate?

Some keys to quick development are Frameworks and Libraries, as well as Methodologies and Practices.

You can probably quickly name a number of web development frameworks. Can you name any operational frameworks?

How much have development frameworks accelerated development?

Looking at one popular framework: From nothing to a CRUD backed form in Rails is around 9 commands.

Rails is widely regarded for accelerating development. It's easy to use, easy to learn, and even if you suffer performance problems from default options, it's easy to optimize specific parts of an application. "You spend time on features that get used, and you spend no further time optimizing ones that are not."

Rails also comes bundled with Javascript libraries that help speed user interface development, and gives you AJAX features from the beginning. It's also possible to swap out the defaults for something more to your liking.

What methods and practices have helped accelerate development?
Some examples include: Agile Development, Minimum Viable Product (MVP), Continuous Integration (CI), Continuous Deployment (CD)

How do we do the same for Operations?

Some operational practices have always been focused on automation. "If I have to do it more than once, I write a script." This is central to good operations, and it can't stop there. Configuration management software has been around for decades, but it's not used widely enough. Operations needs to accelerate in the same ways that web development has accelerated.

Operations needs better Frameworks and Libraries, as well as Methodologies and Practices.

There are some available. Boto provides an interface to AWS. A good library if you're doing CloudOps with Amazon. There's Clusto if (but not exclusively) you're running your own hardware. Both of them take a fair amount of work to use in a specific environment. Is either one an operational framework?

How about practices?

Too many engineering departments don't treat operational concerns as first order problems. How many applications are deployed without sufficient metrics and/or alerting, logging, easy deploy, administrative features? If the environment you work in promotes "throw it over the wall" mentality, if the people who wrote the code don't get alerts for it, then You're Doing It Wrong. This is how distrust, and walls between "operations" and "engineering" are started. If you really believe in any of the new buzzwords, "DevOps", "NoOps", etc, then you're really embedding some "operations" with each of your development teams.

And people who really believe that running "in the cloud", or on AWS means "NoOps", just realize that you've decided to outsource some Operations. You might as well argue that because ADP prints your checks, or runs your direct deposit that you have "NoPayroll".

What about methodologies?

DevOps is a methodology, not a team. It's a mentality or methodology that should permeate an entire engineering organization.

Agile Operations

With all the focus on Web Developers, what have the Systems Engineers and Operations folks have been up to?

permalink | trackback | comments: | tags: devops, ops, sysadmin, webops
Page 1 of 2  »