When you work in software operations supporting software applications, there are plenty of practices and tools you can use to be more efficient at your job: keep all of your scripts and code in source control (just like developers do!), maintain proactive monitoring, work closely with your development team on operational requirements, read DevOps on Windows (ahem.), and much more.
But even before all of that, what’s one task operators have to do frequently? Make changes to the production environment. And if you take away all the fancy tools and processes around implementing and managing these changes, there are three simple, non-technical steps you can do to help ensure successful, and hopefully panic free change implementation.
1. Plan what you’re changing upfront
This may sound obvious, but doing something as simple as putting your change steps/plan down in writing can be a huge help. Let’s say you’re responsible for moving some application from one server to another. You’ve done it a hundred times in production or otherwise, but next time before you do it, try writing it down. Your document doesn’t have to be some big bureaucratic process or huge, highly polished document. It can be as simple as a listing out of what you’re going to do to implement a given change. And it doesn’t have to be the process document for now until the end of time for all cases, just what you plan to do this time.
Of course ideally a task like moving an application isn’t a terribly complicated, multi-step, manual procedure. But either way, writing your plan down in a document (like on a wiki, some internal document management system, an issue tracking ticket, or whatever works for you) is the perfect way to organize and solidify your thoughts on the process. And it has two great side effects – it can help eliminate a little bit more of your organization’s bus factor, and it can help create or add to an internal knowledge-base if your plan document is in a location that is accessible by the whole operations team.
2. Plan for rollback
Bad ThingsTM happen, so always have a rollback plan ready ahead of time. In my experience, very few changes are completely irreversible. (And if and when changes are irreversible, you better damn well make sure that’s crystal clear to your team and all the stakeholders!) It’s only to your benefit to think about and document a rollback plan. In the heat of the moment, when something goes wrong during a change implementation, it’s nice to be able to refer to a document or procedure that clearly states your path out of the mess and at least brings the system back to the state it was in prior to the change.
And if you took the time to plan up front to begin with, odds are your rollback plan is basically already written – just reverse the steps in your original plan. Again, it doesn’t need to be a highly polished rollback procedure for all cases until the end of time, just a rollback plan for the task at hand. Of course not every change is trivial enough for a quick document, but hopefully you get the idea how planning out the change up front can guide a rollback plan.
3. Validate what you changed
In test-driven development, a common (nay required?) practice is when a production bug is discovered, the developer first creates a unit test to reproduce the issue. Of course, the test will fail against the current production version’s code base because the issue is still in the code. Then once the bug has been fixed, the developer has greater confidence that the issue is truly resolved because their unit test validated the change for them.
In software operations you can apply a similar concept to your system changes. How can you know your change was successful if you can’t validate it? Do you just hope it worked? Do you wait for a user to call and report the issue again? By asking questions of your teammates and/or system owners/subject matter experts, no matter how stupid the questions sounds you can increase your own confidence and the quality of your work simply by validating your changes.
Again, these steps may sounds like common sense, but a little bit of reflection on what you’re actually doing can go a long way towards increased quality when it comes to making changes to a software system.
And as an operator, always remember – your number one goal is the stability of the production environment which ensures your organization can keep doing what it needs to do, and you accomplish this goal not by blocking change to the environment, but by enabling change to the environment.