Does this sound familiar?
8:00 PM and you are watching your favorite TV show Survivor when out of nowhere you are interrupted with an on call page. You log on to check the servers, and for some unknown reason “it” has hit the fan and the server is spiraling out of control (By the way, if you don’t know what sound is makes when “it” hits the fan just say MARINEEE out loud). You grab your DBA hat, some Mountain Dew and you settle in to take care of the problem and walk away the hero. But, it’s not going just as plan… It’s now 3:30 AM and for one reason or another you are still working. It appears that all the planets have aligned and you are facing one issue after another. Just as you think you have a handle on the cursed issue that is keeping you awake at this time of night some new issue shows up and you realize that you will be enjoying your next breakfast at your desk.
If you have been a production DBA for some time I assume that you have a few stories that you could tell me, I know I have more than a few that I often share when I am speaking at events. I think these are the moments that define production support staff. It is difficult to fight off the frustration, the lack of sleep and the stress of having a production environment down. Seasoned administrators become the way they are based on these experiences and use them, to better their skill sets. As the “Crisis” is occurring different decisions are being made…
- Do you apply a hotfix?
- Do you reboot the system?
- Is it time to look at restoring the whole thing?
- How much do you worry about collecting evidence to find the root cause compared to spending that time correcting the problem?
- And many, many more.
However, I believe there is one question that is more important than any other, and you need to ask yourself this question over and over again. Are you in the state of mind to make good decisions?
The self-pride that most of us have would default the answer to the question as yes. We don’t want to think there is a point that we can act and react logically. However, we have to remember we are human, and no matter the experience we can fall victim to the stress. When we do, we may be less likely to apply best practices such as documenting each step we take or double checking each action before we move to the next task. The end result is when a crisis occurs the stress is the natural fall-out. As the problem increases and progresses you may find yourself needing a backup, or trying to find a backup and what if that backup us bad… the stress kicks up a notch. Life gets real, when you start to operate without a safety net.
Example: A few years ago I was in one of those book studies a company required. The coach was going on about how people don’t make me mad, only I make me mad and we can control when we are mad. I tried to explain my point to the coach, that human nature comes in and sometimes there are just situations where some issues are just going to raise emotions that are going to make people mad. I provided the example of my father. I lost my father 21 years ago just a few days after my 21st birthday, and well this is a very sensitive subject for me to this day. At one point many years ago, someone insulted my father. Was I mad? Oh you bet I was, and I don’t care what anyone says, the person who insulted my father made me mad. I can control my reaction to my emotion but not the actual emotion itself.
Granted the example is not a technical one but with all the emotion that is going on during that discussion was that the right time for me to make critical decisions? When stress reaches a new level you have to stop and ask yourself if you are in the state of mind to make the critical decisions that need to be made in the time of a crisis.
If you answer yes to the question and you can do so without any doubt, then proceed with applying good practices for troubleshooting and correction issues that are in production.
If you answered no or even a not sure, then it is time to re-evaluate the situation. The easy answer is that it is time to step away from the situation and get someone to take over for you, but what if you can’t? What if you are the only resource and if stepping away means that you are just going to prolong the problem then what do you do?
- First you need to start triple checking every move you make. You may think that code you are executing is in the right database, but check again. I have dropped a database in production on accident, and I can tell you that the second I hit execute, was the same moment I realized I was connected to the wrong server. I will go one step more and say I will start disconnecting all my sessions, and will not re-connect until I am ready to execute.
- After you have checked for the third time, now write it down. For years people have been preaching document what you do, yet I very rarely find this to be the actual case. But if you are of the mindset where you should really be in bed… Well document everything before you do it. Why, well the mere action of writing it down may trigger that thought in your head that says wait, is this the correct thing to do.
- Nothing helps me more then when I explain what I am doing to someone else. It doesn’t even have to be a DBA, or someone who has any understanding of what I am doing. The process of explaining verbally helps me double check that I am thinking about all the downs stream impacts.
I am curious, is there something that you do? How to you protect yourself and just as important how do you protect what you are working on from mistakes?