Critical Decisions During A Crisis

Posted: May 15, 2014 in Career, Database Recovery

Does this sound familiar?

8:00 PM and you are watching your favorite TV show Survivor when out of nowhere you are interrupted with an on call page. You log on to check the servers, and for some unknown reason “it” has hit the fan and the server is spiraling out of control (By the way, if you don’t know what sound is makes when “it” hits the fan just say MARINEEE out loud). You grab your DBA hat, some Mountain Dew and you settle in to take care of the problem and walk away the hero. But, it’s not going just as plan… It’s now 3:30 AM and for one reason or another you are still working. It appears that all the planets have aligned and you are facing one issue after another. Just as you think you have a handle on the cursed issue that is keeping you awake at this time of night some new issue shows up and you realize that you will be enjoying your next breakfast at your desk.

If you have been a production DBA for some time I assume that you have a few stories that you could tell me, I know I have more than a few that I often share when I am speaking at events. I think these are the moments that define production support staff. It is difficult to fight off the frustration, the lack of sleep and the stress of having a production environment down. Seasoned administrators become the way they are based on these experiences and use them, to better their skill sets. As the “Crisis” is occurring different decisions are being made…

  • Do you apply a hotfix?
  • Do you reboot the system?
  • Is it time to look at restoring the whole thing?
  • How much do you worry about collecting evidence to find the root cause compared to spending that time correcting the problem?
  • And many, many more.

However, I believe there is one question that is more important than any other, and you need to ask yourself this question over and over again. Are you in the state of mind to make good decisions?

The self-pride that most of us have would default the answer to the question as yes. We don’t want to think there is a point that we can act and react logically. However, we have to remember we are human, and no matter the experience we can fall victim to the stress. When we do, we may be less likely to apply best practices such as documenting each step we take or double checking each action before we move to the next task. The end result is when a crisis occurs the stress is the natural fall-out. As the problem increases and progresses you may find yourself needing a backup, or trying to find a backup and what if that backup us bad… the stress kicks up a notch. Life gets real, when you start to operate without a safety net.

Example: A few years ago I was in one of those book studies a company required. The coach was going on about how people don’t make me mad, only I make me mad and we can control when we are mad. I tried to explain my point to the coach, that human nature comes in and sometimes there are just situations where some issues are just going to raise emotions that are going to make people mad. I provided the example of my father. I lost my father 21 years ago just a few days after my 21st birthday, and well this is a very sensitive subject for me to this day. At one point many years ago, someone insulted my father. Was I mad? Oh you bet I was, and I don’t care what anyone says, the person who insulted my father made me mad. I can control my reaction to my emotion but not the actual emotion itself.

Granted the example is not a technical one but with all the emotion that is going on during that discussion was that the right time for me to make critical decisions? When stress reaches a new level you have to stop and ask yourself if you are in the state of mind to make the critical decisions that need to be made in the time of a crisis.

If you answer yes to the question and you can do so without any doubt, then proceed with applying good practices for troubleshooting and correction issues that are in production.

If you answered no or even a not sure, then it is time to re-evaluate the situation. The easy answer is that it is time to step away from the situation and get someone to take over for you, but what if you can’t? What if you are the only resource and if stepping away means that you are just going to prolong the problem then what do you do?

  • First you need to start triple checking every move you make. You may think that code you are executing is in the right database, but check again. I have dropped a database in production on accident, and I can tell you that the second I hit execute, was the same moment I realized I was connected to the wrong server. I will go one step more and say I will start disconnecting all my sessions, and will not re-connect until I am ready to execute.
  • After you have checked for the third time, now write it down. For years people have been preaching document what you do, yet I very rarely find this to be the actual case. But if you are of the mindset where you should really be in bed… Well document everything before you do it. Why, well the mere action of writing it down may trigger that thought in your head that says wait, is this the correct thing to do.
  • Nothing helps me more then when I explain what I am doing to someone else. It doesn’t even have to be a DBA, or someone who has any understanding of what I am doing. The process of explaining verbally helps me double check that I am thinking about all the downs stream impacts.

I am curious, is there something that you do? How to you protect yourself and just as important how do you protect what you are working on from mistakes?

 

Comments
  1. I’m fortunate that I work with a large group of talented DBAs. I have the luxury of being able to call someone for a second opinion before I make any major change and believe me I use it. We have a long standing policy that if you need help you call even if it is in the middle of the night. So on the odd occasion when it’s 3 am and I’ve decided that yes this is the correct course of action I still make a call to one of my team members to double check me. And you know what? They never object, They know that they can call me next time they have a problem.

  2. Chris Yates says:

    For me I guess it largely plays into what kind of environment you fall in. Are you a single shop, big shop, what size team? I fall into the Kenneth category, albeit we don’t have 25 DBAs sitting around we do have more than one. Going back to your post before this one as a marine unit; we have a cohesive unit who has each others six and I know if I am stuck I can give a call and vice versa. It’s always going to be a fine line of when to execute the press on an issue or pull back into a zone and figure out what and where the issue is coming from. I’ve seen the best stay cool, calm, and collected and you touched on the key topic ~ decision making. When is the right time to bring in reinforcements and when is it time to motor through……and survivor ~ c’mon man🙂, and high five on the mt. dew. Another good one sir.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s