Archive for April 16, 2012

I have been placed on call for years, and there have been times where there is a declared on-call list, and then there are times when there was no such list.  But either way I have always been on-call (if something is wrong with one of my databases, I am there.  It’s my job, my career choice and my responsibility).  I actually like having an on-call list, because this is a defined set of time that I am not the first line of defense for looking at error messages.  Many people look at an on-call list and think “look at all this time I am having to work, change my plans, have my life impacted”.  I look at a defined on-call schedule as a, wow I only have to be working outside normal hours as a first level trouble shooting on these days.  Now I have all this time to go do whatever it is I want.  If there was no defined on-call schedule I would have to look at every single message and determine if it was a database issue, then I would have to determine what the action is.  If it is not a database issue I would just have to wonder if the person responsible for that area is going to look at whatever the problem was.    In other words, I like a clear definition of when I am the guy who is watching out for what may be threating, and a clear definition of when I can go camping in some remote location.

On a side note…

Maybe this is an old Marine thing coming out in my personality.  I want to know when I sleep someone has got my back,  so I will make sure when it is my time I will make sure others feel that same comfort.  I have been spending a lot of time lately trying to determine why I have certain opinions and where I developed them.

Really what I want to accomplish with this post is a fact finding mission of sorts. See, I have spent so many years on call, and the responsibilities of what I do while I am on-call has been pretty simple.  When I have been on-call the reasonability has always been that I will be the person who is the first line of defense for pages, alerts, and phone calls.  The more I think about this I don’t think that I am making the best use of my time.  Sure I may fix simple issues, or make sure that the servers are up and running.  But what can I, or what should I be doing to make this time that I am already spending on-call used to its fullest potential?

I really am curious what others are being asked to do when they are on-call.  If you have time, please leave a comment and let me know, even if it is something that I have mentioned.

I have had some very simple requirements before. For example, at one time I was asked to make sure that I was checking my email twice a day, but this was well before the smart phones, and email following you around everywhere you go.  Now I just check email when my phone thing plays a random assortment of noises.  I know this is shameful to admit, but I really do still carry a pager like Steve Jones (T|B) has pictured on his blog this week (In all fairness, some of the places I go when I am not working can be a little out of cell area.  And, well, nothing will wake me like the screaming buzz from this pager).

Something new I am going to try next week when I start the on-call shift for a week is keeping a log.  I am going to keep track of each time someone calls, or I get a page.  My goal with this is to make sure that I am completing issues and communicating the completion of these issues to the source.  So if I get a call from employee xyz that says they cannot search something in the database, I want to know when they called, who called, what I did, and when I let them know the issue was fixed.  The other thing that I hope to accomplish with this log is having a better timeline when it comes to doing a post mortem.  Sure, when I am troubleshooting something I keep notes and copies of the logs, but what tells me when an issue was reported, or when the “all clear” was given?

I have learned over the last few years that the impression is so important, and in some cases the impression is more important than the facts.   Some people have just done a really good job at being able to explain what a technical problem is to non-technical people, but there are some cases when it does not matter that the whole server was covered in water from the custodian in the datacenter.  The customer only knew that the database was down.  Who would have guessed that water could short out one of those server things? 

So really I am curious as to what I am missing, even if it is not critical, just items that may make life easier.