The Science of Trouble Shooting

Posted: January 30, 2012 in Interviews, SQLServerPedia Syndication

Over the last 6 months or so I have spent more time trying to find the cause of a problem more than anything else.  The results are interesting, and I am starting to see a pattern that is proving to be successful.  It doesn’t matter if I am working on a performance issue, or just trying to figure out why something is not working.  The most common problem that I see time and time again is performance. For the most part as soon as you get the system up and running or features configured, they pretty much stay that way.  However, with performance it changes. It changes when the data changes; new releases can be one cause; another common cause is business.  When I say business, I am referring to growing your data, or the changes in services your company is offering (say for example your company has decided to see to a new market share, and the sales doubled). The short of it is when a business is not growing and it is just staying static there isn’t much long term hope for it.  Stock prices alone take drastic up turns and down turns just based on growth percentages. 

So you have been presented with a problem, let’s say it is a performance problem and you are hearing about it from someone at the help desk.  The complaint in the ticket says that the database is not performing like it should.  The rest of the information you receive from the ticket is not very helpful.  You have not been given the complete picture, you don’t even know what part of the database is slow, so how do you treat it as professional you are?  I have a few tips for you that I hope will help.  I have developed these tips over years of learning the things the hard way.

  1. Always get the description of the problem from the source.   Skip the help desk and the ticket system. If you can, go sit down with the person who is reporting the problem.  Let them show you where they see the performance problem. Let them show you how this problem creates problems in their life. A query that takes 3 seconds to return to a customer service person may not be considered a big deal.  But wait until you are on the phone with a customer screaming their head off wanting to know why they were charged an extra $3.00.  The 3 seconds is pure hell for the person answering the phone.  If you can make this a lot faster, then you can make the job a bit easier. By learning what else can be done, and then fixing these problems you are now the hero.
  2. I think it was 6th or 7th grade where I learned all about the scientific process, the part where you hypothesize, then run your tests, document your results, and then draw your conclusion.  I remember when I was learning about this, thinking when the heck I am going to use this.  I had no plans on being a scientist, so I just didn’t see where this was relevant.  Now, my work does not include cutting up frogs, but I have found a way to use those old science class skills.  When you are troubleshooting a problem make sure that you know what you are testing, what you want to test, run your test, measure your test and then record it.  Once you are done document your results and determine if you were right or wrong.  If you’re wrong, start again.  But make sure that you are only testing one thing at a time.
  3. Remove many potential issues in a test as quickly as you can.  Let’s reflect back to poor performance –  you need to determine as quickly as you can if it is the database. If it is, then where is that problem in the database?  One of the first places that you are going to want to look is your wait times.  Run a query against the sys.dm_os_wait_stats DMV.  If you can isolate a process like returning a client, and you can isolate that on a specific server,  stop all the other traffic , measure the wait stats, record them, run your process and then run them again.  The delta between the two runs will give you a good head start on where you should look

One thing to keep in mind, when you use this method to eliminate many potential problems, or even identify problems, be sure that you understand the complete problem.  Let’s say you have a high number of disk waits. These could be caused by many reasons.  Many DBA’s may start to point at the storage as part of the problem; however you may be setting yourself up to fail.  A lot of disk reads could indicate poor indexing, or poorly written queries. 

The way that you trouble shoot problems is going to say a lot about you.  If you keep calm and document well developed thoughts, I think your problem will be behind you in no time!  But if you sit around pointing fingers at everyone else and then freaking out when it might be a database issue, then you are going to have a stress filled day.  If you don’t change your process, it will be filled with a lot of stress for your career.  I would love to hear what trouble shooting technics you use, drop one of two in the comments section for me.

Comments
  1. Chris Yates says:

    Point one is key to me, a lot of times the issue that comes across on a ticket is not the true issue. It amazes me how “convoluted” things become as they trickle from person to person. This was a great read.

  2. spottedmahn says:

    Great post!

    I too use the scientific approach I learned in school as a young person all the time. Thanks for sharing.

    Michael DePouw

    • Chris Shaw says:

      Thanks SpottedMahn,

      I am often suprised when people miss some of the basic best practices. Something I should have mentioned when I was thinking about point number 3, is to not make to many changes at one time. It is always a wise idea to know what caused any one specific problem. If there is more than on change at a time, granted the issue may be fixed, but knowing what fixed it might be harder to figure out.

      Chris

  3. John says:

    Using scientific principles and practise when troubleshooting is something I do all the time. Its worth noting that even if a problem can’t be easily reproduced, or the user can’t remember application error messages, the following sources of information can be useful in tracking down the problem, each with different levels of detail:

    1. Application specific error messages and log files etc – the only one likely to be written in terms the end user is likely to understand.
    2. Windows event logs on client PC or web server if a web app.
    3. If app is web based, web server (IIS/Apache) error logs
    4. Database server (SQL Server or other platform) error logs
    5. SQL Server background trace data
    6. Windows event logs on the database server

  4. Matt says:

    I ask these questions:
    1. What action(s) did you take?
    2. What did you expect to happen?
    3. What did you observe?
    This usually gets me a lot better understanding of the sequence of the events than I’m initially provided with, e.g., “Application x isn’t working”.

    • Chris Shaw says:

      Matt,

      I really like the answer, “What did you expect to happen”, one of the issues that I think DBA’s have to work with each day is making sure that we set expectations correctly. I have worked in a couple places where they buy hardware that is a little older or may not be fully featured, some of these are just the impacts of a smaller budget. However, I find that many times the expectations are not in line with reality.

      If you buy a laptop on sale at Wal-Mart for $400.00 I would hope people would not expect it to perform like a $2,000.00 laptop. Exceeding expiations is a good thing most times. I am not saying that Wal-Mart has slow performance systems. I am just trying to relay that the expectations need to be set correctly.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s