We started work with a new client today; it appears that their SQL Server Performance is not what they expect it to be. I thought that it would be interesting to use this client as an example on my blog because the problem that they are seeing is not just isolated to them. We have had many clients with the same issues at SQL on Call and we have proven that Performance problems can be fixed. Through this series of blogs I will only refer to them as the client, for a couple of reasons. First and foremost is for the privacy of the client and the chances are that you know who this client is since they are a pretty large website. I can describe the issues that they are seeing so that you can understand where we are coming from and what we are looking at.
Issue: The client has one production clustered database system that has plenty of hardware behind it. They host a website off this server and the website has thousands of web pages. The original complaint was the page to page navigation was slow and when the server was monitored for any bottlenecks the hardware came up clean. The client stated that the slowness that they were experiencing was not occurring all the time but more like 70% of the time. When I logged on to the web page to reproduce the issue we were not able to reproduce the issue at first. I then tried 2 or 3 times throughout the day to see if there was a time when the server was being hit harder than others. I was only able to reproduce and validate the problem a couple of times, it was enough to see that there was an issue, but the issue was not related to time, load, dates or pages being called.
Step One: The first thing that I did as described in the Issue section of this entry was to validate the issue. Even if I was not able to validate the issue there was a concern on the client’s side that there was a problem so I let them know we have seen this in the past and have had successful results.
I had to have facts on how bad was bad. Look at it this way. If you are an accountant working in an office and you are the end user of a database that searches for billing information on the client that takes 10 seconds, then 10 seconds may not be that bad if you know that is what the system is going to behave like. If you look at it from a different angle where you the same account but now your job is to talk to a client who wants to know why his bill is messed up. The client on the other side of the phone is not very happy; he is screaming and yelling because he was double billed. You as the account are trying to pull up his information and now it’s going to take 10 seconds where there is going to be silence on the phone. The end result is that the same 10 seconds can feel like hours in one situation yet in another situation it may be no issue at all. (As an FYI the client we are working on is a call center this is just an example that came to mind.)
An interesting side note here. Once I worked with a company that had a call center and was trying to understand the whole time issue with someone on the phone. I started to think about it and I did a test with my Mother and did not let her know that I was doing it. What I did was I called her and would talk for a few minutes and then I would pause. I had a stop watch with me to see how long the silence would last before she would feel awkward and start talking. The average was about 3 seconds. So the 10 second thing could be a big issue.
My Actions: Since there was the issue of this not being consistent and we were not able to pinpoint any sort of collisions we needed to get answers to a few questions.
What and when, were the changes for 2 weeks before you noticed the problem up to now. If we can get a list of changes that were made we may be able to speed up the process by looking at the individual changes.
- We requested these changes from the client and are expecting a list soon.
Since the performance problems are not consistent we don’t know if the performance counters that they had may have missed any sort of issues that they were experiencing.
- We have set up a Perf Mon to watch the Disk, Memory, Processors, Cache Hit Ratio, along with number of users connected (as a baseline number). We set the perfmon to poll and record once every 60 seconds, this way we can leave the perfmon up and running for an extended period of time.
When was the last time that database was re-indexed.
- We were given the answer of last weekend, but we went in and validated this along with making sure that the statistics were being updated.
Has there been any major data inserts since the last time that database was re-indexed.
- The answer to this was no, nothing out of the normal all the inserts were about the same as what they have been doing for years. If there was an index problem we would have expected the performance problem across the board, but since the hard ware was in good shape there is the possibility that the times when the database was running well that the hardware was able to power through the problems.
At this point we are going to wait 24 hours to see what the perf mon picks up. The client was made aware of the fact that we were just collecting data. In other cases in the past they client could not wait. In those cases we take a little more aggressive actions, the caution at being more aggressive is that you hurt the performance worse by tacking on too much to an already taxed system. We are going to take a look at the Perf Mon in a day or two and validate that we do not have any issues with the hardware.