Recently one of the teams that I am on has been working on an issue with a major hardware manufacturer. This issue has been around for a number of months and we have slowly been making adjustments, testing and then making more adjustments. The process has been slow because the server is in a production capacity and there are not the resources to apply what we are doing to a test machine. The only way for this to get more difficult at this point in time is if we were not able to recreate the issue or if the issue was intermittent. Recently the manufacturer of the hardware has made a few statements that have raised some serious questions for me. It appears that they want to not only make a hardware change, but a configuration change at the same time. The reason they want to do this, I believe, is to just get the problem corrected.
From my view point, as I am watching these changes being made and as we continue to try and try again with potential fixes, the faith that the company has in the existing hardware and staff is fading. I can’t help but to think that if this whole thing were to start working today a full post mortem would need to be done. But after a hardware change and a configuration change there are no real key pointers as to what the actual problem was. If there is no way that we can say this is for sure what caused the problem how do we know that it won’t pop back up again? Let’s say that it is a hardware issue – we have many pieces of hardware that match that. If we deploy the same type of hardware that has a bug in it we could be impacted by the same problem twice. If we get hit by the same problem twice does it not lead to the appearance that we did not learn the first time?
The same could be said for configuration issues. I guess the point that I am trying to make is: if you can’t say you have resolved an issue and you know exactly what caused that issue then you very likely could see the issue again. Heck, you could see the issue again even if you do know what caused the problem. If you see, or more importantly, if your customer or end users see that you are having the same issues over and over again then what is the impression that you are passing along? I can’t say enough about root cause analysis, and making changes in small increments. At the same time, impacting the customers as little as possible may be the number one priority. If that is the case, maybe making multiple changes at the same time is better in correcting the problem sooner.