DR/HA Testing

Posted: July 2, 2012 in Database Recovery, SQLServerPedia Syndication

Consistently my favorite projects
over the last 15 years has been the HA/DR projects.  It doesn’t matter if it is Windows clustering, transaction log shipping or mirroring.  I have implemented many different solutions, and I am very close to putting SQL Server 2012 Always-On into production.  My current project is working on a mirroring solution for a company.  Mirroring was the solution of choice because of the speed of the failovers, and the ease of tying applications into mirroring. 

The technology part of these solutions have a lot of support documentation around them, to assist you in implementing them. If you are looking for something outside of the standard Books on Line or Microsoft training videos, there are tons of blogs, articles and inperson training sessions to choose from.  The solutions technologies are not all that difficult to implement. However, there are some aspects of the technology, where experience helps navigate around obstacles.  One of these is the point of my blog post today.

One of the key aspects of a technology solution that meets the DR/HA requirements is testing the solution to make sure it works when you are counting on it working.  If you are not testing your solutions I am afraid that you may end up in a situation which is less than optimal for the requirements.  This leads to the general overall opinion about the solution that has been put in place. For Example…

My Mother is not too sure what I do for a living, and well that is OK.  I try to explain it but once we get to some of the more technical aspects, she asks me questions about and well she will get lost in the explanation (as do I when she talks about investment brokers, I hear blah, blah, blah).  She knows that the database stores the data that supports the applications.  Yet it is often when she will be watching TV or working on her mobile phone, and as soon as something doesn’t work she thinks it is due to a database issue.  My wife is the same.  If we are shopping in a store, and there is a miss-spelling in a LED scrolling sign, she jumps to the conclusion it is because their DBA is not doing their job. (On a side note… I sure to appreciate the confidence in my ability, but that sure is a lot of expectations when my printer doesn’t work, It goes in the trash, I don’t have the experience to fix it, or the time.)

When the company is expecting a failover to happen, and there is an issue that is not even be related to the database failover and then this issue impacts the project or the failover to not work exactly as planned can we expect the non-technical aspects of the company to understand the technical issue behind the problem

They only know one thing: the failover did not work, and the company is down.  If the power goes out at the DR facility and the facility is offline or someone doing maintenance on a rack in the datacenter unplugs a router; a non-SQL related issue can leave the impression that there is a problem with the failover.  Most people only care about one thing: 

Does it work?

In short, does it really matter to the rest of the organization why a system is down or why a process has failed? Sure they want the problem addressed so that it can be avoided in the future. To a marketing person or a customer service manager it does not matter if a datacenter had a power issue or if the database did not come online like it should. Both problems look the same to them to them, broke.

My point, you may ask?

For me the most difficult part about putting a solution in place is expressing how important it is to the rest of the organization it is to test it, and how important it is to retest on a regular basis.  There are so many parts of the organization that need to know what is going on, what to expect, and how long everything is going to last.  The overall impression counts on not only everyone having the expectations correctly set, but the coming together of many people to complete many tasks that lead to a successful test. Without all of the different organization’s supporting and validating the test how can anyone be sure that the complete solution works for the whole company?

I want to make sure that I express that the difficult part of doing a HA/DR solution is not putting the solution in place.  The difficulty lies in testing and continued testing to make sure that each and every time the behavior is as expected.  If you tested your solution once, maybe it is time to schedule another test.  If you put something in place, did you test it?  Did you test all aspects of it?  When I speak at events, I often ask how many people have done a full end to end test of the restore process.  Sure, as a DBA you need to be able to restore the database, but where did that backup file come from, was it off-site?  How many people had to be included for a full test, and how long did it take?  Do you have to wait an hour for someone to drive into an off-site storage facility at 1:00 AM. before the tape was even to the point where someone on your staff could touch it?

Do yourself a huge favor, answer all the questions before you have to get to the point where you are relying on your plan that has not been tested.

  • Does the solution work?
  • What resources (buildings, power, data centers, backups…) do you need physical access to?
  • What resources are needed at all steps? (before, during and after)
  • What is the potential data loss?
  • How long is the impact to the company to the customer?
  • What are the obstacles that need to be addressed?
    • How are these obstacles addressed?
  • Is the cost of the solution and the risk/impact ratio acceptable to the company?
  • Are the steps to failover/recovery documented?
    • Not just the obvious steps such as get the backup tape and start restore, but steps such as here is the call list, here is where we meet. And the most important but I have only seen mentioned once in DR document, but validation that the human resources are safe and they are safe to work. I could not imagine having to be worried about my family and home while work is knocking at the door.
  • What needs to be addressed first? Are some applications more important than others?
  • Do the users of the database understand the risk?
  • What is the worst case scenario?

    Just think. I live in Colorado; my house is about 3 miles from the evacuation area from where the Waldo Canyon fire destroyed well over 340 homes. I personally have already talked to 2 organizations, one that I use to work with. These companies had no time to get ready before they were told to get out of the area. One is a delivery company that had to continue operations, it did not matter that they did not have trucks, or even a building. An original series discussions with them showed an obstacle that they were not even thinking about (I am working hard to get more information on this for you). The more we learn from the past and the obstacles that other organizations have faced the better we can make our process.

One last note… We had 350 families who lost everything; more I am sure after all the counting is done.

I have at least 2 friends who were fighting that fire as firefighters.

I have 2 friends who are police officers fighting that fire.

Without the dedication of all these people, it would have been much worse. Thank you to everyone who pitched in and thank you to all their families as they were worried about their loved ones as they helped the rest of us.

 

 

 

 

 

Comments
  1. Chris Yates says:

    First, what a great topic. My gosh how this hits home. Currently, we do a yearly test on our core systems; do we have gaps – of course we do and perhaps I can share some of those off line at a different point in time. However, the topic merits a hard look at how typical businesses operate. I know from personal experience that we recently came to a DR situation; very very close. When that time did come you could see the panic in some of the eyes of all parties and all departments. We also had one of our nodes fail over to a node with one prod system already up and running so we had to have 2 apps on the same node in a prod environment. Needless to say it was an eye opening experience. Again thanks for the great post.

    Secondly, my father being a police officer for 32 years there is a camaraderie there. That is awfully close to home for you and I hope that the people of Colorado can pick up and move forward in picking up the pieces.

  2. jsterrett says:

    Great article Chris. It’s also an honor to say my brother is a firefighter. He actually works in the four corner region and is doing a lot of work in Colorado lately to pitch in to help.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s