Archive for the ‘Database Recovery’ Category

This article is part 7 in a series on the top 10 most common mistakes that I have seen impacting SQL Server Performance and Resiliency. This post is not all inclusive.

Most common mistake #7: Disaster Recovery Plans

Often people hear disaster recovery plan and the first reaction is to start worrying about the costs. Disaster recovery plans don’t have to be expensive, expensive disaster recovery plans come from strict requirements.

About 10 years ago when I started as an independent consultant one of my first clients was contacting me to help build out a disaster recovery plan for them. After our initial discussion I learned some consulting firms had forecasted one hundred thousand dollar solutions. Many large companies would look at that number and determine it was a bargain, however this clients company made less than 50k a year. The data changed about once a year, and if the database was down a week or two it was questioned if anyone would even notice. It was easy to see that the hundred thousand dollar solution was extremely over engineered.

Don’t ignore the basics

Disaster Recovery Solutions should start with two basic questions, what is the recovery point object and what is the recovery time objective.

  • RPO – Recovery Point Objectives – To what point must the database be restored after a disaster. Another way to ask this question would be, how much data can be lost.
  • RTO – Recovery Time Objectives – How much time can elapse after the disaster has occurred? Or, how long can your system can be down?

Depending on these answers additional questions will arise, however these two questions can help determine what potential solutions will work. SQL Server offers a number of solutions from Transaction Log shipping to AlwaysOn Availability Groups.

Pay Attention to the Details

Whenever I visit a datacenter for a client I make sure that I take some time to review how the cages are wired. On more than one occasion I have seen servers with redundant power supplies have both of the power cords plugged into one circuit. This configuration will protect you if one of the power supplies goes bad, however if the circuit goes down the redundant power supply isn’t any help.

When executing a disaster recovery plan ensure all the small details are checked. If there is a single point of failure in the system Murphy is going to find it.

Test

I can tell you the most common mistake I see on a regular basis with Disaster Recovery solutions is the lack of testing. Some testing is better than no testing, but the best testing is testing that mimic’s actual disasters. If there is a power outage for your servers and you have 5 min. to get everything moved do you know the steps to complete before the unlimited power supply loses its charge? What steps must you take if you don’t have the 5 minutes? I was working with the chief technology officer for a major education facility and he had a vendor that was telling him he we safe, and he didn’t have to worry about it. His contract was for a 15 minute recovery point. When we reached out to the vendor and asked them to prove it.

The lesson here is perform regular realistic tests, if they don’t work, find out why and fix it.

Top 10 Tips for SQL Server Performance and Resiliency

  1. Improper Backups
  2. Improper Security
  3. Improper Maintenance
  4. Not having a Baseline
  5. SQL Server Max Memory
  6. Change History
Advertisements

Does this sound familiar?

8:00 PM and you are watching your favorite TV show Survivor when out of nowhere you are interrupted with an on call page. You log on to check the servers, and for some unknown reason “it” has hit the fan and the server is spiraling out of control (By the way, if you don’t know what sound is makes when “it” hits the fan just say MARINEEE out loud). You grab your DBA hat, some Mountain Dew and you settle in to take care of the problem and walk away the hero. But, it’s not going just as plan… It’s now 3:30 AM and for one reason or another you are still working. It appears that all the planets have aligned and you are facing one issue after another. Just as you think you have a handle on the cursed issue that is keeping you awake at this time of night some new issue shows up and you realize that you will be enjoying your next breakfast at your desk.

If you have been a production DBA for some time I assume that you have a few stories that you could tell me, I know I have more than a few that I often share when I am speaking at events. I think these are the moments that define production support staff. It is difficult to fight off the frustration, the lack of sleep and the stress of having a production environment down. Seasoned administrators become the way they are based on these experiences and use them, to better their skill sets. As the “Crisis” is occurring different decisions are being made…

  • Do you apply a hotfix?
  • Do you reboot the system?
  • Is it time to look at restoring the whole thing?
  • How much do you worry about collecting evidence to find the root cause compared to spending that time correcting the problem?
  • And many, many more.

However, I believe there is one question that is more important than any other, and you need to ask yourself this question over and over again. Are you in the state of mind to make good decisions?

The self-pride that most of us have would default the answer to the question as yes. We don’t want to think there is a point that we can act and react logically. However, we have to remember we are human, and no matter the experience we can fall victim to the stress. When we do, we may be less likely to apply best practices such as documenting each step we take or double checking each action before we move to the next task. The end result is when a crisis occurs the stress is the natural fall-out. As the problem increases and progresses you may find yourself needing a backup, or trying to find a backup and what if that backup us bad… the stress kicks up a notch. Life gets real, when you start to operate without a safety net.

Example: A few years ago I was in one of those book studies a company required. The coach was going on about how people don’t make me mad, only I make me mad and we can control when we are mad. I tried to explain my point to the coach, that human nature comes in and sometimes there are just situations where some issues are just going to raise emotions that are going to make people mad. I provided the example of my father. I lost my father 21 years ago just a few days after my 21st birthday, and well this is a very sensitive subject for me to this day. At one point many years ago, someone insulted my father. Was I mad? Oh you bet I was, and I don’t care what anyone says, the person who insulted my father made me mad. I can control my reaction to my emotion but not the actual emotion itself.

Granted the example is not a technical one but with all the emotion that is going on during that discussion was that the right time for me to make critical decisions? When stress reaches a new level you have to stop and ask yourself if you are in the state of mind to make the critical decisions that need to be made in the time of a crisis.

If you answer yes to the question and you can do so without any doubt, then proceed with applying good practices for troubleshooting and correction issues that are in production.

If you answered no or even a not sure, then it is time to re-evaluate the situation. The easy answer is that it is time to step away from the situation and get someone to take over for you, but what if you can’t? What if you are the only resource and if stepping away means that you are just going to prolong the problem then what do you do?

  • First you need to start triple checking every move you make. You may think that code you are executing is in the right database, but check again. I have dropped a database in production on accident, and I can tell you that the second I hit execute, was the same moment I realized I was connected to the wrong server. I will go one step more and say I will start disconnecting all my sessions, and will not re-connect until I am ready to execute.
  • After you have checked for the third time, now write it down. For years people have been preaching document what you do, yet I very rarely find this to be the actual case. But if you are of the mindset where you should really be in bed… Well document everything before you do it. Why, well the mere action of writing it down may trigger that thought in your head that says wait, is this the correct thing to do.
  • Nothing helps me more then when I explain what I am doing to someone else. It doesn’t even have to be a DBA, or someone who has any understanding of what I am doing. The process of explaining verbally helps me double check that I am thinking about all the downs stream impacts.

I am curious, is there something that you do? How to you protect yourself and just as important how do you protect what you are working on from mistakes?

 

Are you testing your backups, If so how? Sure there is a couple options that you can use to help validate backups but does this mean you are 100% sure without a doubt that it can be restored? Have you ever seen a backup file not restore?

Last week I was working on a server that was having a little bit of a storage issue, the vendor was pretty sure after we took a couple steps that the storage would return to normal behavior. Management was not satisfied with the vendors forecast of pretty sure, and as I side note I could not agree more with them. Management knew that a restore process was going to take some time, so rather than wait for the maintenance window when the storage fix was going to be attempted a decision was made to start a restore on a different server. This restore was being done as a just in case… Just in case the fix on the storage side did not work, and they lost a database. So I was off and running with a restore database task.

About two hours into the restore I received an error that stated unexpected end of file. I wasn’t too sure what exactly caused the short file, but I did know it did not sound good, and I could not restore from it. After I did some research it looked like the backup file I was restoring from was much smaller than the backup file SQL Server had created a few days before.

So here is my point…

It does not matter what you do on the backup side of the process, if you do not restore the file, you are at risk of having a bad backup file and then what?

It is good practice to restore your backups on a frequent basis, I may even be as bold as to say that you may want to restore each backup just to verify it. Either way, when you restore these backups it may not be good enough to copy the file to another server and then restore that backup. When you do a test restore you should test it from end to end. This means if you store your backups off-site then request a backup from that off site location and restore that file. If you restore your files with that methodology, not only are you testing that SQL Server created a good backup file, but that all the other pieces work as well. I like to back up and keep only one week’s worth of backups on local storage. As soon as the backup is done, it either goes to tape or another data center. So there are a number of pieces in that backup process, those are the same pieces that are part of the restore process.

Do yourself a favor, test your backups. Do this by restoring your databases from end to end.

One of the key points that I like to raise when I deliver my session about what the best Disaster Recovery solution, is making sure that you pay extra attention to not over designing a solution, and I think this is a great example of why… I was talking with a client today about a disaster recovery solution; the client had requested t-log shipping for their environment. I could not help but think about all the newer wiz bang solutions that I could provide them, and I was really excited to share all these options with them. All of these solutions I was ready to talk about with them could really enhance a disaster recovery plan, or so I thought.

During my visit with the client, I learned the client has 2 data centers; one is on the East coast and one on the West. The link between the two centers is pretty good, but not always the best. The client is a growing company that would take a big hit if their database was down for a few hours. The cost of an Enterprise deployment of SQL Server would be a challenge for them in addition to what sounds like a limitation of the hosting company to support Enterprise features (I am not sure about this aspect, but I am doing some more research).

As with most databases that are important and require a DR solution, the thought of losing any data is enough to make an executive sick to their stomach. However in this case, the cost/benefit ratio or what I refer to as the sweet spot would allow for some data loss and some time to recover. Transaction Log shipping is going to meet all the requirements that they have, and can be done with no additional costs. They are aware of the benefits of other solutions so as they grow, they know the options to shorten downtime, or decrease the risk of data loss.

End result; take time to discuss with the owner of projects about what the expectations or requirements are, and then what the real expectations are in comparison to the cost of implementation. There are some cases where a solution that is all shiny will work, but may not be the right choice. As a consultant I need to make sure that I am on the client’s side, I need to understand the box they are working in, and what is going to be the best option. Present the options and the reasons for the solutions and the reasons why they may want to steer clear of a solution.

In summary, don’t just try to implement the newest feature in the newest version. If there is a case for it, then more power to you. Put the right solution in place.

 

Consistently my favorite projects
over the last 15 years has been the HA/DR projects.  It doesn’t matter if it is Windows clustering, transaction log shipping or mirroring.  I have implemented many different solutions, and I am very close to putting SQL Server 2012 Always-On into production.  My current project is working on a mirroring solution for a company.  Mirroring was the solution of choice because of the speed of the failovers, and the ease of tying applications into mirroring. 

The technology part of these solutions have a lot of support documentation around them, to assist you in implementing them. If you are looking for something outside of the standard Books on Line or Microsoft training videos, there are tons of blogs, articles and inperson training sessions to choose from.  The solutions technologies are not all that difficult to implement. However, there are some aspects of the technology, where experience helps navigate around obstacles.  One of these is the point of my blog post today.

One of the key aspects of a technology solution that meets the DR/HA requirements is testing the solution to make sure it works when you are counting on it working.  If you are not testing your solutions I am afraid that you may end up in a situation which is less than optimal for the requirements.  This leads to the general overall opinion about the solution that has been put in place. For Example…

My Mother is not too sure what I do for a living, and well that is OK.  I try to explain it but once we get to some of the more technical aspects, she asks me questions about and well she will get lost in the explanation (as do I when she talks about investment brokers, I hear blah, blah, blah).  She knows that the database stores the data that supports the applications.  Yet it is often when she will be watching TV or working on her mobile phone, and as soon as something doesn’t work she thinks it is due to a database issue.  My wife is the same.  If we are shopping in a store, and there is a miss-spelling in a LED scrolling sign, she jumps to the conclusion it is because their DBA is not doing their job. (On a side note… I sure to appreciate the confidence in my ability, but that sure is a lot of expectations when my printer doesn’t work, It goes in the trash, I don’t have the experience to fix it, or the time.)

When the company is expecting a failover to happen, and there is an issue that is not even be related to the database failover and then this issue impacts the project or the failover to not work exactly as planned can we expect the non-technical aspects of the company to understand the technical issue behind the problem

They only know one thing: the failover did not work, and the company is down.  If the power goes out at the DR facility and the facility is offline or someone doing maintenance on a rack in the datacenter unplugs a router; a non-SQL related issue can leave the impression that there is a problem with the failover.  Most people only care about one thing: 

Does it work?

In short, does it really matter to the rest of the organization why a system is down or why a process has failed? Sure they want the problem addressed so that it can be avoided in the future. To a marketing person or a customer service manager it does not matter if a datacenter had a power issue or if the database did not come online like it should. Both problems look the same to them to them, broke.

My point, you may ask?

For me the most difficult part about putting a solution in place is expressing how important it is to the rest of the organization it is to test it, and how important it is to retest on a regular basis.  There are so many parts of the organization that need to know what is going on, what to expect, and how long everything is going to last.  The overall impression counts on not only everyone having the expectations correctly set, but the coming together of many people to complete many tasks that lead to a successful test. Without all of the different organization’s supporting and validating the test how can anyone be sure that the complete solution works for the whole company?

I want to make sure that I express that the difficult part of doing a HA/DR solution is not putting the solution in place.  The difficulty lies in testing and continued testing to make sure that each and every time the behavior is as expected.  If you tested your solution once, maybe it is time to schedule another test.  If you put something in place, did you test it?  Did you test all aspects of it?  When I speak at events, I often ask how many people have done a full end to end test of the restore process.  Sure, as a DBA you need to be able to restore the database, but where did that backup file come from, was it off-site?  How many people had to be included for a full test, and how long did it take?  Do you have to wait an hour for someone to drive into an off-site storage facility at 1:00 AM. before the tape was even to the point where someone on your staff could touch it?

Do yourself a huge favor, answer all the questions before you have to get to the point where you are relying on your plan that has not been tested.

  • Does the solution work?
  • What resources (buildings, power, data centers, backups…) do you need physical access to?
  • What resources are needed at all steps? (before, during and after)
  • What is the potential data loss?
  • How long is the impact to the company to the customer?
  • What are the obstacles that need to be addressed?
    • How are these obstacles addressed?
  • Is the cost of the solution and the risk/impact ratio acceptable to the company?
  • Are the steps to failover/recovery documented?
    • Not just the obvious steps such as get the backup tape and start restore, but steps such as here is the call list, here is where we meet. And the most important but I have only seen mentioned once in DR document, but validation that the human resources are safe and they are safe to work. I could not imagine having to be worried about my family and home while work is knocking at the door.
  • What needs to be addressed first? Are some applications more important than others?
  • Do the users of the database understand the risk?
  • What is the worst case scenario?

    Just think. I live in Colorado; my house is about 3 miles from the evacuation area from where the Waldo Canyon fire destroyed well over 340 homes. I personally have already talked to 2 organizations, one that I use to work with. These companies had no time to get ready before they were told to get out of the area. One is a delivery company that had to continue operations, it did not matter that they did not have trucks, or even a building. An original series discussions with them showed an obstacle that they were not even thinking about (I am working hard to get more information on this for you). The more we learn from the past and the obstacles that other organizations have faced the better we can make our process.

One last note… We had 350 families who lost everything; more I am sure after all the counting is done.

I have at least 2 friends who were fighting that fire as firefighters.

I have 2 friends who are police officers fighting that fire.

Without the dedication of all these people, it would have been much worse. Thank you to everyone who pitched in and thank you to all their families as they were worried about their loved ones as they helped the rest of us.

 

 

 

 

 

About a week or so ago I had posted a few thoughts on database mirroring in SQL Server and some of the considerations that should be reviewed before relying on it to be your failover solution. I love the features in mirroring and as I write this I am looking at how I can start using the new features in SQL Server 2012. However, as much as I would like to upgrade today, I need to run a full round of testing with the developers I work with to make sure everyone understands the behavior. The key part to any backup or HA/DR plan all lies in the testing and the planning that you do. However, if you do a bunch of planning without any testing, I am afraid you are likely to run into some issues. This is why I have created this post. There have been a few things that I have seen which are often missed with database mirroring.

Linked Servers

Problem – Over the years a bit of code has been developed to support Linked Servers. If you have a server named Server A and one of the stored procedures in one of the databases on that server link to a database on another server (Server B), you may run into an issue during a failover. Linked server information is a server level configuration, not database level, so the information about the linked server is not kept in any of the individual user databases.

Solution – This may sound like an over simple solution, but it is what it is. Make sure that you have all your user database external dependencies configured on both the principal and the mirror. The complexity increases when you have links referring to the mirrored solution.

External Dependencies

Problem – In like fashion to the linked servers, there are many external SQL Server items that may be relying on the name of the server to connect. When the mirror takes over from the principal, the server that was the principal is no longer hosting the database in an online mode. Even if there is a user database on the same server as the database that was failed over. These links/connection do not just change. This could include many items such as SSIS packages, SQL Mail and cross database queries.

Solution – The deeper we look at the considerations the easier I find it is to recommend SQL Server 2012. There are a number of things that you can do when it comes to making adjustments to the external dependencies you can create/configure these items on both servers and in turn enable or disable them depending on where a database is live at the time. Many of these issues are being addressed SQL Server 2012 with Contained Databases, Partially Contained Databases and Always On Availability Groups.

No matter the solution, that you choose to use the key to a successful DR solution is testing. I am not sure I can think of any critical process outside the technology that is not tested and then re-tested. This is why the military spends countless hours training and the local fire departments do disaster drills. The time to learn that your disaster plan is week in an area is not when the disaster happens.

I am really curious as to what you considerations you may have seen that I have not noted. Send me an email and I can add yours to this list.

 

 

Database mirroring is a great option for many disaster scenarios. If using a witness server automated failovers can be achieved.  However, just like any other disaster recovery solution database mirroring should be tested.  Before you execute your first test of failing over a database from one server to another you may want to take into consideration a few different items.

User Accounts

Problem – When a new SQL Server login is created on a server, a row is added to a system table in the master database.  The Unique Identifier on that row of data is called the SID.  The SID is unique based on the account and the server on which the login is created. 

When a login is then added to a database as a user, and the user is granted permissions to work with data inside a database, those permissions are added to the individual databases. Logins are related to the server while users are related to the database.  The relationship between the login and the user are based on the SID.

When a database mirror becomes the Principal on a different server, the users that are in that database remain due to the user account information being stored in the user database.  The relationship from the User in the database to the Login on the server is broken. This is referred to as an orphaned user. 

Solution- There are a couple ways of correcting orphaned user.  One method is to correct an orphaned user using the sp_change_users_login stored procedure.  I don’t use this method because it requires either additional activity on your side by either running that procedure manually or via SQL Server Agent job and, depending on how you have it configured, may take some time before those accounts can log on to the database. 

The solution I choose to use is by creating all the user accounts on the mirror with the same SID as they have on the Principal.  By having the same Login SID on both the mirror and the Principal, the accounts relationship remains.  To create a new login by declaring the SID use the Create Login statement. 

SQL Server Agent Jobs

Problem – Many servers have jobs that reach into databases to manipulate data, or have automated backups that are scheduled via the SQL Server Agent Service.  On mirrored versions of the database not marked as being online, these jobs will fail if that is not taken into consideration.  Some may consider just adding the job to the server and letting it run and fail and when the mirror becomes the Principal, the jobs will already be scheduled and ready when the next execution time comes around.  (I am not a big fan of this, if for nothing else because I monitor for failed jobs and a false positive will eventually be ignored so when the job is needed to run and fails the notification may be missed.)

Solution – Kevin Cox and Glenn Berry, two SQL Server Professionals that I respect a lot posted a solution on the SQL Cat site that solves this issue.  In short…  Add a category of jobs to your server that you can assign all your jobs to.  Once your database status changes the jobs can be enabled. 

Indexing

Problem – Running a reindex on a database that is mirrored can create a number of transactions that need to be propagated over to the mirror.  On servers that are located in different physical environments, an additional load can be placed on the bandwidth, and the server that is the mirrored.

Solution – It is important to know the impact of database mirroring on your database and the servers that are hosting those databases.  In the situation that I am thinking about the mirror server did not have the same hardware as the Principal server. This meant that indexing was a bigger load on the mirror than it was the Principal. 

One of the solution options is to make sure that only the indexes which need to be reindexed are reindexed.  This can be done by checking the fragmentation before issuing the index statements.