Archive for the ‘Database Recovery’ Category

This article is part 7 in a series on the top 10 most common mistakes that I have seen impacting SQL Server Performance and Resiliency. This post is not all inclusive.

Most common mistake #7: Disaster Recovery Plans

Often people hear disaster recovery plan and the first reaction is to start worrying about the costs. Disaster recovery plans don’t have to be expensive, expensive disaster recovery plans come from strict requirements.

About 10 years ago when I started as an independent consultant one of my first clients was contacting me to help build out a disaster recovery plan for them. After our initial discussion I learned some consulting firms had forecasted one hundred thousand dollar solutions. Many large companies would look at that number and determine it was a bargain, however this clients company made less than 50k a year. The data changed about once a year, and if the database was down a week or two it was questioned if anyone would even notice. It was easy to see that the hundred thousand dollar solution was extremely over engineered.

Don’t ignore the basics

Disaster Recovery Solutions should start with two basic questions, what is the recovery point object and what is the recovery time objective.

  • RPO – Recovery Point Objectives – To what point must the database be restored after a disaster. Another way to ask this question would be, how much data can be lost.
  • RTO – Recovery Time Objectives – How much time can elapse after the disaster has occurred? Or, how long can your system can be down?

Depending on these answers additional questions will arise, however these two questions can help determine what potential solutions will work. SQL Server offers a number of solutions from Transaction Log shipping to AlwaysOn Availability Groups.

Pay Attention to the Details

Whenever I visit a datacenter for a client I make sure that I take some time to review how the cages are wired. On more than one occasion I have seen servers with redundant power supplies have both of the power cords plugged into one circuit. This configuration will protect you if one of the power supplies goes bad, however if the circuit goes down the redundant power supply isn’t any help.

When executing a disaster recovery plan ensure all the small details are checked. If there is a single point of failure in the system Murphy is going to find it.

Test

I can tell you the most common mistake I see on a regular basis with Disaster Recovery solutions is the lack of testing. Some testing is better than no testing, but the best testing is testing that mimic’s actual disasters. If there is a power outage for your servers and you have 5 min. to get everything moved do you know the steps to complete before the unlimited power supply loses its charge? What steps must you take if you don’t have the 5 minutes? I was working with the chief technology officer for a major education facility and he had a vendor that was telling him he we safe, and he didn’t have to worry about it. His contract was for a 15 minute recovery point. When we reached out to the vendor and asked them to prove it.

The lesson here is perform regular realistic tests, if they don’t work, find out why and fix it.

Top 10 Tips for SQL Server Performance and Resiliency

  1. Improper Backups
  2. Improper Security
  3. Improper Maintenance
  4. Not having a Baseline
  5. SQL Server Max Memory
  6. Change History
Advertisement

Does this sound familiar?

8:00 PM and you are watching your favorite TV show Survivor when out of nowhere you are interrupted with an on call page. You log on to check the servers, and for some unknown reason “it” has hit the fan and the server is spiraling out of control (By the way, if you don’t know what sound is makes when “it” hits the fan just say MARINEEE out loud). You grab your DBA hat, some Mountain Dew and you settle in to take care of the problem and walk away the hero. But, it’s not going just as plan… It’s now 3:30 AM and for one reason or another you are still working. It appears that all the planets have aligned and you are facing one issue after another. Just as you think you have a handle on the cursed issue that is keeping you awake at this time of night some new issue shows up and you realize that you will be enjoying your next breakfast at your desk.

If you have been a production DBA for some time I assume that you have a few stories that you could tell me, I know I have more than a few that I often share when I am speaking at events. I think these are the moments that define production support staff. It is difficult to fight off the frustration, the lack of sleep and the stress of having a production environment down. Seasoned administrators become the way they are based on these experiences and use them, to better their skill sets. As the “Crisis” is occurring different decisions are being made…

  • Do you apply a hotfix?
  • Do you reboot the system?
  • Is it time to look at restoring the whole thing?
  • How much do you worry about collecting evidence to find the root cause compared to spending that time correcting the problem?
  • And many, many more.

However, I believe there is one question that is more important than any other, and you need to ask yourself this question over and over again. Are you in the state of mind to make good decisions?

The self-pride that most of us have would default the answer to the question as yes. We don’t want to think there is a point that we can act and react logically. However, we have to remember we are human, and no matter the experience we can fall victim to the stress. When we do, we may be less likely to apply best practices such as documenting each step we take or double checking each action before we move to the next task. The end result is when a crisis occurs the stress is the natural fall-out. As the problem increases and progresses you may find yourself needing a backup, or trying to find a backup and what if that backup us bad… the stress kicks up a notch. Life gets real, when you start to operate without a safety net.

Example: A few years ago I was in one of those book studies a company required. The coach was going on about how people don’t make me mad, only I make me mad and we can control when we are mad. I tried to explain my point to the coach, that human nature comes in and sometimes there are just situations where some issues are just going to raise emotions that are going to make people mad. I provided the example of my father. I lost my father 21 years ago just a few days after my 21st birthday, and well this is a very sensitive subject for me to this day. At one point many years ago, someone insulted my father. Was I mad? Oh you bet I was, and I don’t care what anyone says, the person who insulted my father made me mad. I can control my reaction to my emotion but not the actual emotion itself.

Granted the example is not a technical one but with all the emotion that is going on during that discussion was that the right time for me to make critical decisions? When stress reaches a new level you have to stop and ask yourself if you are in the state of mind to make the critical decisions that need to be made in the time of a crisis.

If you answer yes to the question and you can do so without any doubt, then proceed with applying good practices for troubleshooting and correction issues that are in production.

If you answered no or even a not sure, then it is time to re-evaluate the situation. The easy answer is that it is time to step away from the situation and get someone to take over for you, but what if you can’t? What if you are the only resource and if stepping away means that you are just going to prolong the problem then what do you do?

  • First you need to start triple checking every move you make. You may think that code you are executing is in the right database, but check again. I have dropped a database in production on accident, and I can tell you that the second I hit execute, was the same moment I realized I was connected to the wrong server. I will go one step more and say I will start disconnecting all my sessions, and will not re-connect until I am ready to execute.
  • After you have checked for the third time, now write it down. For years people have been preaching document what you do, yet I very rarely find this to be the actual case. But if you are of the mindset where you should really be in bed… Well document everything before you do it. Why, well the mere action of writing it down may trigger that thought in your head that says wait, is this the correct thing to do.
  • Nothing helps me more then when I explain what I am doing to someone else. It doesn’t even have to be a DBA, or someone who has any understanding of what I am doing. The process of explaining verbally helps me double check that I am thinking about all the downs stream impacts.

I am curious, is there something that you do? How to you protect yourself and just as important how do you protect what you are working on from mistakes?

 

Are you testing your backups, If so how? Sure there is a couple options that you can use to help validate backups but does this mean you are 100% sure without a doubt that it can be restored? Have you ever seen a backup file not restore?

Last week I was working on a server that was having a little bit of a storage issue, the vendor was pretty sure after we took a couple steps that the storage would return to normal behavior. Management was not satisfied with the vendors forecast of pretty sure, and as I side note I could not agree more with them. Management knew that a restore process was going to take some time, so rather than wait for the maintenance window when the storage fix was going to be attempted a decision was made to start a restore on a different server. This restore was being done as a just in case… Just in case the fix on the storage side did not work, and they lost a database. So I was off and running with a restore database task.

About two hours into the restore I received an error that stated unexpected end of file. I wasn’t too sure what exactly caused the short file, but I did know it did not sound good, and I could not restore from it. After I did some research it looked like the backup file I was restoring from was much smaller than the backup file SQL Server had created a few days before.

So here is my point…

It does not matter what you do on the backup side of the process, if you do not restore the file, you are at risk of having a bad backup file and then what?

It is good practice to restore your backups on a frequent basis, I may even be as bold as to say that you may want to restore each backup just to verify it. Either way, when you restore these backups it may not be good enough to copy the file to another server and then restore that backup. When you do a test restore you should test it from end to end. This means if you store your backups off-site then request a backup from that off site location and restore that file. If you restore your files with that methodology, not only are you testing that SQL Server created a good backup file, but that all the other pieces work as well. I like to back up and keep only one week’s worth of backups on local storage. As soon as the backup is done, it either goes to tape or another data center. So there are a number of pieces in that backup process, those are the same pieces that are part of the restore process.

Do yourself a favor, test your backups. Do this by restoring your databases from end to end.

One of the key points that I like to raise when I deliver my session about what the best Disaster Recovery solution, is making sure that you pay extra attention to not over designing a solution, and I think this is a great example of why… I was talking with a client today about a disaster recovery solution; the client had requested t-log shipping for their environment. I could not help but think about all the newer wiz bang solutions that I could provide them, and I was really excited to share all these options with them. All of these solutions I was ready to talk about with them could really enhance a disaster recovery plan, or so I thought.

During my visit with the client, I learned the client has 2 data centers; one is on the East coast and one on the West. The link between the two centers is pretty good, but not always the best. The client is a growing company that would take a big hit if their database was down for a few hours. The cost of an Enterprise deployment of SQL Server would be a challenge for them in addition to what sounds like a limitation of the hosting company to support Enterprise features (I am not sure about this aspect, but I am doing some more research).

As with most databases that are important and require a DR solution, the thought of losing any data is enough to make an executive sick to their stomach. However in this case, the cost/benefit ratio or what I refer to as the sweet spot would allow for some data loss and some time to recover. Transaction Log shipping is going to meet all the requirements that they have, and can be done with no additional costs. They are aware of the benefits of other solutions so as they grow, they know the options to shorten downtime, or decrease the risk of data loss.

End result; take time to discuss with the owner of projects about what the expectations or requirements are, and then what the real expectations are in comparison to the cost of implementation. There are some cases where a solution that is all shiny will work, but may not be the right choice. As a consultant I need to make sure that I am on the client’s side, I need to understand the box they are working in, and what is going to be the best option. Present the options and the reasons for the solutions and the reasons why they may want to steer clear of a solution.

In summary, don’t just try to implement the newest feature in the newest version. If there is a case for it, then more power to you. Put the right solution in place.

 

Consistently my favorite projects
over the last 15 years has been the HA/DR projects.  It doesn’t matter if it is Windows clustering, transaction log shipping or mirroring.  I have implemented many different solutions, and I am very close to putting SQL Server 2012 Always-On into production.  My current project is working on a mirroring solution for a company.  Mirroring was the solution of choice because of the speed of the failovers, and the ease of tying applications into mirroring. 

The technology part of these solutions have a lot of support documentation around them, to assist you in implementing them. If you are looking for something outside of the standard Books on Line or Microsoft training videos, there are tons of blogs, articles and inperson training sessions to choose from.  The solutions technologies are not all that difficult to implement. However, there are some aspects of the technology, where experience helps navigate around obstacles.  One of these is the point of my blog post today.

One of the key aspects of a technology solution that meets the DR/HA requirements is testing the solution to make sure it works when you are counting on it working.  If you are not testing your solutions I am afraid that you may end up in a situation which is less than optimal for the requirements.  This leads to the general overall opinion about the solution that has been put in place. For Example…

My Mother is not too sure what I do for a living, and well that is OK.  I try to explain it but once we get to some of the more technical aspects, she asks me questions about and well she will get lost in the explanation (as do I when she talks about investment brokers, I hear blah, blah, blah).  She knows that the database stores the data that supports the applications.  Yet it is often when she will be watching TV or working on her mobile phone, and as soon as something doesn’t work she thinks it is due to a database issue.  My wife is the same.  If we are shopping in a store, and there is a miss-spelling in a LED scrolling sign, she jumps to the conclusion it is because their DBA is not doing their job. (On a side note… I sure to appreciate the confidence in my ability, but that sure is a lot of expectations when my printer doesn’t work, It goes in the trash, I don’t have the experience to fix it, or the time.)

When the company is expecting a failover to happen, and there is an issue that is not even be related to the database failover and then this issue impacts the project or the failover to not work exactly as planned can we expect the non-technical aspects of the company to understand the technical issue behind the problem

They only know one thing: the failover did not work, and the company is down.  If the power goes out at the DR facility and the facility is offline or someone doing maintenance on a rack in the datacenter unplugs a router; a non-SQL related issue can leave the impression that there is a problem with the failover.  Most people only care about one thing: 

Does it work?

In short, does it really matter to the rest of the organization why a system is down or why a process has failed? Sure they want the problem addressed so that it can be avoided in the future. To a marketing person or a customer service manager it does not matter if a datacenter had a power issue or if the database did not come online like it should. Both problems look the same to them to them, broke.

My point, you may ask?

For me the most difficult part about putting a solution in place is expressing how important it is to the rest of the organization it is to test it, and how important it is to retest on a regular basis.  There are so many parts of the organization that need to know what is going on, what to expect, and how long everything is going to last.  The overall impression counts on not only everyone having the expectations correctly set, but the coming together of many people to complete many tasks that lead to a successful test. Without all of the different organization’s supporting and validating the test how can anyone be sure that the complete solution works for the whole company?

I want to make sure that I express that the difficult part of doing a HA/DR solution is not putting the solution in place.  The difficulty lies in testing and continued testing to make sure that each and every time the behavior is as expected.  If you tested your solution once, maybe it is time to schedule another test.  If you put something in place, did you test it?  Did you test all aspects of it?  When I speak at events, I often ask how many people have done a full end to end test of the restore process.  Sure, as a DBA you need to be able to restore the database, but where did that backup file come from, was it off-site?  How many people had to be included for a full test, and how long did it take?  Do you have to wait an hour for someone to drive into an off-site storage facility at 1:00 AM. before the tape was even to the point where someone on your staff could touch it?

Do yourself a huge favor, answer all the questions before you have to get to the point where you are relying on your plan that has not been tested.

  • Does the solution work?
  • What resources (buildings, power, data centers, backups…) do you need physical access to?
  • What resources are needed at all steps? (before, during and after)
  • What is the potential data loss?
  • How long is the impact to the company to the customer?
  • What are the obstacles that need to be addressed?
    • How are these obstacles addressed?
  • Is the cost of the solution and the risk/impact ratio acceptable to the company?
  • Are the steps to failover/recovery documented?
    • Not just the obvious steps such as get the backup tape and start restore, but steps such as here is the call list, here is where we meet. And the most important but I have only seen mentioned once in DR document, but validation that the human resources are safe and they are safe to work. I could not imagine having to be worried about my family and home while work is knocking at the door.
  • What needs to be addressed first? Are some applications more important than others?
  • Do the users of the database understand the risk?
  • What is the worst case scenario?

    Just think. I live in Colorado; my house is about 3 miles from the evacuation area from where the Waldo Canyon fire destroyed well over 340 homes. I personally have already talked to 2 organizations, one that I use to work with. These companies had no time to get ready before they were told to get out of the area. One is a delivery company that had to continue operations, it did not matter that they did not have trucks, or even a building. An original series discussions with them showed an obstacle that they were not even thinking about (I am working hard to get more information on this for you). The more we learn from the past and the obstacles that other organizations have faced the better we can make our process.

One last note… We had 350 families who lost everything; more I am sure after all the counting is done.

I have at least 2 friends who were fighting that fire as firefighters.

I have 2 friends who are police officers fighting that fire.

Without the dedication of all these people, it would have been much worse. Thank you to everyone who pitched in and thank you to all their families as they were worried about their loved ones as they helped the rest of us.

 

 

 

 

 

About a week or so ago I had posted a few thoughts on database mirroring in SQL Server and some of the considerations that should be reviewed before relying on it to be your failover solution. I love the features in mirroring and as I write this I am looking at how I can start using the new features in SQL Server 2012. However, as much as I would like to upgrade today, I need to run a full round of testing with the developers I work with to make sure everyone understands the behavior. The key part to any backup or HA/DR plan all lies in the testing and the planning that you do. However, if you do a bunch of planning without any testing, I am afraid you are likely to run into some issues. This is why I have created this post. There have been a few things that I have seen which are often missed with database mirroring.

Linked Servers

Problem – Over the years a bit of code has been developed to support Linked Servers. If you have a server named Server A and one of the stored procedures in one of the databases on that server link to a database on another server (Server B), you may run into an issue during a failover. Linked server information is a server level configuration, not database level, so the information about the linked server is not kept in any of the individual user databases.

Solution – This may sound like an over simple solution, but it is what it is. Make sure that you have all your user database external dependencies configured on both the principal and the mirror. The complexity increases when you have links referring to the mirrored solution.

External Dependencies

Problem – In like fashion to the linked servers, there are many external SQL Server items that may be relying on the name of the server to connect. When the mirror takes over from the principal, the server that was the principal is no longer hosting the database in an online mode. Even if there is a user database on the same server as the database that was failed over. These links/connection do not just change. This could include many items such as SSIS packages, SQL Mail and cross database queries.

Solution – The deeper we look at the considerations the easier I find it is to recommend SQL Server 2012. There are a number of things that you can do when it comes to making adjustments to the external dependencies you can create/configure these items on both servers and in turn enable or disable them depending on where a database is live at the time. Many of these issues are being addressed SQL Server 2012 with Contained Databases, Partially Contained Databases and Always On Availability Groups.

No matter the solution, that you choose to use the key to a successful DR solution is testing. I am not sure I can think of any critical process outside the technology that is not tested and then re-tested. This is why the military spends countless hours training and the local fire departments do disaster drills. The time to learn that your disaster plan is week in an area is not when the disaster happens.

I am really curious as to what you considerations you may have seen that I have not noted. Send me an email and I can add yours to this list.

 

 

Database mirroring is a great option for many disaster scenarios. If using a witness server automated failovers can be achieved.  However, just like any other disaster recovery solution database mirroring should be tested.  Before you execute your first test of failing over a database from one server to another you may want to take into consideration a few different items.

User Accounts

Problem – When a new SQL Server login is created on a server, a row is added to a system table in the master database.  The Unique Identifier on that row of data is called the SID.  The SID is unique based on the account and the server on which the login is created. 

When a login is then added to a database as a user, and the user is granted permissions to work with data inside a database, those permissions are added to the individual databases. Logins are related to the server while users are related to the database.  The relationship between the login and the user are based on the SID.

When a database mirror becomes the Principal on a different server, the users that are in that database remain due to the user account information being stored in the user database.  The relationship from the User in the database to the Login on the server is broken. This is referred to as an orphaned user. 

Solution- There are a couple ways of correcting orphaned user.  One method is to correct an orphaned user using the sp_change_users_login stored procedure.  I don’t use this method because it requires either additional activity on your side by either running that procedure manually or via SQL Server Agent job and, depending on how you have it configured, may take some time before those accounts can log on to the database. 

The solution I choose to use is by creating all the user accounts on the mirror with the same SID as they have on the Principal.  By having the same Login SID on both the mirror and the Principal, the accounts relationship remains.  To create a new login by declaring the SID use the Create Login statement. 

SQL Server Agent Jobs

Problem – Many servers have jobs that reach into databases to manipulate data, or have automated backups that are scheduled via the SQL Server Agent Service.  On mirrored versions of the database not marked as being online, these jobs will fail if that is not taken into consideration.  Some may consider just adding the job to the server and letting it run and fail and when the mirror becomes the Principal, the jobs will already be scheduled and ready when the next execution time comes around.  (I am not a big fan of this, if for nothing else because I monitor for failed jobs and a false positive will eventually be ignored so when the job is needed to run and fails the notification may be missed.)

Solution – Kevin Cox and Glenn Berry, two SQL Server Professionals that I respect a lot posted a solution on the SQL Cat site that solves this issue.  In short…  Add a category of jobs to your server that you can assign all your jobs to.  Once your database status changes the jobs can be enabled. 

Indexing

Problem – Running a reindex on a database that is mirrored can create a number of transactions that need to be propagated over to the mirror.  On servers that are located in different physical environments, an additional load can be placed on the bandwidth, and the server that is the mirrored.

Solution – It is important to know the impact of database mirroring on your database and the servers that are hosting those databases.  In the situation that I am thinking about the mirror server did not have the same hardware as the Principal server. This meant that indexing was a bigger load on the mirror than it was the Principal. 

One of the solution options is to make sure that only the indexes which need to be reindexed are reindexed.  This can be done by checking the fragmentation before issuing the index statements.

I am really surprised at how often that I see questions like this.  I have been answering a few questions on a few different forums and I see this more and more often.  Here is the quick answer:  I don’t know.  There really is no way to find the book answer to your question.  Disaster Recovery Plans (DRP) can be very deep or very cheap. I would hope that most people here have a basic DRP in one fashion or another.  This doesn’t mean that it will work, though. Heck, it doesn’t even mean that it will cover your database other than one or two specific disasters.  See, I think that a lot of people don’t realize that even RAID 5 can be considered part of your DRP.  Let think about this: A disaster is anything that could happen to your server that would cause it to be disrupted.  So with looking at this simple definition we can consider anything from a bad hard drive or a power supply going bad to an F-5 tornado ripping through your datacenter to be a disaster.  Fair enough?  Is the impact to end user of the database any more or any less depending on the cause of the outage?  End result is the database is down or it is not down. 

 When you consider that something as simple as a power supply going bad can be a disaster for your database, your perspective may change.  Have you done anything to protect yourself from a bad hard drive or a bad power supply?  Consider the server class of the machine that your SQL Server is on; do you have multiple power supplies; do you have your data on some storage that will protect you from a single hard drive going bad?  If yes then you have at least started on the right foot.  There are levels of disasters.  Some are small and are easy to recover from, others are all out life changing events.  Yet when I see this question as often as I do I think we may have lost a bit of common sense.  If you are being asked to implement a Disaster Recovery Plan then you should have some sort of requirements around it.  More often than not, you didn’t get any requirements. What you got was a task and the people and or companies that rely on you expect that you are going to make the right decisions, to help them out of a sticky situation.  To have a effective DRP, you need to be able to understand what kind of disasters you are being asked to recover the database from so you can make a plan. 

Your company needs to define Disaster. 

Let’s look at an example. Let’s say that your company contracts with a disaster recovery company. This hypothetical company (we will call it DR Vendor) charges your company to be “on-call”, but they provide a location that is different from your datacenter that is ready to be turned up with your data in just a few hours.  The only thing that DR Vendor requires is that you show up at the door with an authorized person and you have a backup in your hand.  Sounds pretty simple doesn’t it? But what happens if you have a internet connection go offline?  One would think that this could be really bad for a database that is serving up data to the internet.   The website would be down, but is it worth turning up DR Vendor, for something that may be fixed in a short period of time?  But then again, how much time is considered a short period of time?  A small web store that does a few hundred thousand a year in sales may not have the same finical impact that a huge website like Amazon does if they were offline for an hour.

If your company does not know how to define a disaster than it is your job to ask them.  Present them with questions that will get you the information that you need. How long can you be down; how much data can you lose; what is considered down (is users can read, but not write to the database is that considered down)?  Then there is the most important question, what is the budget for your DRP?

So I present you with the question that we started with:  What is the best Disaster Recovery Plan?  I don’t know – you tell me, what do you require? Once you have the requirements you can now start to design the best plan for the disaster.  Once you have that documented, the rest becomes a matter of research to understand the potential solutions.  A recovery plan can be something as simple as a backup or as complex as a hot standby site. Common answers to most of the previous questions sound like, “we can never lose data” or “99.999 % uptime is the requirement”, or “we need to be able to bring our site up in seconds if natural disaster destroys our data center”.  But often those answers are slightly altered when the costs are reviewed. This brings me to the next discussion point.

Does it meet common business logic?

Some databases require strict uptime or zero data loss requirements. Think of the implications of a bank losing 15 minutes worth of data. Are these the same requirements that a local convince store has? I was sitting a local coffee store the other day and they were helping one customer about every 3 or 4 min.  Let’s say the average cup of joe is $4.00, so if there was an endless line of steady customers they would help about 20 people an hour, at $4.00 a coffee, they are looking at losing about $80.00 an hour, and that is if they could not serve any coffee at all. (This crosses the line into business continuation planning.) If they were to try to protect the business from 3 hours worth of down time we are talking about $240.00 worth of sales.  Sure they may upset some customers to the point they never come back, or they may miss the one customer that hour that is ordering for everyone in the office, but in general numbers, what is the potential loss to the potential cost?  However, what if that coffee shop is a national chain, and the corporate office goes offline and this caused everyone to stop taking orders for 3 hours?  At 1,000 stores we are talking almost $250,000.00.  Make sure the loss that you want to protect is logical to the business. 

At one point in time I was working with a client who was talking to me about another consultant who recommended a DRP.  The total cost was going to be close to $100,000. Well, depending on the company or organization that can be pretty affordable. But, the more I thought about it, I started to have questions.  When he told me the requirements were that his 2 gig database could be offline for up to a week without impacting revenue, all the sudden this $100,000 did not meet the business logic rule. The database requirements were light, and so light that a laptop could host the database.  This problem was solved by a simple backup schedule with tests that proved the database could be restored.

SQL Server offers a variety of options to recover your data, or to make another copy of your data somewhere. These options come with different costs, not just financial in nature, and some of these costs are linked directly to performance.  The more information that you can supply or the more information you can get as a DBA, the better off you are going to be when you are faced with that disaster.  

SQL Server Backups

Posted: October 17, 2008 in Database Recovery

Over the last month I have run across to companies both of them are new clients. I am glad they are on board with SQL on Call. What makes concerns me about this situation is that both of these two new clients had a slight problem. Ok slight does not really cover it, they were major problems. Both of the clients contacted SQL on Call both of the clients has a database that was marked as suspect. One of the clients was running SQL Server 2005, the other client was running SQL Server 2000. Both clients had a torn page that caused the suspect database. These were problems, but not the major problem. I client running SQL Server 2005 was fortunate I was able to find the torn page and identify what table was impacted and it was an easy fix. The second client has not been so lucky up to this point (Still researching the options).

Now what I have described is a problem, and a big one. But there was one bigger. Neither of the databases had been backed up, one had never been backed up at all. The good news is that was the one I was able to fix in a very short period of time. The other one had not been backed up in 2 months. They are a small company that is required to be kept all of its data for an extended period of time. I have a couple of options left and I still may be able to fix their database. But if I can’t they are out of 2 months of work. And the legal action could be severe. End result is that they do have hard copies of the data, but what if they didn’t?

The point I am trying to make is that if you run SQL Server Database you really need to make sure that your database is backed up. The database may contain information that is not replaceable. Just today I answered a question on linkedin and it was about a database that is a time card keeper. If this database were lost there would be no record to how many hours’ staff worked. Can you imagine the problems that would create? Backing up your database is one of many tasks that need to be done, but talking to some of my fellow SQL Server Professionals we find that many companies don’t do it.

Don’t lose your company because you don’t backup your data.

Sox Lessons Learned

Posted: October 6, 2008 in Database Recovery

Sorry for the short note today. Working on finishing up my presentations for the SQL Connections show in Las Vegas the second week of November. Anyways I am doing two sessions there one is what you should be doing your first 30 days on the job as a SQL Server DBA and the other is Lessons I have learned from going through a few SOX Audits. This got me thinking to something that happened to me once that I just floored me, and well I thought I would put it up on my blog since this is sort of a diary type thing minus the whole mushy stuff.

The year was 2005 if I remember correctly. The reason I think it was 2005 was because it was the second year that I had to go through a SOX Audit and the first year they were required was 2004. Anyway, I had spent the morning in my office meeting with 2 auditors. Neither of them was very savvy with SQL Server, but they did understand that there needed to be guidelines on whom and when can people see what data in the database. More importantly not who can read the data but who can insert and change data. See we had a meeting not 3 months before where I finally after years of work convinced the CTO that Developers did not need and should not get access to write data in production. The CTO was a developer and was having a difficult time understanding the need to control and keep the environment safe (whole separate story). The end result was that I finally had the CTO agreeing that there was no reason that a developer should need to see production data. I was explaining this to the auditors and showing them how we had lock all the accounts of the developers to development. I thought that this was a good thing and that the VP would be happy that we dodged a bullet just a few months before the audit.

Well it was not 30 min later that one of these young buck auditors went over to the developers in the company and asked them for some assistance. The developer agreed seeing how the request was pretty simple, and he didn’t think it would get done. The request was this. The Developer was to request that the permissions on his read only account in production to be changed to DBO access. Not just a Write account but DBO. The Developer sent an e-mail and my over efficient team was more than happy to meet the needs of a developer, the account was granted permissions and within 20 min. I was standing front and center in the VP’s office.

The end result after talking to my team was that they were tired of getting beatings everyday from our development team, see normally the way this would have worked would have been that my DBA would have said no. Then the Developer would go to his boss and tell management how my team was just being difficult. Then the Developer CTO would come over and ask why my team was being a road block. Fair question, the problem was that the CTO cared more about pushing new things into production rather than protecting the existing data. The policies of the company were written but being enforced by an individual that had 0 concerns for the systems and security until something broke and he was being asked about them. I am not so much bitter about this as I am dumb founded.

How do you balance development of new tools against the stability of what is in place.