The art of getting to the root cause of the problem….
There are many things that bug me. I am one of those people that likes to understand the real cause of an issue and I don’t typically take quick answer as good enough. How many times have you dealt with a problem and quickly got to resolution only to find out that a week later the issue is back? Perhaps something else breaks and at first it doesn’t seem related. In the constant day to day battles we all fight in the technical area it’s often hard to see the big picture and how this all fits together. We often look past the real issue and go with the quick fix. This is common, we are all busy people. Our tasks lists are long, the list of open requests is even longer.
Am I really going to use this stuff?
One thing I was exposed to at a previous employer that I will always carry with me was Six Sigma training. I don’t practice this on a regular basis and it’s not something my current employer focuses on. There is one tool in this training that I still use consistently. There is a discipline on getting to the root of a problem using the 5 Why technique. 5 is not a set number but the point was to continue asking questions till you got to the root problem. If you want to actually impact something with change you need to be spending time fixing the root problem. The first, second, and third thing you answered as to why x broke is most likely a byproduct of the actual problem.
So many times when there is an issue we take the most recent indication of the problem, act on it and move on. This is a recipe for more work later. A good one I used to hear quite often when I was a database developer writing decision support queries. “I just need this data one time”. I quickly learned after being burned 50 times. Okay not so quick but the point stands. SAVE THE QUERY. That person will be back with additional questions or requirements they forgot to ask for the first time. Same is true in problem solving. If you are simply fixing the issue of today are you sure that the real root problem was actually addressed.
Why are these SQL Backup files not getting deleted?
Here is a real example of this in action. In being responsible for a team that manages over 100 SQL servers with thousands of databases I often question things. Some people think I’m on Twitter all day but I actually do some constructive work at times. Many times I’ll see a trend, or notice a problem, and I ask my team what the issue is to better understand where we are spending our time. I mostly get great answers; however there are times when I want to know more about the reason.
Recently we upgraded our backup software to the latest version which of course we thought would fix all our outstanding issues including the occasional backup files not being deleted. We have a pretty solid backup practice and most days have 100% backup compliance. Every so often however we have some old backup files that don’t get removed from disk. Sometime in the past a report that is independent of the backup tool was created that scans the backup file system and checks for dated files on our backup servers. Occasionally this reports back that some files are out there that are more than 12 days old. It’s typically the on call dba’s responsibility to go delete these old files. They ensure we have good backups for the past few days and simply drop the old files.
What I didn’t understand is why these files were not getting deleted. In our backup software there is a box to remove backup files older than x. We have this set to 1 day for example. So the question was why were these files not getting removed? The answer I received consistently from my team was that it’s the 3rd party backup tools problem. The thought was why spend the time to dig deeper. It was much easier to delete the file and move on. Not against that but in this case I had seen this issue too many times and had a desire to know why.
Acting like a 3 year old
I had some time available and decided I wanted to understand this better. I didn’t like the answers I was getting so I set out to dig deeper. Since 99.99% of the time the backup software deleted the files it was supposed to I figured something else was up. So I started by pinging folks on Twitter that are trusted SQL resources. Ted Krueger (Twitter) quickly responded with some ideas. What I ended up stumbling upon with his help was that the backup software was looking for the backup file in the msdb.dbo.backupmediafamily table to determine what file to delete.
What I also found out was that the reason the files were not being removed was due to the fact that a record did not exist in this table for the day in question. This seemed odd… There was a file on disk but no record in this table. So I did what I was taught many years ago and asked why? In digging a bit deeper and looking at the backup files, I noticed that the backup file that was not being removed was quite a bit smaller than a typical backup on that database, why? I dug into my job alerts and found that yes the backup actually failed on the night in question so I naturally asked myself why? I then dug into the SQL Error log and noticed that there are quite a bit of I/O errors on that night at the time the backup was running again I had to ask why?
In the true nature of things I still had not reached root cause. I still need to find out why we are getting I/O errors occasionally that cause our backups to fail and fix that problem. At this point I felt I had what I needed to answer the question of why the files are not being deleted and am comfortable with that the backup software is doing its job exactly as it is supposed to do.
Here is a query I used to help me identify the issue. We have a nice naming convention in our backup file names so it was easy to spot a gap in the results and the day in question was missing:
SELECT TOP 1000
WHERE physical_device_name like ‘%insert your backup file name here_mmddyyyy%’
order by physical_device_name desc
So now I had my root cause for the files being out on the disk and being over 2 weeks old and not being removed on subsequent successful backup runs. At the end of the day I learned a bit about how the backup process works behind the scenes. Here is a summary of what I took away from this experience.
- Continue to keep asking why until you get to the true root cause.
- Not fixing the actual root cause just ends up in creating more work. In this case we had someone code an entire separate report and maintain it to find these files so we could delete them.
- When a backup failure happens add a step to our process to go remove the failed backup file so the report doesn’t find it, creating more work for later.
- I still need to find out the cause of the I/O errors and fix that issue.
- Twitter is a great way to get SQL Help use #SQLHelp hash tag in your question. More on Twitter hash tags.