DBPedias

Your Database Knowledge Community

David Levy

  1. Promised Links from SQL Saturday 119 Presentations

    If you attended one of the sessions I promised to post the links we discussed here. I have done my best to include everything we talked about but we covered a lot of ground. If you notice a link that I promised to share is missing then please let me know by posting a comment below.

  2. Video of Tim Chapman presenting the demo scripts in his own words: http://www.youtube.com/watch?v=H7tPyVA7tRY&lr=1
  3.  

  4. SQL 2012 Performance Dashboards: http://www.microsoft.com/en-us/download/details.aspx?id=29063
  5.  

  6. SQL 2005 Performance Dashboards: http://www.microsoft.com/en-us/download/details.aspx?id=22602
  7.  

  8. Calculating Rows Per Page: http://msdn.microsoft.com/en-us/library/ms178085(v=SQL.110).aspx
  9.  

    Again, if I missed anything please let me know and I will get it posted as soon as possible.

  10. SQL University Troubleshooting Week: Having a Plan for Every Situation

    SQL University LogoToday’s SQL University post will highlight the need to have a methodology to address issues that we as IT professionals may encounter in the course of our day. We will start off by looking at why we need to have a plan for every situation, and then we will dig into a methodology that I have developed by stealing bits and pieces of other people’s approaches over my career.

    Most IT professionals lean on either knowledge, instinct or some mix of the two to solve the problems that they encounter. Knowledge and instinct are powerful tools that develop with experience. In the case of knowledge, we can rely on the experience of those that are gracious enough to share their experience via books, blogs etc. to make us all stronger. Instinct is a much harder developed tool. There is no way to do a Bing search to see if you have experienced a similar situation before and how you reacted to it.

    What happens if we encounter a completely new problem that nobody has ever experienced before? Instinct and knowledge both require experience to move forward but there is none to draw from. At this point there is a real chance of getting caught up in what I like to call the Reaction Cycle. The Reaction Cycle is the technological quicksand that waits for us outside of our knowledge or when we are misled by our instincts. We apply an opposite force without thought, iteratively making things worse.

    Think of the Reaction Cycle in terms of a brand new junior DBA. One day while the rest of the team is off at lunch they start getting alerts from the production sales database. From the alerts it appears that the server briefly lost its connection to the SAN and now there is some database corruption. The junior DBA, seeing all of the disk errors, decides the box needs a reboot while most of the company is still at lunch and proceeds to kick it over. When the server comes back up there are multiple inaccessible databases. By this time help has arrived, the rest of the team has returned from lunch. They quickly decide that at this point the best option they have is to restore the databases. They will lose some data but since this happened over lunch it is not the end of the world. They all agree that this is much faster than trying to work through the database corruption; after all they need to react quickly before too much money is lost. As they are restoring the sales database they discover they cannot apply transaction logs after 4 AM. A little further digging reveals that the junior DBA reacted to an alert about a log filling last night by truncating the database log. At this point the production database is gone and they have to go with what they have. 8 hours of sales are now gone due to the Reaction Cycle.

    Shrinking databases might be said to kill kittens but reacting can definitely kill a career. When something goes wrong, your first step is to collect information. What is wrong? What do the logs say? If you have an error message then what does Bing say about it?

    The next step is to process the information that was gathered. Based on the facts and data collected you can begin to formulate a response to what is going on. Sometimes based on this you need to go back and collect more data, like if you decide you need to restore you may want to try out the restore to a development server to make sure you have all the steps down.

    Finally, after all the collecting and processing it is time to respond. What we are doing here is making a carefully planned move based on based upon the collection and processing we have done in the previous steps. We almost always know at least our next move if not the next couple of moves and we have thoughts on what may go wrong and how we would respond to that.

    After responding we start over with the collection phase and move into the processing phase. Did we get the results we wanted? Are things better or worse? We continue to cycle through this process until we find no response is necessary.

    With that, it is time to unveil the Collect, Process, Respond methodology for troubleshooting. In simple terms, we want to gather all necessary data to develop a plan then execute it. Feel free to print the image below and hang it on your cube wall to remind you to use the methodology.
    Collect, Process, Respond
    Now that we have covered why it is important to have a methodical approach to problem solving and taken a high level look at the Collect, Process, Respond methodology it is time to start digging into the individual phases.

    The first phase we are going to dig into is the Collect phase. The most important thing about the Collect phase is that only 1 person does the data collection per system. With multiple people collecting data from the same system the risk of reacting to monitoring induced symptoms goes up exponentially. A good example of this is if someone fires up a Profiler trace on a server to collect data about an issue while someone else is looking at active user sessions. The person monitoring user sessions may see the sessions start to pile up but not get back to the source of the waits before the trace is stopped. This could lead to user sessions piling up being misattributed as a symptom of the problem rather than a necessary byproduct of troubleshooting. If it is truly necessary to have more than 1 person collect data from a system then 1 person has to call the shots and everyone has to communicate well.

    The Collect phase is where we start scoping the issue or figuring out how wide to cast the net. I like to refer to it as ruling things in. We want to err on the side of ruling things in because it is easy for knowledgeable people to rule them out later. Remember our goal here is to figure out what is wrong.

    So how do we go about actually collecting data? I like to ask these questions:

  11. What are the Symptoms?
  12. What Locations are involved?
  13. What Systems are involved?
  14. What Changed?
  15. What is in the Logs?
  16. What are the Performance Indicators showing?
  17. The starting point for the Collect phase is to look at what the symptoms are. This is a key question because it helps us figure out what to collect. The symptoms may even tell us exactly what is wrong. If users are reporting a SQL Server error message about an account being locked it then it becomes easy to know what is going on and we can jump to the Process phase.

    Once we know the symptoms we want to look at what locations are involved. This is very important because it helps define the scope of further collection activities. If all users in a particular location are having issues then we would want to focus on what is unique to that location, but if all users in the company are having issues then we would want to look at what is common to all.

    Based on the answers to the previous questions it is now time to start looking at what systems might be involved. The goal of this step is to find all of the moving parts that make up whatever activity it is that is failing. This step often includes web servers, network load balancers and SANs so it is important to start bringing in other teams at this point.

    The last of the generalist tasks to get through before really digging in with the tools that we are comfortable with is to review the change control history. Many organizations have a calendar on a wiki or on SharePoint, sometimes there is even a log at the NOC in organizations large enough to have one. Worst case, talk to the primary on-calls or managers of systems that you may be involved to find out what changed lately. Spend some time here; almost everything that goes wrong is because of a change that someone implemented.

    Finally, we get down to the part that DBAs love. We get to bust out our magical tools that we are so used to using. I always save this for last because it is easy to get lost in the data, especially if it feels like it is leading us somewhere. This is where you would go trolling on any involved servers looking for any data that supports what you have seen earlier or anything that does not match with established baselines or in absence of formal baselines then anything that does not look the way you are used to seeing it.

    There are a number of third party tools to automate information gathering. Microsoft has the Management Data Warehouse and the Performance Dashboards. The great thing about these tools is they tell you at a glance what the important metrics are and usually have some sort of indicator when things are bad.

    Remember to be careful not to react while collecting data, the idea here is to gather as much useful information as possible.

    At this point we have collected all of the information we think we need and it is time to move into the Process phase. This is most people’s favorite phase because it feels the most like solving the problem but it is important to remember that we are not pushing any buttons or pulling any levers at this point. This phase is all about the making a plan to address the issue. You want to come away from this phase with an action plan, an expected result and a plan to rollback whatever action you take in case it makes things worse.

    As we move from the Collection phase to the Process phase we need to ask these questions:

  18. Are there any obvious signs of trouble?
  19. Can the problem be linked to a change?
  20. There is a reason that I have listed these questions first. These are the things that will let us short circuit out of the Collect phase. It pays to keep them in mind when working through the Collect phase to avoid prolonging outages with unnecessary analysis. A good example of a short-circuit out of the Collect phase would be if one of the symptoms was an error message stating that a SQL login was locked out. The problem is clear and the solution is simple and low risk. The follow-up monitoring is simply to make sure that the account does not lock out again. The trick here is to make sure that your intentions are in the right place. Be especially aware of decisions made to make you look good or avoid looking bad.

    Next we have to look at the data we collected to see if any patterns can be identified. As patterns emerge theories will develop. It is important to create at least one test for each theory. I say at least one because we may have multiple moving parts and each moving part should get a test. Say that a client application hosts a Reporting Services report is slow, we would want to first run the report by calling it directly in a web browser. If it is slow in the browser then we would want to start looking at the parts that make it up, eventually pulling out individual queries and running them in a query window. We will eventually get to what the problem is.

    A clearly defined problem almost always indicates what corrective action is necessary. To keep with the above example we may say the report is slow because the query to get customer orders for the last 10 years is missing an index. There are times when no matter how well defined the problem is the answer is not clear. In those cases having a clearly defined problem is invaluable in enlisting external help whether they are from another team or from a vendor. It is the only way to make sure you are asking for the right kind of help.

    Once a list of possible actions is developed it is important to stack-rank the possible solutions by likelihood of success. The goal is to try the action that is most likely to resolve the issue while exposing you to the smallest amount of risk. I always recommend trying things out in another environment first. It helps get the steps and timing down and it exposes weaknesses in the plan so that the plan can be properly ranked. Think about this in terms of a situation where you have unrecoverable database corruption and short circuit all the way to this step, would you immediately start a point in time restore to production or would you try it out on another server first to make sure that your backup is good and that you have a bullet-proof script?

    The last thing to do before moving on to the Respond phase is to define how to measure whether the change helped or made things worse. I like to define a measure for each benefit and each risk that I identified while ranking the possible actions. I might say that adding this index will reduce reads from 10,000 to 6 or that adding this index may cause inserts into the table to take longer. I may also say that if the index does make inserts slower and page splits are noticeably higher then I may alter the fill factor of the index. It really pays to define success and failure here to make it clear when to stay, when to rollback and when to tweak the implementation.

    More than anything, you really need to make sure you have thought things out and are doing what is right for the situation you are facing and not doing something like rebooting because that is what you always do first.

    The first step in the Respond phase is to communicate your intentions. Depending on the type of change you might just tell the rest of your team or you may have to through change control to get approval to do something. The more involved or risky the action you are going to take the more documentation you should have and the more people you should involve to make sure you are not missing anything. Think of it like pool where the shot doesn’t count unless you call it.

    Next we make the change. To make the change we follow a written plan that we have hopefully rehearsed. Granted unlocking a user’s account is something you have done 100 times so you can say that is well rehearsed but how many times have you rebuilt the passive node of a production database cluster? Use your best judgment here, erring on the side of being conservative.

    A single person should make the change so that the plan can be followed step by step. If something is missing from the plan then it should be added to the plan in case these steps need to be followed again or reversed to roll back the change.

    After all that, it is time to go back and start collecting data again. The issue is closed when there are no more symptoms to be addressed and no more fixes to be deployed.

    So there you have it, a flexible, scalable methodology for solving just about any problem that any of us might face in the IT world. Use it well.

  21. SQL University Troubleshooting Week: Having a Plan for Every Situation

    SQL University LogoToday’s SQL University post will highlight the need to have a methodology to address issues that we as IT professionals may encounter in the course of our day. We will start off by looking at why we need to have a plan for every situation, and then we will dig into a methodology that I have developed by stealing bits and pieces of other people’s approaches over my career.

    Most IT professionals lean on either knowledge, instinct or some mix of the two to solve the problems that they encounter. Knowledge and instinct are powerful tools that develop with experience. In the case of knowledge, we can rely on the experience of those that are gracious enough to share their experience via books, blogs etc. to make us all stronger. Instinct is a much harder developed tool. There is no way to do a Bing search to see if you have experienced a similar situation before and how you reacted to it.

    What happens if we encounter a completely new problem that nobody has ever experienced before? Instinct and knowledge both require experience to move forward but there is none to draw from. At this point there is a real chance of getting caught up in what I like to call the Reaction Cycle. The Reaction Cycle is the technological quicksand that waits for us outside of our knowledge or when we are misled by our instincts. We apply an opposite force without thought, iteratively making things worse.

    Think of the Reaction Cycle in terms of a brand new junior DBA. One day while the rest of the team is off at lunch they start getting alerts from the production sales database. From the alerts it appears that the server briefly lost its connection to the SAN and now there is some database corruption. The junior DBA, seeing all of the disk errors, decides the box needs a reboot while most of the company is still at lunch and proceeds to kick it over. When the server comes back up there are multiple inaccessible databases. By this time help has arrived, the rest of the team has returned from lunch. They quickly decide that at this point the best option they have is to restore the databases. They will lose some data but since this happened over lunch it is not the end of the world. They all agree that this is much faster than trying to work through the database corruption; after all they need to react quickly before too much money is lost. As they are restoring the sales database they discover they cannot apply transaction logs after 4 AM. A little further digging reveals that the junior DBA reacted to an alert about a log filling last night by truncating the database log. At this point the production database is gone and they have to go with what they have. 8 hours of sales are now gone due to the Reaction Cycle.

    Shrinking databases might be said to kill kittens but reacting can definitely kill a career. When something goes wrong, your first step is to collect information. What is wrong? What do the logs say? If you have an error message then what does Bing say about it?

    The next step is to process the information that was gathered. Based on the facts and data collected you can begin to formulate a response to what is going on. Sometimes based on this you need to go back and collect more data, like if you decide you need to restore you may want to try out the restore to a development server to make sure you have all the steps down.

    Finally, after all the collecting and processing it is time to respond. What we are doing here is making a carefully planned move based on based upon the collection and processing we have done in the previous steps. We almost always know at least our next move if not the next couple of moves and we have thoughts on what may go wrong and how we would respond to that.

    After responding we start over with the collection phase and move into the processing phase. Did we get the results we wanted? Are things better or worse? We continue to cycle through this process until we find no response is necessary.

    With that, it is time to unveil the Collect, Process, Respond methodology for troubleshooting. In simple terms, we want to gather all necessary data to develop a plan then execute it. Feel free to print the image below and hang it on your cube wall to remind you to use the methodology.
    Collect, Process, Respond
    Now that we have covered why it is important to have a methodical approach to problem solving and taken a high level look at the Collect, Process, Respond methodology it is time to start digging into the individual phases.

    The first phase we are going to dig into is the Collect phase. The most important thing about the Collect phase is that only 1 person does the data collection per system. With multiple people collecting data from the same system the risk of reacting to monitoring induced symptoms goes up exponentially. A good example of this is if someone fires up a Profiler trace on a server to collect data about an issue while someone else is looking at active user sessions. The person monitoring user sessions may see the sessions start to pile up but not get back to the source of the waits before the trace is stopped. This could lead to user sessions piling up being misattributed as a symptom of the problem rather than a necessary byproduct of troubleshooting. If it is truly necessary to have more than 1 person collect data from a system then 1 person has to call the shots and everyone has to communicate well.

    The Collect phase is where we start scoping the issue or figuring out how wide to cast the net. I like to refer to it as ruling things in. We want to err on the side of ruling things in because it is easy for knowledgeable people to rule them out later. Remember our goal here is to figure out what is wrong.

    So how do we go about actually collecting data? I like to ask these questions:

  22. What are the Symptoms?
  23. What Locations are involved?
  24. What Systems are involved?
  25. What Changed?
  26. What is in the Logs?
  27. What are the Performance Indicators showing?
  28. The starting point for the Collect phase is to look at what the symptoms are. This is a key question because it helps us figure out what to collect. The symptoms may even tell us exactly what is wrong. If users are reporting a SQL Server error message about an account being locked it then it becomes easy to know what is going on and we can jump to the Process phase.

    Once we know the symptoms we want to look at what locations are involved. This is very important because it helps define the scope of further collection activities. If all users in a particular location are having issues then we would want to focus on what is unique to that location, but if all users in the company are having issues then we would want to look at what is common to all.

    Based on the answers to the previous questions it is now time to start looking at what systems might be involved. The goal of this step is to find all of the moving parts that make up whatever activity it is that is failing. This step often includes web servers, network load balancers and SANs so it is important to start bringing in other teams at this point.

    The last of the generalist tasks to get through before really digging in with the tools that we are comfortable with is to review the change control history. Many organizations have a calendar on a wiki or on SharePoint, sometimes there is even a log at the NOC in organizations large enough to have one. Worst case, talk to the primary on-calls or managers of systems that you may be involved to find out what changed lately. Spend some time here; almost everything that goes wrong is because of a change that someone implemented.

    Finally, we get down to the part that DBAs love. We get to bust out our magical tools that we are so used to using. I always save this for last because it is easy to get lost in the data, especially if it feels like it is leading us somewhere. This is where you would go trolling on any involved servers looking for any data that supports what you have seen earlier or anything that does not match with established baselines or in absence of formal baselines then anything that does not look the way you are used to seeing it.

    There are a number of third party tools to automate information gathering. Microsoft has the Management Data Warehouse and the Performance Dashboards. The great thing about these tools is they tell you at a glance what the important metrics are and usually have some sort of indicator when things are bad.

    Remember to be careful not to react while collecting data, the idea here is to gather as much useful information as possible.

    At this point we have collected all of the information we think we need and it is time to move into the Process phase. This is most people’s favorite phase because it feels the most like solving the problem but it is important to remember that we are not pushing any buttons or pulling any levers at this point. This phase is all about the making a plan to address the issue. You want to come away from this phase with an action plan, an expected result and a plan to rollback whatever action you take in case it makes things worse.

    As we move from the Collection phase to the Process phase we need to ask these questions:

  29. Are there any obvious signs of trouble?
  30. Can the problem be linked to a change?
  31. There is a reason that I have listed these questions first. These are the things that will let us short circuit out of the Collect phase. It pays to keep them in mind when working through the Collect phase to avoid prolonging outages with unnecessary analysis. A good example of a short-circuit out of the Collect phase would be if one of the symptoms was an error message stating that a SQL login was locked out. The problem is clear and the solution is simple and low risk. The follow-up monitoring is simply to make sure that the account does not lock out again. The trick here is to make sure that your intentions are in the right place. Be especially aware of decisions made to make you look good or avoid looking bad.

    Next we have to look at the data we collected to see if any patterns can be identified. As patterns emerge theories will develop. It is important to create at least one test for each theory. I say at least one because we may have multiple moving parts and each moving part should get a test. Say that a client application hosts a Reporting Services report is slow, we would want to first run the report by calling it directly in a web browser. If it is slow in the browser then we would want to start looking at the parts that make it up, eventually pulling out individual queries and running them in a query window. We will eventually get to what the problem is.

    A clearly defined problem almost always indicates what corrective action is necessary. To keep with the above example we may say the report is slow because the query to get customer orders for the last 10 years is missing an index. There are times when no matter how well defined the problem is the answer is not clear. In those cases having a clearly defined problem is invaluable in enlisting external help whether they are from another team or from a vendor. It is the only way to make sure you are asking for the right kind of help.

    Once a list of possible actions is developed it is important to stack-rank the possible solutions by likelihood of success. The goal is to try the action that is most likely to resolve the issue while exposing you to the smallest amount of risk. I always recommend trying things out in another environment first. It helps get the steps and timing down and it exposes weaknesses in the plan so that the plan can be properly ranked. Think about this in terms of a situation where you have unrecoverable database corruption and short circuit all the way to this step, would you immediately start a point in time restore to production or would you try it out on another server first to make sure that your backup is good and that you have a bullet-proof script?

    The last thing to do before moving on to the Respond phase is to define how to measure whether the change helped or made things worse. I like to define a measure for each benefit and each risk that I identified while ranking the possible actions. I might say that adding this index will reduce reads from 10,000 to 6 or that adding this index may cause inserts into the table to take longer. I may also say that if the index does make inserts slower and page splits are noticeably higher then I may alter the fill factor of the index. It really pays to define success and failure here to make it clear when to stay, when to rollback and when to tweak the implementation.

    More than anything, you really need to make sure you have thought things out and are doing what is right for the situation you are facing and not doing something like rebooting because that is what you always do first.

    The first step in the Respond phase is to communicate your intentions. Depending on the type of change you might just tell the rest of your team or you may have to through change control to get approval to do something. The more involved or risky the action you are going to take the more documentation you should have and the more people you should involve to make sure you are not missing anything. Think of it like pool where the shot doesn’t count unless you call it.

    Next we make the change. To make the change we follow a written plan that we have hopefully rehearsed. Granted unlocking a user’s account is something you have done 100 times so you can say that is well rehearsed but how many times have you rebuilt the passive node of a production database cluster? Use your best judgment here, erring on the side of being conservative.

    A single person should make the change so that the plan can be followed step by step. If something is missing from the plan then it should be added to the plan in case these steps need to be followed again or reversed to roll back the change.

    After all that, it is time to go back and start collecting data again. The issue is closed when there are no more symptoms to be addressed and no more fixes to be deployed.

    So there you have it, a flexible, scalable methodology for solving just about any problem that any of us might face in the IT world. Use it well.

  32. SQL University Troubleshooting Week: Keeping an Open Mind

    SQL University LogoOffice politics during a major event can be dangerous. It pays to be seen as contributing to solving the problem rather than being seen as a part of it. We may have the best intentions and know our systems inside and out but if we refuse to look into something because we are sure it is not our issue then we are going to be seen as difficult and argumentative. If it turns out we are wrong and it is our issue we could even be seen as hiding something.

    I like to use the example of the rip current to explain office politics during a system outage. Unfamiliar swimmers caught in rip currents typically make the mistake of swimming against the current trying to get directly back to shore. Many get exhausted before they get back to shore and drown. Swimmers familiar with rip currents will go with the flow, swimming parallel to the shore until they are out of the current then swim back to shore.

    Keeping with the rip currents example, what happens when you fight people and say it is not your issue? Do they give up and go away; leaving you to what you were working on or do they fight you harder to prove it is your issue? At times it may even seem like their goal is not prove it is your issue but prove that you are being arrogant and that it could be. Many times it becomes less about the issue at hand and more about winning an argument.

    The simple truth is that it is better to go with the flow because it is faster.

    An important part of going with the flow is to construct tests that prove that something is our issue. Notice how I say to prove it is our issue rather than prove it is not. The human brain is incredibly open to having tricks played on it, by constructing an affirmative test we trick our brain into trying to find a way to make the test work. We are not happy with just a single failure to recreate the issue, we need to change the test scenario and test more until we can recreate the issue or run out of test scenarios. Most importantly, we are engaged and working to understand what is really going on.

    Having a good attitude is key to success in the information technology field. Keeping an open mind is central to that. By going with the flow and looking at issues from the right perspective we can solve problems faster while becoming known for our skill and professionalism.

  33. SQL University Troubleshooting Week: Keeping an Open Mind

    SQL University LogoOffice politics during a major event can be dangerous. It pays to be seen as contributing to solving the problem rather than being seen as a part of it. We may have the best intentions and know our systems inside and out but if we refuse to look into something because we are sure it is not our issue then we are going to be seen as difficult and argumentative. If it turns out we are wrong and it is our issue we could even be seen as hiding something.

    I like to use the example of the rip current to explain office politics during a system outage. Unfamiliar swimmers caught in rip currents typically make the mistake of swimming against the current trying to get directly back to shore. Many get exhausted before they get back to shore and drown. Swimmers familiar with rip currents will go with the flow, swimming parallel to the shore until they are out of the current then swim back to shore.

    Keeping with the rip currents example, what happens when you fight people and say it is not your issue? Do they give up and go away; leaving you to what you were working on or do they fight you harder to prove it is your issue? At times it may even seem like their goal is not prove it is your issue but prove that you are being arrogant and that it could be. Many times it becomes less about the issue at hand and more about winning an argument.

    The simple truth is that it is better to go with the flow because it is faster.

    An important part of going with the flow is to construct tests that prove that something is our issue. Notice how I say to prove it is our issue rather than prove it is not. The human brain is incredibly open to having tricks played on it, by constructing an affirmative test we trick our brain into trying to find a way to make the test work. We are not happy with just a single failure to recreate the issue, we need to change the test scenario and test more until we can recreate the issue or run out of test scenarios. Most importantly, we are engaged and working to understand what is really going on.

    Having a good attitude is key to success in the information technology field. Keeping an open mind is central to that. By going with the flow and looking at issues from the right perspective we can solve problems faster while becoming known for our skill and professionalism.

  34. SQL University Troubleshooting Week: Communication

    SQL University LogoIt should come as no surprise that the first topic I am covering this week is communication because the first thing I think anyone should do is communicate that they are troubleshooting an issue. This post will cover why we should communicate then dig into how to put together an initial alert. The rest of the post will be spent talking about how to communicate updates and the resolution.

    First and foremost communication prevents your management from being caught by surprise when the VP of Sales calls to ask when they will be able to place orders again.

    Communication also prevents duplicated efforts. Many times when a system is down trouble reports come in from everywhere and go to everyone, resulting in a situation where there is no clear problem definition or problem owner. Communicating the problem owner allows the information to flow to a central place, allowing the problem to be properly defined.

    Finally, communication allows people to speak up about a recent change. If another team made a change recently they may be able to identify aspects of the issue you are working on that may be related to what they did. This is not saying someone was doing something sneaky although that sometimes happens. Usually this means that something was done and communicated to all the right people but not fully understood by the people it was communicated to or lost in turnover between support rotations. Assume the best here because you need people to speak up sooner rather than later. Treating them badly when they do speak up will only cause trouble in the long run.

    So what is the best way to communicate that there is a system issue? It helps to have an email group that includes all IT on-call pagers, management and other key people. If you do not have one I suggest setting one up solely for communicating large issues. It is important not to spam this list, treat it like pulling a fire alarm. It should only be used to communicate system issues from discovery through resolution with updates at regular intervals or large milestones throughout the process.

    It is also very important that the distribution list for these emails is IT only. People outside of IT may not know the intricacies of your particular implementation leading to the possibility of spreading misinformation. They are not doing this on purpose, they think they understand and like being involved in something exciting; that they are helping get the word out; doing their part. Let your management craft the organizational communications, they will have to answer for what is said in them.

    When sending alert emails keep the subject line general. If you give too much information in the subject line then people can assume it is not their issue and move on. We want to take advantage of that little pit in their stomach that everyone gets when something breaks to get them up to speed on the issue.

    Finally, the body of the email should provide a broad overview of the issue including what systems are impacted, any major symptoms including error messages, the number of people impacted and any location specific information. It is very important to keep to the important points here. The body of the email must be short enough to be fully read while long enough to include all important information.

    The body of your email should also contain a listing of any resources you need. If you need people then say I will be contacting the primary on-call from network engineering etc. to get their attention. Never use a mass communication to say “I cannot find Mike from the server team. If anyone sees him please tell him I need his help on this issue.” It will make both of you look bad and make the person on the receiving end less likely to help.

    Finally, only state the facts when communicating an issue and never assign blame. This is such an important part of the communication. It is important to only state what you know. What you think is not important and who is to blame is even less important. In the end the person that fixes the problem will be asked to explain what went wrong. Chances are they either made the change that led up to the issue or know who did. If there was a hardware failure they will be able to explain it in-depth as well. If you have any doubt about what you are communicating then check with someone that knows more about that particular area.

    Once the first alert is sent you generally have 30 minutes to either fix the issue or convene a war room. 30 minutes is a loose rule that I use because if the fix is easy then you will almost always identify the issue, develop a plan and fix it within that time. If 30 minutes goes by and you are still trying to figure out what is wrong then it is time to ask for help. Either way a follow-up alert should go out at the 30 minute mark to update everyone on the issue. The update should follow the same rules as the initial alert although the subject line should be prefixed with “UPDATE: “. Updates should continue at regular intervals until the issue is resolved.

    At some point all issues get resolved and that also needs to be communicated. The resolution should include the subject line of the original alert prefixed with “RESOLVED:“. Due to the potential for wide distribution, the alert should never mention anyone by name. The alert should contain a factual description of what the issue was and what was done to solve it, because facts are just facts and cannot convey opinions. Conclusions on the other hand, can convey opinions. Put the facts out there and let people draw their own conclusions. All that really matters is that the people in a position to prevent such a thing in the future recognize what happened and take actions to either prevent or mitigate the impact in the future.

    I hope you can see why I think properly communicating issues is important. I have outlined a system that has worked well for me. I strongly believe that anyone handling communications in this manner will be recognized for their professionalism and leadership.

    What works for you? Please feel free to leave it in the comments below.

  35. SQL University Troubleshooting Week: Communication

    SQL University LogoIt should come as no surprise that the first topic I am covering this week is communication because the first thing I think anyone should do is communicate that they are troubleshooting an issue. This post will cover why we should communicate then dig into how to put together an initial alert. The rest of the post will be spent talking about how to communicate updates and the resolution.

    First and foremost communication prevents your management from being caught by surprise when the VP of Sales calls to ask when they will be able to place orders again.

    Communication also prevents duplicated efforts. Many times when a system is down trouble reports come in from everywhere and go to everyone, resulting in a situation where there is no clear problem definition or problem owner. Communicating the problem owner allows the information to flow to a central place, allowing the problem to be properly defined.

    Finally, communication allows people to speak up about a recent change. If another team made a change recently they may be able to identify aspects of the issue you are working on that may be related to what they did. This is not saying someone was doing something sneaky although that sometimes happens. Usually this means that something was done and communicated to all the right people but not fully understood by the people it was communicated to or lost in turnover between support rotations. Assume the best here because you need people to speak up sooner rather than later. Treating them badly when they do speak up will only cause trouble in the long run.

    So what is the best way to communicate that there is a system issue? It helps to have an email group that includes all IT on-call pagers, management and other key people. If you do not have one I suggest setting one up solely for communicating large issues. It is important not to spam this list, treat it like pulling a fire alarm. It should only be used to communicate system issues from discovery through resolution with updates at regular intervals or large milestones throughout the process.

    It is also very important that the distribution list for these emails is IT only. People outside of IT may not know the intricacies of your particular implementation leading to the possibility of spreading misinformation. They are not doing this on purpose, they think they understand and like being involved in something exciting; that they are helping get the word out; doing their part. Let your management craft the organizational communications, they will have to answer for what is said in them.

    When sending alert emails keep the subject line general. If you give too much information in the subject line then people can assume it is not their issue and move on. We want to take advantage of that little pit in their stomach that everyone gets when something breaks to get them up to speed on the issue.

    Finally, the body of the email should provide a broad overview of the issue including what systems are impacted, any major symptoms including error messages, the number of people impacted and any location specific information. It is very important to keep to the important points here. The body of the email must be short enough to be fully read while long enough to include all important information.

    The body of your email should also contain a listing of any resources you need. If you need people then say I will be contacting the primary on-call from network engineering etc. to get their attention. Never use a mass communication to say “I cannot find Mike from the server team. If anyone sees him please tell him I need his help on this issue.” It will make both of you look bad and make the person on the receiving end less likely to help.

    Finally, only state the facts when communicating an issue and never assign blame. This is such an important part of the communication. It is important to only state what you know. What you think is not important and who is to blame is even less important. In the end the person that fixes the problem will be asked to explain what went wrong. Chances are they either made the change that led up to the issue or know who did. If there was a hardware failure they will be able to explain it in-depth as well. If you have any doubt about what you are communicating then check with someone that knows more about that particular area.

    Once the first alert is sent you generally have 30 minutes to either fix the issue or convene a war room. 30 minutes is a loose rule that I use because if the fix is easy then you will almost always identify the issue, develop a plan and fix it within that time. If 30 minutes goes by and you are still trying to figure out what is wrong then it is time to ask for help. Either way a follow-up alert should go out at the 30 minute mark to update everyone on the issue. The update should follow the same rules as the initial alert although the subject line should be prefixed with “UPDATE: “. Updates should continue at regular intervals until the issue is resolved.

    At some point all issues get resolved and that also needs to be communicated. The resolution should include the subject line of the original alert prefixed with “RESOLVED:“. Due to the potential for wide distribution, the alert should never mention anyone by name. The alert should contain a factual description of what the issue was and what was done to solve it, because facts are just facts and cannot convey opinions. Conclusions on the other hand, can convey opinions. Put the facts out there and let people draw their own conclusions. All that really matters is that the people in a position to prevent such a thing in the future recognize what happened and take actions to either prevent or mitigate the impact in the future.

    I hope you can see why I think properly communicating issues is important. I have outlined a system that has worked well for me. I strongly believe that anyone handling communications in this manner will be recognized for their professionalism and leadership.

    What works for you? Please feel free to leave it in the comments below.

  36. SQL University Troubleshooting Week: Syllabus

    SQL University LogoWelcome to SQL University Troubleshooting Week. For anyone unfamiliar with SQL University, it is a project created by Jorge Segarra (Blog|Twitter) to give people a free way to learn SQL Server from the ground up. The professors at SQL University are bloggers with one or more of them getting a week to cover their topic.

    I am honored to close out the final week of the Spring 2011 semester with a topic that I really enjoy: Troubleshooting. I really enjoy solving problems and that has caused me to get pulled into many situations where I could use and develop my troubleshooting skills. I hope to share the things I have learned here this week to speed everyone along in the process, hopefully avoiding some of the pitfalls that I had along the way.

    I tried to keep the posts short and easily digestible but I will warn you now that there are one or two that are a bit long. Here is what I have planned for this week:

  37. Communication – Why it is good to communicate during system issues and how to get the word out effectively.
  38. Keeping an Open Mind – Having the right attitude means being courageous enough to put ego aside, looking at things from a different perspective.
  39. Having a Plan for Every Situation – You may not know what is going to happen next but you can still have a plan to deal with it.
  40.  
    I hope to cover something useful for every level this week. Even as I wrote these posts I was reminded of things I could have done better in recent situations. Look for the “Communication” post tomorrow with “Keeping an Open Mind” on Thursday and “Having a Plan for Every Situation” on Friday. Friday’s post is quite long so it will make for good weekend reading.

  41. SQL University Troubleshooting Week: Syllabus

    SQL University LogoWelcome to SQL University Troubleshooting Week. For anyone unfamiliar with SQL University, it is a project created by Jorge Segarra (Blog|Twitter) to give people a free way to learn SQL Server from the ground up. The professors at SQL University are bloggers with one or more of them getting a week to cover their topic.

    I am honored to close out the final week of the Spring 2011 semester with a topic that I really enjoy: Troubleshooting. I really enjoy solving problems and that has caused me to get pulled into many situations where I could use and develop my troubleshooting skills. I hope to share the things I have learned here this week to speed everyone along in the process, hopefully avoiding some of the pitfalls that I had along the way.

    I tried to keep the posts short and easily digestible but I will warn you now that there are one or two that are a bit long. Here is what I have planned for this week:

  42. Communication – Why it is good to communicate during system issues and how to get the word out effectively.
  43. Keeping an Open Mind – Having the right attitude means being courageous enough to put ego aside, looking at things from a different perspective.
  44. Having a Plan for Every Situation – You may not know what is going to happen next but you can still have a plan to deal with it.
  45.  
    I hope to cover something useful for every level this week. Even as I wrote these posts I was reminded of things I could have done better in recent situations. Look for the “Communication” post tomorrow with “Keeping an Open Mind” on Thursday and “Having a Plan for Every Situation” on Friday. Friday’s post is quite long so it will make for good weekend reading.

  46. SQL Saturday 67 Slides Are Now Available

    I recently debuted a new presentation, “What To Do When It All Goes So Wrong”. The presentation is designed to give Database Administrators a basic overview of the skills they need to handle virtually any crisis that may arise. While the target audience is DBAs, I feel that most IT Professionals can benefit from the concepts.

    The first delivery of the presentation went well, although I definitely have some ideas for how I can improve on it. Look for this deck to evolve a bit over time. The biggest area that I still feel needs work is the narrative around the emergency scenario that I created. Right now it does not tie as well as I would like with the concepts later on in the presentation. Look for the narrative to develop more as I get more opportunities to deliver this presentation.

    You can get to the deck from my Presentations page. Yep, that’s right, I have a presentations page now. It feels good to finally have enough content to warrant a dedicated page.

    Please have a look and feel free to leave any feedback you might have in the comments section on that page.

  1. 1
  2. Next ›
  3. Last »