DBPedias

Your Database Knowledge Community

Bob Pusateri

  1. The 1940 Census: A True NoSQL Database!

    On April 2 of this year, the National Archives released the complete population schedule of the 1940 census. These records were highly anticipated not only for their genealogical value but also because of their detailed information about an incredibly interesting period of U.S. History. This census captured the point in time where the country was finally starting to come out of the great depression but had not yet entered World War II. Many questions it included were new and designed to gauge the effects of the depression, with topics including income, education, unemployment, and migration. In 1940, millions were employed by the WPA, PWA, and other New Deal agencies, and the Farm Security Administration’s photography program had a small group of photographers traversing the nation capturing images of everyday American life. Some of my favorite photos come from this collection.

    Population Schedule Form

    Population Schedule Form (Click to Enlarge)

    As interesting as the Census is for all its historical and social reasons, there’s an equally awesome tale to be told of all its data and the technologies behind it. Setting up a table in SQL Server to store 310 million rows and aggregate results from them would be pretty easy today – many DBAs deal with tables that are orders of magnitude larger than that, but in 1940 it was a major undertaking involving thousands of workers. Today the census is still a non-trivial task, however I’d imagine most of that work goes into getting data from the population into a database, while calculating the results from that is relatively simple.

    ETL: Enumerate, Tabulate, Lock Up
    Prior to 1960, censuses weren’t mailed to your house like they are now. Instead every household was visited in person by an enumerator, a single person responsible for an “Enumeration District”, or “ED”. EDs varied wildly in size, and could consist of a single block in a large city, or an entire township in a more rural area. The enumerator would stop by and ask questions about each member of your household while writing the answers onto a population schedule form that measured 23.75″ wide by 18.5″ high. Yep, the entire database was on paper. Torn page detection must have been a very serious issue! When the enumerator had information about every last person in their district, they would send their data to the Census Office via log shipping Registered Mail.

    Once in Washington, an army of operators transferred each record from the population schedule forms to a punch card. Punch cards had been used to tabulate the census since 1890 and were still the best technology available fifty years later. The 1880 census was tabulated by hand, which took 7 years to complete. Knowing there had to be a better way to calculate results, former census employee Herman Hollerith set out to create a machine that could count results from data stored on punched cards. He won a contract to tabulate the 1890 census, which was completed in only 1 year. By 1900, he had formed the Tabulating Machine Company and greatly increased his fees, knowing he had a monopoly and the Census Office would have no other option than to pay them. By 1910, census employees had developed and patented their own counting machine to avoid using Hollerith’s. The Tabulating Machine Company, which by then had merged and changed names to the Computer Tabulating Recording Company, was nearly bankrupted by the loss of business. They eventually got their act together and were able to turn a profit. In 1924, Computer Tabulating Recording Company changed its name to International Business Machines Corporation.

    Tabulating by Machine

    Women in (1940) Technology: Tabulating By Machine

    After the records were copied to punch cards and tabulated by machine, the aggregated results were released immediately for uses like determining congressional seats and allocation of public funds. Since the population schedules contain information on individuals, they are held for 72 years before being released for research purposes. Rather than keep all 3.9 million pages of records on paper, the Census Office used the most compressed format available at the time, microfilm. Apparently they had not yet discovered the rather obscure and undocumented BACKUP CENSUS TO TAPE=’MICROFILM’ WITH COMPRESSION; command. The records released this year are images scanned from that microfilm.

    Indexing
    Since all the data consists of images, there’s no easy way to index them. Optical character recognition software is pretty good these days, but probably not good enough to pick out the handwriting in these images – most of which is in cursive. Instead everything was indexed by enumeration district, meaning you need to know where someone was living during April of 1940 before you can search for them. Many genealogy websites are now working on indexing this data by name, but it is not expected to be completed for a few months.

    Finding Your Family
    If you had relatives in the US in 1940 and know where they lived at that time, I highly recommend looking for them. Everything can be found for free at http://1940census.archives.gov. The first thing you’ll need to do is find which enumeration district they lived in. If you have an address, you are very much in luck. If you only have a general idea of where they were, then you’ll probably have to do a bit more work to find them. The census site lets you drill down by state, county, and city, and provides a list of EDs that apply. If you’re searching in an urban area you might need to use maps and/or descriptions returned by the search to narrow down exactly which ED they were in. If the official site isn’t finding anything for you, I’ve also had luck using Steve Morse’s 1940 ED Finder. Once armed with a list of relevant enumeration districts, you can view or download the population schedules from each district and look for people you recognize. You’ll probably end up looking through all the sheets because the entries on the forms aren’t always in order. My guess is that enumerators would start going down a street, skip houses where nobody was at home and then come back to them later.

    I was fortunate enough to find all of my family, and it’s really neat to be able to see a snapshot of their lives at a time when my grandparents were close to my age. It also gave me great appreciation for what a chore recordkeeping was in that era! Even if you have no relatives in this census I think it’s still worth taking a look at – it was very interesting to see what kinds of jobs people had, their education level, and how much they were paid. My family was in the suburbs of Chicago at that time, and probably 7 out of 10 people in their area worked in “telephone manufacturing” which would have been at Western Electric’s Hawthorne Works. My wife’s family was in a rural area downstate, and practically everyone worked on a farm and the few who didn’t were employed by the WPA. The best job title I saw when searching for her family was “chicken picker”.

    Best of luck if you end up searching for your ancestors. If you find any who were employed as a chicken picker, let me know!

    Fun Videos

  2. Compression Estimates That Don’t Make Sense

    “Statistics means never having to say you’re certain” is a well-known quote that would often come up in my AP Statistics class back in high school. The same can be implied for estimation, and while estimates need not be perfect, they should at least be logical. While estimating the effects of compressing data I recently came across a behavior that I don’t think Mr. Spock would approve of.

    The sp_estimate_data_compression_savings procedure is an excellent tool for estimating how compression will affect an index or partition. It works by sampling 5% of the appropriate pages and applies the desired level of compression on them in TempDB. The results are then extrapolated to give an estimated size for the entire index/partition with that level of compression applied. It’s not perfect, but again it’s an estimate. I recently found however that depending on the datatypes present in the table, that estimate can get very bad. I noticed this in SQL Server 2008R2, but it also shows up in 2012.

    For this demo, you’ll need the AdventureWorks database or one of its variants. I used AdventureWorks2008R2 and AdventureWorks2012 in my tests. You’ll also need tables a bit larger than AdventureWorks provides, so I use Adam Machanic’s bigAdventure script to create the dbo.bigTransactionHistory table we’ll be working with.

    Once dbo.bigTransactionHistory is set up, we’re going to focus on it’s nonclustered index which is created in the script along with the table. Let’s rebuild this index with a fill factor of 100 so there’s as little fragmentation as possible.

    ALTER INDEX IX_ProductID_TransactionDate ON dbo.bigTransactionHistory
    REBUILD WITH (FILLFACTOR=100);
    

    Now check the estimated size of it using all 3 compression settings: NONE, ROW, and PAGE.

    EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, NONE;
    
    EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, ROW;
    
    EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, PAGE;
    

    (Click to enlarge)
    Screenshot1

    The index is currently not compressed, and you can see that the first result set’s estimated size (with no compression) is close to the current size. It won’t be exactly the same and your values will vary, but the main idea here is that using no compression gives an estimate closest to the current value and row and page compression yield smaller values yet.

    Now let’s change things up by adding another column to bigTransactionHistory and giving it a value:

    ALTER TABLE dbo.bigTransactionHistory ADD vbm VARBINARY(MAX) NULL;
    
    UPDATE dbo.bigTransactionHistory SET vbm = CONVERT(VARBINARY(MAX), '0xAB');
    

    Then check the estimated sizes again just like before:

    EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, NONE;
    
    EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, ROW;
    
    EXEC sp_estimate_data_compression_savings 'dbo','bigTransactionHistory', 2, NULL, PAGE;
    

    Screenshot2

    Look at that first result (with no compression) – it’s now way smaller than the current size. It’s actually even smaller than the estimated sizes for row and page compression, and it all seems to be due to the VARBINARY(MAX) column. If you remove that column and rebuild the table, the estimated size with no compression will be back to where it should be.

    Strange, huh? I can only assume this has something to do with VARBINARY being a type that doesn’t get compressed, but I don’t see how that should affect the estimation.

    BOL states that “If the requested compression setting is same as the current compression setting, the stored procedure will return the estimated size with no data fragmentation and using the existing fill factor.” I can’t see the assumption of no fragmentation being responsible for the estimated size shrinking so much and being smaller than the estimates using compression. Compression can increase the size of an object in certain cases, but the differences here are just too great.

    I have submitted this on MS Connect – feel free to vote up if you can reproduce it!

  3. Preventing SSRS From Changing Text Size

    Every once in a while I find myself building a report in SQL Server Reporting Services. I know my way around SSRS enough to get a report looking the way I like, but I’m definitely not an expert in the subject. While developing a report recently I noticed some very strange behavior that I have a hard time believing is intended: SSRS is changing font sizes on me depending on what I’m viewing! I’m happy to provide screenshots and instructions to reproduce this behavior, and also the solutions I found. If you’re curious, I first observed this in SQL Server 2008 R2 SP1 CU2 (10.50.2772), though I’m guessing it appears in earlier builds as well. It is still present in the latest build, SQL Server 2008 R2 SP1 CU5 (10.50.2806).

    Let’s start off by creating a very simple table about soda pop soft drinks in a test database (mine is named “Sandbox”) and fill it with some data:

    USE [Sandbox]
    GO
    
    CREATE TABLE [dbo].[SoftDrinks](
    	[Name] [varchar](50) NOT NULL PRIMARY KEY,
    	[Style] [varchar](50) NOT NULL,
    	[CaffeineLevel] [varchar](10) NULL
    );
    GO
    
    INSERT INTO [dbo].[SoftDrinks] VALUES (N'50/50', N'Citrus', NULL);
    INSERT INTO [dbo].[SoftDrinks] VALUES (N'7 Up', N'Citrus', NULL);
    INSERT INTO [dbo].[SoftDrinks] VALUES (N'Coca-Cola Classic', N'Cola', N'Medium');
    INSERT INTO [dbo].[SoftDrinks] VALUES (N'Diet Pepsi', N'Diet', N'Medium');
    INSERT INTO [dbo].[SoftDrinks] VALUES (N'Diet Rite', N'Diet', N'Medium');
    INSERT INTO [dbo].[SoftDrinks] VALUES (N'Jolt', N'Cola', N'High');
    INSERT INTO [dbo].[SoftDrinks] VALUES (N'Mello Yello', N'Citrus', N'High');
    INSERT INTO [dbo].[SoftDrinks] VALUES (N'Mountain Dew', N'Citrus', N'High');
    INSERT INTO [dbo].[SoftDrinks] VALUES (N'Pepsi', N'Cola', N'Medium');
    INSERT INTO [dbo].[SoftDrinks] VALUES (N'RC Cola', N'Cola', N'Medium');
    

    Selecting from the table should now get you this:
    Query Result

    I should add that the values in the CaffeineLevel column are completely made up. I just wanted a few different values with some NULLs mixed in.

    Now that the table is set up, we need a report. To briefly describe the report I created, I’m using the List object with text boxes for Name and CaffeineLevel in the detail section. The report is grouped by Style and the groups are expandable and closed by default. The name of each group also displays a count of the number of records it contains. (Click to enlarge).
    Visual Studio

    To make things a bit quicker, you can download the report project here. It’s setup to use Windows authentication to connect to the Sandbox database on localhost. You may need to make adjustments to the connection information depending on your test environment. I wasn’t able to observe this behavior when previewing in Business Intelligence Development Studio, so you will need to deploy it to a report server.

    Once the report is deployed, run it and you should see the 3 Style values with counts, however the details aren’t visible:
    Report

    Open the Cola and Diet groups and you’ll see the values for Name and CaffeineLevel. Everything looks fine, right?
    Report

    Now open the Citrus group. Notice how the size of the text in the CaffeineLevel column shrinks?
    Report

    Closing the Citrus group will return the text sizes for the other groups to normal. For fun, let’s try this again with a non-null value for the first row of the CaffeineLevel column. Add another row to the SoftDrinks table:

    INSERT [dbo].[SoftDrinks] ([Name], [Style], [CaffeineLevel])
    VALUES (N'123_NotNull', N'Citrus', N'Low');
    

    Refresh the report and open all groups. You’ll see the font size doesn’t change, which should be expected.
    Report

    It appears that the cause of this is related to the top value in the column. If the top value is null, the font size gets smaller.

    Solutions

    I found a couple of ways to resolve this. The first is to remove the NULL values from the CaffeineLevel column by adjusting the query. I used the ISNULL() function and tried replacing the NULL values with an empty string, but that had no effect. Replacing null values with a single space did fix the problem, however.

    SELECT
       Name,
       Style,
       --CaffeineLevel -- (The original query)
       --ISNULL(CaffeineLevel,'') AS CaffeineLevel -- (This doesn't work)
       ISNULL(CaffeineLevel,' ') AS CaffeineLevel -- (This works)
    FROM dbo.SoftDrinks;
    

    Another solution is to filter out the NULL values by using an expression in SSRS. For this I used the IsNothing() function to replace NULL values with a string containing a single space, which resolved the issue. Much like above, replacing NULL values with an empty string had no effect. The expression I used was:
    =IIF(IsNothing(Fields!CaffeineLevel.Value)," ",Fields!CaffeineLevel.Value)

    Whichever of the above you choose, you’ll arive at the correct (and visually pleasing) result of having the font sizes be the same:
    Report

    Personally I prefer to alter the query rather than add an SSRS expression, as I am not a fan of seeing <<expr>> in report definitions when it’s not truly necessary.

    If anyone else has encountered this situation, I hope you find this solution helpful!

  4. SQLFriends: A Different Kind of Gathering

    Imagine you had the chance to get together with other data professionals in your area in a setting outside of a training session or user group meeting. There’s no formal agenda, just an excellent meal and some great conversation. Perhaps a respected expert is available to answer questions and bounce ideas off of. Does this sound appealing? If so, you should check out SQLFriends.

    This past Friday I had the pleasure of attending the first-ever SQLFriends event!

    SQLFriends is the brainchild of Aaron Lowe (blog | @Vendoran) who, as he described it, wanted to come up with a SQL Server community event based around networking instead of training. Events such as SQL Saturdays and conferences are a great source of training, but many people look forward to the social and networking aspects as well. So why not just have an event focused around networking? SQLFriends was created to fill that void.

    The first SQLFriends event was a sell-out lunch at Buca di Beppo in Chicago with 30 attendees and special guest Brent Ozar (blog | @BrentO). Brent answered all kinds of questions about training, certification, consulting, and anything else we could come up with. There was also the opportunity to network amongst ourselves and meet some new faces. Several attendees mentioned how happy they were to meet at lunchtime because they can’t fit evening user group meetings into their schedule.

    All in all I’d say the inaugural SQLFriends lunch was a great success. Thank you very much to Aaron for putting it together. If you’d like to find out about future events, head on over to their website at http://sqlfriends.org/ or follow them on Twitter at @SQLFriends.

  5. Speaking at SQL Saturday #118: Wisconsin 2012

    SQL Saturday 118I feel very fortunate to have been chosen as a speaker for SQL Saturday #118 in Wisconsin on April 21, 2012. After an excellent time speaking at MADPASS in February, I’m really looking forward to getting back to Madison for this event!

    I’ll be giving my presentation entitled The Skinny on Data Compression which covers the different types of data compression available in SQL Server and some of the details behind how they work. I’ll also talk about the pros and cons of compressing data and how to determine which flavor of compression is most appropriate for an object. This presentation was very popular in Kansas City and I’m expecting more of the same up North next month!

    Registration is still open as of when this post went live, so if you’d like to attend please sign up here. If you’re on twitter, follow the #sqlsat118 hashtag for the latest information.

    I can’t wait to head to up Madison. Hope to see you there!

  6. T-SQL and B-Ball: My 2012 NCAA Bracket

    March Madness is once again upon us, which means it’s time for one of the biggest time sinks of the year: filling out a bracket for the pseudo-sanctioned office NCAA pool. Along with the superbowl, this is a hallowed time when bosses turn a blind eye to all the cash that seems to be heading towards that person who has been designated as both trustworthy enough to manage the pool and cool enough to forget that gambling is against company policy.

    Super Bowl pools are usually pretty easy – pick a square (or two or three), hand over the cash, and the numbers get filled in later. Building an NCAA bracket on the other hand is a lot of work. Even when picking winners completely at random there are over 60 choices to make. For someone like myself who doesn’t watch nearly enough basketball to make educated guesses about most of the teams it’s a lot of wasted effort. I’d rather spend my time doing something more constructive, such as developing a formula in T-SQL to make my bracket picks for me. Much like last year I’ve done it again.

    This year’s bracket is based on 2 things. The first is the idea that a #16 seed team will never beat a #1 seed team. (Since it’s never happened yet it seems like a safe bet). After that, game winners are determined by a formula.

    All teams in the tournament are searched in Bing twice according to the following template:
    <School Name> <Team Name> Basketball rocks
    <School Name> <Team Name> Basketball sucks

    For example:
    “Illinois Fighting Illini Basketball rocks”
    and
    “Michigan Wolverines Basketball sucks”
    would both be valid queries.

    The total number of results returned by Bing for each query are recorded in a table, and to represent a game the following values will be compared for teams A and B:
    Team A: A.rocks / B.sucks
    Team B: B.rocks / A.sucks

    The team with the higher value wins the game.

    One flaw in last year’s formula was that it generated a value for each school which never changed. This meant the school with the highest generated value would be the one to win the championship. This time around the outcome differs based on the teams that are compared, which makes things much more interesting. Here’s the bracket that resulted (click to enlarge).
    Bracket

    If you’d like to try this yourself you can download the code here. Since search results are constantly changing, you’ll likely have a different outcome than I did. Also be forewarned that while this code does work, it was put together very quickly and is neither elegant nor efficient.

    Why Bing?
    In case you’re curious, I decided to use Bing search instead of Google because I’m a tightwad and it turns out Google’s Search API isn’t free. If you want to make more than 100 queries per day on Google, you need to pay. Since I needed to make over 100 queries to gather my data plus a few extra during development I went with Bing, which has a much friendlier policy of restricting your query rate instead of limiting the total number of queries per day.

    Enjoy the tournament! Should you use my method and come up with a winning bracket, please contact me to discuss my share of the winnings.

  7. Collecting Index Usage Statistics

    Knowing when and how often your indexes are used can really come in handy. Indexes that are never utilized aren’t worth keeping around, and knowing their usage patterns can be a big help when making decisions regarding things such as filegroup placement and compression settings. SQL Server 2005 brought some great advancements in the form of the sys.dm_db_index_usage_stats DMV, which returns statistics on index usage that are automatically being kept by the server. While it’s a step forward, I feel it still leaves a few things to be desired:

    • The values are cumulative since the server was last restarted
    • Once the server is shut down or restarted, there’s no way to get the previous values back

    CollectableWith this in mind, I came up with my own system for archiving non-cumulative index usage stats on a daily basis. Microsoft introduced Management Data Warehouse in SQL Server 2008 which provides a way to keep track of these statistics over time (along with many others), but I’ve been using my own method longer than that. This will also work in SQL Server Express Edition, which doesn’t support MDW. I’ll give descriptions of each part below, but if you just want the code right now download it here.

    Tables

    IndexUsageStats_LastCumulative
    As I mentioned before, the output from sys.dm_db_index_usage_stats is cumulative since the last time the server was restarted, and I was looking for a way to find daily values that weren’t cumulative. Restarting the server immediately after collecting the values would have accomplished this, but users probably wouldn’t be very happy :). Instead I just keep track of the previous cumulative values and subtract them from the current ones. If the output shows an index was used 7 times on Tuesday at Midnight, and the same number was 12 on Wednesday at Midnight, then it must have been used 5 times in that 24 hour period. IndexUsageStats_LastCumulative holds the previous cumulative values for all indexes to make this calculation possible.

    Names
    The only true way to identify an object in SQL Server is by its name, but DMVs typically return object IDs instead of names. It’s easy to translate between names and IDs, but over time names and IDs can change:
    - If an object is dropped and re-created with the same name, chances of it having the same ID are virtually zero.
    - If an object is renamed, its ID will not change, but of course its name will.
    Using either IDs or Names exclusively can make historical values become useless over time – you might have an ID that’s no longer valid or a name that changed a while back. To give myself more options for tracking changes over time, I store both. Since storing each object’s name in a character data type would consume a lot of space and be repetitive, I prefer to map object names to numeric values, and this table facilitates that.

    IndexUsageStats
    This table does the heavy lifting storing for our system – it contains all the historical index usage statistics by both object ID and name. I’ve found it to be a great candidate for compression if you’re running enterprise edition, but it’s still pretty compact even when uncompressed thanks to the Names table.

    Stored Procedures

    CollectIndexUsageStats
    The stored procedure that drives statistics collection. You’ll want to set up a job to run this – mine runs at midnight. Basically it iterates through each online database in the instance collecting usage stats and then compares them to the previous cumulative values and writes the differences to the IndexUsageStats table. You’ll find comments in-line.

    Views

    vw_IndexUsageStats
    Since the data contained in the IndexUsageStats table is all numeric, it’s not the easiest to browse. This view makes some joins back to the Names table so things are a little more user-friendly. Most of the queries I write are against this view. Object IDs aren’t returned here as they’re typically not needed. When they are necessary, I’ll write more detailed queries against IndexUsageStats.

    A few notes before the code:

    • As always, this code is offered as-is with no warranty whatsoever. Do not deploy this in a production environment before you have tested it and understand exactly what it does.
    • I recommend putting all these objects in their own database. If you already have a DBA database for management information that should work fine too. My database is called “iDBA” because I inherited it with that name.
    • This was written and tested on SQL Server 2008 and 2008R2. It can probably be tweaked to run in 2005 by removing the datetime2 datatype and writing substitute queries for the MERGE statement.
    -----------------------------------------------------------
    -- Name: CollectIndexUsageStats.sql
    --
    -- Description: Collects index usage statistics over time.
    --
    -- Author: Bob Pusateri, http://www.bobpusateri.com
    --
    -- THIS SCRIPT IS PROVIDED "AS-IS" WITHOUT ANY WARRANTY.
    -- DO NOT RUN THIS ON A PRODUCTION SYSTEM UNTIL YOU HAVE
    --   COMPLETE UNDERSTANDING OF THE TASKS IT PERFORMS AND
    --   HAVE TESTED IT ON A DEVELOPMENT SYSTEM.
    -----------------------------------------------------------
    
    -- I recommend putting these objects in a separate database
    -- (or your existing DBA database if you have one)
    --
    -- If your DBA database is not named 'iDBA', you'll want to replace it with your DB name
    --     It is referenced several times in the script below
    USE [iDBA];
    GO
    
    -- stores cumulative data from sys.dm_db_index_usage_stats DMV
    CREATE TABLE [dbo].[IndexUsageStats_LastCumulative] (
    	[ServerNameID] [int] NOT NULL,
    	[DatabaseID] [smallint] NOT NULL,
    	[ObjectID] [int] NOT NULL,
    	[IndexID] [int] NOT NULL,
    	[LoadTime] [datetime2](0) NOT NULL,
    	[User_Seeks] [bigint] NOT NULL,
    	[User_Scans] [bigint] NOT NULL,
    	[User_Lookups] [bigint] NOT NULL,
    	[User_Updates] [bigint] NOT NULL,
    	[System_Seeks] [bigint] NOT NULL,
    	[System_Scans] [bigint] NOT NULL,
    	[System_Lookups] [bigint] NOT NULL,
    	[System_Updates] [bigint] NOT NULL,
    	CONSTRAINT [PK_IUS_C] PRIMARY KEY CLUSTERED ([ServerNameID],[DatabaseID],[ObjectID],[IndexID])
    );
    GO
    
    -- used for Server/DB/Schema/Table/Index name mapping
    CREATE TABLE [dbo].[Names] (
    	[ID] [int] IDENTITY(1,1) NOT NULL,
    	[Value] [nvarchar](260) NOT NULL,
    	CONSTRAINT [PK_Names] PRIMARY KEY CLUSTERED ([ID])
    );
    GO
    
    -- stores historical usage statistics
    CREATE TABLE [dbo].[IndexUsageStats] (
    	[StatsDate] [datetime2](0) NOT NULL,
    	[ServerNameID] [int] NOT NULL,
    	[DatabaseID] [smallint] NOT NULL,
    	[ObjectID] [int] NOT NULL,
    	[IndexID] [int] NOT NULL,
    	[DatabaseNameID] [int] NOT NULL,
    	[SchemaNameID] [int] NOT NULL,
    	[TableNameID] [int] NOT NULL,
    	[IndexNameID] [int] NULL,
    	[User_Seeks] [bigint] NOT NULL,
    	[User_Scans] [bigint] NOT NULL,
    	[User_Lookups] [bigint] NOT NULL,
    	[User_Updates] [bigint] NOT NULL,
    	[System_Seeks] [bigint] NOT NULL,
    	[System_Scans] [bigint] NOT NULL,
    	[System_Lookups] [bigint] NOT NULL,
    	[System_Updates] [bigint] NOT NULL,
    	CONSTRAINT [PK_IUS] PRIMARY KEY CLUSTERED ([StatsDate],[ServerNameID],[DatabaseID],[ObjectID],[IndexID]),
    	CONSTRAINT [FK_IUS_Names_DB] FOREIGN KEY([DatabaseNameID]) REFERENCES [dbo].[Names] ([ID]),
    	CONSTRAINT [FK_IUS_Names_Index] FOREIGN KEY([IndexNameID]) REFERENCES [dbo].[Names] ([ID]),
    	CONSTRAINT [FK_IUS_Names_Schema] FOREIGN KEY([SchemaNameID]) REFERENCES [dbo].[Names] ([ID]),
    	CONSTRAINT [FK_IUS_Names_Table] FOREIGN KEY([TableNameID]) REFERENCES [dbo].[Names] ([ID]),
    	CONSTRAINT [CK_IUS_PositiveValues] CHECK ([User_Seeks]>=(0) AND [User_Scans]>=(0) AND [user_Lookups]>=(0)
    		AND [user_updates]>=(0) AND [system_seeks]>=(0) AND [system_scans]>=(0) AND [system_lookups]>=(0)
    		AND [system_updates]>=(0))
    );
    GO
    
    -- collects usage statistics
    -- I run this once daily (can be run more often if you like)
    CREATE PROCEDURE [dbo].[CollectIndexUsageStats]
    AS
    BEGIN
    	BEGIN TRY
    		SET NOCOUNT ON;
    
    		-- get current stats for all online databases
    
    		SELECT database_id, name
    		INTO #dblist
    		FROM sys.databases
    		WHERE [state] = 0
    			AND database_id != 2; -- skip TempDB
    
    		CREATE TABLE #t (
    			StatsDate DATETIME2(0),
    			ServerName SYSNAME,
    			DatabaseID SMALLINT,
    			ObjectID INT,
    			IndexID INT,
    			DatabaseName SYSNAME,
    			SchemaName SYSNAME,
    			TableName SYSNAME,
    			IndexName SYSNAME NULL,
    			User_Seeks BIGINT,
    			User_Scans BIGINT,
    			User_Lookups BIGINT,
    			User_Updates BIGINT,
    			System_Seeks BIGINT,
    			System_Scans BIGINT,
    			System_Lookups BIGINT,
    			System_Updates BIGINT
    		);
    
    		DECLARE @DBID INT;
    		DECLARE @DBNAME SYSNAME;
    		DECLARE @Qry NVARCHAR(2000);
    
    		-- iterate through each DB, generate & run query
    		WHILE (SELECT COUNT(*) FROM #dblist) > 0
    		BEGIN
    			SELECT TOP (1) @DBID=database_id, @DBNAME=[name]
    			FROM #dblist ORDER BY database_id;
    
    			SET @Qry = '
    				INSERT INTO #t
    				SELECT
    					SYSDATETIME() AS StatsDate,
    					@@SERVERNAME AS ServerName,
    					s.database_id AS DatabaseID,
    					s.object_id AS ObjectID,
    					s.index_id AS IndexID,
    					''' + @DBNAME + ''' AS DatabaseName,
    					c.name AS SchemaName,
    					o.name AS TableName,
    					i.name AS IndexName,
    					s.user_seeks,
    					s.user_scans,
    					s.user_lookups,
    					s.user_updates,
    					s.system_seeks,
    					s.system_scans,
    					s.system_lookups,
    					s.system_updates
    				FROM sys.dm_db_index_usage_stats s
    				INNER JOIN ' + @DBNAME + '.sys.objects o ON s.object_id = o.object_id
    				INNER JOIN ' + @DBNAME + '.sys.schemas c ON o.schema_id = c.schema_id
    				INNER JOIN ' + @DBNAME + '.sys.indexes i ON s.object_id = i.object_id and s.index_id = i.index_id
    				WHERE s.database_id = ' + CONVERT(NVARCHAR,@DBID) + ';
    				';
    
    			EXEC sp_executesql @Qry;
    
    			DELETE FROM #dblist WHERE database_id = @DBID;
    		END -- db while loop
    
    		DROP TABLE #DBList;
    
    		BEGIN TRAN;
    
    		-- create ids for Server Name by inserting new ones into dbo.Names
    		INSERT INTO iDBA.dbo.Names (Value)
    		SELECT DISTINCT RTRIM(LTRIM(t.ServerName)) AS ServerName
    		FROM #t t
    		LEFT JOIN iDBA.dbo.Names n ON t.ServerName = n.Value
    		WHERE n.ID IS NULL AND t.ServerName IS NOT NULL
    		ORDER BY RTRIM(LTRIM(t.ServerName));
    
    		-- same as above for DatabaseName
    		INSERT INTO iDBA.dbo.Names (Value)
    		SELECT DISTINCT RTRIM(LTRIM(t.DatabaseName)) AS DatabaseName
    		FROM #t t
    		LEFT JOIN iDBA.dbo.Names n ON t.DatabaseName = n.Value
    		WHERE n.ID IS NULL AND t.DatabaseName IS NOT NULL
    		ORDER BY RTRIM(LTRIM(t.DatabaseName));
    
    		-- SchemaName
    		INSERT INTO iDBA.dbo.Names (Value)
    		SELECT DISTINCT RTRIM(LTRIM(t.SchemaName)) AS SchemaName
    		FROM #t t
    		LEFT JOIN iDBA.dbo.Names n ON t.SchemaName = n.Value
    		WHERE n.ID IS NULL AND t.SchemaName IS NOT NULL
    		ORDER BY RTRIM(LTRIM(t.SchemaName));
    
    		-- TableName
    		INSERT INTO iDBA.dbo.Names (Value)
    		SELECT DISTINCT RTRIM(LTRIM(t.TableName)) AS TableName
    		FROM #t t
    		LEFT JOIN iDBA.dbo.Names n ON t.TableName = n.Value
    		WHERE n.ID IS NULL AND t.TableName IS NOT NULL
    		ORDER BY RTRIM(LTRIM(t.TableName));
    
    		-- IndexName
    		INSERT INTO iDBA.dbo.Names (Value)
    		SELECT DISTINCT RTRIM(LTRIM(t.IndexName)) AS IndexName
    		FROM #t t
    		LEFT JOIN iDBA.dbo.Names n ON t.IndexName = n.Value
    		WHERE n.ID IS NULL AND t.IndexName IS NOT NULL
    		ORDER BY RTRIM(LTRIM(t.IndexName));
    
    		-- Calculate Deltas
    		INSERT INTO iDBA.dbo.IndexUsageStats (StatsDate, ServerNameID, DatabaseID, ObjectID,
    			IndexID, DatabaseNameID, SchemaNameID, TableNameID, IndexNameID, User_Seeks, User_Scans,
    			User_Lookups, User_Updates, System_Seeks, System_Scans, System_Lookups, System_Updates)
    		SELECT
    			t.StatsDate,
    			s.ID AS ServerNameID,
    			t.DatabaseID,
    			t.ObjectID,
    			t.IndexID,
    			d.ID AS DatabaseNameID,
    			c.ID AS SchemaNameID,
    			b.ID AS TableNameID,
    			i.ID AS IndexNameID,
    			CASE
    				-- if the previous cumulative value is greater than the current one, the server has been reset
    				-- just use the current value
    				WHEN t.User_Seeks - ISNULL(lc.User_Seeks,0) < 0 THEN t.User_Seeks
    				-- if the prev value is less than the current one, then subtract to get the delta
    				ELSE t.User_Seeks - ISNULL(lc.User_Seeks,0)
    			END AS User_Seeks,
    			CASE
    				WHEN t.User_Scans - ISNULL(lc.User_Scans,0) < 0 THEN t.User_Scans
    				ELSE t.User_Scans - ISNULL(lc.User_Scans,0)
    			END AS User_Scans,
    			CASE
    				WHEN t.User_Lookups - ISNULL(lc.User_Lookups,0) < 0 THEN t.User_Lookups
    				ELSE t.User_Lookups - ISNULL(lc.User_Lookups,0)
    			END AS User_Lookups,
    			CASE
    				WHEN t.User_Updates - ISNULL(lc.User_Updates,0) < 0 THEN t.User_Updates
    				ELSE t.User_Updates - ISNULL(lc.User_Updates,0)
    			END AS User_Updates,
    			CASE
    				WHEN t.System_Seeks - ISNULL(lc.System_Seeks,0) < 0 THEN t.System_Seeks
    				ELSE t.System_Seeks - ISNULL(lc.System_Seeks,0)
    			END AS System_Seeks,
    			CASE
    				WHEN t.System_Scans - ISNULL(lc.System_Scans,0) < 0 THEN t.System_Scans
    				ELSE t.System_Scans - ISNULL(lc.System_Scans,0)
    			END AS System_Scans,
    			CASE
    				WHEN t.System_Lookups - ISNULL(lc.System_Lookups,0) < 0 THEN t.System_Lookups
    				ELSE t.System_Lookups - ISNULL(lc.System_Lookups,0)
    			END AS System_Lookups,
    			CASE
    				WHEN t.System_Updates - ISNULL(lc.System_Updates,0) < 0 THEN t.System_Updates
    				ELSE t.System_Updates - ISNULL(lc.System_Updates,0)
    			END AS System_Updates
    		FROM #t t
    		INNER JOIN iDBA.dbo.Names s ON t.ServerName = s.Value
    		INNER JOIN iDBA.dbo.Names d ON t.DatabaseName = d.Value
    		INNER JOIN iDBA.dbo.Names c ON t.SchemaName = c.Value
    		INNER JOIN iDBA.dbo.Names b ON t.TableName = b.Value
    		LEFT JOIN iDBA.dbo.Names i ON t.IndexName = i.Value
    		LEFT JOIN iDBA.dbo.IndexUsageStats_LastCumulative lc
    			ON s.ID = lc.ServerNameID
    			AND t.DatabaseID = lc.DatabaseID
    			AND t.ObjectID = lc.ObjectID
    			AND t.IndexID = lc.IndexID
    		ORDER BY StatsDate, ServerName, DatabaseID, ObjectID, IndexID;
    
    		-- Update last cumulative values with the current ones
    		MERGE INTO iDBA.dbo.IndexUsageStats_LastCumulative lc
    		USING #t t
    		INNER JOIN iDBA.dbo.Names s ON t.ServerName = s.Value
    		ON s.ID = lc.ServerNameID
    			AND t.DatabaseID = lc.DatabaseID
    			AND t.ObjectID = lc.ObjectID
    			AND t.IndexID = lc.IndexID
    		WHEN MATCHED THEN
    			UPDATE SET
    				lc.LoadTime = t.StatsDate,
    				lc.User_Seeks = t.User_Seeks,
    				lc.User_Scans = t.User_Scans,
    				lc.User_Lookups = t.User_Lookups,
    				lc.User_Updates = t.User_Updates,
    				lc.System_Seeks = t.System_Seeks,
    				lc.System_Scans = t.System_Scans,
    				lc.System_Lookups = t.System_Lookups,
    				lc.System_Updates = t.System_Updates
    		WHEN NOT MATCHED BY TARGET THEN
    			INSERT (ServerNameID, DatabaseID, ObjectID, IndexID, LoadTime, User_Seeks, User_Scans,
    				User_Lookups, User_Updates, System_Seeks, System_Scans,
    				System_Lookups, System_Updates)
    			VALUES (s.ID, t.DatabaseID, t.ObjectID, t.IndexID, t.StatsDate, t.User_Seeks, t.User_Scans,
    				t.User_Lookups, t.User_Updates, t.System_Seeks, t.System_Scans,
    				t.System_Lookups, t.System_Updates)
    		WHEN NOT MATCHED BY SOURCE
    			THEN DELETE;
    
    		COMMIT TRAN;
    
    	END TRY
    	BEGIN CATCH
    
    		IF @@TRANCOUNT > 0
    			ROLLBACK TRAN;
    
    		DECLARE @ErrorNumber INT;
    		DECLARE @ErrorSeverity INT;
    		DECLARE @ErrorState INT;
    		DECLARE @ErrorProcedure NVARCHAR(126);
    		DECLARE @ErrorLine INT;
    		DECLARE @ErrorMessage NVARCHAR(2048);
    
    		SELECT @ErrorNumber = ERROR_NUMBER(),
    			   @ErrorSeverity = ERROR_SEVERITY(),
    			   @ErrorState = ERROR_STATE(),
    			   @ErrorProcedure = ERROR_PROCEDURE(),
    			   @ErrorLine = ERROR_LINE(),
    			   @ErrorMessage = ERROR_MESSAGE();
    
    		RAISERROR (@ErrorMessage, @ErrorSeverity, @ErrorState);
    
    	END CATCH
    END
    GO
    
    -- displays usage statistics
    CREATE VIEW [dbo].[vw_IndexUsageStats]
    AS
    SELECT
    	s.StatsDate,
    	vn.Value AS ServerName,
    	dbn.Value AS DatabaseName,
    	sn.Value AS SchemaName,
    	tn.Value AS TableName,
    	dn.Value AS IndexName,
    	s.IndexID,
    	s.User_Seeks,
    	s.User_Scans,
    	s.User_Lookups,
    	s.User_Updates,
    	s.System_Seeks,
    	s.System_Scans,
    	s.System_Lookups,
    	s.System_Updates
    FROM dbo.IndexUsageStats s
    INNER JOIN dbo.Names vn ON s.ServerNameID = vn.ID
    INNER JOIN dbo.Names dbn ON s.DatabaseNameID = dbn.ID
    INNER JOIN dbo.Names sn ON s.SchemaNameID = sn.ID
    INNER JOIN dbo.Names tn ON s.TableNameID = tn.ID
    LEFT JOIN dbo.Names dn ON s.IndexNameID = dn.ID;
    GO
    
  8. 2012: New Year, New Goals

    Happy New YearHappy New Year to all! At the start of 2011, I laid out some goals for the year and did an update on them back in July. Now that it’s 2012 it’s a great time to take a look at what I accomplished (and fell short on) last year and also set some new targets. My 2011 goals were as follows:

    Speak More – I spoke at two SQL Saturdays this year, #67 in Chicago and #101 in Kansas City. I had hoped to make it to more than 2, but I still met my goal for the year and am hoping to continue the trend.

    Travel More – I flew to Kansas City when I spoke there, so I’ll consider this goal completed as well.

    Become eligible to take the MCM exam – I achieved this when I passed 70-451 back in April. Check!

    Attend training – I considered this done when I attended the SQLskills Immersion Event in Chicago in May.

    Keep up the blogging – I had kind of a mid-year lull in blog posts, but I like to think I finished off strong. I had 39 posts in 2011, but 45 in 2010. Worse yet, I didn’t even blog the entire year in 2010. Definitely a fail here.

    Learn IO – As I mentioned in July, the plan was for my team to get a new SAN this year. We now have the SAN, actually it was delivered just last week. Unfortunately with the holidays apparently nobody from the vendor was available to install it, so as far as I know it is currently in our datacenter but still in boxes. Hopefully it gets set up soon so I can start working with it, but in the meantime this is a fail as well.

    Help out more – I’m still around on twitter and chip in on #sqlhelp when I can, but more things going on at work keep me from being as active as I would like. I’m going to have to find a way to increase my activity. Fail.

    All in all I consider myself to be 4 for 7. I did some great things in 2011, but not quite as much as I had hoped. That being said, I think it’s ok that I didn’t do everything I wanted to as those things will automatically roll over into next year. Here’s what I hope to accomplish in 2012:

    Continue all the above – While I listed the above as goals, many of them are more of a journey than a destination. For example, training is a process that should have no end because there will always be something else out there to learn. The same goes for speaking, traveling, blogging and helping out. As for my goal of learning more about IO (and SAN management in particular) I should get that opportunity very soon once our new hardware is installed.

    More MCM Progress – Becoming eligible to take the Microsoft Certified Master exam was great, but that’s no stopping point. I need to keep up studying, practicing, and working towards passing things! I don’t feel comfortable saying I want to become an MCM in 2012, so instead my goal is to continue progressing toward it.

    Attend the PASS Summit – I should have done this long ago but never got the opportunity, so one of my goals for this year is to attend the PASS Summit. Perhaps this year I can be on the other side of all the tweets coming from the keynote! An added bonus would be if I could speak there, and I will definitely be submitting a few abstracts once the call for speakers opens.

    Blog smarter, not harder – While my previous goal of keeping up my blogging is all well and good, I don’t just want to focus on the number of posts I write – anyone can write a lot of short and useless things and call it a victory in terms of quantity. I’m generally happy with what I put up here – if I don’t like a post then it doesn’t go live, and I’ve left a few on the cutting room floor or delayed them so they can be improved. In 2012 I plan to continue this trend and ensure that when you come here you will find informational, helpful, and thought-provoking material. As always, I welcome your comments and do my best to respond to them in short order. If you, my readers, aren’t happy, then there’s really no point in me having a blog to begin with!

    So that’s my goals for the year. May you all have a happy, healthy, and prosperous 2012!

  9. T-SQL Tuesday #025: Trick or …?

    T-SQL Tuesday LogoThis month’s T-SQL Tuesday is hosted by Allen White (blog | @SQLRunr) and is an invitation to show off a nifty T-SQL trick you use to make your job easier. Here’s one of my favorites as of late:

    Some of the source systems we copy from shard their data across multiple similar tables. For example, instead of having 1 table named dbo.Sales, they might have 30 of them named dbo.Sales_001 to dbo.Sales_030. If there were a rhyme or reason to this sharding it might be seen as an advantage, but unfortunately neither myself nor others on the team have ever found a method to this madness, and the vendor will not divulge the way they do this. As if that’s not bad enough, additional tables can pop up out of nowhere, and not all these tables are guaranteed to contain the same columns. Table dbo.Sales_031 may pop up tomorrow and have only a subset of the columns from the previous 30 tables, or may contain new columns not present in any of the others.

    To keep this all straight, I have a process that compares the columns in all applicable tables and generates a view that combines their contents using UNION ALL. I’ve actually blogged about an aspect of this before, but today I have a function containing a nice little trick that’s proven very helpful for me. Since the tables I’m generating a view of aren’t guaranteed to have the same columns in them, a simple “SELECT *” from each of them won’t work because unioning tables together requires identical datatypes in the same order. Instead I have to generate a select statement that explicitly lists all columns for every table. T-SQL can easily accomplish this with cursors and loops, but then I found FOR XML PATH, which made life a lot simpler.

    FOR XML PATH has many features which are outside the scope of this post, but the one behavior we’re going to exploit today is that you can pass a row tag name that’s blank. When you do this, instead of sandwiching each row between tags, it simply concatenates all the row values together into one string. If you add commas, you’ve got a great way to list out all the columns of a table which you can then form into a SELECT statement. Without any further ado, here’s the demo code:

    -- create our demo table with a few columns
    CREATE TABLE dbo.MyTable (
       ID INT NOT NULL,
       Value VARCHAR(20),
       Notes NVARCHAR(MAX)
    );
    
    -- select the table's columns from system tables
    SELECT c.name
    FROM sys.columns c
    INNER JOIN sys.tables t ON c.object_id = t.object_id
    WHERE t.name = 'MyTable'
    ORDER BY c.column_id;
    
    -- now let's format it as XML
    SELECT c.name
    FROM sys.columns c
    INNER JOIN sys.tables t ON c.object_id = t.object_id
    WHERE t.name = 'MyTable'
    ORDER BY c.column_id
    FOR XML PATH;
    
    -- now pass a blank string argument to PATH
    -- note how the <row> tags are now removed but column tags remain
    SELECT c.name
    FROM sys.columns c
    INNER JOIN sys.tables t ON c.object_id = t.object_id
    WHERE t.name = 'MyTable'
    ORDER BY c.column_id
    FOR XML PATH('');
    
    -- casting it to NVARCHAR removes the tags for each column
    DECLARE @s NVARCHAR(MAX);
    
    -- now let's add a blank string argument to PATH
    SET @s = (
          SELECT ',' + QUOTENAME(c.name) -- comma to separate names
          FROM sys.columns c
          INNER JOIN sys.tables t ON c.object_id = t.object_id
          WHERE t.name = 'MyTable'
          ORDER BY c.column_id
          FOR XML PATH('') -- XML that really isn't
       );
    
    -- remove the leading comma
    SELECT SUBSTRING(@s, 2, LEN(@s));
    

    Wrap the final output with a SELECT and a table name and you’re good to go. There’s tons more uses for this than creating lists of columns – basically anytime you want to turn column values into a delimited list you can make it happen using this method. Happy coding!

  10. SQL Saturday #119 Open for Registration!

    SQL Saturday 119 LogoThe third time’s a charm! SQL Saturday #119 (Chicago 2012) has been posted on the SQL Saturday website! It will once again be held at the DeVry University campus in Addison, IL and will be on May 19, 2012.

    SQL Saturday in Chicago has a special place in my heart for a few reasons. The first SQL Saturday I ever attended was SQL Saturday #31 back in 2009, the very first one held in Chicago. I didn’t have this blog back then, but had an excellent time. Last year was SQL Saturday #67, which I did blog about. I was also very fortunate to be part of a great group of people who help put it together, and am back for more this time around. The rest of our team consists of:

    - Ted Krueger (blog | @onpnt)
    - Aaron Lowe (blog | @vendoran)
    - Bill Lescher (@blescher)
    - Rich Rousseau (@zigzag219)
    - Wendy Pastrick (blog | @wendy_dance)

    Register Today!

    Registration is open, so sign up today! We had a great turn out last year and we’re hoping to fill up again! The event itself is free, but there is a $10.00 charge for lunch (it won’t be pizza!)

    Why You Need To Be There

    • It’s free knowledge! Yes there is a charge for lunch, but that’s optional. The training itself costs nothing.
    • Meet other awesome DBAs! Most companies have very few DBAs, sometimes only one. This can lead to feelings of isolation and that the whole world of developers is against you. You’re not alone though – there’s tons of us out there! Community events are by far the best way to meet people in this field – where else can you hang out all day with a bunch of people as passionate about databases as you are?
    • Did I mention it’s free?

    Call for Speakers

    Have you spoken at work, a user group, or a SQL Saturday before? Have you never tried but are looking to take the plunge? The call for speakers is open until April 19, 2012, so please submit an abstract (or two or three) if you’re at all interested!

    I can’t wait to see you there in May!

  1. 1
  2. Next ›
  3. Last »