DBPedias

Your Database Knowledge Community

Valentino Vranken

  1. SSIS, Flat Files And Accents (é, è, …)

    This is a follow-up to my article on Loading Complex XML Using SSIS and XSLT.  In that article I demonstrated how you can convert complex XML into simple CSV using XSLT in SSIS.

    Dealing With Special Characters

    If you’ve followed the instructions in my article mentioned above and you need to deal with special characters such as the é and è encountered in the French language, you probably noticed that it wouldn’t really work as expected.  In fact, in your final result you may have ended up with the special characters being replaced with other, even more special, characters.  Obviously not good.

    Here’s an explanation on the reason why that happens, and also how to deal with it.

    Setting The Scene

    Imagine the following sample XML, representing a really huge book collection:

    
    <books>
        <book>
            <title>The Hitchhiker's Guide to the Galaxy</title>
            <author>Douglas Adams</author>
            <language>EN</language>
            <description>The Hitchhiker's Guide to the Galaxy is a science fiction comedy series created by Douglas Adams.</description>
        </book>
        <book>
            <title>Le Trône de fer</title>
            <author>George R.R. Martin</author>
            <language>FR</language>
            <description>Le Trône de fer est une série de romans de fantasy de George R. R. Martin, dont l'écriture et la parution sont en cours. Martin a commencé à l'écrire en 1991 et le premier volume est paru en 1996. Prévue à l'origine comme une trilogie, la série compte désormais cinq volumes publiés et deux autres sont attendus.</description>
        </book>
    </books>
    

    As you can see, the second book in the list is the French version of the first book in the A Song of Ice and Fire series by George R.R. Martin and as it goes with French, there are some accents in the description of the book.

    We’ll use the following XSLT to convert it to CSV:

    
    <?xml version="1.0" encoding="utf-8"?>
    <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs=http://www.w3.org/2001/XMLSchema xmlns:fn="http://www.w3.org/2005/xpath-functions">
      <xsl:output method="text" version="1.0" encoding="UTF-8" indent="no"/>
      <xsl:template match="/">
        <xsl:text>BookTitle;Author;Language;Description</xsl:text>
        <xsl:text>&#13;&#10;</xsl:text>
    
        <xsl:for-each select="books/book">
          <xsl:text>"</xsl:text>
          <xsl:value-of select="title"/>
          <xsl:text>";"</xsl:text>
          <xsl:value-of select="author"/>
          <xsl:text>";"</xsl:text>
          <xsl:value-of select="language"/>
          <xsl:text>";"</xsl:text>
          <xsl:value-of select="description"/>
          <xsl:text>"</xsl:text>
          <xsl:text>&#13;&#10;</xsl:text>
        </xsl:for-each>
    
      </xsl:template>
    </xsl:stylesheet>
    

    Using an XML Task in the Control Flow, as explained in my article, we’d get the following output:

    BookTitle;Author;Language;Description

    "The Hitchhiker’s Guide to the Galaxy";"Douglas Adams";"EN";"The Hitchhiker’s Guide to the Galaxy is a science fiction comedy series created by Douglas Adams."

    "Le Trône de fer";"George R.R. Martin";"FR";"Le Trône de fer (A Song of Ice and Fire) est une série de romans de fantasy de George R. R. Martin, dont l’écriture et la parution sont en cours. Martin a commencé à l’écrire en 1991 et le premier volume est paru en 1996. Prévue à l’origine comme une trilogie, la série compte désormais cinq volumes publiés et deux autres sont attendus."

    So far so good, all accents are still present!

    Then we’d import the file using a Flat File Source component in a Data Flow Task.  Here’s what the General page of the Flat File Connection Manager would look like:

    Flat File Connection Manager: General

    We’ve set double-quote as Text Qualifier and checked the Column names in the first data row textbox.

    Switching to the Columns page we’d get the following:

    Flat File Connection Manager: Columns - the Preview has messed up the accents!

    Hang on, that’s not right!  The Preview is not displaying our accents as expected!  Oh my, what’s going on here? Let’s call the code page detectives!

    A Mismatch Investigation

    Take a good look at the XSLT which we’ve used to convert the XML into CSV, especially the xsl:output line:

    <xsl:output method="text" version="1.0" encoding="UTF-8" indent="no"/>

    That line specifies that the text output should be encoded using the UTF-8 code page.

    Now take a good look at the General page in the screenshot earlier, more precisely this part:

    Code page: 1252 (ANSI - Latin I) is not what we need right now!

    Indeed, code page 1252 (ANSI – Latin I).  While the input is UTF-8.  Of course that results in a mismatch of certain characters, as demonstrated here.  The fix is fairly easy, just change the Code page setting to 65001 (UTF-8).

    Code page: 65001 (UTF-8) - much better!

    If we now switch back to the Columns page we should come to the following result:

    Flat File Connection Manager: Columns page preview with accents!

    Ah, sure looks better doesn’t it?  All accents are present as expected.

    But in case you thought that’s it, I’d advise you to think again.  Don’t worry, I’ll demonstrate what I mean.  Let’s do that by setting up a simple Data Flow.

    Setting Up The Data Flow

    Throw in a Flat File Source and specify our Flat File Connection Manager.  I also prefer to keep NULLs as they come in, using the Retain null values from the source as null values in the data flow checkbox.

    Flat File Source: Connection Manager

    If you click the Preview button you should get similar output as shown one screenshot earlier.

    Now hook this up to an OLE DB Destination that writes the incoming data into a table in your favorite database:

    OLE DB Destination is not happy :(

    As you can see, our destination is not entirely happy with all this.  Here are the details of one of the error messages:

    Validation error. Data Flow Task: Data Flow Task: The column "BookTitle" cannot be processed because more than one code page (65001 and 1252) are specified for it.

    Looks like once more we’ve got a code page conflict.  And we sure do. Clicking the Data Flow connector between the Flat File source and OLE DB destination shows us the following:

    Data Flow Path Editor shows that our strings are encoded using the 65001 code page.

    Each of our incoming string values is encoded using the 65001 (UTF-8) code page.  But our database was created using the Latin1_General_CI_AS collation.  So we’ve indeed got a code page conflict!

    Fear not, that’s easily remedied.  Add a Derived Column transformation in between the source and destination and convert each incoming string value using a cast expression such as this one:

    (DT_STR, 50, 1252)BookTitle_IN

    Note: whenever I need to manipulate incoming columns to create a second version of the same column, I rename the incoming column to TheColumn_IN.  The new version will be called TheColumn and preferably TheColumn is the name of the field in the destination table.  This makes it easy to distinguish all columns later down the flow.

    Here’s what the final version of the Derived Column looks like:

    Using the Derived Column transformation to cast the incoming strings into the correct code page.

     

    Next we’ll need to open the Destination and change the mapped fields to the new ones.  Because my new columns are called exactly the same as the fields in the destination table, I can do that easily.  In the Mappings page, all I need to do is right-click the grey background in between the two tables and click Select All Mappings, hit the Delete button, right-click again and click Map Items By Matching Names:

    Using Map Items By Matching Names, easy!

    With the data flow finished, let’s give our package a run!

    Flat File Source has got a length issue!

    Ouch, our source is not happy!  A closer examination of the Output pane brings us to the following error:

    Error: 0xC02020A1 at Data Flow Task, Flat File Source [16]: Data conversion failed. The data conversion for column "Description" returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page.".

    Oh right, so far we haven’t bothered looking at the actual length of the data that we’re importing.  Actually, what is the length of our data flow columns??  Well, if you’ve been paying close attention you should have noticed the number 50 several times in the screenshots and expressions above.  That’s indeed the default length for text columns when importing a flat file.

    And if you scroll back up to the sample XML, you’ll notice that the content for the description is longer than 50 characters, thus causing our error!  Let’s find out how to get that solved!

    Fixing The Field Length Issue

    The first step in getting this fixed is opening up the Advanced page in the Flat File Connection Manager editor.

    Flat File Connection Manager: using the Advanced page to change field length.

    Then select the Description field and change its OutputColumnWidth property from 50 to 500.

    That will cause the source to generate a warning.  Remove this warning by opening and closing the source editor.  Click the Yes button in the popup that appears.

    The next step is changing the expression for the Description field in the Derived Column to this:

    (DT_STR,500,1252)Description_IN

    Indeed, the field length is one of the parameters in that cast.  The other numeric parameter is obviously the code page.

    Having done that you’ll notice that the destination will start complaining.  Of course, you’ll need to adapt the destination table to reflect the field length increase as well.  So change the table definition and open/close the destination editor to make it happy.

    Alright, let’s run the package once more!

    Finally the data flow is happy with it all and has inserted two records:

    That's more like it: all components colored green!

    And what does our table contain?  Let’s find out:

    All accents have been imported!

    That’s looking good for sure!

    Conclusion

    In this follow-up article I have demonstrated what might go wrong when you need to deal with special characters while importing flat files, and how to solve your possible issues.  In case you missed the original article, have a look through this link.

    Have fun!

    Valentino.

    References

    Wikipedia: UTF-8

    Share

  2. Building Reports With Dynamic Datasets

    On the forums I regularly encounter questions like:

    I have a report and want to show A, B, C but sometimes I want X, Y, Z and not A, B, C. How?

    Or, in other words:

    I have a parameter and based on the selection I want to return different fields in my dataset.  How can I do that?

    My initial reaction would be “I don’t think you can do that”.  But then I thought it would be useful in certain situations and decided to try it out. And guess what?  You can do that!  Here’s how.

    Just to make sure everyone is on track: this article is not about dynamicity in terms of rows returned, that would be filtered datasets and you’ll already find plenty of references on the internet.  This article is about a varying number of columns in the dataset, which is a little less straightforward.

    The database used in the examples is AdventureWorks2008R2, available at Codeplex.  And the screenshots are taken from SQL Server 2008 R2 x64 SP1.

    The Scenario

    The report we’re going to build should show a list of products sold.  But the report is used by different user groups: those who just want to see the sales numbers and those who also care about stock levels!

    So by default the report should show a list of items and number sold, but it should also be possible to render that same report whilst displaying the stock-related statistics.

    And to make it even a bit more interesting, by default the data should be ordered according to product number but in “stock level mode” the ordering should put those with the lowest current stock first.

    The Report

    Let’s get started!  The first step in creating a report is often the writing of a SELECT statement.  In this scenario we’ll be needing two of them, both in the same dataset.

    The Dataset

    Our dataset is going to need a parameter to be able to decide what type of user is requesting the report. Let’s call that parameter WithStockData, and its type will be Boolean.

    The layout of the dataset statement will be this:

    if @WithStockData = 1
    -- sales and stock data
    else
    -- regular sales data

    A simple IF statement, taking the parameter into account.  The parameter is a boolean value so when it’s True, it equals to 1.

    And here’s the full statement for the dataset:

    if @WithStockData = 1
        -- sales and stock data
        select P.ProductNumber, P.Name
            , P.SafetyStockLevel, P.ReorderPoint
            , SUM(SOD.OrderQty) SoldQuantity, SUM(I.Quantity) InventoryQuantity
            , SUM(I.Quantity) - SUM(SOD.OrderQty) CurrentStock
        from Production.Product P
        inner join Sales.SalesOrderDetail SOD on SOD.ProductID = P.ProductID
        inner join Production.ProductInventory I on I.ProductID = P.ProductID
        group by P.ProductNumber, P.Name, P.SafetyStockLevel, P.ReorderPoint
        order by SUM(I.Quantity) - SUM(SOD.OrderQty) asc
    else
        -- regular sales data
        select P.ProductNumber, P.Name
            , SUM(SOD.OrderQty) SoldQuantity
        from Production.Product P
        inner join Sales.SalesOrderDetail SOD on SOD.ProductID = P.ProductID
        group by P.ProductNumber, P.Name
        order by P.ProductNumber asc;

    Some data as returned by the SELECT in the then part:

    The sales and stock data: 7 fields in total

    And some data as returned by the query in the else part:

    The regular sales data only consists of three columns

    As you can clearly see, the first query returns seven fields while the second one contains only three.  You can also see that both results are ordered differently.

    Now, let’s get the fun started!  Create a new report, set up a data source that points to the AdventureWorks2008R2 database and create a dataset with the query above:

    Setting up the dataset

    Power tip: to create the dataset, do not right-click on the Datasets node in the Report Data pane, but right-click on the Data Source and then select Add Dataset.  That saves you some work because the Data Source will be pre-populated.  All you need to do is paste the query in the Query field and give it a decent Name.

    Click the OK button to close the Dataset Properties.

    Now open the new dataset in the Report Data pane and count its fields:

    Our dataset contains seven fields!

    You should come to seven!  How nice, all of our fields are there.  This is not always the case, but I’ll handle that later.

    The Parameters node in the Report Data pane should now contain a new parameter called @WithStockData:

    The @WithStockData parameter got created automatically

    Double-click it to get to its properties and change the Data type to Boolean (by default it’s Text).

    The WithStockData parameter properties

    If you want, you can also specify a default value.  The value should be either “true” or “false”:

    Our Boolean parameter is set to false as default

    With the dataset fully set up, let’s now move on to visualizing it.

    Displaying Dynamic Columns

    Put a Table on the report Design and set it up as follows:

    The Table contains seven columns

    All seven columns have been added, and to avoid whitespace in the table when some columns are hidden I’ve put the dynamic columns at the end.

    One step remains: even though the values won’t always be present, the columns will not disappear automatically.  To take care of that, we’ll enter an expression on the Hidden property of each column.  The expression looks like this:

    =Fields!InventoryQuantity.IsMissing

    Or visually:

    Using the IsMissing property to hide a column dynamically

    To set up the expression, click the grey area above the column title to select it and then locate the Hidden property in the Properties pane.

    Each column in a dataset has got the IsMissing property.  When its value is True, it means that the column is not present and should thus be hidden.

    And here’s what the rendered report looks like:

    Report with all columns displayed

    Once more, with the parameter set to False:

    The report showing sales data only

    The four stock-related columns are nicely hidden!  And the ordering is working as well because we’ve taken care of that in the dataset’s queries.

    Easy, huh?  Well, sure, but I’ve made sure that the process went as smoothly as possible.  It takes some knowledge on how SSRS actually works.  Let’s make this clear by adapting the example just a little.

    Understanding The Dataset

    Delete the current dataset and create a new one, using the following statement (ensure that you give it the same name as the original one):

    if @WithStockData = 0
        -- regular sales data
        select P.ProductNumber, P.Name
            , SUM(SOD.OrderQty) SoldQuantity
        from Production.Product P
        inner join Sales.SalesOrderDetail SOD on SOD.ProductID = P.ProductID
        group by P.ProductNumber, P.Name
        order by P.ProductNumber asc
    else
        -- sales and stock data
        select P.ProductNumber, P.Name
            , P.SafetyStockLevel, P.ReorderPoint
            , SUM(SOD.OrderQty) SoldQuantity, SUM(I.Quantity) InventoryQuantity
            , SUM(I.Quantity) - SUM(SOD.OrderQty) CurrentStock
        from Production.Product P
        inner join Sales.SalesOrderDetail SOD on SOD.ProductID = P.ProductID
        inner join Production.ProductInventory I on I.ProductID = P.ProductID
        group by P.ProductNumber, P.Name, P.SafetyStockLevel, P.ReorderPoint
        order by SUM(I.Quantity) - SUM(SOD.OrderQty) asc;

    The only difference with the previous version is that the IF condition is reversed and thus the two queries are swapped.

    Now render the report.  What do you see?

    An error occurred during local report processing.

    The definition of the report ‘/DynamicDataset’ is invalid.

    The Hidden expression for the text box ‘Textbox7’ refers to the field ‘InventoryQuantity’. Report item expressions can only refer to fields within the current dataset scope or, if inside an aggregate, the specified dataset scope.

    Letters in the names of fields must use the correct case.

    Oh my, it’s broken!

    Now take a good look at the available dataset fields:

    The dataset only has three columns, oh my!

    That’s right, only three!  Four of them have gone missing!  The reason for that is because SSRS uses the first SELECT query it encounters in the whole statement to determine the available fields.  It’s not able to automatically detect the different situations and create all the fields that can possibly be returned.

    One way to ensure all fields are created is to put the query that returns all possible fields as first query, which is what I initially did.  But of course that’s not always an option.

    Manually Adding Fields To A Dataset

    Luckily it’s possible to manually add fields to the dataset.  You can do this by clicking the Add button in the Fields page of the Dataset Properties and then selecting Query Field.

    Manually adding additional fields to the dataset

    So, add the four missing fields:

    The four additional fields added manually

    If you now render the report, it should behave exactly the same as in the initial version!

    What About Those Warnings?

    If you’re someone who pays attention to the Error List pane, you may have noticed some warnings.  Two for each dynamic field to be exact.

    Here’s an example of the rsMissingFieldInDataSet warning:

    Warning    1    [rsMissingFieldInDataSet] The dataset ‘dsProductSales’ contains a definition for the Field ‘SafetyStockLevel’. This field is missing from the returned result set from the data source.    C:\test\SSRS\SSRS2008\DynamicDataset.rdl    0    0

    And here’s the rsErrorReadingDataSetField warning:

    Warning    2    [rsErrorReadingDataSetField] The dataset ‘dsProductSales’ contains a definition for the Field ‘SafetyStockLevel’. The data extension returned an error during reading the field. There is no data for the field at position 4.    C:\test\SSRS\SSRS2008\DynamicDataset.rdl    0    0

    I’ve got a developer background, so I always try to remove all warnings.  So if you really want to get rid of those warnings too, even that’s possible.  But it will require some Custom Code writing.  I already covered that topic some years ago, when I wrote an article about Detecting Missing Fields.

    Conclusion

    With this article I believe to have demonstrated that datasets can be quite flexible, even though it doesn’t always seem so.

    Have fun!

    Valentino.

    Share

  3. Passing A DateTime Parameter Into The Execute SQL Task

    When you’ve used SSIS for a while, you may have run into the following situation already.  Or maybe today is your first time and that’s the reason that you’ve arrived here.

    “Huh, what’s he talking about?”, I hear you thinking.  Read on then. :-)

    The Scenario

    You’ve got a stored procedure or another SQL statement that needs to get called from the Execute SQL Task in the Control Flow of your package.  So far so good.  One of the parameters that needs to get passed into the statement is of the DateTime type.

    How would you do that?

    Parameter Mapping – Take 1

    Following the KISS principle, let’s say we’ve got the following really complex table in our database:

    create table dt ( dtVal datetime );

    And in our Execute SQL task we have this extremely complex INSERT statement:

    insert into dt values (?)

    The statement is expecting one parameter.  The parameter that I want to pass into it is System::StartTime which is of type DateTime as shown in the screenshot below.

    Show the system variables by activating the Show System Variables button

    “Hang on, how did you get the Variables window to display the system variables?”

    Ah, good question, by clicking that Show System Variables button, indicated with the red rectangle.

    So you set up the Parameter Mapping as follows, specifying DBTIMESTAMP as Data Type and zero as Parameter Name because it’s the first parameter in the statement:

    Execute SQL Task: Parameter Mapping

    Then you decide to give it a test run.  But alas, it throws you the following error:

    Error: 0xC002F210 at Execute SQL Task 1, Execute SQL Task: Executing the query “insert into dt values (?)” failed with the following error: “Invalid time format”. Possible failure reasons: Problems with the query, “ResultSet” property not set correctly, parameters not set correctly, or connection not established correctly.

    So we’ve got a datetime column in the table and we’ve got a DateTime package variable.  But alas, the Execute SQL Task is not happy with passing this value to the query.

    Now what?

    The SqlStatementSource Expression

    Let’s try another method then.  Instead of passing the parameter’s value through the Parameter Mapping page, we’ll set up an expression that constructs the whole INSERT statement, including the parameter’s value.

    Have a look at the following expression:

    "insert into dt values ('" +
    
    (DT_STR, 4, 1252) DATEPART("yyyy", @[System::StartTime]) + "-" +
    
    (DT_STR, 2, 1252) DATEPART("mm", @[System::StartTime]) + "-" +
    
    (DT_STR, 2, 1252) DATEPART("dd", @[System::StartTime]) + " " +
    
    (DT_STR, 2, 1252) DATEPART("hh", @[System::StartTime]) + ":" +
    
    (DT_STR, 2, 1252) DATEPART("mi", @[System::StartTime]) + ":" +
    
    (DT_STR, 2, 1252) DATEPART("ss", @[System::StartTime]) + "." +
    
    (DT_STR, 3, 1252) DATEPART("ms", @[System::StartTime]) + "')"

    It uses the DATEPART function to fetch parts of the System::StartTime variable and feed it into the INSERT statement using a format that works all the time (YYYY-MM-DD HH:MM:SS.MIL).  Here’s what it generated when I clicked the Evaluate Expression button in the Expression Builder:

    insert into dt values (’2011-5-31 17:59:37.0′)

    So where exactly would you specify that expression?  In the Execute SQL Task editor, open up the Expressions page.  Then click the Expressions item in the Misc list so that the button with the ellipsis appears.  Now click that button, select SqlStatementSource as property and click the Ellipsis button in the Expression field to get to the Expression Builder.

    Then you’ll end up with something like this:

    The Property Expressions Editor with an expression specified for the SqlStatementSource property

    Give the package another run.  If everything has been set up as expected, the Execute SQL Task should color green and a select on the table should give one record:

    Our test table contains one timestamp!

    Hang on, does it really have to be this complicated?

    Well, maybe not…

    Parameter Mapping – Take 2

    So let’s give the Parameter Mapping another go.

    Set up the Execute SQL Task just like in Take 1 above, with one small difference: select DATE instead of DBTIMESTAMP as Data Type for the parameter.

    Choose DATE as Data Type when passing a DateTime package variable into the Execute SQL Task

    Now give the package another run.  Look at that, it colors green and there’s an extra record in the table:

    An extra timestamp was written to the table

    DATE doesn’t seem like the most logical type to choose in this scenario, which is why most people won’t even consider it.  But it works!  Actually, “DATE” is not really the best name that could be given to this particular data type.  Here’s the description of DT_DATE (not to be confused with DT_DBDATE!) according to MSDN:

    A date structure that consists of year, month, day, hour, minute, seconds, and fractional seconds. The fractional seconds have a fixed scale of 7 digits.

    The DT_DATE data type is implemented using an 8-byte floating-point number. Days are represented by whole number increments, starting with 30 December 1899, and midnight as time zero. Hour values are expressed as the absolute value of the fractional part of the number. However, a floating point value cannot represent all real values; therefore, there are limits on the range of dates that can be presented in DT_DATE.

    On the other hand, DT_DBTIMESTAMP is represented by a structure that internally has individual fields for year, month, day, hours, minutes, seconds, and milliseconds. This data type has larger limits on ranges of the dates it can present.

    What this means is that you have to be careful when using this type.  Even though it works fine today, it may not run fine in a similar scenario that required different date ranges.  But obviously you’ve got that covered by your unit test scenarios!

    Conclusion

    In this article I have demonstrated how a DateTime package variable can be passed as parameter into the Execute SQL Task in more than one different way.  My method of preference is the one using the DATE type in the Parameter Mapping.

    Have fun!

    Valentino.

    References

    SSIS Execute SQL Task

    SSIS DatePart function

    KISS Principle

    SSIS Junkie: Datetime variables don’t always do what you expect

    Share

  4. Loading Complex XML Using SSIS

    In my previous article I showed you how the XML Source component can be used to load XML files into a SQL Server database, using fairly simple XML structures.  In this follow-up article I will demonstrate how to tackle the complex XML issue.

    The Complex XML Example

    You probably know that SSRS reports, RDLs, are actually XML files.  And they’re not the easiest types of XML files around.  To humans they are still readable but the structure can be quite complex.  So there we’ve got our example: an RDL.  More specifically I’ll be using the RDL that’s available for download in one of my earlier articles.

    The Goal

    Every good example has got a goal.  Our goal today is to retrieve a list of datasets and fields as defined in the RDL.  Shouldn’t be too difficult, right?

    Using The XML Source Component

    Let’s try to get this done through the XML Source component with which we’re very familiar by now.  You know the drill: drag an XML Source into your Data Flow, open it up and configure the XML and XSD locations.

    Note: to be able to do this I cheated a bit by manually manipulating the RDL a little.  More precisely I removed all the namespace references from the <report> tag and further down the XML (removed “rd:”).

    With both files configured, let’s have a look at the Columns page:

    The XML Source component handling a really complex XML file

    Look at that massive list of output flows!  In total I’ve gotten 45 of them, all for free!  Even if you’re up to the task of creating 45 output tables, do you really want to find out how to get these joined together?  To prevent creating that bunch of tables you may consider using the Merge Join component… 45 times in your data flow. Didn’t think so!

    Sure, it would run fine if you manage to get it all constructed.  But in my opinion this is just too silly to try out because there’s an interesting alternative.

    And that alternative is XSLT – eXtensible Stylesheet Language Transformations.

    Using XSLT

    With XSLT you describe what you want to retrieve from the XML document and what it should look like.  In this example we’ll be retrieving the list of datasets and their fields, in CSV format.  CSV stands for Comma-Separated Values, although I prefer the term “Character-Separated Values” as the separator is not always a comma.

    To be able to write correct XSLT, you need to know what the XML structure looks like.  Here are the first 31 lines of the sample RDL file mentioned earlier.

    <?xml version="1.0" encoding="utf-8"?>
    <Report>
      <AutoRefresh>0</AutoRefresh>
      <InitialPageName>A Very Unique Name</InitialPageName>
      <DataSources>
        <DataSource Name="srcContosoDW">
          <DataSourceReference>ContosoDW</DataSourceReference>
          <SecurityType>None</SecurityType>
          <DataSourceID>b7a3d32c-e95d-4acf-bb99-9d60755303ea</DataSourceID>
        </DataSource>
      </DataSources>
      <DataSets>
        <DataSet Name="dsProductList">
          <Query>
            <DataSourceName>srcContosoDW</DataSourceName>
            <CommandText>select DPC.ProductCategoryName, DPS.ProductSubcategoryName, DP.ProductName
    from dbo.DimProduct DP
    inner join dbo.DimProductSubcategory DPS
        on DPS.ProductSubcategoryKey = DP.ProductSubcategoryKey
    inner join dbo.DimProductCategory DPC
        on DPC.ProductCategoryKey = DPS.ProductCategoryKey;</CommandText>
          </Query>
          <Fields>
            <Field Name="ProductCategoryName">
              <DataField>ProductCategoryName</DataField>
              <TypeName>System.String</TypeName>
            </Field>
            <Field Name="ProductSubcategoryName">
              <DataField>ProductSubcategoryName</DataField>
              <TypeName>System.String</TypeName>
            </Field>

    As you can see, the main node is called Report.  Nested under Report we’ve got DataSets, which can have several DataSet elements.  Each DataSet has a set of Fields with one or more Field elements.  Using that information we come to the following XSLT.

    <?xml version="1.0" encoding="utf-8"?>
    <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions">
      <xsl:output method="text" version="1.0" encoding="UTF-8" indent="no"/>
      <xsl:template match="/">
        <xsl:text>DataSource;DataSet;Field</xsl:text>
        <xsl:text>&#13;&#10;</xsl:text>
    
        <xsl:for-each select="Report/DataSets/DataSet/Fields/Field">
          <xsl:text>"</xsl:text>
          <xsl:value-of select="../../Query/DataSourceName"/>
          <xsl:text>";"</xsl:text>
          <xsl:value-of select="../../@Name"/>
          <xsl:text>";"</xsl:text>
          <xsl:value-of select="@Name"/>
          <xsl:text>"</xsl:text>
          <xsl:text>&#13;&#10;</xsl:text>
        </xsl:for-each>
    
      </xsl:template>
    </xsl:stylesheet>

    So, what is the XSLT describing?  On line three, we say that the output should be text in UTF-8 encoding.  The “template match” on the fourth line takes the whole XML document into consideration, hence the forward slash.  Then on line five we start writing output through the xsl:text tag.  This is our header line.  As you can see we’re using the semi-colon as column separator in the CSV output.  Line six adds a CRLF (carriage-return + line feed) to the output.

    Then the fun part starts.  If you have experience with XPath, the way XSLT walks through the XML document should look familiar to you.

    The xsl:for-each tag loops over all the Fields in all the DataSets in the document.

    Using the xsl:value-of tag, we can fetch values out of the XML.  The first value being retrieved is the name of the data source that dataset is using.  (I’ve added the retrieval of the data source to demonstrate how element values are retrieved.)  The path to the DataSourceName element is Report/DataSets/DataSet/Query/ so we use the double-dot syntax to navigate two levels up in the XML tree.  The value of the element itself is retrieved by just using its name, as demonstrated in the XSLT above.

    The next value-of tag retrieves the Name attribute of the DataSet, hence the two levels up, and the final value-of fetches the Name attribute of the Field element.

    Now that the XSLT is clear for everyone, how do we apply it to our XML document?  Here comes the time for SSIS once more!

    Open up the BIDS with the Control Flow of an SSIS package active and throw in an XML Task component.

    The XML Task, one of the Control Flow Items in Integration Services

    Double-click the component to open up the XML Task Editor.  This is what it looks like by default:

    XML Task Editor: default settings

    As this is an all-round XML task that can handle several XML-related tasks, the first setting that we need to modify is called OperationType.  That’s not too complicated because it comes with a dropdown and XSLT is one of the possible values.

    The different operation types supported by the XML Task

    With XSLT selected, the editor transforms into the following:

    The XML Task Editor with XSLT as OperationType

    Now we need to configure where the task can find our XML file, through the Source property.  Click the Source textbox to make the dropdown appear and select <New File connection…>.

    You can create a new File Connection through the XML Task Editor

    In the File Connection Manager Editor, leave the Usage type at Existing file and select the RDL.

    Next up we’re going to specify where the task can find the XSLT that needs to be applied to the XML.  That can be done through the Second Operand settings.  As SecondOperandType, select File Connection.  Use the dropdown of the SecondOperand property to create a second new file connection that points to your XSLT file.

    With that set up as well, only one step remains.  The task still doesn’t know where the output should be saved.  Or that it actually should get saved.  So first switch the SaveOperationResult property to True.  As you can see, DestinationType is already set to File Connection, that’s what we need.  Use the dropdown of the Destination property to create a third new file connection.  This time however, Usage Type should be set to Create File.  Specify path and filename for the output file and click OK to close the File Connection Manager Editor.

    This is what our XML Task now looks like in the editor:

    The XML Task Editor with all input and output files specified, as expected for our XSLT experiment

    As shown above, I’ve called the output file DatasetInfo.csv.

    One more property that can be interesting is the OverwriteDestination property.  Setting it to True can ease the testing of your package if you need to execute it multiple times.  Which you’ll probably want when your XSLT is not giving the expected output.  Don’t forget to set it to False afterwards (depending on what behavior you actually expect from your package).

    Okay, now close the XML Task Editor and execute the package.  If you haven’t made any mistakes, the task should color green and you should have an extra file on your hard drive somewhere.  Here’s what the content of my DatasetInfo.csv looks like:

    DataSource;DataSet;Field

    “srcContosoDW”;”dsProductList”;”ProductCategoryName”

    “srcContosoDW”;”dsProductList”;”ProductSubcategoryName”

    “srcContosoDW”;”dsProductList”;”ProductName”

    “srcContosoDW”;”dsProductList”;”ProductCategoryColor”

    “srcContosoDW”;”dsProductList”;”EasterEgg”

    Look at that, a list of fields, all part of the dsProductList dataset.

    “Hang on, wasn’t this article going to demonstrate how to get complex XML files imported into our database?  And now you’re writing the data to a file?!”

    Well yeah, you’re right.  Unfortunately the XML Task does not offer the possibility to write to a table in a database.  So to get the data imported into your database you’ll need to set up a Data Flow that imports the CSV files.  But that shouldn’t be too difficult to achieve, right?

    Mission accomplished!

    Conclusion

    With this article I have shown how Integration Services can be used to retrieve data out of complex XML files, without actually using the XML Source component.  I hope you’ve enjoyed reading it as much as I had while writing.  Or maybe you know another interesting method to get complex XML imported.  Feel free to post comments!

    Have fun!

    Valentino.

    References

    XSLT (Wikipedia)

    CSV (Wikipedia)

    XML Task (MDSN)

    Share

  5. Loading XML Using SSIS

    SQL Server Integration Services can read XML files, that’s known by every BI developer.  (If you didn’t, don’t worry, I’m aiming this article at newcomers as well.)

    But how far can you go?  When does the XML Source component become unusable?  Let’s find out!

    To create the examples I’m using the following SQL Server version:

    Microsoft SQL Server 2008 R2 (RTM) – 10.50.1600.1 (X64)   Apr  2 2010 15:48:46   Copyright (c) Microsoft Corporation  Enterprise Edition (64-bit) on Windows NT 6.1 <X64> (Build 7600: )

    Basic Example

    This first example is a really simple XML file containing a list of colors with their corresponding RGB code.

    <colors>
      <color RGB="FF0000">Red</color>
      <color RGB="00FF00">Green</color>
      <color RGB="0000FF">Blue</color>
      <color RGB="FFFFFF">White</color>
      <color RGB="000000">Black</color>
    </colors>

    Let’s import this into a database.  Open up the BIDS, create an SSIS project and throw a Data Flow Task into the package and open it up.

    The component that we’re now most interested in is the XML Source, one of the components in the Data Flow Sources category in the Toolbox.

    The XML Source component

    Add one of those to your Data Flow and double-click it to open up the XML Source Editor.

    The Data Access Mode should be set to XML file location, which is the default setting.  The other options are XML file from variable – useful if you’ve got the file path and name of the XML file in a variable – and XML data from variable – interesting if your XML data is actually stored in a variable.

    As XML Location, select the .xml file.  Our XML sample does not have an inline schema, so we can’t use that checkbox.  And we can’t click the OK button either, it’s grayed out.  The source component really expects a description of the XML structure before the editor can be closed.

    The bottom of the screen even shows a warning with the following message:

    XML Schema (XSD) is not specified. Select an existing XSD or click Generate XSD to create an XSD from the XML file.

    So, what are you waiting for,  Click the Generate XSD button to let the XML Source Editor generate the XSD schema for us.  Real easy, right?

    Remember where you save the file, and when it’s generated, select the .xsd file in the XSD location textbox.  As you can see, the OK button will become available.  But don’t click it just yet.

    Here’s what the XML Source Editor now looks like:

    XML Source Editor with an XML and XSD file specified

    Let’s now move on to the second page of the XML Source Editor, called Columns.  When you open it, you’ll receive the following popup with a couple of warnings:

    Warning gets displayed when opening the Columns page

    The editor is letting us know that the columns that are being generated do not have a maximum length specified.  So it’s setting them to Unicode (DT_WSTR) with a length of 255.  Click the OK button to get rid of that message and to be able to see the generated columns.

    Note: if your data elements or attributes may contain longer strings then you should have a look at modifying the length specification.  This can be done through the Advanced Editor, which is opened by right-clicking the XML Source.  The Input and Output Properties page is the one you’re after.

    The Columns page, showing the columns that the XML Source generated for us

    As you can see, our only attribute – RGB, is nicely put in a column with the same name.  The value of each <color> node however is not put in a column called Color.  By default, this value is put into a column called “text”.  Which is a weird name for a column in an SSIS data flow if you ask me.  The good thing is that you can just rename it by changing the Output Column value.

    Let’s test this out.  My favorite way is to add a Multicast component to the Data Flow, then add a Data Viewer on the connector (right-click the green arrow, select Data Viewers, click Add > OK > OK).  Now execute the package to get this result:

    Testing the output of the XML Source through the Data Viewer

    Mission accomplished, we’ve retrieved data from a very basic XML file!

    Adding Some Complexity

    Let’s move on to the second example of this article.  The difference with the previous example is that now we’ve got multiple nested structures to deal with.

    The example represents a list of book reviews, including some details on the books themselves.  A book can have multiple writers and obviously multiple reviews as well.

    <books>
        <book pages="300">
            <title>Microsoft SQL Server 2008 R2 Master Data Services</title>
            <category>Information Technology</category>
            <authors>
                <author>Jeremy Kashel</author>
                <author>Tim Kent</author>
                <author>Martyn Bullerwell</author>
            </authors>
            <reviews>
                <review>If you're looking for an excellent book on the new Master Data Services component of SQL Server 2008 R2, definitely check this one out!  To be released in June 2011 by Packt Publishing!</review>
            </reviews>
        </book>
        <book pages="832">
            <title>Inside Microsoft SQL Server 2008: T-SQL Querying</title>
            <category>Information Technology</category>
            <authors>
          <author>Itzik Ben-gan</author>
          <author>Lubor Kollar</author>
          <author>Dejan Sarka</author>
          <author>Steve Kass</author>
            </authors>
            <reviews>
                <review>Every "Inside SQL Server" book can be recommended, especially when written by Itzik!</review>
            </reviews>
        </book>
        <book pages="1137">
            <title>The Lord of the Rings</title>
            <category>Fantasy</category>
            <authors>
                <author>J.R.R. Tolkien</author>
            </authors>
            <reviews>
                <review>Like fantasy?  What are you waiting for then?  It's a classic!</review>
                <review>If you liked the movie, you'll love the book.</review>
            </reviews>
        </book>
    </books>

    Configure an XML Source so that it uses the books.xml file, generate the XSD and specify its location.  I’m not going into details on that, the procedure is the same as in our first example above.

    Now open up the Columns page to have a closer look at how the XML data is going to get imported.

    XML Source generates multiple=

    So how does the XML Source component deal with the multiple nested structures?  It generates multiple outputs!  If you select another output from that dropdown, you get to see its fields.

    To get a clear understanding of what exactly is going on, let’s connect each output with an OLE DB Destination component.  The target table can be generated based on the incoming fields by clicking the New button.  Replace the table name in the generated CREATE TABLE script with a clear one that fulfills your naming convention requirements – such as NO SPACES IN A TABLE NAME for instance – and hit the OK button.

    Destinatio table can be generated by using the New button in the OLE DB Destination Editor

    Now that the table is created, it will be automatically selected in the Name of the table or the view dropdown.  Don’t forget to visit the Mappings page so that the, well, mappings are created.  If no field names were modified in the CREATE TABLE script then all fields should be mapped automatically based on their names.

    With all five destinations added, execute the package.

    Each XML Source output is connected to an OLE DB Destination - executes fine

    So now we’ve loaded the data from our XML file into a database, but the data is spread over five tables.  How do we retrieve that data?  Join them together!

    select * from XML_book
    inner join XML_authors on XML_authors.book_Id = XML_book.book_Id
    inner join XML_author on XML_author.authors_Id = XML_authors.authors_Id
    inner join XML_reviews on XML_reviews.book_Id = XML_book.book_Id
    inner join XML_review on XML_review.reviews_Id = XML_reviews.reviews_Id;

    And the result looks like this:

    XML data imported into the SQL Server database

    Conclusion

    We have managed to flatten the data from an XML file containing multiple nested repeating nodes, nice huh?  But do you also feel the limitation using this method?  In terms of modern XML, this was still a fairly easy XML file and yet we already needed five tables to store the data.  Can you imagine what this will give with a really complex file?

    Watch this blog for the follow-up article where I will try to import data from a really complex XML file!

    Have fun!

    Valentino.

    References

    XML Source component

    Share

  6. Where The Sheets Have A Name

    Did you know that as of SQL Server Reporting Services 2008 R2 you can give the worksheets a customized name when exporting your report to Excel?  If you didn’t, or you did but never took the time to find out how you’d implement that, I’ll show you here and now!

    For this example I’ll be starting off from the report created in my earlier post on Cascading Calculated Fields.

    The result can be downloaded from my Skydrive through this link.

    The Scenario

    As you may recall, our report shows a list of all our company’s products.  When the report gets exported to Excel, each product category should get its own sheet.  So all products from the Audio category should be located in a sheet called “Audio”, all Games and Toys in a sheet called “Games and Toys”, and so on.

    The Report

    Starting Position

    Let’s first have a quick look what the export to Excel currently looks like, without any modifications to the report.

    Default export to Excel - all data in one sheet

    All records are being exported to just one sheet.  And, by default, the name of the sheet is the name of the report.  (I made a copy of my existing report and called it NamingExcelSheets.rdl.)

    In case you want to change the default name of the sheet, it’s possible.  On the report itself, there’s a property called InitialPageName.

    Use the InitialPageName property on the report to change the default name of the Excel sheet

    Fill in a value and here’s the result in Excel:

    The Excel sheet with its default name changed to a very unique name

    Adding The Category Group

    To be able to get the different categories into different sheets, we need to add a group on Category to the tablix in the report.

    With the tablix selected, right-click the Details line in the Row Groups pane and select Add Group > Parent Group.  Select ProductCategoryName as field to group by and activate the Add group header checkbox.

    Add group on Product Category

    Remove the group column that gets added automatically and move the header cells from the main header to the group header.  Delete the main header row so that you end up with something like this:

    Tablix with group on Product Category added

    Open up the Group Properties by double-clicking the new ProductCategoryName item in the Row Groups pane.  Select the Page Breaks page and activate the Between each instance of a group checkbox.  Doing this ensures that each group gets its own page in the report, and its own sheet in Excel.

    Adding page breaks between the instances of a group

    Let’s render the report and export to Excel to have a look at the effect of adding these page breaks.

    Report exported to Excel with each group in a separate sheet

    Indeed, every group has gotten its own worksheet.  However, they’ve also gotten the very original names such as Sheet1, Sheet2 and so on.

    Customizing The Names Of The Sheets

    On to the final part of the requirements: giving our own customized name to the generated Excel sheets.  This is actually really easy once you know how to do this.

    First select the ProductCategoryName group in the Row Groups pane so that its properties are displayed in the Properties pane.  In the Properties pane, locate the Group > PageName property and specify the following expression:

    =Fields!ProductCategoryName.Value

    That’s it, that’s all you need to do!  Don’t believe me?  Here’s what the export to Excel now looks like:

    Data exported to Excel, with customized sheetnames

    Conclusion

    As we’ve seen in this article, it really doesn’t take too much effort to implement a custom name for the worksheets when exporting a report to Excel.  Neat feature!

    Have fun!

    Valentino.

    References

    Understanding Pagination in Reporting Services (Report Builder 3.0 and SSRS)

    Share

  7. Cascading Calculated Fields (SSRS)

    When thinking about Reporting Services in combination with the word cascading, the first that jumps to mind is cascading parameters.  We all know that one parameter can have its list of values filtered by what’s selected in another parameter.

    But what about calculated fields?  Are those cascading as well?  Can we refer to a calculated field in the definition of another calculated field?  Let’s find out!

    For the example I’ll be using the ContosoDW sample data warehouse running on SQL Server 2008 R2, more precisely:

    Microsoft SQL Server 2008 R2 (RTM) – 10.50.1600.1 (X64)   Apr  2 2010 15:48:46   Copyright (c) Microsoft Corporation  Enterprise Edition (64-bit) on Windows NT 6.1 <X64> (Build 7600: )

    The Example

    Scenario

    We’ve been asked to build a report that produces a product catalogue.  The report only has one requirement: as our company is well-known for its branding, each product category has got its own color and this color should be used as background color in the report.

    Report

    Let’s first get some data.  Here’s a fairly simple query that retrieves all products with their related category and subcategory from the ContosoDW database:

    select DPC.ProductCategoryName, DPS.ProductSubcategoryName, DP.ProductName
    from dbo.DimProduct DP
    inner join dbo.DimProductSubcategory DPS
        on DPS.ProductSubcategoryKey = DP.ProductSubcategoryKey
    inner join dbo.DimProductCategory DPC
        on DPC.ProductCategoryKey = DPS.ProductCategoryKey;

    After building a dataset in a new report I end up with this:

    The Report Data pane before adding any calculated fields

    Calculated Field Number One

    Because this post is about calculated fields, we’re now going to apply a little dirty trick of hard-coding the category names and their corresponding color into a calculated field.  I do not recommend this for professional reports where the colors should be coming from the database so that your reports are not impacted when extra categories are added or when the marketing department decides to change their vision.

    But for this example it’s perfect so let’s create a calculated field in the dataset.  That can be done by right-clicking the dataset and then selecting Add Calculated Field…

    Right-click the dataset to add a calculated field

    Give the field a clear name, such as ProductCategoryColor, and click the fx button to enter the following expression:

    =Switch(
        Fields!ProductCategoryName.Value = "Audio", "#FFD800",
        Fields!ProductCategoryName.Value = "Cameras and camcorders ", "#FF0000",
        Fields!ProductCategoryName.Value = "Cell phones", "#00FF00",
        Fields!ProductCategoryName.Value = "Computers", "#0000FF",
        Fields!ProductCategoryName.Value = "Games and Toys", "#FF00FF",
        Fields!ProductCategoryName.Value = "Home Appliances", "#FFFF00",
        Fields!ProductCategoryName.Value = "Music, Movies and Audio Books", "#00FFFF",
        Fields!ProductCategoryName.Value = "TV and Video", "#ABCD12"
    )

    Funny side note: do you notice that trailing space in the “Cameras and Camcorders” category?  It’s intentional!  Apparently that record has got a trailing space stored in the ProductCategoryName field in DimProductCategory.

    With the first calculated field created, add a table to your report to display the products.  Set the BackgroundColor property of the whole Details row to the newly-created calculated field.

    So far so good, here’s what the rendered report currently looks like:

    Rendered report with background color

    Calculated Field Number Two

    According to the business requirements we’re done creating the report.  However, it’s Friday early afternoon and we feel like having some fun.  And this post is about “cascading” calculated fields, so we need at least two of them.  Let’s create an Easter egg!

    The “fun” requirement is the following: if the product’s name starts with an A then the text color for that record should be the same as the background color, but with the Blue component set to FF.  For example, if the background color is #00FF00 (green) then the text color should become #00FFFF (cyan).

    Let’s create another calculated field in our dataset, called EasterEgg (don’t make it too hard for your colleagues to fix the weirdly behaving report).  Give it the following expression:

    =IIF(Left(Fields!ProductName.Value, 1) = "A",
        Left(Fields!ProductCategoryColor.Value, 5) + "FF",
        "Black")

    As you can see, we’re referring to the ProductCategoryColor field, the calculated field created earlier.

    Now set the Color property of the Details row to this new calculated field and Preview the report.

    Guess what?

    Rendered report with cascading calculated field

    It works!

    Conclusion

    If you’re in a situation where you’d like to add calculated fields to an existing dataset and one of those fields should use the value of another calculated field, you can do it!  Cascading calculated fields are working fine in Reporting Services.

    Have fun!

    Valentino.

    Share

  8. Finding Similar Strings With Fuzzy Logic Functions Built Into MDS

    This post is inspired by a presentation that’s available on the Microsoft TechEd Online website.  It’s called Master Data Management – Merging from Multiple Sources, and is presented by Dejan Sarka, one of the Solid Quality Mentors and writer of several SQL Server-related books.

    Even if you’re not interested in Master Data Services (MDS), the following will be good to know if you need to compare strings with each other for similarity and find the string that’s the closest match to your input string.

    Compare Strings Why?

    You may be wondering in what scenarios you’d be required to compare strings for similarity.  To clarify, I’ll give you an example.  Imagine you’re building a data warehouse (DWH).  This DWH receives data from several different source systems.  In two of those systems, you’ve got a list of customers.  To be able to populate your DimCustomer table and avoid duplicate customers, you need to implement some logic to detect that customer Smith in System A is the same customer Smith in System B.

    That’s when string similarity or fuzzy-logic functions come in handy.

    Compare Strings How?

    When strings are 100% equal, it’s obviously not difficult to find matching strings.  Just use the equals (=) operator in your query and you’ve matched them.  However, when strings are not 100% equal, due to typing errors or whatever cause, things get a little more complicated.  Perhaps customer Smith in System A is called Smyth, Smiht or even Smiths in System B while they are actually one and the same person.  That’s when we need to use additional logic, which we – not being rock star mathematicians ourselves – can hopefully find in built-in system functions.

    The Built-in Soundex() And Difference() Functions

    The standard functionality available in SQL Server to compare strings is fairly limited.  You’ve probably heard of the SOUNDEX() function, maybe even used in somehow already.  This function receives a string as parameter and calculates a four-digit code out of it.  When used on two similar strings, the two strings will produce the same code.  When they are not similar, you get two different codes.  And that’s it.

    Here’s an example:

    select SOUNDEX('Smith'), SOUNDEX('Smiht'), SOUNDEX('Washington')

    The result of that query is:

    Result of SOUNDEX query

    Is it perfect?  No, it’s not.  If you’d give it a value of ‘Smiths’, it would return S532, which is different from S530 even though there’s only one letter of difference between the two strings.

    Next to SOUNDEX() we’ve got the DIFFERENCE() function.  This function accepts two parameters and returns an integer between 0 and 4.  What this function does is it calculates the soundex value for both strings and returns the number of characters of the code that are matching.  In the case of a comparison of ‘Smith’ with ‘Smiths’, it would return 3 because three characters are matching (‘S53’).

    Let’s move on to an alternative solution.

    The Similarity Function In Master Data Services

    The MDS installation procedure goes through several steps to get all the required functionality installed.  One of those steps is the creation of a database.  What’s interesting about this database, even if you’re not interested in MDS or Master Data Management, are the custom functions that it contains.

    One of those functions is called Similarity, located in the mdq schema.  This function allows you to compare two strings with each other through a specified match algorithm.  What’s interesting here is that you can choose between those four different algorithms depending on your data.  In some cases a certain algorithm will be more interesting while in other cases the best algorithm will be another one.

    The value returned by the Similarity function is a float between zero and one, which makes it more precise than the soundex option.

    So how do you use the function?  Let’s have a look at its definition:

    ALTER FUNCTION [mdq].[Similarity](@input1 [nvarchar](4000), @input2 [nvarchar](4000),
    
      @method [tinyint], @containmentBias [float], @minScoreHint [float])
    RETURNS [float] WITH EXECUTE AS CALLER, RETURNS NULL ON NULL INPUT
    AS
    EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality].
    
       [Microsoft.MasterDataServices.DataQuality.SqlClr].[Similarity]

    As you can see, this is not a standard T-SQL function but it’s been implemented in .NET through CLR Integration.  The function expects five parameters.  The first two parameters, @input1 and @input2, are the two strings that need to get compared.  The third parameter specifies the match algorithm that should be used.

    Here are the four algorithms as supported by the Similarity function:

    Value for @method Algorithm
    0 The Levenshtein edit distance algorithm
    1 The Jaccard similarity coefficient algorithm
    2 A form of the Jaro-Winkler distance algorithm
    3 Longest common subsequence algorithm

    The fourth parameter, @containmentBias, specifies how exact the fuzzy index should be when comparing strings of different lengths.  Values go from 0.0 to 1.0 with the lower number being the more precise one.  This only applies to the Jaccard and longest common subsequence algorithms.  The default is 0.85.

    The fifth parameter, @minScoreHint, influences the calculated scores returned by the Similarity function.  Valid values go from 0.0 to 1.0.  When a value greater than 0 is passed, any calculated score under that value will result in zero.

    Note: according to the Books Online, this fifth parameter is optional.  But it’s not.

    Note: also according to the Books Online, a value of 4 for @method would also be accepted.   In that case the function will use a date comparison algorithm, thus the two first parameters should be either DateTime values or valid dates that are strings specified in the format yyyy-mm-dd.  However, when testing this out I noticed that this is not working.  Further exploration of the MDS database led me to another function called SimilarityDate, also located in the mdq schema:

    ALTER FUNCTION [mdq].[SimilarityDate](@date1 [datetime], @date2 [datetime])
    RETURNS [float] WITH EXECUTE AS CALLER, RETURNS NULL ON NULL INPUT
    AS
    EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality].
    
      [Microsoft.MasterDataServices.DataQuality.SqlClr].[SimilarityDate]

    From the looks of it, this function implements the functionality as explained in the BOL for @method = 4.  And this one is actually working!

    So how do you find out which of the four algorithms is the most interesting one in your situation?  You’ll have to try it out.  Take a sample data set and run the four algorithms on the data.  As I don’t have any real-world data to use here (it wouldn’t be legal anyway), I’ll demonstrate this using some data from the ContosoDW database.

    The following query uses the Customer dimension and combines that with a very small set of “sample” data that imitate some real-world problems like typos.

    with DistinctLastname as
    (
        select distinct LastName
        from DimCustomer
    ),
    NewData as
    (
        select 'Wahsington' as LastName2 --typo
        union select 'Wqshington' --QWERTY/AZERTY mixup
        union select 'Zqtson' --QWERTY/AZERTY mixup x2
    )
    select DistinctLastname.LastName, NewData.LastName2
    into #SampleData
    from DistinctLastname
    cross join NewData
    where LastName is not null;

    I’m only using the LastName column here.  In a real situation you’d probably want to combine that with FirstName and also some address-related data such as city and street.

    Up next is letting the algorithms loose on the data set:

    select LastName, LastName2,
        MDS.mdq.Similarity(LastName, LastName2, 0, 0.85, 0) as Levenshtein,
        MDS.mdq.Similarity(LastName, LastName2, 1, 0.85, 0) as Jaccard,
        MDS.mdq.Similarity(LastName, LastName2, 2, 0.85, 0) as JaroWinkler,
        MDS.mdq.Similarity(LastName, LastName2, 3, 0.85, 0) as LongestCommonSubsequence
    from #SampleData;

    Copy the results of that query to Excel for further, and easier, analysis.  You can easily sort your data in Excel, so that the highest calculated scores of a certain algorithm are located on top.

    Let’s first start by sorting on the Levenshtein results:

    Results ordered on Levenshtein value

    As you can see, the three values that we were looking for are located on top.  That’s a good sign!  Furthermore, the Washington values are quite high.  So based on my sample data this is possibly a good algorithm.

    How about the Jaccard results?

    Results for the Jaccard algorithm

    You can clearly see that the maximum values for the Jaccard algorithm are significantly lower than those of the other algorithms.  Furthermore, the correct value for Watson is scoring lower than Son.  Assuming our logic would select the best-scoring values when searching for “Watson”, it would select the incorrect value of Son.

    All this indicates that the Jaccard algorithm is not the best-suited one for our situation.

    So, what about Jaro-Winkler?

    Results for the Jaro-Winkler algorithm

    In this case we’ve got even higher maximum values compared to Levenshtein and the two values for Washington are located on top.  So far so good.  The correct value for Watson is located at position 7.  But as you can see, this is the first match for Zqtson, which means that the correct value would get selected by our matching logic.  Based on these numbers I would say that so far this is the best algorithm for the situation.

    One more to go: Longest Common Subsequence.

    Results for the Longest Common Subsequence algorithm

    Again the three correct values are located on top, just like the Levenshtein algorithm.  In fact, the calculated scores are very similar to the Levenshtein algorithm.  Quite logical: both calculations are using similar algorithms.

    Conclusion

    Based on the results above and using this very limited sample data set, I would select the Jaro-Winkler algorithm as being the most suitable for our situation.  But I do have to mention that you should really use larger data sets to be sure.

    Also, even though we can rely on a fuzzy-logic algorithm to find the correct match, the selected matches should be verified and approved manually.  Of course, all that can be part of a Master Data Management process.

    Note that for this post I only looked at fuzzy matching possibilities using just T-SQL.  In Integration Services there are a couple of components, such as the Fuzzy Lookup data flow component, that offer similar functionality.  If you’re dealing with ETL flows in SSIS, be sure to check that one out as well!

    Have fun!

    Valentino.

    References

    Microsoft Contoso BI Demo Dataset for Retail Industry

    Fuzzy String Searching

    Soundex algorithm

    Levenshtein Distance

    Jaccard Index

    Jaro-Winkler Distance

    Share

  9. Hide/Show Items Dependant On Export Format (SSRS)

    Now and then I encounter forum questions in the style of the following:

    I have a report with a title.  When rendered through the Report Manager and when exported to PDF, I want it to render as normal.  However, when exported to Excel I do not want to get the title.  How can I hide it?

    Because I don’t like re-inventing the wheel each time I decided to write a blog post about it.

    As of SQL Server 2008 R2, we’ve got a built-in global field that can help us out.  This field is called Globals!RenderFormat.  It has two properties: Name and IsInteractive.  Name represents the unique name that indicates the chosen renderer, and IsInteractive indicates whether or not the chosen report format is, well, interactive.

    Depending on the renderer, the values of the properties differ.  To be able to use the variable in an expression, we need to know its values for each rendering format.  Here’s the list of different possibilities:

    Renderer RenderFormat.Name RenderFormat.IsInteractive
    Preview in BIDS or rendered through Report Manager RPL True
    XML file with report data XML False
    CSV (comma delimited) CSV False
    TIFF file or Print button IMAGE False
    PDF PDF False
    MHTML (web archive) MHTML True
    Excel EXCEL False
    Word WORD False

    If these names for RenderFormat look familiar to you, you’re probably right.  Have a look at the rsreportserver.config file in the C:\Program Files\Microsoft SQL Server\MSRS10_50.SQL2008R2\Reporting Services\ReportServer folder.  Note that you may need to adapt the folder to your specific settings.  In my case my instance is called “SQL2008R2”.  Near the bottom of that configuration file you can find the <Render> node, located under <Extensions>.  The names that you see there are those used by the RenderFormat.Name property.

    Now that we know what values to test on, let’s get started.

    If we get back to the example of hiding a title, or textbox, when exporting to Excel, here’s what needs to happen.  Locate the Hidden property of the textbox that you want to hide, and give it the following expression:

    =IIF(Globals!RenderFormat.Name = "EXCEL", True, False)

    What we’re saying here is: if the RenderFormat is EXCEL, then the Hidden property should be set to True.  Which results in a hidden textbox whenever the report is exported to Excel!

    As Erik pointed out in the comments, in this particular case you don’t need the IIF() statement.  The result of the expression results in True when the expected value should be True, and False when False is expected.

    As a quick note: when building your expression through the expression builder, you’ll notice that the Intellisense doesn’t know the new RenderFormat field yet.  Do not worry about that, just continue typing and ignore any errors being indicated.  If you use the syntax as I highlighted above, it will work!  Well, unless you’re running an earlier version than SQL Server 2008 R2 of course.  In that case it won’t work.

    Intellisense doesn't know RenderFormat yet

    In contradiction to the Intellisense, the bottom part of the expression builder screen has been updated to show the new properties.  So if you don’t remember the syntax, you can just locate the field in the Built-in Fields category and give it a good double-click.

    RenderFormat is located in the Built-in Fields category

    Of course, the Excel example in this post is just one of many possibilities that this new field offers.  Is your company environment-friendly and does it want to prevent wasting paper?  Now it’s possible, just hide that 50-pages long table when the report is being rendered for print!

    Have fun!

    Valentino.

    References

    Globals!RenderFormat aka Renderer Dependent Report Layout by Robert Bruckner

    Share

  10. Calculating LastXMonths Aggregations Using T-SQL and SSIS

    With the holidays I haven’t been able to write much.  So I’ll make up for it with this +3000 words article.  If you’re reading this early in the morning, you’d better get a double espresso first ;-)

    In this article I will demonstrate a method that can be used to calculate aggregations over a certain period of time in the past, or LastXMonths aggregations as I’m calling them throughout the article.  I’ll be using T-SQL, SQL Server Integration Services and a relational database as source.  More specifically I will be using the Merge Join data transformation in SSIS, and Common Table Expressions in T-SQL.

    Version-wise I’m using SQL Server 2008 R2, but this method should work as of SQL Server 2005.  Furthermore I’m using the Contoso DWH, available for download at the Microsoft Download Center.  (In case you’re wondering, it’s the .BAK file.)

    You can download the finished SSIS package from my Skydrive.  (The file is called MergeJoin.dtsx.)

    The Scenario

    Let’s say we’ve got a relational database containing some sales figures.  Management has asked for sales-related data to be available somewhere for easy analysis.  Ideally a cube would be built for that purpose but as budgets are currently tight, a temporary solution needs to be provided meanwhile.  So it’s been decided that an additional table will be created, populated with the exact data as required by management.  This table should contain all details (number of items and amount of the sale) about products sold, grouped by the date of the sale, the zip code of the place where the sale occurred and the category of the product.

    Furthermore, each record should contain the sum of all sales of the last month for the zip code and product category of each particular record.  Two additional aggregations should calculate the sales for the last three months and last six months.

    A Simple Example

    To make sure we’re all on the same track on the requirements, here’s a small example to illustrate the expected outcome.

    Small example displaying the expected outcome of the process

    I’ve omitted the SalesAmount numbers for readability reasons.  The records are ordered chronologically, with the oldest first.  As you can see, the bottom record shows 16 as value for Last6MSalesQuantity.  This is the result of the SalesQuantity of the current record and the SalesQuantity of the previous record, which happens to fall within the timespan of the lowest record’s SaleDate going back six months.  The two other records do not fall within the six months timespan and are thus not included in the sum for the Last6MSalesQuantity of that bottom record.

    Fetching The Data Into A Table

    Our scenario requires that the sales figures are calculated and put into a new table.  Let’s first start with creating the queries to fetch the data.

    Step 1: The Daily Numbers

    The easiest part are the daily sales numbers.  These can be retrieved fairly easy from the Contoso data warehouse, just by using a GROUP BY clause as shown in the following query.

    --daily sales
    select DD.Datekey, DS.ZipCode, DPC.ProductCategoryName,
        SUM(FS.SalesAmount) SalesAmount_SUM,
        SUM(FS.SalesQuantity) SalesQuantity_SUM
    from dbo.FactSales FS
        inner join dbo.DimStore DS on DS.StoreKey = FS.StoreKey
        inner join dbo.DimProduct DP on DP.ProductKey = FS.ProductKey
        inner join dbo.DimProductSubcategory DPS
            on DPS.ProductSubcategoryKey = DP.ProductSubcategoryKey
        inner join dbo.DimProductCategory DPC
            on DPC.ProductCategoryKey = DPS.ProductSubcategoryKey
        inner join dbo.DimDate DD on DD.Datekey = FS.DateKey
    group by DD.Datekey, DS.ZipCode, DPC.ProductCategoryName
    order by DD.Datekey asc, DS.ZipCode asc, DPC.ProductCategoryName asc;
    

    Part of the result of that query looks like this:

    Result of the daily sales query

    Nothing special to mention so far so let’s continue to the next step.

    Step 2: The Monthly Numbers

    In this step, we’ll use the query from step 1 as base for the full query.  I’ll first show you the query and then provide you with some explanation of what’s going on.

    --LastMonth
    declare @numberOfMonths tinyint = 1;
    with DailySalesData as
    (
        select DD.Datekey, DS.ZipCode, DPC.ProductCategoryName,
            SUM(FS.SalesAmount) SalesAmount_SUM,
            SUM(FS.SalesQuantity) SalesQuantity_SUM
        from dbo.FactSales FS
            inner join dbo.DimStore DS on DS.StoreKey = FS.StoreKey
            inner join dbo.DimProduct DP on DP.ProductKey = FS.ProductKey
            inner join dbo.DimProductSubcategory DPS
                on DPS.ProductSubcategoryKey = DP.ProductSubcategoryKey
            inner join dbo.DimProductCategory DPC
                on DPC.ProductCategoryKey = DPS.ProductSubcategoryKey
            inner join dbo.DimDate DD on DD.Datekey = FS.DateKey
        group by DD.Datekey, DS.ZipCode, DPC.ProductCategoryName
    ),
    UniqueRecordsPerDay as
    (
        select Datekey, ZipCode, ProductCategoryName
        from DailySalesData
        group by Datekey, ZipCode, ProductCategoryName
    )
    select UR.Datekey, DSD.ZipCode, DSD.ProductCategoryName,
        SUM(DSD.SalesAmount_SUM) SalesAmount_SUM,
        SUM(DSD.SalesQuantity_SUM) SalesQuantity_SUM
    from DailySalesData DSD
        inner join UniqueRecordsPerDay UR
                on UR.ProductCategoryName = DSD.ProductCategoryName
            and UR.ZipCode = DSD.ZipCode
            and DSD.Datekey
                between DATEADD(month, -@numberOfMonths, UR.Datekey + 1)
                and UR.Datekey
    group by UR.Datekey, DSD.ZipCode, DSD.ProductCategoryName
    order by UR.Datekey asc, DSD.ZipCode asc, DSD.ProductCategoryName asc;

    The query uses a variable called @numberOfMonths.  This will allow us to use the same query for the totals of last month, as well as for the Last3M and the Last6M numbers.  All that’s needed is changing the variable to 3 or 6.

    But how does the query get to the results?  To start, it uses two CTEs (Common Table Expressions).  The first one is called DailySalesData.  And the query for that CTE should look familiar to you by now: it’s the one from step 1, without the ORDER BY clause.

    The second CTE is called UniqueRecordsPerDay and gives us one record for each unique date, zip code and product category as found in the Contoso data.  The DateKey, ZipCode and ProductCategoryName fields are our key grouping fields.  And this CTE is actually the key to calculating the monthly aggregated data, as I’ll explain next.

    What the main query does is the following.  It selects the data from the DailySalesData CTE and joins that with the unique records per day recordset.  All grouping key fields need to be included in the join.  However, as you can see, to add the DateKey into the join I’m not just using the equals operator but the BETWEEN keyword instead.  I’ve also used the DATEADD function to subtract the number of months as specified through the @numberOfMonths variable.  That statement is saying: “give me all records starting from DateKey, going back @numberOfMonths”.  The query again groups by the key fields to be able to sum the records up.

    This construction ensures that the SalesAmount_SUM and SalesQuantity_SUM fields represent the sum for the record’s zip code and product category and for the period as indicated by the @numberOfMonths variable.

    Step 3: Merging It All Together Into One Table

    Now that we know how to retrieve the data, we still need to get it into a table.  One option would be to use the INSERT statement on the daily records, followed by UPDATE statements to populate the monthly (1, 3, 6) aggregated columns.  However, I’m a BI guy so let’s use an SSIS package to get to the result (plus it allows me to illustrate the Merge Join data flow transformation :-) ).

    So open up the BIDS and create a new package.  Drop a Data Flow Task into the Control Flow and add a Connection Manager connecting to your Contoso DWH.  Then switch to the Data Flow page.

    Nothing special so far I believe.  Next we need to set up four Data Flow Sources: one for the daily figures, one for the monthly, one for the 3M and one for the 6M data.

    Setting Up The Data Sources

    Throw in an OLE DB Source component, configure it to use your connection manager and copy/paste the first query above into the command textbox.  Again nothing special, right?

    However, the Merge Join component expects its incoming data to be sorted.  That’s why I’ve included the ORDER BY clause in the queries above.  But that’s not all.  Connecting our data source to a Merge Join transformation without any additional change will result in an error such as the following:

    Validation error. Data Flow Task Merge Join [457]: The input is not sorted. The “input “Merge Join Left Input” (458)” must be sorted.

    To avoid this error, we need to explicitly inform our data flow that the data is actually ordered, and we need to give it all the details: on what fields has the data been ordered and in what order!  And that needs to be done through the Advanced Editor.

    So, right-click the OLE DB Source and select Show Advanced Editor.

    Right-click OLE DB Source to open up the Advanced Editor

    In the Advanced Editor, navigate to the last tab called Input and Output Properties and select the “OLE DB Source Output” node in the tree structure on the left.  Doing that will show the properties for the selected output and one of those properties is called IsSorted.  By default it is set to False.  Set it to True.

    Tip: double-clicking the label of the property will swap its value to the other value.  This can be useful in cases when you need to change several options but even here is saves a couple of clicks.  It’s all about optimization. :-)

    Advanced Editor on OLE DB Source: the IsSorted property

    At this moment the component knows that the incoming data is sorted, but it still doesn’t know on what fields.  To specify that, open up the OLE DB Source Output node, followed by the Output Columns node.  You’ll now see the list of fields.  As specified in the query, the data is ordered firstly on DateKey, secondly on ZipCode and thirdly on ProductCategoryName.

    Select DateKey to see its properties.

    Advanced Editor of OLE DB Source showing the SortKeyPosition property

    The property in which we’re interested here is called SortKeyPosition.  By default it is set to zero.  When the incoming data is sorted,  this property should reflect in what order the data is sorted, starting with one for the first field.  So in our case here the value should be set to 1.

    Set the SortKeyPosition property for ZipCode to 2 and for ProductCategoryName to 3.

    That’s one of the four OLE DB sources set up.  The other three will be easier as we can start from the first one.  So, copy and paste the source component, open it up by double-clicking it and replace the query with our second query from earlier, the one returning the monthly figures.  Ow, and give it a decent name but I’m sure you knew that.

    Create the third source component in the same way, but change the value for the @numberOfMonths variable to 3.  And again the same process for source number four, changing the variable’s value to 6.

    Here’s what we have so far:

    Four OLE DB sources set up - waiting to be merged

    Merging The Sources Into One Flow

    Next up is merging the incoming flows.  Drag a Merge Join data flow transformation under the Daily Sales source and connect the source to the Merge Join.  That will open the following Input Output Selection screen.

    Input Output Selection window

    A Merge Join expects two inputs: one is called the Left Input and the other is called the Right Input.  Select Merge Join Left Input as value for the Input dropdown.

    Close the popup window and connect the second source (with the monthly data) as well to the Merge Join.  There’s only one input remaining so this one is automatically the right input – no popup window is shown.

    Next we need to configure the Merge Join so that it merges the data as expected.  Open the Merge Join Transformation Editor by double-clicking the component.

    Merge Join Transformation Editor

    By default the Join type dropdown is set to Inner join.  In our situation that’s good enough.  In the case that only one record exists for a certain zip code and product category on a given day, the monthly data for this record will be the sum of just that one record but in any case: there’s always at least one record for each incoming flow to be combined with each other.

    As you can see, because both incoming flows are ordered in the same way, it automatically knows on which fields to put the join.

    By default, no output fields are created as the white bottom half of the screenshot indicates.

    Now I’ll show you a screenshot of the expected setup:

    Merge Join Transformation Editor set up as expected

    There are several ways to specify the output fields.  The first method is by using the dropdown in the Input column.  Selecting a value there will populate a dropdown in the column called Input Column (djeez, that was one column too much).  Here’s what that method looks like:

    Specifying the output fields by using the dropdowns

    Selecting a value in the second column will then give you a default value for the Output Alias.  This default can be freely modified.  As you may have guessed, this is not my preferred method – way too many comboboxes.

    Another method of specifying the output fields is by using the checkboxes in front of the fields in the top part of the window.  I believe the larger screenshot above says it all.  Just check the fields that you need and then change their default Output Alias to whatever suits you.   In my example here I only needed to modify the alias for the last two records.

    With our first Merge Join set up, only two are remaining.  So drag in a second Merge Join from the Toolbox, connect the output of the first join as Left Input on the second join and add the output of the third OLE DB source as Right Input.

    Interesting to note here is that the output of the Merge Join is sorted in the same manner as its inputs.  One way of verifying this is by right-clicking the connector between the two joins and choosing Edit.

    Right-click data flow connector and select Edit to open up Data Flow Path Editor

    That opens up the Data Flow Path Editor.

    Tip: double-clicking the connector will also open the editor!

    Examine the Metadata of the Data Flow Path to verify the sort order

    As you can see in the above screenshot, the metadata page shows a list of the available fields with some properties, such as the Sort Key Position.  Now if that doesn’t look familiar?! :-)

    So far, the second Merge Join has been added and connected but it hasn’t been configured yet.  The process is very similar to the way we’ve set up the first join.  Just select all fields from the left input by checking all the checkboxes and select the two SUM fields from the right input.  Don’t forget to give those SUM fields a clear name.

    Two joins done, one remaining.  Just drag one in and connect it with the second join plus the last remaining OLE DB source.  I won’t go into further details here, it’s exactly the same as I just explained for the second join.

    Here’s what the Data Flow should look like:

    The Data Flow with all the Merge Joins connected

    And here’s what the third Merge Join should look like:The third Merge Join as set up for the example

    An Error That You May Encounter

    When using sorted data flows and the Merge Join component, you may encounter the following error message:

    An error that you may encounter while using the Merge Join component

    And now in words for the search engines:

    The component has detected potential metadata corruption during validation.

    Error at Data Flow Task [SSIS.Pipeline]: The IsSorted property of output “Merge Join Output” (91) is set to TRUE, but the absolute values of the non-zero output column SortKeyPositions do not form a monotonically increasing sequence, starting at one.

    Yeah right, you had to read that twice, didn’t you?  And the best is yet to come:

    Due to limitations of the Advanced Editor dialog box, this component cannot be edited using this dialog box.

    So there’s a problem with your Merge Join but you cannot use the Advanced Editor to fix it, hmm, and you call that the ADVANCED editor?  Is there anything more advanced perhaps?  Well, actually, there is.  It’s called the Properties pane.  With the Merge Join selected, one of the properties there is called NumKeyColumns.  That property reflects on how many columns the incoming data is sorted.  And currently it contains the wrong value.  Changing its value to the correct number of columns will remove the error.

    Properties pane displaying the Merge Join's properties, including NumKeyColumns

    In case you’re wondering when you might encounter this particular problem, here’s how you can simulate it.  (Don’t forget to make a copy of the package before messing around with it.)

    With the package as it currently is, remove the ZipCode field from the first two sources by unchecking it in the Columns page of the OLE DB Source Editor.

    The sources are now complaining so open up their Advanced Editor and correct the SortKeyPosition of the ProductCategoryName field: it should become 2 instead of 3 because ZipCode was 2 and has been removed.

    Now try to open the first Merge Join.  The first time it will complain about invalid references so delete those.  With the references deleted, if you now try to open the Merge Join editor, you’ll see the error we’re discussing here.  To fix it, change the NumKeyColumns property of the Merge Join to 2 instead of 3.

    Adding The Destination Table

    Now there’s only one step remaining: adding a destination for our merged data.  So, throw in an OLE DB Destination and connect it with the output of the last Merge Join:

    An OLE DB Destination connected to the join that merges it all together

    I’ll just use a quick and dirty way of creating a new table in the database.  Open up the OLE DB Destination Editor by double-clicking it and select a Connection Manager in the dropdown.  Now click the New button next to the Name of the table or the view dropdown.

    That opens up the Create Table window, with a CREATE TABLE query pre-generated for you for free.  Isn’t that nice?  Change the name of the table to something nice (at least remove those spaces, yuk!!) and click OK.

    The Create Table window

    The new table is created at the moment that the OK button gets clicked.

    Right, so are we there?  Well, almost.  As you can see now in the next screenshot, the BIDS does not want us to click the OK button just yet.

    The OLE DB Destination Editor with the Mappings still missing

    To resolve that warning, just open the Mappings page.  As the names of the input columns are matching exactly with the names of the fields in the destination table, everything will be automagically configured at this moment.  So now you can close the window with the OK button.

    And that’s it!  Everything is set up to populate the new table with the aggregated figures, as requested by management.  To give it a run, right-click your package in the Solution Explorer and guess what… select Execute Package!  If everything has been configured as expected, you should get some green boxes soon.  And some data in the table, like this:

    The final result: sales figures aggregated over different periods in time

    Conclusion

    In this article I’ve demonstrated a way to aggregate data over different periods in time, using T-SQL and Integration Services.  Obviously this method does not replace the flexibility that one gets when analyzing data stored in an OLAP cube, but it can be a practical method when you quickly need to provide aggregated data for management.

    Have fun!

    Valentino.

    References

    Merge Join Data Flow Transformation

    Common Table Expressions (CTEs)

    DATEADD() function

  1. 1
  2. Next ›
  3. Last »