DBPedias

Your Database Knowledge Community

Josef Richberg

  1. Easily view errors from Script Components

    I've been doing quite a bit of work for my upcoming SQL University class next and in it is a Script Component.  These are the ones that live inside of a Data Flow task, which means: "No Breakpoints!".  What a pain that can be.  I had red boxes being tossed up left and right.  I knew I was getting errors, just not where.  Then it dawned on me: send the output down the data flow!  It was brilliantly simple.  In my example I use the FileName column for this.  I was debugging so I used whatever column was handy.  I set the column equal to the output of the exception te.ToString().

    Then to see the output, I simply connect a Data Viewer.

    No need to build elaborate mechanisms like pop-up boxes.

  2. SSIS OLEDB Command and procedure output params

    While designing a package last week with our consultant, Latha Chandri (twitter), I came across a need to call a stored procedure drom within a Data Flow Task.  I have never needed to do this before and while the call is simple there are a few quirks that I found and would like to share with you.

    I've created a sample procedure with a single input parameter (@incoming) and a single output parameter (@outgoing).

    In the OLEDB Command, I write the execute statement.

    I need to go to the Column Mapping tab to map data flow columns to the procedure parameters and I hit the first snag.

    While the component does know there are parameters, it makes no assumption as to how you are going to access or organize them, so it complains.  I can satisfy this complaint by marking each of the parameters with a ?, just like you do in just about any other SSIS component that takes parameters.  Here is the new sql command.

    Now, when I click over to the Column Mappings tab, I see this:

    I actually mapped the columns, but the cool thing is the component picked up the parameter names from the procedure!  I even get a freebie: @RETURN_VALUE.  You need to map that to a column in order to pass that down stream.  I'm getting ahead of myself, I explain more a little later in the blog.  As of this step, you are done.  This component will call the procedure and pass in the parameters.  

    "But wait!", you say. "If I look at the parameters at the top of the blog post, I see an input parameter and an OUTPUT parameter.  How do I get the output from the procedure as well as the return value?".  Good question.  The answer is counter intuitive.  

    The proc_output is defined in the mapping tab as an INPUT column, but in this case it is actually an output column.  I defined it as a derived column of the proper datatype prior to this step in the Data Flow Task.

    The proc_output column will contain the value returned by the procedure and I can use that downstream.  This goes for the return value as well.  The parameter results will overwrite those default values.  Ignore the column titles "Input Column' and 'Destination Column' and just think of this in terms of 'local variable'  and 'parameter name'.  Now, I want to show you one more thing before we go.

    The procedure used in this example is rather trivial and many real-world situations call for more complex procedures.  There are two ways in which you can call a stored procedure: 'Pass-by-Position' or 'Pass-by-Parameter'.

    'Pass-by-Parameter' is my favorite method of calling a stored procedure.  You list out the parameter names and pass them values.  I could call our test procedure as such:

    declare @out int

    exec dbo.usp_testoledbcmd @outgoing=@out OUTPUT

     Notice I do not include @incoming.  If it is defined within procedure with a default, I don't need to pass a parameter.  SQL Server will realize I have nothing for it and push in the default defined within the procedure.

    If wanted to do 'Pass-by-Position' I would write the statement like this:

    declare @out int

    exec dbo.usp_testoledbcmd 2,@out OUTPUT

    As you can see, in this case I added a 2 in front of the @out OUTPUT.  That is because when I 'Pass-by-Position' I can't skip one.  SQL Server will map each value in the execute statement to the corresponding variable position.  I don't like this at all and never use it (I don't see a need to).

    To show you how this looks when you try to mix the two methodologies, I did a simple test. 

    I originally tried this: @outgoing=?,?.  I wanted the value of the output parameter, but didn't want to have to type out the other parameter (actually the real procedure had 15 params and I only cared about 5).  Turns out, you can't do that.  You either 'Pass-by-Position' or 'Pass-by-Parameter'.  Here's the proof:

     

    Lesson of the Day: Inputs can be Outputs and it's all or nothing.

  3. Statistics IO parser in Ruby & Python

    Here I am working on tuning a rather large stored procedure; old school.  I have statistics IO and showplan turned on.  I work through the obvious missing indexes and improper joins and now I'm into the 'Let's try this technique', to improve different pieces.  The gauge for me is the  logical IO count in the statistics IO output.  The lower the IOs the better the query.  Since a change in one place might reduce the IO of that section, but might increase the IO of another section, I have to count up all the IOs.   An example of this would be adding/removing an index on say a temp table.  It might improve inserts, but kill you later on down the road during look-ups.  Needless to say I got really tired of counting the logical reads, adding them up and seeing if my change worked.   I had about 50 lines of stats output from the one procedure!  

    Wanting to flex a different portion of my brain, I decided to play around with a few programming languages.  I need a problem to solve in order to really learn a language and this was a perfect problem.  I needed to quickly count up all of the IOs for the run, but more importantly I need it broken out by table.  Having the total counts by table will enable me to quickly pinpoint areas that need the most attention.  The end result are two programs, one written in Ruby and the other in Python.   The command line for each is identical.  I'll go through how to use them, using the Ruby version in the example.

    I run my query in SSMS, with statistics io turned on, and save the output to a file: io_output.txt

     

    Since logical IOs are what I am interested in, I will look for "logical reads".  The command line for the program is quite simple:

    Voila!  Broken out by table, including a total for the entire query/procedure.

    Here is output looking for 'Scan count'.  Yes, it is case sensitive!

    Here is the Ruby version.     Here is the Python version.  

  4. Improved partition loading

    We have a table that contains just over 1.7 billion records and requires a daily feeding of around 9 million.  It is split into 10, roughly equal, partitions.  I originally just pushed the daily feeding into the main table from a staging table and it worked well, but thought to myself, as I nearly always do when developing: "Is this the best that can be done?".  The single source stuck in the back of my head, but at the time never made it forward.

    While working on a different project, I realized that sometimes pre-processing specific pieces of information provides an overall benefit to the process.  That's when the single source thought in the back of my head came sliding forward and smacked me.  Pre-process the single source using the partitions!

    Here is what the steps look like:

     

    The first Data Flow Task, "Pull Missing Records multi", extracts the 9 million records and divides them up into 10 Raw File Destination components, based upon the partition function.  

    I recreated the partition function, very easily, in a Conditional Split component.  I put them in the order they appeared in the function in SQL Server. 

     

     I decided to split the data loading into 3 Data Flow Tasks 4/4/3.  Oh, you might be thinking, "Hey, Joe!  That adds up to 11".  It does.  I have the default output just in case something is not caught by the partition function.  This has always been 0.  Now I have to read the data back in.

    Each of the OLEDB Destination Tasks looks like this:

    The important thing is NOT to select 'Table Lock', however you can choose whatever 'Maximum Insert Commit Size' suits your needs.  This improved our load times significantly, but of course your mileage will vary.  What I hope to have shown you is to think of problems in a less static way.  Things aren't always as rigid as they might appear.

  5. Behold, SSIS Advanced Editor for Data Source Components

    It's not everyday that one of your information vendors decides to change the format of its data feed.  Evidently, today is not every day.  It would be one thing to change the data type of an incoming field or even add/remove a field from the feed.  Nope, they decided to change from a .csv file to a full blow .xls with a human readable header telling me what the file is for.  Not sure why they would do this, since the file is over 150MB and over 27k rows long!  Who is going to look at it?  Now we have to read in an .xls file with no valid header information.  Its very nice when the Excel Source component reads in the first row and populates the external columns for you.  Its especially nice when the file contains 90 columns!  Even worse, when you have empty fields mixed with alphanumeric and numeric the helpful source component gives a best guess at what the column data type should be.  In our case it was terribly wrong and we needed a way to fix it.  Here is what the new header looks like:

    This is causes bunches of issues with the Excel Source.  I can no longer say 'first row has column names' and this spreadsheet has 90 columns!  Add to this the 'auto-magical' setting of column data types.  To accomplish this, the source reads the first, I believe, 100 rows and makes a determination.  Not helpful in this circumstance.  Let's see if the standard editor provides us with any relief.

    The columns are labeled F1-F90, because, as I stated before, I cannot use the first row as header information!  What I can do here is change the output name from F1 to say 'Reporting Period' and select what columns are read in and pushed down the data pipe.  I need to change data types.  No joy.

    Enter the Advanced Editor available on all Data Source components.  This often overlooked editor is tucked away in the right-click menu of the Data Source.  This editor has 4 tabs, we are interested in the last tab labeled Input and Output Properties.  

     

    Here is where you can wrestle the columns and data types from the component and do with them as you wish.  The source thinks the highlighted column is a Unicode string of length 255 named F1. It's really called 'Reporting Period' and should be a string of length 5, since the value is always 'Daily'.

    Realize we now have garbage in our stream.  We want to throw away everything prior to the first good data record (row 6 in the example above).  We simply added a Conditional Split component to weed out the bad header records. 

    Using the Advanced Editor, the vendor can change the header in any way, even modify the column names themselves and it won't affect the source component.  We have drastically reduced the dependency of the component and our SSIS package on 'proper' feed formatting, allowing us to recover quickly from feed 'adjustments'.

  6. SSIS : Using variables as a source in a Data Flow Component

    Currently, none of the existing Data Flow sources allow you to source variables.  Once again in my quest for performance, I look to keep as much data as possible in memory.  There are many uses for pushing data to external files (Raw Files, control files, etc), but there is a difference between doing so, because you need to and doing so because you have no choice.  In this post I will show you how to take data output from a stored procedure into local variables and then, using a Script Task Source, build an entire Data Flow Component.

    First up is getting the data.  I decided to use a stored procedure, since I can get a single record.  In this post I will use a set of variables, which correspond to a single record.  The stored procedure is created with output parameters.  

     

    The output parameters are mapped to SSIS variables:

     

    We have the result of a stored procedure in memory, through the use of variables.  Now I need a way to use them in a Data Flow and to do so means I need to either introduce them through a Derived Column or a Data Source; I chose Data Source.  To do this I need to create a custom Script Component.

    The first thing I do is create the output columns, which is how the columns gathered within the source component are introduced into the data stream.  I create a corresponding output for each of the variables filled in by the stored procedure.

     

    Now we have to introduce the variables to into the script component and there are two ways to do this.  The first way is to list them in the read-only section of the script component.

    I didn't do that this time.  I was carrying over a bit of prejudice from SSIS 2005 where I found it easier to work with variables through the VariableDispenser object.  I identify the variables I want to insert into the data stream and then pass them down, using the CreateNewOutputRows() method available to a Script Source Component.

    Each package variable is added to the corresponding output variable built off the Script Source Component for use by any component within the Data Flow Task.

    Now you might be thinking, "That's a neat trick, but why would I want to do that.  I want SSIS to pull as many records as possible and push them down then line  I want speed!"  So do I and for the package I built, this is the way to get it.  This is an enhancement to the document loader I built.  Pulling hundreds of .pdfs from a directory slowed down workflow considerably.  It was more efficient to work on a single document at a time.  Now you might be asking 'Why don't I just select from the queue table to get the next document to import?"  The answer is: You can't.  Stored procedures can't be called from within a Data Flow task and since I want to remove the record from the queue, I needed a stored procedure.

    This an enhancement to my Enhanced Threading Framework which I designed.  You can find the older slideshow here.  I am in the process of enhancing it, including a way to turn engines on and off, and improved performance.

  7. SSIS: Using a Script Component as a substitute OLEDB Destination

    I developed an SSIS package that loads invoices, statements, pro-formas, etc. into a SQL Server table.  The invoices are .pdf files and the business required that no identical documents were allowed to be stored.  To determine uniqueness I simply created a hash of each document and used it as the unique identifier.  I used SHA1 as the algorithm, which provided a 20-byte fingerprint and all was right with the world.  That is until the developer working on the project had difficulty using that fingerprint from the front end.

    An application was built that enabled users to meander through the data and view the .pdfs based upon criteria they selected.  Once they narrowed down the field of valid documents, they would click on one or more and it would be pulled from the system and displayed.  The problem was there seemed to be no way for the front-end to properly store the fingerprint, so when the user selected the document to view, the front-end would not send the proper fingerprint back to the server.  The 20-byte fingerprint seemed to get all messed up.  To the credit of the application group they tried everything, but to no avail.  This meant I needed to add an identity column to the table.  Easy-Peasy.  Then I remembered all of the ancillary data.

    There were a few tables that included additional data about the document, such things as the company it belonged to, date of the invoice, and all the things that make life easy for a user to narrow down the millions of documents they need to comb through to get the few they need.  Since the front-end would be combing through this, I could not longer use the fingerprint, but had to make sure the ancillary tables were all based upon the identity column, which I aptly named: Webunique.

    I hit a snag in the beginning on how to do this.  I couldn't use the OLEDB destination to insert the document since there was no way to get the identity column for that record, so I had to write my own using a Script Component.  

     I broke the Script Component out in three pieces and will play them in reverse order, which should make it easier to follow.  The first image will be of the ProcessInputRow.

    The key to extracting the identity column is combining the Insert statement and the select statement in one sql command, separated by semi-colons.  I then execute the combined statements using ExecuteScalar since I should get back a single value, which will be from identity select.  It is all in the same scope, which is important and also took me the longest to figure out.  Now that I have the identity in a variable, I need to pass that up and out of the Component.  This is done in the PostExecute method.

     

     To square things out, the image below shows the additional library (System.Data.SqlClient) needed as well as how to set up the connection string.  I hope this helps solve problems you may be having in the field or expands your knowledge of what can be done with Script Components.

  1. 1
  2. Next ›
  3. Last »