I am honored to say that have been selected to present this year. I'll be giving a talk on my threading framework for SSIS. Unseen Minions : Improving parallel processing in SSIS. I am very excited and hope to see you there!
Today I had to write a little function that takes the long had version of a fraction, say 8 3/4 and convert it a proper number (float) and then convert that number from inches to centimeters. While that is handy, I thought it would be best to include a few more variations on that theme. It handles the following input scenarios:
- Whole numbers -- '8'
- Fractions alone -- '1/4'
- Mixed Numbers -- '8 1/4','8-1/4','8.25','0.75','.75'
- An invalid fraction -- 'Special',' eight and 1/4'
I'd list the code, but I haven't found a good way to embed formatted code into this website so instead I am linking the source code.
Working with Excel in SSIS is not fun and something I whole-heartedly discourage. With that said, I had a not so uncommon issue that I thought I would blog, once I found the answer. The issue is with writing columns of data to Excel with a width>255 characters, more specifically varchar(max).
Whenever I need to export data to Excel within SSIS I like the approach of having a blank template that I copy to a work location, fill with data, change the name and ship it off. The issue I ran into was where I had to change a column name in one of my templates. Upon saving the template and re-running the DataFlow Task I got the error codes above. The complete message was as follows:
I delete the existing tab, called Excel_Destination, with a drop table statement. I make sure the Execute SQL Task has a connection type of EXCEL, a result set of type None, and the connection manager is tied to the Excel spreadsheet I am modifying.
The above statement will run and delete the offending tab. Now I have to rebuild it, so I do it with a create table statement with a twist; the backquote (`). The backquote surrounds the names of the columns. The table name is the tab name.
Now, you might have noticed that I have standard sql data types (varchar, char, int). It turns out that all translates to varchar(255). The only column of interest to me is the memo data type. That is synonymous with varchar(max).
So long as I don't resave that template, the DT_NTEXT field will stay. If you need to make a change to the column names, do it programmatically or you will get the above error messages. Hope this saves some of you the headache I went through :)
There are countless ways to mix and match components in SSIS to simulate what SQL Server does in a query. Here I am going to show you how to replicate an existence check using SSIS components. The only time I use this type of configuration is when I have cross-server joins. I find it more efficient for SSIS to hop to different servers in discrete components than to rely on SQL Server to do linked server calls. Here is an example of a simple select statement that does a cross-server join to validate the data.
There are two reasons I don't care for this method. First, your package will only have the datasource for the primary server, so you will not know from looking at the package that it also has a dependency on the secondary server ([otherserver]). Second, you are relying on the efficiency of the linked servers to do the work. SSIS has a much cleaner mechanism; Lookup Transformation. Here's how it would look if we took apart the above query and used SSIS components to do the same work:
Let's break this image down and see how we created a deconstructed SQL Server existence check in SSIS. In our OLEDB Source we removed the existence check and simply pull from the source table:
Next we drop a Lookup Transformation and in that we put our subselect:
Now at this point each component is pointing to it's own Server.Database source and each of those sources will be listed in the Connection Manager tab. This design pattern is self-documenting since there are no servers hiding inside components. This is one of the benefits of using this pattern. Now we have to connect the two components.
After connecting them we now need to tell SSIS this is an existence check. The great thing about this component is that it can act as both as an exists or a not exists. We do that in two steps. The first step is to tell the component to pass all rows that don't match to a special output (it's highlited in yellow). If we didn't do that, the default is to fail the component and while there are many reasons you would want that, we don't here.
Before we move on to the final step I want to take a moment to explain a little more about this component. You will notice there are 3 choices at the top; Full Cache, Partial Cache, No Cache. The default choice is Full Cache and that tells SSIS to select all records from the table for the chosen columns and store them in memory for that component during the initialization phase of the package. The only two caveats are the records must be unique (use a distinct if you have to) and the total number of records must fit in memory. If you've jumped ahead a little you'll be thinking "I don't want to use this on very large tables." That's not totally accurate, as you can also use the partial cache or no cache. Partial Cache gives SSIS a limited amount of memory and does not prefil it. What happens is when the first record comes down from the source the component checks it's internal cache and if there is no match it goes out to the database (using the provided query) and tries to retrieve a record to match the key. If there is a match, that record is stored in the cache and the next time that key is in the data stream the component will not have to go to the database. You can set up the maximum amount of memory the component is allowed in this tab:
This works out well if you have a large set of values you are checking AGAINST, but the source table is small as the Lookup Transform will only cache the matches and only make a call to the database for each non-existent row. This means if you have 10,000 records in your source table, but only 100 unique isbn13s, you will only make 100 calls through the lookup (assuming your cache is big enough to hold all 100 keys). If your 10,000 records were all unique, then you would make 10,000 calls to the lookup table, which would be similar to the No Cache.
With No cache, the component does a lookup for each incoming record and only stores the last retrieved key. This can act identical to the Partial cache if you have unique values in the source table. Even with all the potential gotchas, this combination can be more efficient than a cross-server join.
Finally, you connect the output of the lookup component to the destination. This is where you can you the Lookup Transformation as either an Exists or Not Exists. Here is the final product one more time:
We are looking to replicate an exists and to do that we use the Lookup Match Output. This output will be for records that have a match in the output. If you wanted a not exists you would simply use the Lookup No Match Output. You can use them both if you wanted, getting a list of records that do exist and records that do not exist.
Have at it and see if this gives your SSIS packages the boost you were looking for.
During my SSWUG webcast I was asked to give some examples on how to tune the DefaultBufferMaxRows and DefaultBufferSize, so here goes.
When a source is pulling in data, it places them into buffers; one row=one buffer. The default setting for DefaultBufferMaxRows=10,000. This means that the source will collect 10k records and then 'dump' them into the pipeline for processing. Now ignoring asynchronous/synchronous components, until all of these buffers are processed by your destination, no more rows will be pulled in. While this sounds like a batch, and you can think of it like this, don't confuse this with the batching at a destination, they don't have to match. I don't believe there is a stated upper limit (I've done 1M).
DefaultBufferSize tells SSIS how much memory to set aside for those 10k buffers. They are structures (don't know what language SSIS is built in) and they need to have memory allocated, but just how much; 10MB (10485760 bytes). The maximum size is 100MB (104857600 bytes). How this parameter works is as follows. As SSIS starts filling buffers it keeps track of the memory allocated. If you have really wide rows (over 10485760/10000=1048 bytes) then you will reach 10MB right around 10k buffers. If you have a row that is 2096 bytes in size, you will hit the limit around 5k buffers.
Now you've probably had the 'Ah Ha!' moment. Now, how do you use this information to improve the performance of your Data Flow Task?
- Find the avg size of the source record: 523 bytes (e.g.)
- Find how much memory you want to afford this task : 100MB (we'll give it the max)
- How many records can my source pull at any given time: 200492 (always round down)
So given the max memory I can allow per task, setting my DefaultBufferMaxRows>200493 will be a waste. There you have it, a simple mechansm to determine the proper settings. More often than not I set my task to 100MB and reset my buffers to fill it. Yes there is a possibility that if you have many packages running simultaneously and they have many Data Flow Tasks and all of them are at 100MB, you will run out of memory on your SSIS machine rather quickly causing the packages to push buffers to disk. If this happens and you notice performance degrading, but it happens randomly, look to lower this number. While it might seem counter intuitive to pull fewer records in each round, if I don't have to write these buffers to disk, it will be a faster total run time.