Today, Day 3 of SQL-U's SSIS session, we will go over the Data Flow Task exploring some of the components and digging into a little bit of performance tuning. Let's look at one of the Tasks from the 'Big Picture' you saw on Day one.
Essential you have a start, an optional middle, and an end. I say optional middle, because I have used SSIS to simply move data between two servers. This would have a source and destination. In this example we have some very common middle components which I will go over one at a time. First off, Data Source.
There are many data source choices, this one is for pulling data from an excel spreadsheet. I don't care for this very much. I'd prefer a flat file, '|' (pipe) delimited, if only because using Excel for source or destination, prevents SSIS from running in 64-bit mode. There are no 64-bit drivers for Excel.
This component is tied to an excel connection manager, which gives some basic information. The location of the file, what version of Excel and if the first record contains headers.
I now have some source information. I intend to inspect this information, make a few corrections, and then push this into either a SQL Server table, or an Excel spreadsheet and package variable. I'm jumping ahead a bit. One of the most commonly used components is the Derived Column component, which does just what it says; Derives new columns from old ones. Excel brings data in as Unicode and if you are looking to move this data into SQL Server it should be converted into non-Unicode or code page 1252, unless of course your columns are for some reason Unicode.
Some derivations are very simple, converting Unicode to codepage 1252. An example of this is the first column Adj_ISBN10. Since Excel brings string columns in as Unicode, length 255, I use Substring to take the first 10 characters of the source column named 'ISBN'' and then I cast it from Unicode to codepage 1252 (which is the Latin alphabet). Read the derived column from right to left. There is another example where I don't even take use a source column, I just create a column that is inserted into the data stream. Take 'C001_SaleUnitWTD', which has no source column, it is always an integer with a value of 0. There are dozens of Casts available within SSIS and I won't get into them here. Take some time and look at them in the 'Type Casts' section. Some will be familiar others will not (those you can look up online). Any column created in a Derived Column component is available to the stream from that point on.
At this point I have all of the columns from my source, modified a few, and added a few of my own. I now want to split the data up. To do this I use a Conditional Split component.
The Conditional Split component applies rules to the incoming data and based upon those rules, places the records that qualify in the data streams specified. In the above example I want to move every record that has both columns ISBN and ISBN13 are null. If this condition is met, the record will be placed in the Bad ISBN data stream. If you look at the data flow image at the begining of this session, you will notice there is no Bad ISBN data stream. Why is that? I don't care what happens to those records, I in essence toss them away. SSIS sees that I do not connect that stream to another component and politely throws out those records that meet that criteria. All the records that fail to meet that criteria get pushed in the Conditional Split Default Output that you see at the bottom, just above the Okay,Cancel,Help buttons. That is also the stream you see connecting the next component, the Script Component, Correct ISBNS. We will be skipping the internals of that since it was discussed in detail in Day Two's session. Word of caution about the Conditional Split component. If you have more than one condition, the component checks them in order from top to bottom or 1 through X. That is what the little column Order means. Make sure you have the most restrictive to least restrictive checks or you might weed out records that you planned on using later in the flow. Each condition will be given its own stream. What you do with that stream is entirely up to you.
Lookup transformation is a powerful component that enables you to take one or more columns from the incoming record and check to see if that exists in a lookup table or cache. In today's lesson we are going to show the table lookup, but I suggest you also look at the cache lookup which can improve the efficiency of queries against large data sets, data sets on slower machines or across slow networks.
The highlighted portion is an important one. An unusual feature of this component is by default it errors when a record does not exist in the lookup. I found that odd, since most people are doing a lookup to determine if the record exists or not and will take different actions based upon the answer to that question. My cache mode is full, which means I SSIS will read the entire table this lookup is based upon into memory. If the table is too large or the memory on your SSIS machine is restricted, you have the choice of a partial cache, which limits the number of records kept in memory to a specific number you specify or no cache, which does a lookup for every row.
Here I chose my <server>.<database> and the <view> to build my cache.
Below is how the Lookup works.
You match one or more columns from the data stream with one or more columns from the Lookup source, which in this case is table/view based. NOTE: The lookup is case and data size sensitive! I can't tell you the amount of time I spent trying to understand why I couldn't match C001_ISBN13 to F021_ISBN. C001_ISBN13 was initially sized not as a varchar(13), but as a varchar(15). This is done throughout our tables and it's for future growth. This caused a problem when I went to match the two. I changed the data size of the column in the stream to match the size of the column in the table and viola! Upon a match, I extract two columns from the cache table; F021_PubPrice and F021_DateOnSale. If I match I continue down the stream via the 'Lookup match output' if I cannot find the lookup I continue down the 'Lookup no match output'. Let's go down the 'Lookup match output'. This leads us to another Derived Column.
This Derived Column creates two columns based upon the information obtained from the Lookup. Continuing down this stream we end up at an OLEDB Destination. We match up the columns we want and the records will be pushed into the table defined.
Taking a step back up the stream, let's move follow the stream created when the Lookup finds no match.
The first component we encounter moving down the no match stream is a Row Count component.
The Row Count component counts the number of rows that pass through it. The number is stored in a package variable defined in the highlighted section. In this case the variable is NOP. We will use that later when we need to determine of there were any ISBNs that were not published by us.
Continuing down the stream we run into a Multicast component, whose job it is to take a single data stream and broadcast it to two or more streams. In this case I need the data stream to be sent to the custom destination and the client requested an excel containing information about the ISBNs (so they can be corrected). I find it very handy.
At the end of this stream we have the custom destination component we described yesterday and an excel destination. The excel destination is very similar to and OLEDB destination except you define an excel file vs a database table.
This wraps up Day 3 of the SSIS Fundamentals class. Tomorrow will be our fourth and final day. We will be going over the miscellaneous components, such as the email component and the precedence constraints.