Navigation
Follow Along

Powered by Squarespace
Tuesday
03Mar2009

Not Exists in SSIS

There are many times that I would like to insert a record into a table only if it doesn't exist. This is done easily in T-SQL:

insertTableA
selectB.RepID
fromTableBB
wherenotexists(select1
fromTableAA
whereA.RepID=B.RepID)

SSISis a little more complex, but once you get the hang of it, it's very powerful.

Create adataflowobject and within it place adataflowsource. Select from theSSIStoolbox, underdataflowtransformations, theLookuptask. Attach your task to thedataflowsource. Once you have that connected we move on to the next part.

Double-click on theLookuptask and it will open up a window with 3 tabs. First tab is the reference table, that is the table you will be checking. In the example above, that would beTableA. I personally like to select only the columns I need to validate what I am inserting (which would most likely be the 1 or more columns that make up the unique key). Once this is done, click on theColumnstab.

TheColumnsis where you link the reference table, (Table B)to the source table (TableAin the above example). Drag one or more columns from theAvailable InputColumson the left to the correspondingAvailableLookupColumnson the right. Click any one of the check boxes from the right side. It doesn't matter which one as we are only using this as a flag. Once you do this, you will see an entry in the grid below withLookupColumn,LookupOperation,Output Alias. This should be filled in with the column on the left, the corresponding column you clicked thecheckboxon to the right and in the middle it should say <add as new column>. Remember this column name. We aregoing to skip theAdvancedtab (I'll cover that in another post soon). Now, at the very bottom left of theLookuptask you will see a buttonConfigure Error Output, click on it.

Across the top you will see 5 columns, you want to focus on the thirdError. Right below that column you will see, by default,Fail Componentand if you click in it you will have 3 options:gnoreFailure, Redirect Row, Fail Component. You want to selectIgnore Failure. You might be asking why.

The purpose of theLookuptask is to find an existing record and if no record matches it is considered a failure. We alter that default behavior and tellSSISto pass the record through even if there is no match. We catch the failure with aConditional Splittask.

The output of theLookuptask goes into theConditional Split. Drag down theISNull()function and place it on the first line. Click on theColumnsfolder on the left side and find the column we selected up above in theLookuptab. Drag that column into theISNullfunction. I usually change the output name to something along the lines ofValid_RecordorNew_Record. What you have done here is basically tell theConditional Splittask to output any record with that has a null value, which translates into: Pass any record that doesn't already exist. You then connect the output of that condition to your next task and continue on programming. There you have it, usingSSISto input records that don't already exist.

Friday
20Feb2009

Table Variables and Parallel Processing

I have found an interesting quirk about table variables and sql server 2005. When I create a table variable to insert records into,all parallelism in the query plan is removed. I found that you can use table variables in joins and they themselves can partake in parallel operations. As an example.

declare@ordernumstable(ONumint)
select distinct OrderNumbers
from OrderTable

Select......
from Thistable t
where exists(select1
from @ordernums o
where o.ONum=t.ONum)

The above statement will allow for parallel scans of both the @ordernums table and This table (if the optimizer chooses).

However, if you do the following:

declare@Resultstable(ONumbersint,TotalSalesmoney)

insert@Results
select top(5)
ONum,sum(mysales)
fromThistable .........

Any parallel plans sql server might have thought of are thrown out the window. This took me a few hours to track down because I didn't even think of looking at the insert.

Sunday
08Feb2009

Parallel processing in SQL Server

There have been manyoccasionswhere I've needed to run multiple copies of the same process for performance reasons. Let's say for example that a stored procedure can process 10 records/sec, but I have 1,000,000 records to process. It would take 27.7 hours to process at the previous rate. If I could break up this process into discrete pieces I could run multiple copies and reduce the total run time in a linear fashion (that is if it takes 10 hours to run 1 copy, it would take 5 hours to run 2 copies...1 hour to run 10 copies). There is a computer operation known as, Modulo.

The Modulo operation, available in nearly all programming languages (% or Mod), returns the remainder of a division rounded up to the nearest whole number. Here are some examples:

0%3=0
1%3=1
2%3=2
3%3=0
4%3=1
5%3=2
6%3=0

If you look at the above pattern carefully, you will see that it repeats and that is the handy part of the modulo operation.

Let's go back to the example above and try to break up the task of working through a 1,000,000 records. I will create a table as follows:

Create table Base
(First_Name varchar(30),
Last_Name varchar(100),
workidint identity(1,1))

After you enter 1,000,000 records into the table Base,workidwill have values ranging from 1-1,000,000. Now you have a stored procedure which needs to take the first/last name combination and do some elaborate work on it (so much so that you can only do 10 records/second). Within the stored procedure you will no longer just do this:

select First_Name,Last_Name
from Base

You will do this:

select First_Name,Last_Name
from Base
where workid%x=y

To solve for and ywe need to figure out how many simultaneous processes. I say 5. That means x=5 and would range from 0-4. The simple formula would be, the range of y=0 through x-1.

select First_name, Last_Name
from Base
where workid%5=0

select First_name, Last_Name
from Base
where workid%5=1
......

How does this help? The first procedure to run, would pick up records workid%5=0or any multiple of 5 (5,10,15,20, etc). The second procedure would pick up records workid%5=1 ,records 1,6,11,etc. The final procedure would pick up workid%5=4 (4,9,14,etc). Work is evenly divided and never duplicated no matter how many or how few you use.

Make sure you understand the above technique and those in both previous (cryptographic hashes) and future posts. I will start combining many of these base techniques into more complex scenarios that I use in many of my SSIS processes.

Saturday
07Feb2009

Cryptographic Hash Functions

Cryptographic hash functions take an arbitrary number of input bytes and reduce it to a fixed size. This resultant size is dependent upon the function you use, be it MD4,MD5,SHA-1, etc. I prefer to use SHA-1 since it is the most secure of the 3 listed above. There are many other cryptographic functions, but they are not included in all libraries.

SHA-1 has an output size of 160 bits (20bytes) for an input as large as 2^64-1 bytes. The output size of 160 bits means that the chance of a two completely random input streams to produce the same hash is 1 in 2^80 ( or 2^(output size/2)). I have not yet worked on a data set that large, so it's good enough for me to use in this fashion; digital fingerprint. Let's say you have a customer table withfirstname,lastname,address1-address5, city,state,zip. You want to update the record every time something changes in any of the 10 columns. If you happen to have a trigger on the table, then you just do your updates based upon the records that changed, which is happily provided to you by the trigger. If on the other hand you are working with very large data sets or don't have the option of a trigger you need an effective way to determine which records have changed. You might not want to run a 10k record update to catch the 100 or so that actually changed. Here is where the hash comes into play. Depending upon the function you are using (either SQLServer's built-in HashBytes) or through SSIS and a .Net function, you want to take all of the columns you that can change and condense them into a single value. In the example above, which was allchar(x)fields, you can simply do firstname+lastname+address1+address2....Feed the entire concatenation into the function and you have a digital fingerprint. If anything in any of the fields change, then the 'fingerprint' will be different and a simple comparison (orig.digitalprint!=delta.digitalprint) enables you to identify the record. With SQLServer 2005 sp2,HashBytes, is deterministic so you can make a computed column which will calculate the HashByte of each record automatically. You can then run a sql statement which will do an update if the hashes match or an insert if they don't. This is a very good unique key, providing very random placement (the purpose of a cryptographic function is to be as truly random as possible) versus adding a incrementing column simply to provide a unique index.

Wednesday
04Feb2009

You can call me, Julian.

Here is a little bit of sql to help convert Julian dates to Gregorian dates. It seems as though some ERP systems use Julian dates to store their date information vs datetime. It works on a formula that determines the number of days since 1/1/1900.

Declare @JulianDate as int,
@GregorianDate as smalldatetime

select@GregorianDate=case
when@JulianDate > 0
thendateadd(dd,(@JulianDate%1000)-1,cast(dateadd(yy,(@JulianDate/1000),'01/01/1900') as smalldatetime))
else'1/1/1900'

The reason for making sure the @JulianDate variable is>0is to keep the result as a smalldatetime. If you run just the formula and @JulianDate = 0 the result is '12/31/1899', which is a datetime. I find wasting 4 bytes (smalldatetime:4 bytes, datetime:8 bytes) for this condition is senseless since julian dates don't store time, just the day.