Achievements

Sunday
Feb082009

Parallel processing in SQL Server

There have been manyoccasionswhere I've needed to run multiple copies of the same process for performance reasons. Let's say for example that a stored procedure can process 10 records/sec, but I have 1,000,000 records to process. It would take 27.7 hours to process at the previous rate. If I could break up this process into discrete pieces I could run multiple copies and reduce the total run time in a linear fashion (that is if it takes 10 hours to run 1 copy, it would take 5 hours to run 2 copies...1 hour to run 10 copies). There is a computer operation known as, Modulo.

The Modulo operation, available in nearly all programming languages (% or Mod), returns the remainder of a division rounded up to the nearest whole number. Here are some examples:

0%3=0
1%3=1
2%3=2
3%3=0
4%3=1
5%3=2
6%3=0

If you look at the above pattern carefully, you will see that it repeats and that is the handy part of the modulo operation.

Let's go back to the example above and try to break up the task of working through a 1,000,000 records. I will create a table as follows:

Create table Base
(First_Name varchar(30),
Last_Name varchar(100),
workidint identity(1,1))

After you enter 1,000,000 records into the table Base,workidwill have values ranging from 1-1,000,000. Now you have a stored procedure which needs to take the first/last name combination and do some elaborate work on it (so much so that you can only do 10 records/second). Within the stored procedure you will no longer just do this:

select First_Name,Last_Name
from Base

You will do this:

select First_Name,Last_Name
from Base
where workid%x=y

To solve for and ywe need to figure out how many simultaneous processes. I say 5. That means x=5 and would range from 0-4. The simple formula would be, the range of y=0 through x-1.

select First_name, Last_Name
from Base
where workid%5=0

select First_name, Last_Name
from Base
where workid%5=1
......

How does this help? The first procedure to run, would pick up records workid%5=0or any multiple of 5 (5,10,15,20, etc). The second procedure would pick up records workid%5=1 ,records 1,6,11,etc. The final procedure would pick up workid%5=4 (4,9,14,etc). Work is evenly divided and never duplicated no matter how many or how few you use.

Make sure you understand the above technique and those in both previous (cryptographic hashes) and future posts. I will start combining many of these base techniques into more complex scenarios that I use in many of my SSIS processes.

Saturday
Feb072009

Cryptographic Hash Functions

Cryptographic hash functions take an arbitrary number of input bytes and reduce it to a fixed size. This resultant size is dependent upon the function you use, be it MD4,MD5,SHA-1, etc. I prefer to use SHA-1 since it is the most secure of the 3 listed above. There are many other cryptographic functions, but they are not included in all libraries.

SHA-1 has an output size of 160 bits (20bytes) for an input as large as 2^64-1 bytes. The output size of 160 bits means that the chance of a two completely random input streams to produce the same hash is 1 in 2^80 ( or 2^(output size/2)). I have not yet worked on a data set that large, so it's good enough for me to use in this fashion; digital fingerprint. Let's say you have a customer table withfirstname,lastname,address1-address5, city,state,zip. You want to update the record every time something changes in any of the 10 columns. If you happen to have a trigger on the table, then you just do your updates based upon the records that changed, which is happily provided to you by the trigger. If on the other hand you are working with very large data sets or don't have the option of a trigger you need an effective way to determine which records have changed. You might not want to run a 10k record update to catch the 100 or so that actually changed. Here is where the hash comes into play. Depending upon the function you are using (either SQLServer's built-in HashBytes) or through SSIS and a .Net function, you want to take all of the columns you that can change and condense them into a single value. In the example above, which was allchar(x)fields, you can simply do firstname+lastname+address1+address2....Feed the entire concatenation into the function and you have a digital fingerprint. If anything in any of the fields change, then the 'fingerprint' will be different and a simple comparison (orig.digitalprint!=delta.digitalprint) enables you to identify the record. With SQLServer 2005 sp2,HashBytes, is deterministic so you can make a computed column which will calculate the HashByte of each record automatically. You can then run a sql statement which will do an update if the hashes match or an insert if they don't. This is a very good unique key, providing very random placement (the purpose of a cryptographic function is to be as truly random as possible) versus adding a incrementing column simply to provide a unique index.

Wednesday
Feb042009

You can call me, Julian.

Here is a little bit of sql to help convert Julian dates to Gregorian dates. It seems as though some ERP systems use Julian dates to store their date information vs datetime. It works on a formula that determines the number of days since 1/1/1900.

Declare @JulianDate as int,
@GregorianDate as smalldatetime

select@GregorianDate=case
when@JulianDate > 0
thendateadd(dd,(@JulianDate%1000)-1,cast(dateadd(yy,(@JulianDate/1000),'01/01/1900') as smalldatetime))
else'1/1/1900'

The reason for making sure the @JulianDate variable is>0is to keep the result as a smalldatetime. If you run just the formula and @JulianDate = 0 the result is '12/31/1899', which is a datetime. I find wasting 4 bytes (smalldatetime:4 bytes, datetime:8 bytes) for this condition is senseless since julian dates don't store time, just the day.

Tuesday
Feb032009

One-Way hashing, HashBytes, and SSIS

I helped develop an application that stores documents within SQLServer (v2005) and are then retrieved for use by a web-based front end. I had the task of co-developing the back end pieces; table structures, indexes, relationships, and anSSISpackage to manage the load. The design is straight forward: Documents are stored in a varbinary column with ancillary data stored in additional tables to facilitate look ups. The current document form is a .pdf around 80k each. The current logic states that we keep only one copy of a given document and the first copy we get is the one we keep. It is easy to maintain in that I do a look up in the ancillary tables prior to loading up the "new" .pdf and if this document exists, I don't bother loading up the new one. Then came the curve.

Since this is application stores various business documents (and any physical type .pdf,.xls,.doc, etc) and those documents have various business rules..a new rule came to bear. A new business document had a rule that stated we needed to maintain a certain number of distinct copies. Unlike the previous document type whose rule stated you keep the first copy, this one stated you had to maintain multiple, let's say the last ten. To satisfy this business need I had to figure a way to keep 10 DIFFERENT copies, not 10 identical copies. How do I determine if a binary file has changed from one version to the next; by using a cryptographic hash function (I'll explain more about cryptographic hash functions in a soon to come post).  HashBytes to the rescue, or so I thought.

SQLServer 2005 has a function called HashBytes which takes in the name of a known hash function (saySHA-1) and an input of varbinary(max) and hands back avarbinaryof the appropriate size (depending upon the hash function you use). Well, it seems as though BOL has a bit of a misprint. Turns out HashBytes has a max input size of 8000 bytes, which makes it unsuitable for me (remember the avg size of a document is 80k). This wouldn't be as simple as adding a computed column (although this could only be done in SP2, since it seems prior to that HashBytes was considered non-deterministic). Back to SSIS and the load process.

SSIShas two very powerful 'widgets'. The script task widget and the script component widget. The script component widget is the one to solve this task. Within the widget I can create a VB .NET component and VB .NET has a few cryptographic hash functions in

the System.Security.Cryptography library. The code reads in a blob from the input stream and calculates a hash, which it then passes down the stream for the rest of the SSIS package to use.

 

Public ClassScriptMain
InheritsUserComponent
DimmySHA1 As NewSHA1CryptoServiceProvider()

Public Overrides Sub Input0_ProcessInputRow(ByValRow As Input0Buffer)
Dimblobdata() As Byte
DimfSizeAs Int32

fSize= Convert.ToInt32(Row.c002data.Length)
blobdata= Row.c002data.GetBlobData(0,fSize)
Row.HashValue=mySHA1.ComputeHash(blobdata)
End Sub

End Class

In my script component I have an incomingdataflowcolumn called c002data,which contains the .pdf. I have an outgoing dataflow column called HashValue. I get the entire blob out of the stream and into the blob data array, which I then pass to theSHA1 function which computes the hash and sends it on it's merry way! I now can check to see if the digital document has changed and if it has, add it to the table. This is a simple, quick, and efficient way to manage binary data.

Page 1 ... 20 21 22 23 24