Achievements

« SQL-U Day Two SSIS Fundamentals : Script Tasks and Components | Main | SQL-U Day One SSIS Fundamentals : The Big Picture »
Monday
May102010

T-SQL Tuesday 6: What about BLOB?

I built a system that ingests .PDF files which are of varying sizes (100-400k on avg) which represent various types of files (invoices,statements, pro-formas).  I can ingest any binary file of any size, but we currently only use .pdfs.  This has an interesting SSIS package (which I will discuss after the table structure)  The table structure has many columns, but the ones of interest for #TSQL2sDay are the following:

  • [C002_Data] [varbinary](max) NOT NULL  -- stores pdf
  • [C002_Fingerprint] [binary](20) NULL -- stores 20 byte hash of pdf
  • [C002_webunique] [int] IDENTITY(1,1) NOT NULL --some web apps have issues with the fingerprint

Users of the front end enter a bunch of criteria to narrow down the .pdf files they want to see. Then they click on the row in the front end and that grabs the webunique number which efficiently pulls the Data column for display.  One finger print, one binary (.pdf for now)

A portion SSIS package looks like this:

As you can the C002_FingerPrint column is used to not only pull the .pdf for viewing, but to validate that the .pdf going into the table is not duplicated.  Using SHA1, there is a 1 in 2^80 chance that two totally different .pdf files, or any binary file, run through the algorithm will have the same 20 byte binary hash.  This works incredibly well for us.  We don't have to parse through the binary file looking for bits and pieces and we can include a new binary (say a word or excel doc) without having to build a new library to peek inside.  The script used to calculate the hash is explained below.

The binary file is pulled in using the Import Column component and then passed down the stream to this script which then builds the hash using the C# crypto library.

Later on down the line we actually push the data into a sql table, not via the standard OLEDB Destination object, but via a custom Script Destination Component, so we can get an auto-increment number (seems not all of the .net front ends can easily grab a 20 byte binary, go figure).  I will go into this custom script in a future post.

PrintView Printer Friendly Version

EmailEmail Article to Friend

References (1)

References allow you to track sources for this article, as well as articles that were written in response to this article.
  • Response
    T-SQL Tuesday this month was all about LOB (large object) data. Thanks to all the great bloggers out

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>