Today we are going to do something a little different. We are going to dig into a more efficient and powerful way to scan directories for files and process them. The idea behind this method is to load the full path name of the files (and any ancillary data) into a SQL Server table and then build multiple Data Flow tasks to extract and process them. We will look at a script that will do this for you as well as a sample Data Flow task that shows how to use it. Gloves and goggles, people. This isn't any less messy.
Here we have the Script Task, a close relative to the Script Component. This lives on the Control Page of your project. There is no concept of source,destination, or transformation, it just is. When create a new one, it comes with a single method: Main(). Cutting into the script task designed to read files we have this:
You'll need to graft a few extra pieces onto the normal task, including the System.Diagnostics and System.IO. I prefer to put them in alphabetical order as it helps when you are quickly scanning for libraries. System.IO is needed to talk to the file system. System.Diagnostics enables us to send errors to the error log of the host machine. I find this handy when troubleshooting and put it in all my scripts (tasks & components). This ability is native in script components. If you remember from the previous dissections there were these little dangling bits in the error section:
Having the ability to talk to the file system is all well and good, but you need a mechanism to relate back to the SSIS package. Lone components rarely live long outside of the host package and this one is no exception. If you cut along the middle bluish line and move it to one side, you'll see a few variables tucked inside. The most important one is the green highlighted "USER::import_file". This tells the component where to place the list of matching files. It is this variable that will tie this component to the package. The other two variables you see there are directories to search. In this specimen there are two; User:cn_filepath and User:tr_filepath. Here is how they are used:
This script is looking for ".pdf" files contained within the directories provided by the variables. I know there are no subdirectories, so I simply loop through the directory and match the file by extension. For every directory you need to search, you add another loop. You can easily modify this to accommodate your needs. Another benefit is I create a specific file layout, which you can see pointed out in red. It is <fullpath>|<filename>|J|TR . Later on you will see how this blends with a File Source component. Once again, dangling near the bottom is the write to the errorlog, this time using the component found in System.Diagnostics, the EventLog object. Move this to the side and grab a fresh Data Flow task from the bucket.
This is the second phase of processing. Having a bunch of flat file paths, does me no good. I will have to use a File Import component at some point, which means I will need a Data Flow task with my logic. If you go back to the first day of class, the large multi-colored Data Flow task is the one I use. We are going to take that one file which is comprise of the information created up above: <fullpathname><filename>|J|TR and load that into a staging table. Along the way I am going to count the number of files that match either TR or CN I can do it easily as you can see. The end result gets pushed into a sort of queuing table, called T003_PendingFiles.
There is one more thing to do. The File Source needs to know what file to read in. Remember the variable "User::import_file"? That file is where the script component writes its entries and that is where the File Source will read from. Here is how you tie the two together.
We are almost done. The one piece missing from this is HOW to process the files. All we have done here is get a list, add some ancillary data, and put it in a table. The HOW, is a little easier now. Instead of having to put all of your logic in a For Loop task, you can replicate that logic in say X Data Flow tasks and add a select statement from the queue table as your source. You can now scale your package without having to break up files in weird ways to enable you to split your logic between For Loops. You can also add additional directories and file types without having to have a Data Flow task for each. You have decoupled the processing from the collecting.
There you have it, yet another way to process flat files. This method can also work if you want to read the files as a source vs. importing them as I do. It would just take some doing to get the get all the variables connected.
I want to thank you all for attending this series on Advanced SSIS. I believe today will be our last day. I hope you've all learned something that you can take away that will help you solve current or future problems.