

Efficient way to read in filenames
Many times people will use the ForEach Loop Container component to import filenames into variables and then process them one at a time. This is very limited. I needed a way to import thousands of filenames and doing this in a 'single threaded' manner was not an option. To alleviate this use a Script component to import the filenames into a table. I can then use this table as a source in the DataFlow task. This allows for parallel processing. I will show you the script I use, the resultant table and how I use the table in parallel DataFlow task.
The script below reads through a series of directories (2 at this point) and creates a text file. I will not go into detail on the script itself at this time (if someone has questions, you can either post them as comments or email me and I will be happy to go into more depth).
The text file created is shown in the image below. Using a script I can also embed additional information, which I use later. There are 4 pieces of information to this text file,separated by a '|'. First is the full path name, including filename (used by the Import Column component see Simple Series : SSIS Import Column). Second is the filename alone. The last two are markers used by the package.
Now that we have a text file we need to parse it and import the information we need into a table for later processing. I've put some comments into the image itself to help explain. One thing to note, as stated at the bottom, you need to make sure the two OLE Destination components are pointing to the same table, so make sure you DO NOT table lock and you set a limit on the Max insert commit size or you will bottle next on one side of the insert. You are now ready to pull in your files, in parallel!
I have 8 parallel streams, but they are combined into two Data Flow tasks. I found that if I actually created 8 individual Data Flow tasks, it was inefficient in memory usage (max memory/max buffer size). I could more easily control the memory usage by combining say 4 parallel tasks in a single Data Flow task.
Now we can get into the guts of the parallel read. The whole purpose of this technique is to read in thousands of files in a short period of time. I use a little trick to make sure that each of the 8 individual tasks do not step on each other. The infamous % (mod) function. Everyone remember this? Since I am using 8 parallel tasks I use x%8. If I had 2 parallel tasks I would use x%2, 10 parallel tasks would be x%10. X will cycle between 0 and <num tasks-1>.
There you have it. I have 8 parallel tasks all extracting the filename information to import into my data stream and ultimately push into another table for storage and extraction. Remember to select Table or view - fast load, DO NOT check Table lock, and put something [>0<max] in the Maximum insert commit size so you don't bottleneck.
Reader Comments (2)
Is this tested in SSIS 2005 or 2008?
Given execution tree changing and parallel executions, I'm imaging this really being beneficial in SSIS 2008. Would you agree with that assessment Josef? In that, tablock may actually be set for minimal logging and retain the speed?
Thanks for the good article. Always nice to read your blog first thing in the morning :-)
The example I posted was taken from a package written in SSIS 2005. It seems that fast load in 2008 has tablock,check constraint, and rowsize=1000 by default. I don't know if there is a benefit to tablock if I have a rowsize of 1000, but I will look to test it out.
I'm glad you find my blog informative, that's what I strive for :)