I've been a bit busy for the last few weeks, with the holidays, PASS, working on SSIS (more to come), doing interviews, and presentations. Here is my interview with Sam Brace at SSWUG!
This is my first time here at PASS and I got a swept away in all the excitement. I tweeted about yesterday's keynote, but forgot to blog about it :) Today I will combine both keynote speeches into a single compressed version. :)
Day One -
The focus in today's keynote was on community. This is the largest PASS conference to date with over 4k registered attendees. Last year PASS provided over 430k hours technical training and while that is a fantastic amount they want to push that to 1MM. There is a lofty goal of 250k members and we're part way there at 80k, and increase just over 30% over last year (60k last year). As if that was not lofty enough, PASS is looking to expand to 5 global regions!
You might have heard about the MVP SQL Server Deep Dives Vol. 2 book. What's so special about this book; all royalties go to charity. THAT is the sql community. That is the community here at PASS. If you were here, you could have add some star power by getting it signed! Think about that. Dozens of MVPs in the room! Nowhere else; nowhere.
While this is my first time at PASS, I feel as though it is a familiar experience. It is not only see familiar faces, but PASS has done a good job of helping FirstTimers not just extract information from the conference, but to immerse themselves. You should leave PASS with knowledge and a sense of community which has an immeasurable ROI.
On a more technical note, they did announce some new information.
Microsoft is not just providing software, but understands data grows and 'spreads'. To help you accommodate this organic 'live' of your data, there are tools across all areas: Cloud, BI, Analytics.
Day Two -
Bill Graziano, in a kilt! He mentioned something about buckling under the immense peer pressure :)
Quentin Clark's Top 12 of SQL Server 2012:
1. The (9)'s
- SQL Server AlwaysOn
- HA for StreamInsight
- Integration Services as a Server
- ColumnStore Index -- Vertipaq engine now integrated into SQL Server.
3. Rapid Data Exploration
- Power View + PowerPivot
- Administration from SharePoint
- Alert Reporting
4. Managed Self-Service BI
5. Credible consistent Data
- BI Semantic Model
- Data Quality Services
- Master Data Services
6. Organizational Compliance
- Expanded Audit - User-defined, Filtering
- User-defined Server Roles
You can now pull out specific information that you need to be aware of within your audits. This can be matched to the industry you are in, such as HIPPA.
7.Peace of Mind
- Production-simulated App Testing
- System Center Advisor & Management Packs -- Cloud based
- Expanded Support - Premier Mission Critical
8. Scalable Data Warehousing
- Optimized SQL Server Appliances
- HW + SW + Support - Just Add Power
- Choice of Hardware
The Parallel Data Warehouse is available from both HP and Dell. You can service 700TB of data across 480 nodes. Plug them in, turn them on and ....that's it! These are as close to plug-and-play as you are going to get. Private cloud offering known as Database Consolidation Appliance.
9.Fast Time to Solution
10.Extend Any Data, Anywhere
- New drivers for PHP, Java & Hadoop
- ODBC drivers for Linux and Change Data Capture for SSIS and Oracle
- Beyond Relational: FileTable, 2D spatial, Semantic Search.
I think the Semantic Search will be incredibly useful (maybe that's because I come from the publishing world).
11. Optimized Productivity
- SQL Server Data tools (Juneau)
- Unified Across Database & BI
- Deployment & Targeting Freedom
12 Scale on Demand
- Deployment Across Public & Private
- Elastic Scale
Welcome to the third and last day of the Named Pipes seminar. Today we will get into the consumer portion of the framework. We will focus on getting a piece of information from the producer to the consumer and then moving some of that information into the Data Flow.
The basic layout of a consumer will be a self-contained entity. The theory is you build it to not only interact with the producer, but to work on that single unit of work that will be passed to it. Here is the basic layout of a consumer. All you need to do is encapsulate all of your work in a For Loop, which controls if you have work to do. To keep things neat, I wrap everything inside a Sequence Container.
We are going to use the For Loop like a while loop. So long as there is work to be done, the loop will be valid. Here is how we do that:
The variable workflow_1 is set to 1 when created. This validates the loop and as you will see, the Script Component will have control over that. Yes, I said Script Component. Many might be tempted to use a Script Task outside of the Data Flow component and in previous incarnations I did just that, but realized that is a waste. I had to create a variable for each piece of information I wanted to retrieve from the pipe, for each of these consumers. In my original InvoiceArchive system I had 8 'consumers' (they are really engines and do not use producer/consumer model as I have it here) with 7 variables each. I wanted to slim this down and decided to have the DataFlow itself pull the values (through a Script Component). There is a downside to this, which I will get to later.
Here is a peek inside the simplistic DataFlow object.
Now we move into the Work 1 Source Script Component - Source. Once you've set-up your script to be .Net 3.5 and imported the System.IO.Pipes and System.IO libraries, we are off to the races.
You need to create an output column so you can pass the filename down the data path and remember to include the workflow_1 variable as read/write so you can interact with it.
What is done here is very similar to what we do for the producer. We create a pipe object, this time a NamedPipeClientStream connecting to the "testpipe" pipe on the local server ("."). We will then wait 600 ms. Remember when I stated in yesterday's class how we need a way to let the consumers know if there is no more data; this is part of that mechanism. After 600ms a System.TimeoutException is thrown, which is a good thing. As you will see, this gives the component the ability to check and see if we were waiting too long or there really is no work to be done. If there is no time-out or error, I read a single line of information, which I know to be the full path filename and pass it down the flow as the FileName variable. If we do have an error, I do some additional checking to see if the consumer should end or try again for more data.
Remember if it takes longer than 600ms to connect to the producer, a System.TimeoutException is thrown. That out_of_band variable is how the producer tells the consumers there is no more work. Only the producer can modify that variable, which starts out as a 1. If there is another error, then I assume I have no work to do and set my variable to 0, which will fail the For Loop on the next pass.
There is one GOTCHA to be aware of. The Data Flow Task will run even if there is no data. Remember, it is valid for the consumer to be unable to connect to the producer. In the simple example that I have, the Import Column will pass through without issue. Make sure any other custom components you might have or other objects within the workflow can handle that.
Yesterday, I set up the premise for using named pipes in SSIS. Today we are going to get into the actual coding required to get this system up and running. In a bit I will go over how you modify your Script objects to enable you to access the libraries needed for named pipes. Both your producer and consumer scripts need to make the changes below. We will start with the producer Script Taskin my example.
Named pipes are available as of .NET 3.5. The objects reside in the System.IO.Pipes namespace. When you first create a Script object (task or component) it defaults to .Net 2.0. To fix this you need to get to the script properties and I don't think it's very intuitive. When you edit your Script object, open the properties window and right-click on the unusually name of your script and select properties (shown here):
This will open up a larger pane. From this window you want to select the appropriate .Net library, which is 3.5.
After you change the framework, it will pop up a box telling you the script must be closed and re-opened, answer yes. Now you are ready to incorporate named pipes. Open up the Script Task that will be your source and import two libraries; System.IO and System.IO.Pipes.
We need to import System.IO so we can read from our directory list. This information will be pushed down the pipes for the consumers to work on. Now let's dissect a producer.
Essentially we are taking lines from within a text file and passing them literally 'down the pipe' to the consumers. The pipe is easy to set up, in this case we want a NamedPipeServerStream not a NamedPipeClientStream. Client streams will be used by our consumers. I name the pipe so it can be found by the clients and I make sure the direction is out only. There is no reason for the clients to talk to the source, in this case anyway.
Now that the pipe has been created you need a way to send the information down the pipe and to do this, you create a StreamWriter whose destination is the pipe. Now that you have the plumbing (pardon the pun) all set up, you enter a loop that says 'So long as I have a line in my source file wait for a client to connect to me'. Let's take a closer look at that loop.
The while loop will loop once for each line in the text file, nothing unusual about that. The next thing that happens is the Script object will listen on that pipe for a process to make a connection request. If no connections are ever made, this package will only end if it is killed or an error is thrown. Once a connection is established the next line in the code is executed, which is the WriteLine. This reads a line from the text via, via ReadLine method and pushes that into the StreamWriter which is hooked up to the pipe. We then flush the stream, because we are only sending a single line of information and don't want any of it to get tossed out when we disconnect. Why do we disconnect? I need to release that pipe and connect to the next consumer that is waiting for data. If I were to keep this open, I wouldn't be able to service any other consumers. We aren't done just yet. We have some housekeeping to do either during the event of an error or when we have no more work to do.
Although the producer will wait indefinitely for someone to take work from it, consumers will not wait indefinitely for work to be sent. The reason for this If I had one unit of work left, but two consumers, one of them would get the work, the producer would gracefully exit, but the last consumer would be waiting on a pipe forever. I have a neat technique to avoid this situation.
In all of that code the important bit is the out_of_band variable. It defaults to a 1, meaning there is work to be done. When the producer exits, it sets this to a 0 letting all of the consumers know there is no more work to be had. We will get into how the consumers use this and how we prevent infinite loops in this framework. Tomorrow -- Consumers.
There comes a time when you look at how something is being done and say, "Does it have to be that way?". SSIS has a few stock Data Flow Sources and they meet 90% of our needs. There is a situation where I have a single source of data, but I would like to have parallel paths of processing. This would be akin to the producer/consumer model. I would like to have a single producer and multiple consumers. The wrinkle is this data source has no way of managing itself, like a directory listing.
One of the most common problems I need to solve with SSIS, is importing files from one or more directories. There are potentially thousands of files across dozens of directories and sometimes having a single threaded process to pull them in takes longer than you'd like or might have. A very common technique, one that I use quite often, is to run through the directories and push the full path into a holding table. I can then use multiple SSIS packages or a single package with multiple Data Flow components to get the file information and import the file.
By placing the file location information into a SQL Server table I have pushed the responsibility of managing access of the source information onto SQL Server. I use this technique in my Invoice Archive process. This is all well and good when you have excess capacity on your SQL Server, but what if you don't?
I often see in SSIS blogs that to help with improving parallelism in your code or sorting data, push the data into temporary SQL Server tables and let SQL Server do some of the heavy lifting. Again, if you have excess capacity, this is a simple and often powerful technique. What if you don't? You can fall back onto a producer/consumer model and to do that, you use named pipes.
Without the use of SQL Server to manage the listing of files, I need to way to know what file is being processed, what files need to be processed, and to make sure there is no duplication of effort. To do this I will have a single Script Task that will manage what files are going to be worked on. This is called a producer. This will be the source of all work. There will be only one in this example. You can have multiple producers and I will show you how later in the series.
Now I need a process that will do the work itself. In our example, the work consists of importing the file into a sql server table and populating some ancillary tables based upon information about the file itself. This process is called a consumer. You can and should have, multiple consumers running as it takes far more time to process a unit of work than to dish it out. If you were to look at a sample SSIS package you would see a single producer and two consumers.
Now that I have my producer and consumers I need a way to shuttle information between them. Named pipes are a way to pass information between two processes. Think of like an electronic form of the mail tubes used at bank teller windows. You put a 'packet' of information into the tube and it gets sent from source to destination. In our situation the source is always the 'producer' and it gets sent in a one way direction to the 'consumer'. The producer can service multiple consumers, but can only SERVICE one consumer at a time.
That's it for today. Tomorrow I will go into the details of how to set up the producers,consumers, and the named pipes for communication.