SSIS Data Flow Scripting

In the Registrar’s Office at Washington State University, we’ve decided to convert all of our Database Jobs to SQL Server Integration Services (SSIS), a sort of drag-and-drop programming system that allows for fairly easy combining of data from multiple sources. Internally, we use it to Import and Export Course and Student data for a cooperative program we have with the University of Idaho, load data nightly from various files generated on the Enterprise Mainframe, as well as stepping through the data in the database, and checking that data against a webservice. It’s a cool tool, and about the only thing I’ve seen in Microsoft SQL Server that is truly unique and special from other database systems I’m familiar with.

Of course, not all tasks can be completed drag-and-drop. Luckily, SSIS allows you to write your own tasks which can do custom behavior, or build scripting tasks, that allow you to write VB.NET code (hopefully C# is coming in MS-SQL 2008). This has been a godsend in a few circumstances, for instance in a situation where I have to be able to generate a header and a footer for a text file output by our Grade Export task. The problem that I’ve run into, is that our system requires me to output 10 different files, and the way I’ve structured the task, which was with the goal of keeping the scripts as simple as possible, forced me to create 10 different copies of the exact same script.

Hardly ideal. The way I see it, I have three options.

  1. Edit all 10 scripts, and make sure they remain the same. Highly Error-Prone
  2. Filter in the Script instead of the Data Flow. Now the script is highly complex, making it more error prone.
  3. Find a way to standardize the Script Task. Ideal solution, but can it be done?

Unfortunately, I’ve found no way in SSIS to define a script once, and share it between multiple copies of the script, that would be differentiated by the Connection Managers and Result Sets made available to them. At least, not to do it as a simple task. There are some possibilities, but regrettably they both involve writing a DLL which must exist in the system’s Global Assembly Cache (GAC). At least within Data Flows. In the Control Flow, you can run external programs, run ActiveX controls, or use CLR-based Stored Procedures.

In December 2005, Jeffrey Juday wrote a piece a few years back developer.com about linking a DLL into your SSIS package which could then be called within a script. While this is certainly a possibility, it feels a bit clunky to me. You have to link the DLL via a variable in the SSIS package, then make sure you enable the variable within the ScriptTask. It’s quite a few steps, really, and if you’re already going through the trouble of writing a Library anyway, why not try a different route? (Note: This approach may still have uses, like if you were using that Library Object to somehow track the state, and needed it to be persistent between Scripting Tasks.)

For my problem, it looks like it will simply make more sense to implement my own Data Flow Destination, that I can write and maintain once, which will take care of all the problems of dumping the data out to the file. Luckily, this code should be very similar to the existing scripting task. However, there is a lot of maintenance necessary to create an SSIS task, that in many cases hardly seems worthwhile for a single package, particularly because it will undoubtedly end up cluttering the GAC on any Development workstations and servers it is installed upon.

Other problems exist within SSIS’ model that makes things difficult. SSIS only processes a certain batch size of records at a time, a number around 10,000 records, which makes writing robust SSIS tasks more challenging. The script requires a sequence of script-global variables nd creative use of Pre- and Post-Execution methods to handle adding the headers and the footers in order to ensure that the execution path outputs the correct data, while still maintaining a decent performance (.NET Strings aren’t really designed to have 1.5 MiB added to them in 20 B chunks, the overhead in constantly increasing the string size caused the script to execute in ~10 minutes, instead of ~20 seconds. I know there is no good way to solve someone doing something stupid with the API like I was.). There are a lot of pitfalls to SSIS scripting, like this, that only come from experience. Who would have figured that the ProcessInput task could be called more than once? I certainly didn’t until stress-testing revealed otherwise.

Connection Models tend to feel a little bit restrictive. For instance, I’m using the FlatFile Connection Manager for this project, at least in the script tasks, which is great because unlike the File Connection Manager, it doesn’t seem to care if the file exists or not, it will gladly create a new file, or overwrite the old one. The File Connection Manager will only create a new file if configured to, and if configured to Create a New File, will error out if the file already exists, there appears to be no way to force it to continue anyway. I’ve got warnings on more than one of my Data Flows in various tasks because of the way I’m filtering out records, they have to get redirected to the Error output (to ensure they don’t taint later data), which raises a warning. I have no ability to redirect that error output to a Null Data sink in order to clear that warning, and assure the system that, yes, I do know what I’m doing. I’m working on such a Data Flow Destination, just as a means to assure the system that I’ve done something on purpose, and am aware of it’s consequences.

Like many of Microsoft’s products, SSIS has some really wonderful features, however, I find myself stumbling over the interface from time to time, building new modules can be difficult particularly for my Grade Export script mentioned above (why can’t I create a template script that I can have the other tasks inherit from?), there are a lot of side-effects that are exacerbated by the lack of documentation, and the system just isn’t as fast as pure SQL. Also, I learned the hard way that if you have to write a complex SQL query to get a data set, don’t even dream of modeling that complex query as a sequence of SSIS Tasks. It’s unlikely to work right, and the tasks simply have too many restrictions to be as flexible as pure SQL.

Of course, where SSIS shines, in bringing in data from non-SQL sources and outputting it, it really shines. Gone is the need for temporary tables to serve as a translation layer between two slightly different data formats. Gone is the need to write complex queries to pull data for export. Several of our tasks are far easier to read now that the data selection and translation are accomplished in different sets. Plus, using SSIS, I can process large amounts of data off of a single call into the database. The old version of the GradeUploader has to call the Database 12 times. Now, I have a total of four database calls, because I can pull in all the data in one fell swoop (I could reduce that to three, by eliminating one of my filters and doing the filtering in SQL). Rather than having to call SQL for each Campus and Grade Type, I’m able to call once for the grade data, and simply filter it to determine which file it gets outputted to. Plus, a new version of SSIS is supposed to accompany MS SQL Server 2008, which will hopefully alleviate some of the complaints I currently have with the technology.

So, would SSIS cause me to choose SQL Server over any other database system? Absolutely not. Even if Microsoft were to fix all the problems I have with the system, it would still be nothing more than a convenience, and as a Software Engineer, it’s probably less convenient for me than for just a DBA who needs to move that data from one place to another. Plus, I’m just not convinced that Microsoft SQL Server outperforms an open solution like PostgreSQL or MySQL. PostgreSQL in particular has a feature set very similar to MSSQL: Replication, Transactions, Multiple-Language Support for Stored Procedures, etc. I haven’t run a detailed comparison myself, and the most recent one I can find online is from 1996, but it should be noted that the comparison above is hardly fair, as it doesn’t test the database under load, and at least with MySQL, Postgres is not to outperform under load, but not under intermittent connections, plus Postgres is known to be slower on Windows (though it’s the only shared platform, and thus fair comparison point). A lot of Unix software performs poorly on Windows, though.

Ultimately, any performance gains that there might be in using MSSQL Server (and I’m going to run some tests once SQL Server 2008 is available), just aren’t worth the expense for me of buying and running a Windows Server. But then, the cost and expense of running MSSQL is far below that of other DBMSes, like Oracle. MSSQL is a fine product, and SSIS has some cool features that have made some tasks faster for me, but I’m still not sure that we couldn’t have saved quite a bit of money with a different product.