Friday 10 May 2013

SSIS and fleixible table design

A lot of databases “feature” an interesting design pattern that I’m calling a repeating table design. I’ve seen this used for campaigns and similar business problems and data structures. When a new task is created a new set of tables are created for this task. This can cause problems for the ETL developer as the table the ETL packages need to target change over time and cannot be predicted at design time. Today, we’ll look a way to allow SSIS to target this structure and overcome the difficulties it imposes. Our issues are: how do we consume this structure in SSIS? How do we read from these different tables using one dataflow? How do we ensure that any new tables matching this structure are automatically supported?

Where you're head? Image by MGSPiller
First, we’ll create a set of these tables. I’m using the using the AdventureWorks2008R2 version of adventure works DB for demo purposes. Apply the script below to get started.
SELECT
      S.*
INTO
      SalesOrderHeader_Mason_Roberts
FROM
      Sales.SalesOrderHeader S
WHERE
      S.CustomerID IN (
                  SELECT
                        C.CustomerID
                  FROM
                        SALES.Customer C
                  INNER JOIN
                        Person.Person P
                  ON C.PersonID = P.BusinessEntityID
                  WHERE
                        P.FirstName = 'Mason'
                  AND
                        P.LastName = 'Roberts'
            )    

SELECT
      S.*
INTO
      SalesOrderHeader_Dalton_Perez
FROM
      Sales.SalesOrderHeader S
WHERE
      S.CustomerID IN (
                  SELECT
                        C.CustomerID
                  FROM
                        SALES.Customer C
                  INNER JOIN
                        Person.Person P
                  ON C.PersonID = P.BusinessEntityID
                  WHERE
                        P.FirstName = 'Dalton'
                  AND
                        P.LastName = 'Perez'
            )    
           

SELECT
      S.*
INTO
      SalesOrderHeader_Jennifer_Simmons
FROM
      Sales.SalesOrderHeader S
WHERE
      S.CustomerID IN (
                  SELECT
                        C.CustomerID
                  FROM
                        SALES.Customer C
                  INNER JOIN
                        Person.Person P
                  ON C.PersonID = P.BusinessEntityID
                  WHERE
                        P.FirstName = 'Jennifer'
                  AND
                        P.LastName = 'Simmons'
            )    
           
SELECT
      S.*
INTO
      SalesOrderHeader_Jason_Griffin
FROM
      Sales.SalesOrderHeader S
WHERE
      S.CustomerID IN (
                  SELECT
                        C.CustomerID
                  FROM
                        SALES.Customer C
                  INNER JOIN
                        Person.Person P
                  ON C.PersonID = P.BusinessEntityID
                  WHERE
                        P.FirstName = 'Jason'
                  AND
                        P.LastName = 'Griffin' 
            )    




This creates copies of the SalesOrderHeader table for four customers. Our issue now is, how do we consume this structure in SSIS? How do we read from these different tables using one dataflow? How do we ensure that any new tables matching this structure are automatically supported?
These questions can be answered by using catalogue views to determine the set of tables required. This set can be saved in an object that the Foreach Loop Container can traverse. A dataflow inside the container can then select the appropriate table as the input for data source.
First, we need to get the list of tables required. We can use a query like this to get the list of tables.
SELECT
      name
FROM
      sys.tables
WHERE
      name LIKE 'SalesOrderHeader[_]%'



This list needs to be saved in an object the Foreach Loop Container can iterate through. There are a few options here but the easiest way is to populate an ADO object variable. To do this, first create a user variable setting it’s data type to object. Now create an Execute SQL task. Set the task’s result set to “Full result set” and the SQLStatement to your equivalent of the query above. On the Result Set tab ensure your object variable is selected and that its Result Name is set to 0. This will populate your variable with the required result set. This result set is an ADO dataset that the Foreach Loop Container can use.
Now, moving on to the Foreach Loop Container. First create a new variable. The Foreach Loop Container will populate this with the required table name as you run through the loop. Set its default value to one of your table names. You need a the default value for SSIS to generate its deeisgn time meta-data when the data source is created later in the process.
Okay, time to add the Foreach Loop Container and set it up. On the collection tab, set the Enumerator to “Foreach ADO Enumerator”. The options available in the “Enumerator Configuration” section will change. Use the ADO object source variable option to select your variable. On the variable mapping tab select the variable you’ve created for the table name. The value of this variable will change as the container works through the item in the ADO data set.
The collection tab should look like this:

The variable mapping looks like this…

Next, add a data flow, or whatever you need, to do the actual processing. Add a data source and configure it to use the named table. As usual, there are a few options here. I’ve set the data access mode to use Table name or view name variable option and selected my variable.  SSIS will use the table name you’ve specified when the variable was created to define the query’s meta-data. Select the columns that you need and they are available as in any standard dataflow. The data source will look something like this…

I also add a derived column transform to the dataflow to capture the name of the table the data is coming from. 
Outside these, everything else will be standard SSIS. As usual, there are a few more wrinkles for anybody living in the real work but for now we’re good to go.
Last point, this is not a good database design pattern. Just because you do this doesn’t mean you should, so please don’t take this post as an invitation to use this repeating table pattern.