DBPedias

Your Database Knowledge Community

Todd McDermid

  1. Inserting Records AND Getting The Identity in SSIS - Part 2

    Since I posted a ways back on retrieving the identity value for inserts in SQL Server Integration Services, I've learned many things - one of which is an inline solution for that problem I didn't know back then.  Check out that previous post for the business problem, then come back here for an "inline" solution using SSIS - without scripts - just regular SSIS tasks and components.
    It's a Love/Hate Relationship
    The OLE DB Command component in SSIS is a near-universally hated object due to it's inherent performance problems.  But it's so hard to hate when it can actually get the job done when nobody else can.
    Yes, our old nemesis is back to solve a problem for us... at the same price it's always asked.  Do you want your packages to run fast, but be complicated?  Or do you want them to run slow, but be easy to understand?  Sadly, there is not much middle ground.
    The Keys
    It turns out to be pretty easy, and I should have got it into my thick skull before, from all the hints I'd read.  A stored procedure with an OUTPUT parameter, and the OLE DB Command are the keys.  For this example, assume we're using a table that has only two columns: ID and TEXTVALUE, with ID being the automatically generated identity column.
    The Stored Procedure
    Unfortunately, this doesn't use much of the GUI power of SSIS, it's old hand-coding.  For our simple table, this would be the stored procedure definition:

    CREATE PROCEDURE InsertData
      
    @textvalue AS CHAR(10),
      
    @id AS INT OUTPUTAS
    INSERT INTO
    IdentityTable
      
    (textvalue)
      
    VALUES
      
    (@textvalue)
    SET
    @id = SCOPE_IDENTITY()

    The SSIS Data Flow
    Use a Derived Column to generate the ID column with a NULL(DT_I4) value in it.  (We need somewhere to place our identity value in.)


    Following that, we can work on the OLE DB Command component, using a statement like this:

    EXEC InsertData ?, ? OUTPUT


    Map the columns as you'd expect to map them.
    If you'd like I've got an SSIS 2008 package showing this technique.
    The Big Picture
    Sure, it works - but here are the drawbacks:
    It's going to perform badly. Horribly if the package is run on a different server from the database the stored procedure is on.  I can't stress that enough.  The OLE DB Command component is making a round-trip call to the database for each row.  Even on the same machine, that's an expensive operation.
    You're doing an RBAR (row by agonizing row) process which is much, much slower than a set-based process that SQL Server is more suited for.  (SSIS is fine with RBAR, it's designed that way.)  Neither SQL Server process nor the Integration Services runtime will be working very hard during this part of the data flow - I don't think even one core would get utilized fully due to the wait-states involved.
    If you're interested, here's a sample package to show this process.  It creates a database on your local machine, creates the sample table, generates one row, performs the insert (and shows the result to you), then tears the database down.
    Can we mitigate this RBAR flaw?  Stick around or subscribe your RSS reader to find out in an upcoming post.
  2. Inserting Records AND Getting The Identity in SSIS - Part 2

    Since I posted a ways back on retrieving the identity value for inserts in SQL Server Integration Services, I've learned many things - one of which is an inline solution for that problem I didn't know back then.  Check out that previous post for the business problem, then come back here for an "inline" solution using SSIS - without scripts - just regular SSIS tasks and components.
    It's a Love/Hate Relationship
    The OLE DB Command component in SSIS is a near-universally hated object due to it's inherent performance problems.  But it's so hard to hate when it can actually get the job done when nobody else can.
    Yes, our old nemesis is back to solve a problem for us... at the same price it's always asked.  Do you want your packages to run fast, but be complicated?  Or do you want them to run slow, but be easy to understand?  Sadly, there is not much middle ground.
    The Keys
    It turns out to be pretty easy, and I should have got it into my thick skull before, from all the hints I'd read.  A stored procedure with an OUTPUT parameter, and the OLE DB Command are the keys.  For this example, assume we're using a table that has only two columns: ID and TEXTVALUE, with ID being the automatically generated identity column.
    The Stored Procedure
    Unfortunately, this doesn't use much of the GUI power of SSIS, it's old hand-coding.  For our simple table, this would be the stored procedure definition:

    CREATE PROCEDURE InsertData
      
    @textvalue AS CHAR(10),
      
    @id AS INT OUTPUTAS
    INSERT INTO
    IdentityTable
      
    (textvalue)
      
    VALUES
      
    (@textvalue)
    SET
    @id = SCOPE_IDENTITY()

    The SSIS Data Flow
    Use a Derived Column to generate the ID column with a NULL(DT_I4) value in it.  (We need somewhere to place our identity value in.)


    Following that, we can work on the OLE DB Command component, using a statement like this:

    EXEC InsertData ?, ? OUTPUT


    Map the columns as you'd expect to map them.
    If you'd like I've got an SSIS 2008 package showing this technique.
    The Big Picture
    Sure, it works - but here are the drawbacks:
    It's going to perform badly. Horribly if the package is run on a different server from the database the stored procedure is on.  I can't stress that enough.  The OLE DB Command component is making a round-trip call to the database for each row.  Even on the same machine, that's an expensive operation.
    You're doing an RBAR (row by agonizing row) process which is much, much slower than a set-based process that SQL Server is more suited for.  (SSIS is fine with RBAR, it's designed that way.)  Neither SQL Server process nor the Integration Services runtime will be working very hard during this part of the data flow - I don't think even one core would get utilized fully due to the wait-states involved.
    If you're interested, here's a sample package to show this process.  It creates a database on your local machine, creates the sample table, generates one row, performs the insert (and shows the result to you), then tears the database down.
    Can we mitigate this RBAR flaw?  Stick around or subscribe your RSS reader to find out in an upcoming post.
  3. PASS Summit 2011 Content - It's Your Choice

    If you're reading this post and attending the PASS Summit 2011 in Seattle, you absoutely should vote for two things with regards to the Summit: Dr. DeWitt's keynote subject, and the Community Sessions.
    Dr. DeWitt's Keynote
    I've been to the Summit for the past two years (I think... it's hazy) and one of the absolute highlights has been Dr. DeWitt's presentations.  2009 was on columnar database technology, 2010 was about query optimization.  Believe me - he does not dumb things down, and he's not from marketing.  There's a reason everyone raves about his presentations - he talks to geeks the way we want to be talked to.
    Just like last year, PASS is asking you to help select Dr. DeWitt's topic.  The theme this year seems to be parallelism technology - big data or big iron.  Would you like the doctor to explain Hadoop and distributed data processing, or describe the coming impact of the increasing number of CPU cores and I/O changes?  If you're a DBA, I can see you being interested in the Moore's Law presentation.  But since I'm closer to the DBD demographic, I've voted for the Hadoop talk.
    Best quote from Dr. DeWitt last year (I'm paraphrasing): "Guys that drop out of query optimizing because it's too hard... go build rockets for NASA."
    Community Sessions
    Apparently, the PASS Summit Program Committee is taking a day off, because PASS is asking for our help to round out the Summit schedule.  (I kid - the PC did a great job this year and deserve a day off... even though I know they're not taking one.)  From various sources, the Program Committee has selected 20 candidate sessions that are competing for five slots.  Here are my choices, based solely on what I need to learn and want to see:
    Robert Cain on Project Juneau - The team I'm with needs to understand databases better.  It's really hard to help them get there when the code they write is so very disconnected from the database.  As I understand it, Juneau is a step towards integrating code and the database better - which would give me better tools to encourage using better code and a real database.
    Wendy Pastrick on Visualizing Indexes - My understanding of index internals is woefully deficient.  I certainly think I know the basics, but I could really use more.
    Chris Webb on SSAS Security - This one is probably over my head at the moment - but I know it's inevitable that I'll be called upon to deliver it.  My org is presently just too paranoid (IMO) about spilling too many secrets to parallel lines of business.
    John Welch on Advanced DW Scenarios - Some of what I want to be able to deliver in my DW seems to require what John's talking about here... but I don't know enough to say for sure, that's why I have to go.
    Jen Stirrup on Data Visualizations - This topic is both fascinating to me and severely frustrating.  I yearn for the ability to get rid of the complex reams of tabular reporting that require significant mental effort to decipher.  I've done my best with Excel 2007, but I need to expand my toolset to accomplish transforming my business' expectations.
    Vote Now
    If you're attending the Summit, please vote for the keynote and session schedule you'd like to see.  You've only got until the 20th of July.  Please don't make it a popularity contest about the speaker - because that means I'll lose :)  Yes, I happen to be listed twice!  I have a session on complex SSIS executions and another on processing slowly changing dimensions in the running.  If straightforward SQL Agent Jobs can't solve your SSIS execution scenario, I've got some techniques to help.  If John Welch's presentation on advanced DW scenarios is a little too advanced, maybe mine fits in your ballpark a little better.
    Whatever your interests, please do pick some.  It's a shame that so many great sessions are in competition with each other - but I'll be happy to be a loser if it means you get sessions that will do you some good!
  4. Presenting Upserts to PASS AppDev VC

    Hopefully this crowd knows a little about what they're getting into!  John Jakubowski (blog|twitter) has graciously asked me to present on upsert strategies with SSIS.  What are "upserts"?  A not-so-fancy mashup of the words "update" and "insert".
    If you struggle with incremental table loading performance - updating rows that already exist, and inserting rows that didn't - then you might want to drop by.
    When?
    July 26th, 4pmGMT/12pm ET/9am PT
    Where?
    Go to the PASS AppDev website, and find the LiveMeeting link.
    What?
    The full abstract:
    One of the most sought after techniques in Integration Services is how to perform "upserts" - propagating changes from one system to another that may require new records to be inserted, or existing records to be updated. Integration Services can't accomplish this with a single click, but it does contain all the tools required to construct packages that will achieve this goal. The task of synchronizing data will be broken down into two parts: detecting changes, and updating or inserting. Refinements to use of the Lookup and alternatives to the OLE DB Command transform will be explored.
  5. What is Pipeline Backpressure?

    First - I'll get the "what's a pipeline got to do with Integration Services" question out of the way.  The "pipeline" I'm referring to here is the word commonly used to describe the flow of data in the Data Flow Task.  That term doesn't seem to show up in any of the MSDN documentation, but the Microsoft bloggers just can't seem to talk enough about it.  I suppose it's an internal term that's leaked out into general usage.
    The SQL Server Integration Services' Data Flow Task reads data from one or more sources, flows that data through this "pipeline" where some parts of it will get transformed (changed), and then sends the data to one or more destinations.  The choice of the word "pipeline" will tend to get people thinking of a liquid in a pipe - and that's an apt metaphor for the phenomenon of backpressure.  The same kind of action takes place in SSIS.
    Fluid Backpressure
    When talking about a liquid, backpressure refers to the resistance that the fluid encounters flowing through a pipe.  When the fluid encounters a constricted section of the pipe, it can't flow as fast through that section.  But since (most) fluids are incompressible and most pipes are rigid, the reduction in flow rate at this constricted point means that the flow is "backed up" all the way to the start of the pipe.  If a kink is stopping the flow of water near the end of the hose, the tap isn't able to inject any more water into the hose - there's simply no space to put it.  The rate at which you can inject water into the hose is directly related to two things: how constricted the smallest part of the hose is, and how much the hose can "expand" prior to the blockage.  (If the hose can inflate like a balloon, you can still pump water in.)
    Data Flow Backpressure
    SSIS's Data Flow Task has very similar constraints.  The source - or "tap" - wants to inject data into the pipeline as fast as it can.  But that data has to flow through the components - the "pipe" - in order to reach the destination - the "end" of the hose.  Somewhat like a garden hose, the data flow pipeline can stretch to allow the source to inject more data into the pipeline than is ultimately getting through to the end.  But there's a limit, of course, just like with a hose.  If it's stretched too much, it will burst. 
    In SSIS's case, since it's a "fictitious" hose, the nice engineers that built it were able to set limits on how much of a balloon you're allowed to turn your hose into.  It's the RAM limit of the process and/or machine you're running your package on.  In most cases, taking advantage of SSIS' "stretchy hose" is a very bad thing - you'll easily consume all the available memory on your machine, and start spilling into virtual memory... and that's slow disk!
    Backpressure In Action
    Here's an SSIS package that demonstrates backpressure quite well. 

    You can build your own quite easily from what I'll describe here, in case you don't have SSIS 2008.  The first task in it is a Data Flow Task that looks like this:
    The first component hooks up to your local instance's master database, fetching a cross-joined resultset from sysobjects.  (It's the volume of rows that matter here - not the content.)  The second component is one of my Data Flow wiretap scripts that will dump timing information to the logging stream that we'll be able to watch in the Output window.  The last component is a script destination.  Normally, I'd just use a Union All or Derived Column to cap the flow, but those result in a lot of "unused column" warnings that would clutter up the logging stream.
    As you can see below, running this flow results in something like this pretty quickly.  In my case, under 1.5 seconds, just about as quick as SQL Server can deliver the rows to SSIS.  (So fast I didn't bother to capture an "in-flight" picture.)
    The next Data Flow Task in the package shows how an "impeded" data flow will behave.  In this flow, there's one more script component in there.
    The component inserted at the third position is intended to mimic a transformation that takes effort to perform.  This is the equivalent of an uncached Lookup, OLE DB Command, Merge Join, etc...  What does this look like at runtime?  Well, no problem taking a screen capture in mid-execution on this one...
    How does this image show the effects of backpressure?  It was taken over 20 seconds after the data flow started.  Let me explain in more detail.
    Our first "free-flowing" data flow loaded the data into SSIS' pipeline in under 1.5 seconds.  The second "impeded" data flow hasn't even finished loading the 100K rows after 20 seconds... why not?  Because the "sleep" transform is slowing things down.  So why isn't the source simply reading as fast as it can, and piling up the data at the slow transform's doorstep?  Because if it did that, it would consume memory.  The developers of SSIS thought that consuming memory in this situation would be completely unnecessary if they could just tell the source to "slow down" to a manageable level... so they did.
    Any "execution tree" (there's another bit of SSIS internals jargon) will experience this backpressure - forcing every component to operate no faster than the slowest.  One of the key parts of managing the performance of your packages is understanding pipeline backpressure.  It can mislead you about what's performing slowly in your flow, causing you to spend time and effort where it will have no effect.  Tweaking the DefaultBufferSize and DefaultBufferMaxRows properties can change some of the effects of backpressure - but doing so only delays the inevitable with sufficiently large data sets, and typically offers little meaningful benefit.  The real key is understanding your data flow's behaviour through decomposition.
    I have one more example data flow for you in the package, where I add a sort component to the middle of the flow.  How will this affect the flow of data?  The sleep component will definitely cause the end-to-end flow to be slow, but will the source stream quickly like the "free-flowing" example, or will it be throttled like the "impeded" flow?  I'll leave that up to you to explain...
  6. Code Snippet for an SSIS DataFlow Performance Wiretap

    I was recently tweaking a few Data Flows, and settled on a very small, but very useful script to help record execution timing inside the flow. 
    Decomposition Is Time Consuming
    You may know that I do recommend the decomposition technique to help diagnose performance problems in SSIS data flows.  That's the technique where you start recording execution times using fewer and fewer components in the data flow in order to get a differential execution time for particular components.  Even though it's not perfect, it does account for misleading effects of the pipeline backpressure mechanism.  (What I suggest later does not.)
    Unfortunately, performing a decomposition analysis can take quite a while.  It is time consuming to replace the destination component with a row count, run the flow, record the results, and continue up the flow.
    Adding a Performance Wiretap is Easy
    Especially when the code is completely copy-and-paste!  I couldn't measure any performance impact of including this script code, and didn't expect any.  It doesn't use any columns in the data flow, and the operations it does do are very lightweight.  You don't even have to change anything in the script itself - the events that are logged include the name you give the component in the data flow (which must be unique) so you can easily identify which of several wiretaps is reporting information.

    public class ScriptMain : UserComponent
    {
      #region Statistic collectors
      private DateTime _preExecuteTime;
      private DateTime _firstRowTime;
      private DateTime _lastRowTime;
      private int _numRows;
      #endregion

      #region PreExecute
      // This method is called before any rows are read from any sources
      // We reset all statistics here
      public override void PreExecute()
      {
        base.PreExecute();
        this._firstRowTime = DateTime.MinValue;
        this._lastRowTime = DateTime.MinValue;
        this._numRows = 0;
      }
      #endregion

      #region ProcessInput
      // This is NOT a regular method override you normally see in a script component
      // This override is called once for each input buffer (set of rows) that
      // are passed into this script component. We count rows here, and record
      // first and last "row seen times".
      public override void ProcessInput(int InputID, Microsoft.SqlServer.Dts.Pipeline.PipelineBuffer Buffer)
      {
        base.ProcessInput(InputID, Buffer);
        #region Record the time the first row was seen
        if (this._firstRowTime == DateTime.MinValue)
        {
          this._firstRowTime = DateTime.Now;
        }
        #endregion
        #region If this buffer has any rows (the last buffer typically does NOT), record the time
        if (Buffer.RowCount > 0)
        {
          this._lastRowTime = DateTime.Now;
        }
        #endregion
        #region Add the number of rows in this buffer to the row count statistic
        this._numRows += Buffer.RowCount;
        #endregion
      }
      #endregion

      #region PostExecute
      public override void PostExecute()
      {
        bool fireAgain = true;
        base.PostExecute();
        #region Output statistics to package logging, as Information events
        ComponentMetaData.FireInformation(0, "",
          ComponentMetaData.Name + " first row seen at " + this._firstRowTime.ToString("hh:mm:ss.fff"),
          "", 0, ref fireAgain);
        ComponentMetaData.FireInformation(0, "",
          ComponentMetaData.Name + " last row seen at " + this._lastRowTime.ToString("hh:mm:ss.fff"),
          "", 0, ref fireAgain);
        TimeSpan duration = (this._lastRowTime - this._firstRowTime);
        ComponentMetaData.FireInformation(0, "",
          ComponentMetaData.Name + " watched " + this._numRows.ToString("###,###,##0") + " rows pass by in " + duration.TotalMilliseconds.ToString() + "ms",
          "", 0, ref fireAgain);
        #endregion
      }
      #endregion
    }


    So what does it do?  It records the time the first row passes it, the number of rows that go by, and the time the last row passes it.
    How Can This Help Diagnose Performance?
    If you want to know which section of your Data Flow takes the longest duration - place some of these scripts at various places in the flow.  After you run your Data Flow, you'll see how long each section takes - keepin gin mind their execution times will overlap each other.  (If you use two scripts, the second script's "first row" may very well be seen before the "last row" passes through the first script.)
    If you're wondering what kind of impact a certain change has in your Data Flow - place this script before and after the affected area.  Run your flow before you make the change, and again after.  The results will show you how much impact your change had (but remember it won't account for hidden backpressure effects!)
    If you're wondering which of several "branches" of a Data Flow that are combined using a Union All is the slowest, you can use these wiretaps to instrument each one.
    Once again - please don't think this is a replacement for correct use of the decomposition technique.  Component performance in the Data Flow depends heavily on other components coming before and after the area you're interested in - and even flows occurring in parallel in the same (or other) Data Flow.
  7. Use Connections Properly in an SSIS Script Task

    There are lots of scenarios in SSIS where you'd want to use a connection within a Script Task.  Unfortunately, the closest help documentation - the comments inside the Script Task itself - give you exceptionally poor advice:
    To use the connections collection use something like the following:
    ConnectionManager cm = Dts.Connections.Add("OLEDB");
    cm.ConnectionString = "Data Source=localhost;Initial Catalog=AdventureWorks;Provider=SQLNCLI10;Integrated Security=SSPI;Auto Translate=False;";

    Why is this advice bad?  Let me count the ways:

    1. The connection string is hardcoded inside the script.
    2. The connection can't be used with Package Configurations.
    3. The connection string (and name) imply that OLEDB connections can be used.
    4. As presented, the connection won't participate in MSDTC transactions.
    (And seriously, "use something like the following"?  Hey - I write the vague blog posts here.  They're supposed to be writing documentation. ;) )
    The poor advice boils down to the fact that it appears to create a "managed" connection, but doesn't truly do so.  In fact, if you attempt to use code in the comment as is - you're only halfway there.  How do you actually use this "ConnectionManager" object anyway?  There's no "Open" method, or Command object - just weird methods and properties that would be entirely unfamiliar to you even if you know .Net data access!  And to add insult to injury, even if you did know where to go from here, the example provided will only cause an error, and will never work.
    What is a Managed Connection?
    Lets start with describing what a "managed" connection is (in the context of SQL Server Integration Services).  It is simply a connection that's defined and controlled by an entry in the "Connection Managers" window at the bottom of the design surface in BIDS.
    Managed connections have the following benefits:
    • Their properties (primarily the connection string) can be manipulated with Package Configurations.
    • Their properties (again, the connection string) can be manipulated with Property Expressions.
    • They can participate in SSIS (MSDTC) transactions.
    • The connection is tested by default when the package is loaded, before any (potentially unrecoverable) work is started.
    But perhaps most importantly, using a managed connection within a script (properly) informs someone maintaining the package that a connection to an external resource is being used.
    How is the Advice Wrong?
    The comment in the Script Task counsels you to use Dts.Connections.Add to get a reference to a ConnectionManager object.  This is the correct type of object - it's a managed connection.  Unfortunately, using that syntax, you're creating a new one, not using one of the managed connections you've already defined in the package.  Since it's an entirely new connection, you have to set up the connection string.  You could pass the connection string in to the script using an SSIS variable, or read it from a file - but these are both non-standard practices.  Another ETL developer or administrator familiar with SSIS won't anticipate your decision to implement a connection that way.
    Additionally, since it's a brand new object, none of the regular design time facilities are available to you; Package Configurations, Property Expressions, the TransactionOption property, and runtime validation don't work.
    The Right Way to Use Managed Connections
    Ignore the comments in the Script Task - use managed connections properly by following these steps.
    1. Create an ADO.Net Connection
    Add a Connection Manager to your package the same way you'd make one for use in a Data Flow or Execute SQL Task.  Right-click in the Connection Managers pane and add a new ADO.Net connection.  You can use OLEDB type connections and other connection types - but they're more difficult to use and have fewer capabilities inside a Script.
    2. Retrieve a Reference to the Connection
    Inside your Script Task, instead of using Dts.Connections.Add, retrieve a reference to an existing connection manager by using the Dts.Connections indexer.  For example, if you happen to have a connection manager named "(local).master", then use this code to retrieve a reference to the ConnectionManager object:
    ConnectionManager cm = Dts.Connections["(local).master"];
    3. For Connections with Providers, Acquire a Connection
    (A "provider" means any connection that has code supporting and managing the connection - like an ADO, OLEDB, or Excel connection, but unlike a Flat File connection.)
    Even though you now have a reference to the ConnectionManager object, you need to acquire an actual connection to the resource, and request that connection be established in a way that SSIS can manage it.  This step is necessary to ensure that your work with the connection participates in transactions and connection pooling (or lack of it).
    The specific details for each connection type vary, but the general method is consistent.
    // For an ADO.Net ConnectionManager using a SqlClient provider
    System.Data.SqlClient.SqlConnection sqlConn = (System.Data.SqlClient.SqlConnection)cm.AcquireConnection(Dts.Transaction);
    // For an ADO.Net ConnectionManager using an ODBC provider
    System.Data.Odbc.OdbcConnection odbcConn = (System.Data.Odbc.OdbcConnection)cm.AcquireConnection(Dts.Transaction);
    Be aware that it is NOT necessary (or advisable) to use the Open method on the connection object you just retrieved... it's already open.  (Check the connection's State property if you don't believe me.)
    4. Use the Connection
    At this point, you can use the connection to retrieve datasets, execute commands, or do just about anything else you'd want to do with your connection.
    5. Release the Connection Properly
    Once you've finished your work, you need to clean up after yourself - but NOT by using the Close method on the connection!  You'll need to use the ReleaseConnection method of the ConnectionManager instead.
    How Hard Was That?
    Hopefully not so hard as to make you want to go the other direction and use completely unmanaged connections in your scripts!  To recap, here's a complete code listing of how to correctly use an ADO.Net managed connection for a SqlClient provider within a Script Task:
    bool fireAgain = true;
    ConnectionManager cm;
    System.Data.SqlClient.SqlConnection sqlConn;
    System.Data.SqlClient.SqlCommand sqlComm;
    int rowsAffected;

    // Retrieve the reference to the managed connection
    cm = Dts.Connections["(local).master"];
    // Request an open connection
    sqlConn = (System.Data.SqlClient.SqlConnection)cm.AcquireConnection(Dts.Transaction);
    Dts.Events.FireInformation(0, "", "Connection is: " + sqlConn.State.ToString(), "", 0, ref fireAgain);
    // Do your work
    sqlComm = new System.Data.SqlClient.SqlCommand("UPDATE YourTable SET YourColumn = 'SomeValue'", sqlConn);
    rowsAffected = sqlComm.ExecuteNonQuery();
    // Inform SSIS you're done your work
    cm.ReleaseConnection(sqlConn);
               
    Dts.Events.FireInformation(0, "", rowsAffected.ToString() + " rows updated.", "", 0, ref fireAgain);
               
    Dts.TaskResult = (int)ScriptResults.Success;
  8. Idiot's Guide to SSAS Attribute Relationships

    Yes, that's me - and this is a reference post because frankly, this stuff is too complicated to keep in my head for how frequently I use it.
    The Problem
    

    The SSAS Idiot
    When you get the "Errors in the OLAP storage engine: The attribute key cannot be found when processing." error returned when you're trying to process an SSAS cube, what does that really mean?  I have a previous post where I went into some more technical reasons that can cause this error, but there's usually a pretty simple cause related to how you've structured your dimension in the designer... your attribute relationships. Attribute Relationships
    Specifying good attribute relationships are something that the experts say are a very basic and necessary step to configuring a cube that performs well.  Attribute relationships tell SSAS that it can take some shortcuts when processing your cube.  What kind of shortcuts?  Something like telling it that since "British Columbia", "Alberta", "Saskatchewan", and all the other provinces are always in "Canada", if you (or a user) asks for a total sales number for Canada, it doesn't have to add up all the invoice line items for Canada - it can go add up the numbers it's stored already for total sales in each province.
    So yes, it's very good to specify attribute relationships!  It can turn a million row operation into a ten row operation, which results in better query performance, reduced storage requirements, faster cube processing, and more.  It's a best practice to relate attributes that you've placed in a dimension Hierarchy - because they're going to be used in aggregations very often.
    How Are Attribute Relationships Related to the Problem? (Pun Intended)
    In the example of provinces and countries I provided, it seems like you'd never have a problem - it's straightforward and simple, right?  Not so fast.  You could have bad data... or not such a simple relationship.  We'll look at both in turn.
    What do I mean by "bad data"?  I mean that your data warehouse could have business entities that have bad address information delivered to it by your OLTP system(s).  (You're shocked, I know.)  Bad how?  Perhaps address information is incomplete - it only has the country, they didn't bother to fill in the street, city, or province.  Bad, bad, OLTP data entry validation!  But it happens... all too often.  This means that a "blank" province could be located in multiple countries.
    What do I mean by "not a simple relationship"?  Let's take another example - cities.  There's only one Vancouver, right?  The one in British Columbia, Canada.  Not so fast... Vancouver, WA would be having words about that assertion.  Other city names are much more prevalent, and easily demonstrate this particular issue.
    Deciphering SSAS's Error Message
    OK - so you understand the problem when I explain it with my example... but how do you decipher SSAS's message to figure out how your attribute relationships and your data are causing problems?  Here's the long explanation - I'll cover it again more briefly at the conclusion of the post.  My error message looks like this:
    Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: 'dbo_vDIM_CUSTOMER', Column: 'ship_to_city', Value: 'Greenland'. The attribute is 'Ship To City'.
    To genericize it, I'll replace the actual values with placeholder names that I'll use going forward:
    Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: 'DimensionTable', Column: 'AttributeColumn', Value: 'AttributeValue'. The attribute is 'AttributeName'.
    Find the "Source Attribute"
    Breaking it down, this message means that the attribute in your SSAS dimension called AttributeName, and that's where the problem is.  We'll refer to this as the Source Attribute.
    Find the "Related Attribute"
    Open BIDS open the dimension that didn't process, and switch to the Attribute Relationships tab - the one with the graph that looks like this (based on my example):
    (Now, why the "Ship To City" attribute is in a separate box from the "Dim Customer Key" attribute, but the "Ship To State Province Code" isn't is not something I can explain, so I won't.)  Looking at the list view (down below), I see this:
    The important things to notice is that the Source Attribute (in my case) was Ship To City, and it's being "rolled up" or "is contained in" the Ship To State Province Code.  In your case, you need to find the Source Attribute in your relationship graph or list, and find out what other attributes it's related to.  (There might be more than one relationship where your Source Attribute is on the left-hand-side of an arrow.  You'll have to investigate each relationship individually.)
    Find the "Source Column" in the Dimension Table
    Your Source Attribute came from somewhere in your underlying dimensional database - find out where.  You're given a hint in the error message - the AttributeColumn.
    Find the "Related Column" in the Dimension Table
    Just like the Source Attribute, the Related Attribute comes from a dimensional database table - find out where.  You'll likely have to switch to the Dimension Structure tab of the dimension editor, select the Related Attribute, open the Properties window (F4), and look at the KeyColumns property.
    Now you're ready to query your data.
    Querying Your Data
    You now have three critical pieces of information: the Source Column, Related Column, and AttributeValue.  In my case, the Source Column was "ship_to_city" in my vDIM_CUSTOMER view, the Related Column was "ship_to_state_province_code" in the same view, and the offending AttributeValue was 'Greenland'.  To find out why SSAS is having a problem, I need to run this query:
    SELECT ship_to_city, ship_to_state_province_code, COUNT(*)
    FROM vDIM_CUSTOMER
    WHERE ship_to_city = 'Greenland'
    GROUP BY ship_to_city, ship_to_state_province_code

    To genericize the query and relate it to those placeholder names we've been using, it should look like this:
    SELECT [Source Attribute], [Related Attribute], COUNT(*)
    FROM [DimensionTable]
    WHERE [Source Attribute] = AttributeValue
    GROUP BY [Source Attribute], [Related Attribute]

    When I run that, I got 2 rows back - it shows quite clearly what my problem was:
    Yep - OLTP addresses where sometimes the city was specified on a customer record, but the state or province wasn't.  In your case, you should get at LEAST two rows from your query.  If you don't - then this isn't your problem... maybe you have one of these problems instead.
    Knowing Is Half The Battle
    Fantastic, now that you know where the problem is, you can fix it.  How?  One of two ways:
    1. Decide it's a "data problem" and rectify the underlying data.  Tell a business user to fix the offending address, tell the OLTP developer to add validation to customer address entry, and/or change your ETL process to clean the data.
    2. Decide the data will never get "clean" - because nobody will "own" the dirt, or because it's a valid part of the business.  This happens very often.
    In my case, the data you see is dirty.  But the offending business data is very old, and not worth cleaning up.  If I did spend the time to clean up the blanks, I would still be unable to process my cube, because I would find "Springfield" in multiple states, etc...  So even though it is dirty, it is also a business reality that cities don't uniquely belong to state/provinces.
    The Other Half - Fixing The Attributes
    So if I can't clean the data because it is (mostly) how it's supposed to be... how can I model it in SSAS?  I recognize that certain cities are related to multiple states and provinces... but I still want to (and should) help SSAS optimize it's processing.
    I'm sure there are several ways (there always are) - but the following works for me.  If an SSAS expert would like to chime in, please do!
    Recognize That "AttributeColumn" Isn't Unique
    We have to recognize that the Source Attribute doesn't uniquely identify the Reference Attribute.  Just because I say "Vancouver" doesn't always mean I'm referring to the city in British Columbia.  I'm going to have to uniquely identify cities in order to roll them up into states and provinces, just like you'll have to uniquely identify Source Attributes in order to roll them up into Reference Attributes.
    How do we do that?
    Inside BIDS, open the Dimension and select the Dimension Structure tab.  Select the Source Attribute from the list and open the Properties window (F4). 
    Select the KeyColumn property, and click the ellipsis button (...) to open the Key Column selection dialog.
    Pick another column that together with the Source Attribute will uniquely identify a Reference Attribute.  I've chosen to select the Reference Attribute column itself.  You may choose to select another column or columns that will uniquely identify the Reference Attribute.
    Since we've just identified two columns to be the key for this attribute, SSAS can't use the default "name" for attribute members (the KeyColumn itself).  It doesn't know how to display specific members of this attribute - you've got to tell it by editing the NameColumn property - just under the KeyColumns property we just changed.  You may decide to simply choose the Source Attribute column (as I did).
    The dimension will now successfully process... unless you've got another "duplicate attribute"!
    Success! Let's Recap
    The short version of what we just did was:
    We deconstructed the error message to identify the Source Attribute
    We examined the Attribute Relationship tab of the dimension to identify the Reference Attribute
    We verified that there was a many-to-one relationship between the Source and Reference Attribute
    We made the Source Attribute map one-to-one to the Reference Attribute by including the reference column in the KeyColumns for the Source Attribute.
    What I don't know is whether this is the "best" general advice to give, or if there's a better rule of thumb for resolving this issue.  Until I do know, I'll be referring back to this step-by-step quite often, I'm sure...
  9. How To: Use the Same Data Flow Column Twice As a Parameter in SSIS

    Every once in a while, you'll have a slightly more complex UPDATE statement in an OLE DB Command or Destination.  You'll use an UPDATE statement or a destination table that needsto use data from one column in your data flow several times.  A typical example is a range update:
    UPDATE table1
      
    SET column1 = ?
      WHERE datecolumn1 <= ?
        AND datecolumn1 > ?

    Yes, best practices say that you should probably avoid using the OLE DB Command entirely, but if you aren't updating many records and you don't mind the poor interface, it does get the job done.
    What You Can't Do
    The OLE DB Command interface asks for the statement (like above) that has question marks as placeholders for the values you're going to pass in.  The OLE DB and ADO.Net Destinations allow you to identify tables, or craft a SELECT statement to identify a table.  They also have a interface for associating those numbered parameters or destination columns to the columns you have available in the data flow.  One of the many faults of this interface is that it doesn't permit two parameters/destination columns to be associated with one column in the data flow - you just can't do it.

    Another Splice Won't Hurt...
    What You Can Do So if the interface only allows 1:1 associations of parameters in your statement to columns in the data flow... then you just need to duplicate a column in the data flow.  Add a Copy Column transform, or a Derived Column transform to the design surface just prior to the OLE DB Command or Destination transform to create a second column that contains the same data.  Now you've got what you need to configure your OLE DB Command or Destination properly.
  10. Runtime Interaction with SSIS Packages - Confirming Actions

    In developing my data warehouse ETL processes, there are some packages and SQL Agent jobs that are both necessary to the process of deploying a new version of the ETL system - and yet are extremely dangerous.  These packages "reset" certain control infomation to a known state, which is a destructive operation I want to make sure I think about twice before executing - especially in a QA or production environment.  If I executed this kind of package in development on my local machine, I could use Script Tasks to pop up MessageBoxes to ask for this kind of confirmation.

    Are You Sure?

    Of course, you can't do that when you're executing packages on a server!  There are also use cases beyond initial data loads.  You may allow your SSIS packages to be callable by a wide range of people or systems... but you may want some oversight or confirmation before those packages are actually run.  Here's one way to balance convenience of access and execution with some protection against unintended execution.
    The Design
    My basic thought process in finding an alternative to "confirmation" message boxes was:
    • I can't have a popup... how else can the server talk to me?
    • Ah - it can send me an email.  Bonus - it doesn't matter who kicks off the job, I can hardcode an email address in there so I get the message, or I can look up an email in a "responsibility" table so the email gets sent to a designated "process owner."
    • Now how do I talk back to the server?  I can't reply to the email - it doesn't have a mailbox.
    • Ah - it can watch a share for a file to appear that contains a confirmation to continue.
    • So how do I make sure that I don't confuse the package, inadvertently acknowledging confirmation for "request #2" when I intended to confirm "request #1"?  You know how that happens every once in a while - you're editing a document, and some other program "pops up" to ask you a question... but you don't stop typing in time, and your keystrokes end up selecting something and dismissing the dialog!
    • Ah - I can require the file contents to be something... special... and unique to that package execution.  Some kind of code that would be unlikely to mistype unless I spent some serious brain power on it.  Like a GUID.
    • But I don't want to have to navigate to the share, create a new text file, and hand-transcribe a GUID.  I'll probably type it in wrong and have to restart the process again.  (Faulty humans!)
    • Ah - I can attach the file to the email, so all I have to do is save it in the required location.
    Yeah - OK already.  So it takes me a while to figure this stuff out...
    The Implementation
    I'd put up a sample package - but I'm too lazy busy to strip out the sensitive and custom stuff from one of my packages, so I'll just describe it.
    Task 1: Add a Share
    Find a location on your network and create a share that you give permission to both you (as the package execution confirmer) as well as the account you're going to be executing the SSIS package under.
    Task 2: Script - Generate the Confirmation File Contents
    Add an SSIS variable called "Confirmation File Name" that contains the UNC path (not driver letter path) to where you want the confirmation file placed.
    Pass that variable into the Script Task, along with the MachineName and ExecutionInstanceGUID system variables.  I've chosen the ExecutionInstanceGUID as my "confirmation code" - you can choose something else entirely: a passphrase, or another GUID generated by Guid.NewGuid.  (It's more secure if you generate another GUID rather than use the execution instance - someone could guess your implementation and subvert the email step if they have access to the sysssislog table.)
    Paste in code somewhat like this:

    System.IO.StreamWriter confirmationFile
     
        (string)Dts.Variables["User::ConfirmationFilePathname"].Value);
    confirmationFile.WriteLine(
      Dts.Variables["System::ExecutionInstanceGUID"].Value);
    confirmationFile.Close();
    Dts.TaskResult = (int)ScriptResults.Success;
    = System.IO.File.CreateText(


    As you'll see in a bit, I've built the file in the exact spot that I intend to look for it... we'll take care of that.
    Task 3: Send the Confirmation
    Slap a Send Mail Task down, set it up to send you an email, and attach the file using the ConfirmationFilePathname variable.
    Task 4: Delete the Confirmation File
    As you pointed out earlier, if I "waited" for you to place that file in the location we want, it wouldn't work - because the file is already there!  So I need to delete it, making sure that I've got a clean spot for you to place your confirmation.
    (You could have created the file in a separate location so this isn't an issue - but deleting it was simpler to set up.)
    Task 5: Wait... and Watch
    You can use a Script to do this entirely - I happen to have used my File Properties Task instead.  It's up to you, but in this step, you'll be waiting for the file to reappear.  How long you wait, and how often you check is also up to you - but I suggest you don't wait indefinitely.  Have your package fail if a certain timeout is reached.  If you use a Script to watch for the file, you'll be using System.IO.File.Exists and System.Threading.Thread.Sleep to do that.
    Task 6: Read the File
    If you receive your email, decide to confirm the action, and save it to the right share, your package will arrive at this step.  It's now got to open the file and check the contents with another Script Task to make sure you're responding to the proper request.  Here's one with an appropriate level of error handling:

    bool fireAgain = true;
    string confirmationFileName =
      (string)Dts.Variables["User::ConfirmationFilePathname"].Value;
    Dts.TaskResult = (int)ScriptResults.Failure;string contents = "";
    try
    {
      System.IO.StreamReader confirmationFile;

      confirmationFile = System.IO.File.OpenText(confirmationFileName);
      
    Dts.Events.FireInformation(0, "",

        "Confirmation file opened.",
        "", 0, ref fireAgain);
      
    contents = confirmationFile.ReadLine();
      
    Dts.Events.FireInformation(0, "",

        "Confirmation file contents read.",
        "", 0, ref fireAgain);
      
    confirmationFile.Close();

    }
    catch (Exception ex)
    {
      Dts.Events.FireError
    (0, "",

        "Unable to open '" + confirmationFileName + "' to "
        
    + "check contents: " + ex.Message,

        "", 0);
    }

    if (contents == (string)Dts.Variables["System::ExecutionInstanceGUID"].Value)

    {
      Dts.Events.FireInformation
    (0, "",

        "Confirmation file contents validated.",
        "", 0, ref fireAgain);
      
    Dts.TaskResult = (int)ScriptResults.Success;

    }
    else
    {
      Dts.Events.FireError
    (0, "",

        "Confirmation file contents failed to "
        + "match expected code.", "", 0);
    }

    If the confirmation file contains the right code, it will return success, otherwise it will fail with an appropriate message.
    Task 7+: Do Your Dirty Work
    Your package just got the green light to trash the place (or do whatever critical action was requested)!
  1. « First
  2. ‹ Previous
  3. 2
  4. Next ›
  5. Last »