Archives for category: Data Warehouse
There are two basic techniques for updating one table with information from another, in this example, we are updating an OLAP Dimension table with the results from an OLTP table.
 
Flush and Fill Technique
The first and easiest method is called the “flush and fill” technique. Basically, you remove all of the data from the Dimension table and fill it with data from an OLTP table. This is the simplest method by far, but can be extremely inefficient if we’re talking about tables with a lot of rows. The only thing you need to decide here is whether you are going to use the DELETE command or the TRUNCATE command.
 
DELETE COMMAND
Did you know that if you run the following command…
 
DELETE FROM [dbo].[DimDeliverable]
 
…and the CUSTOMER table contains 5,000, that the command will be executed 5,000 times, once for each row, because it deletes them one-by-one? Also, if you use the DELETE command, your IDENTITY key does not automatically RESEED itself. You would also need to explicitly reseed the identity field…
DBCCCHECKIDENT(‘DimDeliverable’,RESEED)
…or use this statement to reseed the identity field to start at a specific value.
DBCC CHECKIDENT (‘DimDeliverable’, 1)
 
TRUNCATE COMMAND
An alternative to the delete command is the TRUNCATE command. The nice thing about the truncate command is that it is executed only once, so it is much more efficient and it also automatically reseeds your identity key. How does it work? Well, the truncate command deallocates data pages which internally store the data in SQL Server. One thing to note, however, is that while the truncate command is the quickest way to clear out a SQL table, it cannot be used on a table with foreign key constraints. You would first need to remove the constraints as shown in the following example:
 
— temporarily remove constraint
ALTERTABLE[dbo].[DimDeliverable]DROPCONSTRAINT[PK_DimDeliverable]
 
— truncate table
TRUNCATETABLE[dbo].[DimDeliverable]
 
— re-add constraint
ALTER TABLE [dbo].[DimDeliverable] ADD CONSTRAINT [PK_DimDeliverable] PRIMARY KEY ([DeliverableKey])
 
Incremental Loading Technique
This technique is much more efficient than the “flush and fill” technique in that it only updates the items which have changed. While it is more efficient, it is definitely not simpler, so you will probably use the “flush and fill” technique on your smaller dimension tables. As of SQL 2008, a new command was introduced to support the “incremental loading” technique called MERGE.
Type 1:  Updates are overwritten and history of changes is not kept. 
This statement does the following: 1) if a record does not exist, create it; 2) if a record exists but is different, update it; 3) if a record exists in the target but not in the source, then deactivate it.
 
use CatapultDatawarehouse
Go
— Declare Variables
DECLARE @BC int = 570
— TYPE 1 (Overwrites history) – DimDeliverable:  CATINDB06.CAT_DB.project_deliverable
BEGIN TRY — perform in transaction so if it fails values are rolled back
      MERGE DimDeliverable AS [Target] — begin merge statements (merge statements end with a semi-colon)
      USING CATINDB06.CAT_DB.project_deliverable AS [Source]
      ON[Target].[DeliverableId]=[Source].[deliverable_id]
      WHEN MATCHED AND — record exists but values are different
      (
            [Target].[Name] <> [Source].[deliverable_name]
            OR [Target].[EstimatedHours] <> [Source].[est_hours]
            OR [Target].[EstimatedRate] <> [Source].[est_billing_rate]
            OR [Target].[EstimatedRevenue] <> [Source].[total_amount]
      )
      THEN UPDATE SET — update records (Type 1 means record values are overwritten)
            [Target].[Name] = [Source].[deliverable_name]
            , [Target].[EstimatedHours] = [Source].[est_hours]
            , [Target].[EstimatedRate] = [Source].[est_billing_rate]
            , [Target].[EstimatedRevenue] = [Source].[total_amount]
      WHEN NOT MATCHED BY TARGET — record does not exist
      THEN INSERT — insert record
      (
            [Name]
            ,[EstimatedHours]
            ,[EstimatedRate]
            ,[EstimatedRevenue]
      )
      VALUES
      (
            [Source].[deliverable_name]
            , [Source].[est_hours]
            , [Source].[est_billing_rate]
            , [Source].[total_amount]
      )
      WHEN NOT MATCHED BY SOURCE — record exists in target but not source
      THEN DELETE — delete from target
      OUTPUT $action, inserted.*, deleted.*; — output results
      COMMIT TRAN
END TRY
BEGIN CATCH
      ROLLBACK TRAN
ENDCATCH
Go
 
Type 2:  Slowly Changing Dimensions track historical changes. 
In this case the merge statement is simply checking to see if a row exists but contains changes.  If this is the case then the merge statement simply deactivates that row but keeps its data intact for historical purposes.  A new row then is created and marked as the current (most up-to-date) information using an INSERT statement that is wrapped around the SQL MERGE.
 
— TYPE 2 (Tracks history) – DimDeliverable:  CATINDB06.CAT_DB.project_deliverable
BEGIN TRY — perform in transaction so if it fails values are rolled back
      INSERT INTO DimDeliverable
      (
            [Name]
            , [EstimatedHours]
            , [EstimatedRate]
            , [EstimatedRevenue]
            , [IsCurrentRow]
      )
      SELECT
            [deliverable_name]
            , [est_hours]
            , [est_billing_rate]
            , [total_amount]
            , 1
      FROM
      (
            MERGE DimDeliverable AS [Target] — begin merge statements (merge statements end with a semi-colon)
            USING CATINDB06.CAT_DB.project_deliverable AS [Source]
            ON [Target].[DeliverableId] = [Source].[deliverable_id] AND [Target].[IsCurrentRow] = 1
            WHEN MATCHED AND — record exists but values are different
            (
                  [Target].[Name] <> [Source].[deliverable_name]
                  OR [Target].[EstimatedHours] <> [Source].[est_hours]
                  OR [Target].[EstimatedRate] <> [Source].[est_billing_rate]
                  OR [Target].[EstimatedRevenue] <> [Source].[total_amount]
            )
            THEN UPDATE SET — update records (Type 1 means record values are overwritten)
                  [Target].[IsCurrentRow] = 0
                  , [Target].[ValidTo] = GETDATE()
            WHEN NOT MATCHED BY TARGET — record does not exist
            THEN INSERT — insert record
            (
                  [Name]
                  ,[EstimatedHours]
                  ,[EstimatedRate]
                  ,[EstimatedRevenue]
                  , [IsCurrentRow]
            )
            VALUES
            (
                  [Source].[deliverable_name]
                  , [Source].[est_hours]
                  , [Source].[est_billing_rate]
                  , [Source].[total_amount]
                  , 1
            )
            WHEN NOT MATCHED BY SOURCE — record exists in target but not source
            THEN DELETE — delete from target
            OUTPUT $action AS Action, [Source].* — output results
      ) AS MergeOutput
      WHERE
            MergeOutput.Action = ‘UPDATE’
      ;
      COMMIT TRAN
END TRY
BEGIN CATCH
      ROLLBACK TRAN
ENDCATCH
Go
 
Advertisements

Indexing strategies on data warehouses can be extremely light, depending on its usage. Most data warehouses are used solely to populate the SSAS database and, therefore, are not queried directly. In this case, the index strategy would be geared towards speeding up the ETL process. In some cases, the data warehouse may be queried directly. If that is the case, then you would adjust your index strategy accordingly.

 

Basic Index Strategy

 

  • Dimension tables will have a clustered single-column Primary Key index on the Primary Keys.
    • For a VERY LARGE (> 100k rows) dimension table, you may want to add a few indexes on the non-key columns most often used in queries.
    • For LARGE dimensions (~100k rows), may want to consider indexing the source system key (remember that this key would not be unique if you have any Type 2 attributes in the dimension). This is not for query time, but for the surrogate key pipeline during ETL.

     

  • Fact tables will have a clustered single-column Primary Key index on the Primary Keys, and will also have a single-column non-clustered index on the Foreign Keys to the dimension tables.

When you are defining data warehouse dimensions, you need to ask yourself how changes to the information impact your fact tables.

 

Type 1: This is straightforward refresh. When a value changes in the dimension table, the fields are simply overwritten and history is not kept for the column. One example of a Type 1 dimension would be an employee name. If, for example, employee Jane Doe gets married and changes her name to Jane Smith, you may choose to simply overwrite the name. All fact tables referencing Jane Doe will suddenly reference Jane Smith as if she was always Jane Smith. Your client may always want the most current employee name and, therefore, would not want to track historical changes.

 

Type 2: This is known as a slowly changing dimension and all changes are recorded historically. Using the example above in which Jane Doe gets married and changes her name to Jane Smith, the company may choose to want to track these historical changes. When this type of change occurs, our dimension table will have two records for Jane, for example, JAN 2013 – MAR 2013 = Jane Doe, MAR 2013 – current = Jane Smith. Some of your fact table records will reference the Jane Doe record and others will reference the Jane Smith record. If you have a sales fact, sales from JAN – MAR will show Jane Doe as the representative, while sales after MAR will reference Jane Smith.

 

Type 3: This is also a slowly changing dimension, but in this case the full history is not recorded. In fact, Jane Doe will always maintain just a single record. However, the previous value will be recorded in an “OldValue” date, along with the date in which the value was changed. For example, when Jane Doe gets married, her record will be updated to Jane Smith. The OldValue field will be set to “Jane Doe” with a change date of MAR 2013. All sales figures will start displaying “Jane Smith” but access to the previous value of “Jane Doe” is still available. Therefore, you would always have your sales report display Jane Smith (previously Jane Doe) if you wanted to do so.