Archives for category: ETL

Indexing strategies on data warehouses can be extremely light, depending on its usage. Most data warehouses are used solely to populate the SSAS database and, therefore, are not queried directly. In this case, the index strategy would be geared towards speeding up the ETL process. In some cases, the data warehouse may be queried directly. If that is the case, then you would adjust your index strategy accordingly.

 

Basic Index Strategy

 

  • Dimension tables will have a clustered single-column Primary Key index on the Primary Keys.
    • For a VERY LARGE (> 100k rows) dimension table, you may want to add a few indexes on the non-key columns most often used in queries.
    • For LARGE dimensions (~100k rows), may want to consider indexing the source system key (remember that this key would not be unique if you have any Type 2 attributes in the dimension). This is not for query time, but for the surrogate key pipeline during ETL.

     

  • Fact tables will have a clustered single-column Primary Key index on the Primary Keys, and will also have a single-column non-clustered index on the Foreign Keys to the dimension tables.
Advertisements

When you are defining data warehouse dimensions, you need to ask yourself how changes to the information impact your fact tables.

 

Type 1: This is straightforward refresh. When a value changes in the dimension table, the fields are simply overwritten and history is not kept for the column. One example of a Type 1 dimension would be an employee name. If, for example, employee Jane Doe gets married and changes her name to Jane Smith, you may choose to simply overwrite the name. All fact tables referencing Jane Doe will suddenly reference Jane Smith as if she was always Jane Smith. Your client may always want the most current employee name and, therefore, would not want to track historical changes.

 

Type 2: This is known as a slowly changing dimension and all changes are recorded historically. Using the example above in which Jane Doe gets married and changes her name to Jane Smith, the company may choose to want to track these historical changes. When this type of change occurs, our dimension table will have two records for Jane, for example, JAN 2013 – MAR 2013 = Jane Doe, MAR 2013 – current = Jane Smith. Some of your fact table records will reference the Jane Doe record and others will reference the Jane Smith record. If you have a sales fact, sales from JAN – MAR will show Jane Doe as the representative, while sales after MAR will reference Jane Smith.

 

Type 3: This is also a slowly changing dimension, but in this case the full history is not recorded. In fact, Jane Doe will always maintain just a single record. However, the previous value will be recorded in an “OldValue” date, along with the date in which the value was changed. For example, when Jane Doe gets married, her record will be updated to Jane Smith. The OldValue field will be set to “Jane Doe” with a change date of MAR 2013. All sales figures will start displaying “Jane Smith” but access to the previous value of “Jane Doe” is still available. Therefore, you would always have your sales report display Jane Smith (previously Jane Doe) if you wanted to do so.