Foundation: The Dimensional Model

A basic introduction to Dimensional Modeling is provided here, however this is an extensive and highly technical topic. If you are new to the concept of Dimensional Modeling and are planning to build an analytical data model, I highly recommend you supplement your education on this topic with additional resources.

The Dimensional Model

One of the most tried and true architectures for analytical data models that is simple, performant, and resilient to change is the Dimensional Model. While the MDW uses an adapted version of the dimensional model, it shares most of the same techniques and concepts for modeling data.

The Normalized Model

In order to understand dimensional modeling, you must first understand the design shared by most application databases. For applications to effectively record data input by its users, it must reduce the number of locations where data is recorded. To do this application developers use a database design known as 3rd Normal form, or a normalized model. This approach reduces data redundancies by dividing data into many separate tables. This way an update performed by the application only touches the database in a single location. The image below illustrates the portion of the normalized AdventureWorks sample database that we will be working with in this training.

AdventureWorks sells products (bicycles, components and accessories) to their customers (both stores and individual people). The image above illustrates the many tables needed to record this information in a normalized model. A sales order is entered into the SalesOrderHeader table. These orders are sold to customers by a sales person. A single order may contain many line items each with its own quantity, price etc. which are entered into the SalesOrderDetail table. Each line item refers to a specific product in the product table.

While this normalized database is ideal for software applications and data input, it presents a serious problem when trying to analyze large amounts of data across many tables. To solve this problem, developers consolidate multiple tables into fewer tables in a process known as denormalization.

Denormalization

Denormalization is the process of combining all related data from multiple tables into a single table. This is typically done by repeating lower cardinality values in the primary table. Take the Product and Product Category tables as an example. The normalized model list Products, Product categories, and Product subcategories only once in their associated tables.

Once you identify how the tables are related, you can "join" the tables together, inserting the lower cardinality values directly into the product table. Lower cardinality values are repeated many times and all the contextual information is combined into a single table known as a Dimension.

This way a reporting tool, for example, can easily identify how many products fall into the Bicycles category without having to analyze three separate tables, improving query performance. In addition, users can easily find all the information related to products in a single table, without having to browse through many different tables, improving simplicity & usability.

Facts and Dimensions

Other core concepts in dimensional modeling are Fact and Dimension tables.

Fact Tables

Fact tables, or Transaction tables, store the measurements recorded during a specific business process, such as a sale or shipment. Because these tables typically record many transactions it's best to keep them as simple as possible. Keeping ONLY the measures in this table supports fast queries across these large tables. It's because of this that fact tables are usually described as skinny (only measures) and long (many records or transactions)

Let's use the example of Adventureworks selling two helmets. The fact table would include measures such as the date of the sale, the amount of the sale, and the quantity of 2. While not technically a measure, its also necessary to include the Sales Order number in this table as well.

Foreign Keys

Finally, we need some way to record the customer, who purchased the helmets, and what helmet they purchased. If we were to place all of the customer info and product info into this table, it could take all day to analyze 100,000 transactions. Instead we record the associated customer and product in this transaction through the use of keys, specifically foreign keys. So, we would have a customer foreign key that relates to the customer who purchased the helmets and a product foreign key related to the specific helmet that was purchased.

Dimension Tables

If we were to just have a fact table we would be missing a lot of important information. So Dimension tables are created in order to provide descriptive context. In this example the Dimension tables would be Customer and Product, containing all of the contextual information necessary to obtain a full picture of the transaction. These tables typically need to contain a lot of text to describe the various attributes of an item or customer. They also, are not updated nearly as frequently as a transaction table (you don't have nearly as many new products as you have sales). So these tables are usually described as wide (containing many descriptive fields) and short (having fewer records).

Primary Keys

Last, but not least, Dimension tables must include a Primary Key. This Primary Key uniquely identifies each and every member (item or customer) in the dimension table. So, when the Foreign Key value of 3 is used in the transaction table, it is referring to the Primary Key of 3 in the dimension table. This enables a single dimension member to be described only once in a dimension table, but referenced many times within the fact table. This is what's known as a one-to-many relationship.

The Star Schema

So far we have Fact tables with many records tracking measurements of business processes, Dimension tables updated occasionally with many text fields to provide descriptive context, and Primary and Foreign keys to relate them together. Typically in a true analytical scenario you would have additional dimensions, such as a date dimension, sales person dimension, and maybe a warehouse dimension. When all of these tables are brought together with the Fact table in the middle, it resembles a star, which is the final database construct, or schema, of the dimensional model.

Suggested Reading

The above information is only a primer of the most basic data warehousing concepts. However, if you are planning to build an enterprise data warehouse, you should consider supplementing your education on this topic with additional resources.

The Data Warehouse Toolkit by Ralph Kimball is considered to be one of the most definitive guides on the topic and goes into much greater detail with real world examples.