Jos, my
co-author for the "Building Pentaho Solutions" book just
pointed me to a recent article by Jeff Prenevost entitled
"The Problem with History".AbstractJeff's
topic, loading a hybrid Type 1 / Type 2 slowly changing dimension table is related to
data warehousing but maybe of interest outside of
that context as well.
As it turns out, the particular problem described by Jeff is
non-trivial, but can be solved quite elegantly in a single SQL
statment. This may be a compelling alternative to the multi-step,
multi-pass solution proposed in his article.Type 1 and Type 2
Slowly Changing DimensionsIn his article, Jeff describes a method
to load a slowly changing dimension (SCD) table from an audit
trail. This would be quite straight forward in case we are
dealing with a Type 2 slowly changing dimension. In that
case, each row in the audit trail would also yield one row in the
dimension table. If the dimension would be a Type 1 slowly changing dimension, the matter
would only be slightly more complicated - in this case only the
most recent version of each object would be loaded into the
dimension table.A Hybrid SCD Type 1/2The interesting thing about
the problem described in the article is that the dimension table
is a hybrid Type 1 / Type 2 dimension. That is, for some
attributes, history needs to be tracked in the dimension table
(Type 2 attributes), whereas only the most recent data is
required for other attributes (Type 1 attributes).Sample DataTo
make things tangible, here's a sample Employee audit trail:
| SCD Type 1 | SCD Type 2 |
+--------+--------+-------------+--------+-------+------------+------------+
| empkey | name | ssn | gender | state | valid_from | valid_to |
+--------+--------+-------------+--------+-------+------------+------------+
| 14 | Jo | 323-10-1116 | F | MI | 1998-12-03 | 1998-12-27 |
| 14 | Jo | 323-10-1119 | F | MI | 1998-12-28 | 2005-04-22 |
| 14 | Joe | 323-10-1119 | F | MI | 2005-04-23 | 2005-08-07 |
| 14 | Joseph | 323-10-1119 | M | MI | 2005-08-08 | 2006-02-12 |
| 14 | Joe | 323-10-1119 | M | MI | 2006-02-13 | 2006-07-04 |
| 14 | Joseph | 323-10-1119 | M | NY | 2006-07-05 | 2006-12-24 |
| 14 | Joseph | 323-10-1119 | M | MI | 2006-12-25 | NULL |
| 15 | Jim | 224-57-2726 | M | IL | 2002-01-16 | 2004-03-15 |
| 15 | James | 224-57-2726 | M | IL | 2004-03-16 | 2007-06-22 |
| 15 | James | 224-57-2726 | M | IN | 2007-06-23 | 2007-08-31 |
+--------+--------+-------------+--------+-------+------------+------------+
The data shows the history for the employees with
empkey
14 and 15. All rows with the same
empkey
value form a time line of data change events.
Each row represents a change event, updating some of the
employee's data. The valid_from
and
valid_to
columns are used to record when the data
change event occurred, so the values in these columns change for
each row. For the other columns, I used bold markup to
make it easier to spot the change.SCD Type 1 and 2 AttributesIn
Jeff's article, the columns name
and
ssn
end up as SCD Type 1 attributes, and the columns
gender
and state
as SCD Type 2
attributes. I used color highlighting to mark up groups of
consecutive rows where the values for the gender
and
state
columns did not change. After loading the
dimension table we must end up with one row for each such group,
capturing all change events for these columns. The
ssn
and name
must always contain the
most recent data. I highlighted the most recent row using grey
markup.Resulting Dimension TableThe data for the resulting
dimension table would look like this:
+--------+--------+--------+-------------+--------+-------+------------+------------+------------+
| dw_key | empkey | name | ssn | gender | state | valid_from | valid_to | is_current |
+--------+--------+--------+-------------+--------+-------+------------+------------+------------+
| 1 | 14 | Joseph | 323-10-1119 | F | MI | 1998-12-03 | 2005-08-07 | N |
| 2 | 14 | Joseph | 323-10-1119 | M | MI | 2005-08-08 | 2006-07-04 | N |
| 3 | 14 | Joseph | 323-10-1119 | M | NY | 2006-07-05 | 2006-12-24 | N |
| 4 | 14 | Joseph | 323-10-1119 | M | MI | 2006-12-25 | 9999-12-31 | Y |
| 5 | 15 | James | 224-57-2726 | M | IL | 2002-01-16 | 2007-06-22 | N |
| 6 | 15 | James | 224-57-2726 | M | IN | 2007-06-23 | 9999-12-31 | Y |
+--------+--------+--------+-------------+--------+-------+------------+------------+------------
As you can see, only the changes in the gender
and
state
columns are recorded, ignoring any changes in
the name
and ssn
columns. Instead,
these columns get the values of the most recent change.Jeff's
SolutionJeff's article describes a step-wise solution to this
problem. To give you an idea and some context, here's a quote
from the article describing the mindset in developing this
approach:
While an ETL tool like Informatica or DataStage would be handy,
any of this could be done fairly easily in straight SQL. We’ll
also keep all the steps simple, easy to understand and discrete.
It’s possible to create enormous heaps of nested SQL to do
everything in one statement, but it's best to keep everything
understandable. We create and label "tables," but whether you
choose to actually create real database tables, or the tables are
just record sets inside a flow, the process remains essentially
unchanged.The actual steps that make up the described solution
can be summarized as follows:
-
- Rank all rows according to the timeline
- Join consecutive rows by rank in case the there is a change
in the SCD attributes
- Break the joined rows back into two rows (Jeff calls this
'semi-pivot')
- Add the last and first version for each object to the
set
- Add another rank number to the result set
- At this point, the rows where the new rank number is odd
contain the
valid_from
date, and rows with an even number contain thevalid_to
date. Join each odd row to its consecutive even rows.
- Join to the last version of each object to fill in for the
SCD Type 1 attributes, and set a flag for the latest version of
the object.
I admit I have a hard time understanding Jeff's method, and I am
completely puzzled as to how he invented it. My apologies if my
summary isn't exactly crystal clear - I encourage you to read
the original article, "The Problem with
History" yourself if you are interested in the details.My
AlternativeFirst of all, let me say that I am impressed with
Jeff's creativity. I would never think of this solution and I am
not really sure I even understand it. I also am not categorically
opposed to Jeff's initial mindset: I too think that it is
sometimes better to avoid complex SQL statements in favor of a
series of relatively simple ones.
That said, when I first read Jeff's article, my first impression
was: "Wow...that are a lot of moving parts". My second impression
was: "Darn, that's a lot of multiple passes over the same data
set". All the time, I had this hunch that it wouldn't be to hard
to do it in two statements (one SELECT...INTO
, one
UPDATE
), or perhaps even in one
(SELECT...INTO
).PreparationI started by setting up
the test data in MySQL 5.1. I'm including the script here for the
readers' convenience:
CREATE TABLE hrsource (
empkey int NOT NULL
, name varchar(10) NOT NULL
, ssn char(11) NOT NULL
, gender char(1) NOT NULL
, state char(2) NOT NULL
, valid_from date NOT NULL
, valid_to date
, PRIMARY KEY(empkey, valid_from)
);
INSERT INTO hrsource
(empkey,name,ssn,gender,state,valid_from,valid_to) VALUES
(14, 'Jo' , '323-10-1116', 'F', 'MI', '1998-12-03', '1998-12-27')
,(14, 'Jo' , '323-10-1119', 'F', 'MI', '1998-12-28', '2005-04-22')
,(14, 'Joe' , '323-10-1119', 'F', 'MI', '2005-04-23', '2005-08-07')
,(14, 'Joseph', '323-10-1119', 'M', 'MI', '2005-08-08', '2006-02-12')
,(14, 'Joe' , '323-10-1119', 'M', 'MI', '2006-02-13', '2006-07-04')
,(14, 'Joseph', '323-10-1119', 'M', 'NY', '2006-07-05', '2006-12-24')
,(14, 'Joseph', '323-10-1119', 'M', 'MI', '2006-12-25', NULL)
,(15, 'Jim' , '224-57-2726', 'M', 'IL', '2002-01-16', '2004-03-15')
,(15, 'James' , '224-57-2726', 'M', 'IL', '2004-03-16', '2007-06-22')
,(15, 'James' , '224-57-2726', 'M', 'IN', '2007-06-23', '2007-08-31');
Of course, this is not a lot of data, but it is the data from the original article, dutifully hacked into my computer's keyboard by your's truly.SolutionWithout further ado, this is my solution for selecting the dimension's table dataset:
SELECT prv.empkey
, curr.name
, curr.ssn
, prv.gender
, prv.state
, MIN(prv.valid_from) valid_from
, COALESCE(
nxt.valid_from - INTERVAL 1 DAY
, '9999-12-31'
) valid_to
, CASE
WHEN nxt.valid_from IS NULL THEN 'Y'
ELSE 'N'
END is_current
FROM hrsource curr
INNER JOIN (
SELECT empkey
, MAX(valid_from) valid_from
FROM hrsource
GROUP BY empkey
) curr1
ON curr.empkey = curr1.empkey
AND curr.valid_from = curr1.valid_from
INNER JOIN hrsource prv
ON curr.empkey = prv.empkey
LEFT JOIN hrsource nxt
ON prv.empkey = nxt.empkey
AND prv.valid_to < nxt.valid_from
AND (prv.gender,prv.state)!= (nxt.gender,nxt.state)
LEFT JOIN hrsource inb
ON prv.empkey = inb.empkey
AND prv.valid_to < inb.valid_from
AND nxt.valid_from > inb.valid_to
AND (prv.gender,prv.state)!= (inb.gender,inb.state)
WHERE inb.empkey IS NULL
GROUP BY prv.empkey
, nxt.valid_from
Now I would certainly not qualify this as a simple statement. At
the same time, I feel this does not resemble the "enormous heaps
of nested SQL" so dreaded by Jeff. Let me explain how it works,
and you can judge for yourself.ExplanationI will know do a
step-by-step explanation of my statement. I hope it clears up any
doubt you might have as to how my solution solves the problem.SCD
Type 1 attributes: Isolating the Most Recent ChangeIn the
introduction, I mentioned that the loading the dimension table
would be easy in case it was either a Type 1 or a Type 2 slowly
changing dimension. Well, we know that, come what may, we will
always need the most recent row for each empkey
to
supply values for the Type 1 attributes name
and
ssn
. So, we start by attacking that part of the
problem.
The following fragment of my solution does exactly that:
FROM hrsource curr
INNER JOIN (
SELECT empkey
, MAX(valid_from) valid_from
FROM hrsource
GROUP BY empkey
) curr1
ON curr.empkey = curr1.empkey
AND curr.valid_from = curr1.valid_from
The subquery curr1
finds us the highest value for
the valid_from
column for each distinct
empkey
. The GROUP BY empkey
clause
ensures we get exactly one row for each distinct value of
empkey
. From all rows with the same
empkey
value, the MAX(valid_from)
bit
finds the largest valid_from
value.
The combination empkey, valid_from
is the primary
key of the employee audit trail. This means we can now use the
empkey, MAX(valid_from)
pair from the
curr1
subquery to point out the most recent row for
each distinct empkey
. The INNER JOIN
with the hrsource
table (dubbed curr
)
does exactly that.Grouping consecutive rows with identical Type 2
attributesWe must now deal with the second part of the problem,
the SCD Type 2 attributes.
If you look back at the color highlighting in the sample data
set, you may realize this is basically a grouping problem: for
each distinct empkey
we need to group rows with
identical values in the Type 2 columns gender
and
state
. From the point of view of the dimension
table, no change occurred within this group, so we should store
it as one row in the dimension table, and reconstruct the change
dates by taking the minimum valid_from
and maximum
valid_to
values from the group.
We must be careful though: we can't simply make groups of all
distinct combinations of the SCD Type 2 columns. Look for example
at the rows having 14
for the empkey
column. The rows with the valid_from
values
2005-08-08
, 2006-02-13
and
2006-12-25
all have the identical combination
(M
, MI
) in the Type 2 columns
gender
and state
respectively. However,
only the first two of these belong together in a group. The row
with 2006-12-25
in the valid_from
column does not belong to the group because another change event
occurred on 2006-07-05
, which lies between
2006-02-13
and 2006-12-25
.
So, we must sharpen up the definition of the problem. It is not
enough to make groups of identical combinations of Type 2
attributes: we must also demand that the rows in the group are
consecutive.
The first step in attacking this problem is generate combinations
of the audit trail rows so that we have the first row of each
group along the first row of the next group. The following
fragment solves part of that problem by combining each row
(alias: prv
) with all more recent rows (alias:
nxt
) that have the same empkey
value
but different values in the Type 2 columns gender
and state
:
INNER JOIN hrsource prv
ON curr.empkey = prv.empkey
LEFT JOIN hrsource nxt
ON prv.empkey = nxt.empkey -- timeline of same employee
AND prv.valid_to < nxt.valid_from -- nxt must be more recent than prv
AND (prv.gender,prv.state)!= (nxt.gender,nxt.state) -- type 2 attributes differ
(Please ignore the part I struck out - it's not relevant at this
point)
Note the usage of the LEFT JOIN
. It ensures that if
prv
is the most recent row, and there is by
definition no row in nxt
that is more recent, the
prv
row is still retained. Had we used an
INNER JOIN
, we would have lost that row, messing up
the result.
If we run this fragment in isolation, we can certainly see rows
that pair the first row of a group with the first row of the next
group. For example, if you run this query:
SELECT prv.empkey
, prv.valid_from prv_from, prv.valid_to prv_to
, nxt.valid_from nxt_from, nxt.valid_to nxt_to
, prv.gender prv_gender
, nxt.gender nxt_gender
FROM hrsource prv
LEFT JOIN hrsource nxt
ON prv.empkey = nxt.empkey
AND prv.valid_to < nxt.valid_from
AND (prv.gender, prv.state) != (nxt.gender, nxt.state)
We get results like this:
+--------+------------+------------+------------+------------+------------+------------+
| empkey | prv_from | prv_to | nxt_from | nxt_to | prv_gender | nxt_gender |
+--------+------------+------------+------------+------------+------------+------------+
| 14 | 1998-12-03 | 1998-12-27 | 2005-08-08 | 2006-02-12 | F | M |
| 14 | 1998-12-03 | 1998-12-27 | 2006-02-13 | 2006-07-04 | F | M |
| 14 | 1998-12-03 | 1998-12-27 | 2006-07-05 | 2006-12-24 | F | M |
. . . . . . . .
. . ...more rows... . . . .
. . . . . . . .
+--------+------------+------------+------------+------------+------------+------------+
The first row highlighted in green is a desired combination
because the valid_from
date of nxt
is
closest to that of prv
. It is this row from
nxt
that marks the end of the group of rows starting
with prv
. For the rows highlighted in red, the
nxt
part indicates a change that occurred beyond
that point and are thus not desired.
To get rid of these undesired rows, we simply have to demand that
no row indicating a change in the SCD Type 2 attributes occurs
between prv
and nxt
. In my query, the
following fragment is responsible for that part:
LEFT JOIN hrsource inb
ON prv.empkey = inb.empkey -- time line of same employee
AND prv.valid_to < inb.valid_from -- more recent than prv
AND (prv.gender,prv.state)!= (inb.gender,inb.state) -- Type 2 columns changed
AND nxt.valid_from > inb.valid_to -- less recent than nxt
WHERE inb.empkey IS NULL -- does not exist
This fragment essentially states that there must not be any row
in between prv
and nxt
that indicate a
change in the SCD type 2 attributes as compared to
prv
. Note that the first part of the join condition
is identical to the one we used to join nxt
to
prv
. The extra's are that we also ask that
inb
lies before nxt
. Because we used a
LEFT JOIN
, the result row is retained in case no
such inb
row exists. The WHERE inb.empkey IS
NULL
condition explicitly filters for these cases. By
definition, this means that the nxt
part in the
result row marks the end of whatever group the prv
belongs to.Rolling Up the Rows in their GroupsNow, if we put
these parts together we get something like this:
SELECT prv.empkey
, prv.valid_from prv_from, prv.valid_to prv_to
, nxt.valid_from nxt_from, nxt.valid_to nxt_to
, prv.gender prv_gender
, nxt.gender nxt_gender
FROM hrsource prv
LEFT JOIN hrsource nxt
ON prv.empkey = nxt.empkey
AND prv.valid_to < nxt.valid_from
AND (prv.gender, prv.state) != (nxt.gender, nxt.state)
LEFT JOIN hrsource inb
ON prv.empkey = inb.empkey
AND prv.valid_to < inb.valid_from
AND (prv.gender,prv.state) != (inb.gender,inb.state)
AND nxt.valid_from > inb.valid_to
WHERE inb.empkey IS NULL
Some of the results are here:
+--------+------------+------------+------------+------------+------------+------------+
| empkey | prv_from | prv_to | nxt_from | nxt_to | prv_gender | nxt_gender |
+--------+------------+------------+------------+------------+------------+------------+
| 14 | 1998-12-03 | 1998-12-27 | 2005-08-08 | 2006-02-12 | F | M |
| 14 | 1998-12-28 | 2005-04-22 | 2005-08-08 | 2006-02-12 | F | M |
| 14 | 2005-04-23 | 2005-08-07 | 2005-08-08 | 2006-02-12 | F | M |
. . . . . . . .
. . ...more rows... . . . .
. . . . . . . .
+--------+------------+------------+------------+------------+------------+------------+
Now, we still see the individual rows. However, because we have
paired them with the row that definitely marks the end of the
group it belongs to, we can now lump them together using
GROUP BY
. This explains the final bits of the query:
GROUP BY prv.empkey
, nxt.valid_from
For each value in empkey
, this GROUP BY
clause essentially rolls up the group of rows that showed no
change in the SCD Type 2 attributes until the occurrence of the
nxt
row. In the SELECT
list, we can now
use the MIN
function to find the date of the first
row in the group.
Note that the GROUP BY
list is the thinnest one
possible. The SELECT
list contains many more columns
that do not appear in the GROUP BY
list. However, it
is perfectly safe to do so in this case. For more details on this
matter, please read my Debunking Group By Myths
article.Glueing the SCD Type 1 and Type 2 Solutions
togetherWe developed the solution as two separate queries - one
to take care of the SCD Type 1 attributes, one for the SCD Type 2
attributes. These two are simply joined together over the
empkey
. The code to do this was displayed struck out
in the first query fragment that deals with SCD type 2
attributes.Calculating the Result columnsThe only part of the
query that was not discussed yet is the SELECT
list:
SELECT prv.empkey
, curr.name
, curr.ssn
, prv.gender
, prv.state
, MIN(prv.valid_from) valid_from
, COALESCE(
nxt.valid_from - INTERVAL 1 DAY
, '9999-12-31'
) valid_to
, CASE
WHEN nxt.valid_from IS NULL THEN 'Y'
ELSE 'N'
END is_current
Note that we select the name
and ssn
columns from the curr
table. This ensures these
columns will always reflect the data of the most recent change
which is typical for SCD Type 1 attributes. The
gender
and state
columns on the other
hand are drawn from prv
. They reflect the change
history for each individual employee.
The valid_from
column is aggregated using the
MIN
function. This essentially yields the date of
the first row of each group of consecutive rows having the same
empkey
and values in the SCD Type 2 columns.
The valid_to
date is calculated from the
valid_from
date of the first row of the next group
of rows that indicates a change of values in the SCD Type 2
columns. This is the relevant code:
nxt.valid_from - INTERVAL 1 DAY
Now, to be completely fair - this is in fact cheating. It implies
I trust the data to be such that the value for the
valid_to
date is always exactly one day less than
the value for the valid_from
date for the next row
in the history. This was certainly the case with the sample data,
but we cannot really rely on that. That said, it is not that hard
to get the 'real' valid_to
date, but it would make
this example a lot harder to understand.
Another thing about the calculation of valid_to
is
the treatment of NULL
values. For each history
associated with one distinct empkey
value, the last
row will have NULL
here, and this is substituted by
a large date instead. Dimension tables tend to avoid nullable
columns and in this case the maximum value '9999-12-31' expresses
that this row is the current row. Using that same logic, and
exrta flag column is added to indicate whether the row is the
current row.