I never learned from a man who agreed with me --Robert
Heinlein
Here are some quotes from the original debates that spawned the
industry.
An article entitled "A Dimensional Modeling Manifesto" in the
August 1997 issue of DBMS magazine has been generating a lot of
misinformation within the data warehouse community. In this
article, author Ralph Kimball says he is "drawing a line between
dimensional modeling (DM) and entity relational (ER) modeling
techniques." Mr. Kimball also makes the statement, "DM is the
only viable technique for databases that are designed to support
end-user queries in a data warehouse. ER is very useful for the
transaction capture and data administration phases of
constructing a warehouse, but it should be avoided for the
end-user delivery." Let's look at why this is a technically
myopic statement.
Mr. Kimball's statement seems to imply that both modeling methods
have value in the warehouse, but that users should not run
queries against the ER-modeled database. In short, Kimball is
advocating a data mart-centric architecture, whereby all access
is through the marts. This is not a scalable architecture nor one
that can be responsive to rapidly changing business needs.
The first theme that should be refuted in Kimball's article is
that ER models have no business rules, only data rules ' as if
this is a bad thing. Business rules come from the business and
that changes over time. The purpose of a data model is to model
the entities of the data and relationships connecting those
entities. Those relationships are inherent in the data due to the
transactions.
By placing the data in the proper tables, as defined by logical
modeling techniques, the business integrity of the data
relationships is maintained. Once this is accomplished, then it
is up to users to decide how they would like to compare and
contrast those relationships and data elements to best address
the analysis at hand. This also leads to the power of data
mining, whereby advanced tools actually discover relationships
that are not easily detected but do exist naturally from the
business.
Another point repeated throughout the manifesto is that fact
tables should always be queried separately. This may be true with
the software and databases that Mr. Kimball likes to advocate,
but it is not necessarily true across the board. A robust
relational database engine will allow an analysis, which needs to
get facts from several tables to be run as a single query.
According to Mr. Kimball's statement, if I needed to get
information regarding inventory, distribution, and sales of
particular items, then I should run this as three separate
queries and correlate the data with the use of intermediate
tables. This not only is wasteful processing, but also tends to
limit the capability of users to ask advanced analysis.
This point leads to another of Mr. Kimball's reoccurring themes:
Cost-based optimizers don't work and cannot be trusted to give
accurate or consistent results. While I agree that many
optimizers on the market don't work, I invite Kimball to try the
NCR Teradata optimizer. Unlike optimizers that grew out of the
OLTP arena, Teradata started as a parallel, relational system and
uses a cost-based, parallel-aware approach to construct a query
plan. The Teradata optimizer has more than 15 years of experience
in processing true decision-support analysis and, as such, is
many years ahead of other optimizers in the marketplace.
The final point to address here is the idea that users can't
navigate an ER model. This is a critical, but simple, point. The
users should not be navigating the system! This is a metadata and
application issue. End users should not know, nor care, what the
model looks like or how to navigate it; and the lack of good
tools is exactly the problem. As the tools evolve, you do not
want to limit the warehouse by a compromised data model.
Mr. Kimball concludes his thoughts by stating the use of ER
modeling defeats the allure of the warehouse, which is intuitive
and high-performance retrieval of data. Here he perpetuates the
myth that decision-support queries need to have very low response
times. Data warehousing is not only about performance but also
about capability. True data warehousing allows the users to ask
questions previously unknown (or previously unanswerable). The
allure of data warehousing was, and is, the ability to ask any
question, at any time, and get a timely response. The discussion
become one of speed and compromises only when certain vendors
could not provide that capability.
In conclusion, I will agree that there is value to the use of
data marts and dimensional modeling within the entire
infrastructure of a complete warehouse. The use of such methods
should come after less intrusive and compromising optimization
techniques have been exhausted, rather than as the first resort.
I find it irresponsible for an "independent industry expert" such
as Mr. Kimball to portray his own apparently biased solution as
the only viable option when writing an article for a magazine
that purports to serve the entire industry.
Rob Armstrong
NCR Corp.
robert.armstrong@sandiegoca.ncr.com
Mr. Armstrong starts by stating that the Manifesto article has
been "generating a lot of misinformation in the data warehouse
community." In my humble opinion, the ideas in the Manifesto
represent the mainstream thinking in data warehousing today, and
Mr. Armstrong is reacting not to the actual issues but to the
fact that the Teradata technology has a difficult time with star
join queries. The following points speak to these issues more
specifically.
Mr. Armstrong says: "[The data mart architecture] is not a
scalable architecture nor one that can be responsive to rapidly
changing business needs." The data mart architecture is popular
precisely because it is scalable and can adapt to changing
business needs. Furthermore, because of its scalability, most
designers believe they can bring a project to completion, unlike
the galactic data warehouse approach that seems insurmountable.
See my article "Bringing up Supermarts," in the January issue of
DBMS for a more complete perspective on this issue.
Mr. Armstrong repeats the familiar mantra of entity-relationship
modeling. "The purpose of a data model is to model the entities
of the data and relationships connecting those entities." The
words are seductive, but they are a snare and a delusion. As I
said in the Manifesto article, the ER model only catalogs
low-level logical relationships between data elements but
provides little or no perspective on higher-level business
issues. For example, what do we do when our understanding of a
product or a customer changes over time? Not only does the
dimensional modeling framework have a specific set of techniques
for handling this situation, but all existing reporting and
querying applications are guaranteed to continue to work
gracefully.
Mr. Armstrong's objection to the separate querying of inventory,
distribution, and sales tables is misplaced. The separate query
approach has immense practical advantages. The SQL required can
be generated much more reliably and through a wider range of
available user interfaces than a single monolithic query.
Monolithic SQL queries combining fact tables of different
cardinalities are extraordinarily dangerous and difficult to
write. In many practical situations, what looks like the correct
query will return the wrong results when the separate fact tables
are joined. This is a "warning" exercise I teach in my data
warehousing classes!
Separate SQL queries also can be "aggregate navigated"
separately. Aggregate navigation is a key data warehousing
technique for building cost-effective data marts and avoiding
brute-force relation scans of the data. I know of no general
approach for aggregate navigating monolithic SQL queries joining
multiple fact tables together.
Separate SQL queries also invite the flexibility of having the
separate fact tables on different databases, different machines,
or even different database technologies. This cost-effective and
modular approach to assembling a data warehouse from small,
medium, and large separate data marts is a key direction in the
industry at this time.
Mr. Armstrong invites me to try the Teradata cost-based
optimizer. Well, I have. I have spent a significant fraction of
my consulting and product development career working on Teradata
machines, including a stint at the Teradata development and test
center in El Segundo, California. My conclusion after trying to
make dimensional models (star joins) run on the Teradata machines
was that Teradata's parallel processing algorithm was very
inefficient and tended to produce a "hot amp" where one of its
processors bottlenecked the entire machine. As a result, the
Teradata system engineers in most of the accounts I worked in
recommended against a dimensional model. Just last week, a
Teradata customer at a large bank told me that his system
exhibited this familiar Teradata behavior.
In the Manifesto I pointed out that end users are not capable of
navigating an ER model. Mr. Armstrong implicitly agrees with me
when he says, "The users should not be navigating the system." My
point, and the point that OLAP vendors make so effectively, is
that the dimensional model is itself intuitive and
understandable.
Mr. Armstrong accuses me of perpetuating the myth that
"decision-support queries need to have very low response times."
I don't believe that is a myth. Mr. Armstrong is telling the
marketplace that it is wrong to expect speedy response times. The
discipline of dimensional modeling offers a hope that with
tightly focused technology, rather than brute-force processing,
we can get the answers and do it fast.
Ralph Kimball,
www.rkimball.com
Mar
19
2011