Planet MySQL :: Planet MySQL - Archives

I never learned from a man who agreed with me --Robert Heinlein

Here are some quotes from the original debates that spawned the industry.

An article entitled "A Dimensional Modeling Manifesto" in the August 1997 issue of DBMS magazine has been generating a lot of misinformation within the data warehouse community. In this article, author Ralph Kimball says he is "drawing a line between dimensional modeling (DM) and entity relational (ER) modeling techniques." Mr. Kimball also makes the statement, "DM is the only viable technique for databases that are designed to support end-user queries in a data warehouse. ER is very useful for the transaction capture and data administration phases of constructing a warehouse, but it should be avoided for the end-user delivery." Let's look at why this is a technically myopic statement.

Mr. Kimball's statement seems to imply that both modeling methods have value in the warehouse, but that users should not run queries against the ER-modeled database. In short, Kimball is advocating a data mart-centric architecture, whereby all access is through the marts. This is not a scalable architecture nor one that can be responsive to rapidly changing business needs.

The first theme that should be refuted in Kimball's article is that ER models have no business rules, only data rules ' as if this is a bad thing. Business rules come from the business and that changes over time. The purpose of a data model is to model the entities of the data and relationships connecting those entities. Those relationships are inherent in the data due to the transactions.

By placing the data in the proper tables, as defined by logical modeling techniques, the business integrity of the data relationships is maintained. Once this is accomplished, then it is up to users to decide how they would like to compare and contrast those relationships and data elements to best address the analysis at hand. This also leads to the power of data mining, whereby advanced tools actually discover relationships that are not easily detected but do exist naturally from the business.

Another point repeated throughout the manifesto is that fact tables should always be queried separately. This may be true with the software and databases that Mr. Kimball likes to advocate, but it is not necessarily true across the board. A robust relational database engine will allow an analysis, which needs to get facts from several tables to be run as a single query. According to Mr. Kimball's statement, if I needed to get information regarding inventory, distribution, and sales of particular items, then I should run this as three separate queries and correlate the data with the use of intermediate tables. This not only is wasteful processing, but also tends to limit the capability of users to ask advanced analysis.

This point leads to another of Mr. Kimball's reoccurring themes: Cost-based optimizers don't work and cannot be trusted to give accurate or consistent results. While I agree that many optimizers on the market don't work, I invite Kimball to try the NCR Teradata optimizer. Unlike optimizers that grew out of the OLTP arena, Teradata started as a parallel, relational system and uses a cost-based, parallel-aware approach to construct a query plan. The Teradata optimizer has more than 15 years of experience in processing true decision-support analysis and, as such, is many years ahead of other optimizers in the marketplace.

The final point to address here is the idea that users can't navigate an ER model. This is a critical, but simple, point. The users should not be navigating the system! This is a metadata and application issue. End users should not know, nor care, what the model looks like or how to navigate it; and the lack of good tools is exactly the problem. As the tools evolve, you do not want to limit the warehouse by a compromised data model.

Mr. Kimball concludes his thoughts by stating the use of ER modeling defeats the allure of the warehouse, which is intuitive and high-performance retrieval of data. Here he perpetuates the myth that decision-support queries need to have very low response times. Data warehousing is not only about performance but also about capability. True data warehousing allows the users to ask questions previously unknown (or previously unanswerable). The allure of data warehousing was, and is, the ability to ask any question, at any time, and get a timely response. The discussion become one of speed and compromises only when certain vendors could not provide that capability.

In conclusion, I will agree that there is value to the use of data marts and dimensional modeling within the entire infrastructure of a complete warehouse. The use of such methods should come after less intrusive and compromising optimization techniques have been exhausted, rather than as the first resort. I find it irresponsible for an "independent industry expert" such as Mr. Kimball to portray his own apparently biased solution as the only viable option when writing an article for a magazine that purports to serve the entire industry.

Rob Armstrong
NCR Corp.
robert.armstrong@sandiegoca.ncr.com

Mr. Armstrong starts by stating that the Manifesto article has been "generating a lot of misinformation in the data warehouse community." In my humble opinion, the ideas in the Manifesto represent the mainstream thinking in data warehousing today, and Mr. Armstrong is reacting not to the actual issues but to the fact that the Teradata technology has a difficult time with star join queries. The following points speak to these issues more specifically.

Mr. Armstrong says: "[The data mart architecture] is not a scalable architecture nor one that can be responsive to rapidly changing business needs." The data mart architecture is popular precisely because it is scalable and can adapt to changing business needs. Furthermore, because of its scalability, most designers believe they can bring a project to completion, unlike the galactic data warehouse approach that seems insurmountable. See my article "Bringing up Supermarts," in the January issue of DBMS for a more complete perspective on this issue.

Mr. Armstrong repeats the familiar mantra of entity-relationship modeling. "The purpose of a data model is to model the entities of the data and relationships connecting those entities." The words are seductive, but they are a snare and a delusion. As I said in the Manifesto article, the ER model only catalogs low-level logical relationships between data elements but provides little or no perspective on higher-level business issues. For example, what do we do when our understanding of a product or a customer changes over time? Not only does the dimensional modeling framework have a specific set of techniques for handling this situation, but all existing reporting and querying applications are guaranteed to continue to work gracefully.

Mr. Armstrong's objection to the separate querying of inventory, distribution, and sales tables is misplaced. The separate query approach has immense practical advantages. The SQL required can be generated much more reliably and through a wider range of available user interfaces than a single monolithic query. Monolithic SQL queries combining fact tables of different cardinalities are extraordinarily dangerous and difficult to write. In many practical situations, what looks like the correct query will return the wrong results when the separate fact tables are joined. This is a "warning" exercise I teach in my data warehousing classes!

Separate SQL queries also can be "aggregate navigated" separately. Aggregate navigation is a key data warehousing technique for building cost-effective data marts and avoiding brute-force relation scans of the data. I know of no general approach for aggregate navigating monolithic SQL queries joining multiple fact tables together.

Separate SQL queries also invite the flexibility of having the separate fact tables on different databases, different machines, or even different database technologies. This cost-effective and modular approach to assembling a data warehouse from small, medium, and large separate data marts is a key direction in the industry at this time.

Mr. Armstrong invites me to try the Teradata cost-based optimizer. Well, I have. I have spent a significant fraction of my consulting and product development career working on Teradata machines, including a stint at the Teradata development and test center in El Segundo, California. My conclusion after trying to make dimensional models (star joins) run on the Teradata machines was that Teradata's parallel processing algorithm was very inefficient and tended to produce a "hot amp" where one of its processors bottlenecked the entire machine. As a result, the Teradata system engineers in most of the accounts I worked in recommended against a dimensional model. Just last week, a Teradata customer at a large bank told me that his system exhibited this familiar Teradata behavior.

In the Manifesto I pointed out that end users are not capable of navigating an ER model. Mr. Armstrong implicitly agrees with me when he says, "The users should not be navigating the system." My point, and the point that OLAP vendors make so effectively, is that the dimensional model is itself intuitive and understandable.

Mr. Armstrong accuses me of perpetuating the myth that "decision-support queries need to have very low response times." I don't believe that is a myth. Mr. Armstrong is telling the marketplace that it is wrong to expect speedy response times. The discipline of dimensional modeling offers a hope that with tightly focused technology, rather than brute-force processing, we can get the answers and do it fast.

Ralph Kimball,
www.rkimball.com

MySQL Links