Mondrian is generally very smart in how it chooses
to implement queries. Over the last month or so, I have learned
some lessons about how hard can be to make Mondrian
smarter.
As a ROLAP engine (I prefer to call it 'ROLAP with
caching'), Mondrian's evaluation strategy has always been a blend
of in-memory processing, caching, and native SQL execution.
Naturally there is always SQL involved, because Mondrian doesn't
store any of its own data, but the question is how much of the
processing Mondrian pushes down to the DBMS and how much it does
itself, based on data in its cache.
The trends are towards native SQL execution. Data volumes are
growing across the board, Mondrian is being deployed to larger
enterprises with large data sets (in some cases displacing more
established, and expensive, engines). Mondrian cannot keep up
with the growth by simply pulling more data into memory and
throwing one or two more CPU cores at the problem.
Luckily a new breed of database engines, including Aster Data,
Greenplum,
Infobright, Kickfire, LucidDB, Netezza and Vertica, are helping to solve the data problem
with innovative architectures and algorithms. To exploit the
power of the database engine, Mondrian's ability to generate
native SQL is more important than ever.
I have spent the last few weeks struggling to make Mondrian
handle a particular case more efficiently. It was ultimately
unsuccessful, but it was a case where defeat teaches you more
than victory.
Here is the actual MDX query:
WITH
SET [COG_OQP_INT_s9] AS
'CROSSJOIN({[Store Size in SQFT].[Store
Sqft].MEMBERS},[COG_OQP_INT_s8])'
SET [COG_OQP_INT_s8] AS
'CROSSJOIN({[Yearly Income].[Yearly
Income].MEMBERS},[COG_OQP_INT_s7])'
SET [COG_OQP_INT_s7] AS
'CROSSJOIN({[Time].[Time].MEMBERS},
[COG_OQP_INT_s6])'
SET [COG_OQP_INT_s6] AS
'CROSSJOIN({[Store].[Store
Country].MEMBERS},[COG_OQP_INT_s5])'
SET [COG_OQP_INT_s5] AS
'CROSSJOIN({[Promotions].[Promotions].MEMBERS},
[COG_OQP_INT_s4])'
SET [COG_OQP_INT_s4] AS
'CROSSJOIN({[Promotion Media].[Promotion
Media].MEMBERS},[COG_OQP_INT_s3])'
SET [COG_OQP_INT_s3] AS
'CROSSJOIN({[Store Type].[Store Type].MEMBERS},
[COG_OQP_INT_s2])'
SET [COG_OQP_INT_s2] AS
'CROSSJOIN({[Marital Status].[Marital
Status].MEMBERS}, [COG_OQP_INT_s1])'
SET [COG_OQP_INT_s1] AS
'CROSSJOIN({[Gender].[Gender].MEMBERS},
{[Education Level].[Education
Level].MEMBERS})'
SELECT {[Measures].[Unit Sales]} ON AXIS(0),
NON EMPTY [COG_OQP_INT_s9] ON AXIS(1)
FROM [Sales]
WHERE ([Customers].[All Customers].[USA].[CA].[San
Francisco].[Karen Moreland])The query looks a bit fearsome, but
is quite likely to occur in practice as a business user slices
and dices on several attributes simultaneously. The rows axis is
a CrossJoin of ten dimensions, but because of the filtering
effect of the slicer (combined with NON EMPTY) the query
evaluates to a single row. The goal is to make Mondrian generate
a SQL statement to evaluate the axis.
Each way that I tried to write the logic, I ended up making
decisions that made other optimizations invalid. It was difficult
to make Mondrian see the big picture: that, although named sets
are not supposed to inherit the context where they evaluated, in
this case it was OK; and to recognize a complex expression (many
nested CrossJoin operators, slicer, and implicit non-empty
context), and convert the whole thing into a single SQL
statement. For instance, in one attempt I succeeded in generating
a SQL statement which evaluates very efficiently, but in so doing
I had to let the non-empty context of the evaluator leak into
places that it shouldn't... which broke quite a few existing
queries, in particular queries involving calculated sets.
There are several conclusions for Mondrian's architecture. One
conclusion is that we need to deal with filtering non-empty
tuples as part of the expression, not as a flag in the evaluator
(the data structure that contains, among other things, the set of
members that form the context for evaluating an
expression).
MDX has an operator, EXISTS, that specifies that empty tuples should be
removed from a set. Then we can reason about queries by applying
logic-preserving transformations (just the way that an RDBMS
query optimizer works), which should be safer than today's ad hoc
reasoning. For example, if I am a developer implementing an MDX
function and the evaluator has nonEmpty=true, am I required to
eliminate non-empty tuples or am I merely allowed to eliminate
them? (In other words, will my caller return the wrong result if
I forget to check the evaluator flag?) I often forget, so I
suspect that filtering of empty tuples is performed
inconsistently throughout the Mondrian code base; which is a
shame, because eliminating empty tuples early can do a lot for
performance.
I'd also like to use the same model for native SQL generation as
for other forms of expression compilation. Native SQL generation
currently happens at query execution time: when the function is
evaluated, it figures out whether it can possibly translate the
logic (and the constraints inherited from the evaluation context)
into SQL. That is currently unavoidable, because the nonEmpty
flag is only available in the evaluator, at query execution time.
And we need to do some work at query execution time, if only to
plug in the keys of the members in the current context as
predicates in the SQL statement. But I've seen several cases
where we need to be smarter.
One example is 'NON EMPTY [Level].Members' that always gets
translated into SQL even though the level only has two members
and they are in cache. Cost-based optimization would help
there.
Another example is where there are many layers of MDX functions —
say Filter on top of CrossJoin on top of Filter — and these could
be rolled into a single SQL statement. The right approach is to
build a SQL statement by accretion, but it is too expensive to do
every time the expression is evaluated.
Further, as we add more rules for recognizing MDX constructs that
can turn into SQL, we will reach decision points where we choose
to have to choose whether to apply rule A or rule B. Solutions
are (a) using costing to decide which rule to apply, and (b)
applying both rules and seeing which ultimately generates a
better outcome. Neither of these solutions are suitable for query
execution time: they need an optimization stage, as part of query
preparation.
It's ironic, considering I've been building SQL optimizers for
years (the first at Broadbase, and the second the optimizer for
the Eigenbase
project, which is used by both LucidDB and SQLstream) that I
have avoided giving Mondrian a true query optimizer for so long.
I know it's a lot of work to build an optimizer, and it's foolish
to start before you know what problem you need to solve.
Don't expect to see any changes in the short term; this kind of
architectural change doesn't happen fast. My struggle over the
past few weeks has been a big step in seeing the big picture, and
realize that the considerable pain and effort of unifying
Mondrian's query planning system is justified by the potential
benefits in performance.
Jan
08
2009