This is the first of a series of posts about business intelligence tools, particularly OLAP (or online analytical processing) tools using MySQL and other free open source software. OLAP tools are a part of the larger topic of business intelligence, a topic that has not had a lot of coverage on MPB. Because of this, I am going to start out talking about these topics in general, rather than getting right to gritty details of their performance.
I plan on covering the following topics:
- Introduction to OLAP and business intelligence. (this post)
- Identifying the differences between a data warehouse, and a data mart.
- Introduction to MDX queries and the kind of SQL which a ROLAP tool must generate to answer those queries.
- Performance challenges with larger databases, and some ways to help performance using aggregation.
- Using materialized views to automate that aggregation process.
- Comparing the performance of OLAP with and without aggregation over multiple MySQL storage engines at various data scales.
What is BI?
Chances are that you have heard the term business intelligence.
Business intelligence (or BI) is a term which encompasses many
different tools and methods for analyzing data, usually
presenting it in a way that is easily consumed by upper
management. This analysis is often used to determine how
effectively the business has been at meeting certain performance
goals, and to forecast how they will do in the future. To put it
another way the tools are designed to provide insight about the
business process, hence the name. Probably the most popular BI
activity for web sites is click analysis.
As far as BI is concerned, this series of posts focuses on OLAP analysis and in a lesser sense, on data warehousing. Data warehouses often provide the information upon which OLAP analysis is performed, but more on this in post #2.
OLAP? What is that?
OLAP is an acronym which stands for online analytical processing.
OLAP analysis, which is really just another name for
multidimensional analysis, consists of displaying summary
aggregations of the data broken down into different groups. A
typical OLAP analysis might show "sale total, by year, by sales
rep, by product category". OLAP analysis is usually used for
reporting on current data, looking at historical trends and
trying to make predictions about future trends.
Multidimensional Analysis
Multidimensional analysis is a form of statistical analysis. In
multidimensional analysis samples representing a particular
measure are compared or broken down into different
dimensions. For example, in a sales analysis, the "sale
amount" is a measure. Measures are always aggregated
values. That is, total sales might be expressed as SUM(sale_amt).
This is because the SUM of the individual sales will be grouped
along different dimensions, such as by year or by product. I'm
getting a little ahead of myself. Before we talk about measures
and dimensions, we should talk about the two ways in which this
information can be stored.
There are two main ways to store multidimensional data
for OLAP analysis
OLAP servers typically come in two basic flavors. Some servers
have specialized data stores which store data in a form which is
highly effective for multidimensional analysis. These servers are
termed MOLAP and they tend to have exceptional performance due to
their specialized data store. Almost all MOLAP solutions
pre-compute many (or even all) of the possible answers to
multi-dimensional queries. Palo
is an example of an open source version of this technology.
ESSbase is an example of closed source product.
MOLAP servers often feature extensive compression of data which
can improve performance. Loading data into a MOLAP server usually
takes a very long time because many of the answers in the cube
must be calculated. The extra time spent during the load is
usually called "processing" time.
A relational OLAP (or ROLAP) server uses data stored in an RDBMS. These systems trade the performance of a multidimensional store for the convenience of an RDBMS. These servers almost always query over a database which is structured as a STAR or snowflake type schema. To go back to the sales analysis example above, in a STAR schema the facts about the sales would be stored in the fact table, and the list of customers and products would be stored in separate dimension tables. Some ROLAP servers support the aggregation of data into additional tables, and can use the tables automatically. These servers can approach the performance of MOLAP with the convenience of ROLAP, but there are still challenges with this approach. The biggest challenges are the amount of time that it takes to keep the tables updated and in the complexity of the many scripts or jobs which might be necessary to keep the tables in sync. Part five of my series will introduce materialized views which attempt to address these challenges in a manageable way.
What makes a ROLAP so great?
An OLAP server usually returns information to the user as a
'pivot table' or 'pivot report'. While you could
create such a report in a spreadsheet, the ROLAP tool is designed
to deal with millions or even billions of rows of data, much more
than a spreadsheet can usually handle. MOLAP servers usually
require that all, or almost all of the data must fit it memory.
Another difference is the ease by which this analysis is
constructed. You don't necessarily have to write queries or drag
and drop a report together in order to analyze multidimensional
data using an OLAP tool.
Data summarized in pivot form:
ROLAP tools use star schema
As I said before, a sale amount would be considered a measure,
and it would usually be aggregated with SUM. The other
information about the sale, such as the product, when it was sold
and to whom it was sold would be defined in dimension tables. The
fact table contains columns which are joined to the dimension
tables, such as product_id and customer_id. These are often
defined as foreign keys from the fact table to the dimension
tables.
A note about degenerate dimensions:
Any values in the fact table that don't join to dimensions are
either considered degenerate dimensions or measures. In the
example below the status of the order is a degenerate dimension.
A degenerate dimension is stored as an ENUM in many cases. In the
example below that there is no actual dimension table which
includes the two different order statuses. Such a dimension would
add an extra join, which is expensive. Any yes/no field and/or
fields with a very low cardinality (such as gender or order
status) will probably be stored in the fact table instead of in a
dedicated dimension. In the "pivot data" example above, all the
dimensions are degenerate: gender, region, style, date.
Example star schema about sales.
Often a dimension will include redundant information to make
reporting easier, a process called "denormalization".
Hierarchical information may be stored in a single dimension. For
example, a dimension for products may include both the category
AND a sub-category. A time dimension includes year, month and
quarter. You can create multiple different hierarchies from a
single dimension. This allows 'drill down' into the dimension. By
default the data would be summarized by year, but you can drill
down to quarter or month level aggregation.
The screenshots here in the jPivot (an OLAP cube browser) documentation can give you a better idea about the display of data. The examples break down sales by product, by category, and by region.
The information is presented in such a fashion that it can be "drilled into" and "filtered on" to provide an easy to use interface to the underlying data. Graphical display of the data as pie, line or bar charts is possible.
Focusing on ROLAP.
This is the MySQL performance blog, and as such an in depth
discussion of MOLAP technology is not particularly warranted
here. Our discussion will focus on Mondrian. Mondrian is an open source ROLAP server
featuring an in-memory OLAP cache. Mondrian is part of the
Pentaho open source business intelligence suite.
Mondrian is also used by other projects such as Wabit and
Jaspersoft.
If you are using open source BI then you are probably already
using Mondrian. Closed source ROLAP servers include Microstrategy, Microsoft Analysis Services and Oracle BI.
Mondrian speaks MDX, olap4j and XML for analysis. This means that there is a very high chance that your existing BI tools (if you have them) will work with it. MDX is a query language that looks similar to SQL but is actually very different. Olap4j is an OLAP interface for java applications. XML for analysis (XMLA) is an industry standard analytical interface originally created by Microsoft, SAS and Hyperion.
Whats next?
Next we'll talk about the difference between data marts and data
warehouses. The former are usually used for OLAP analysis, but
they can be fundamentally related to a warehouse.
Entry posted by Justin Swanhart | 2 comments