We ran a quick scalability test of Calpont join behavior across
using a Star Schema Benchmark data set at a scale factor of 1000.
The Star Schema Benchmark transforms a TPC-H / DBT-3 data to a
more standardized data warehouse star schema data model, and the
1000 scale factor includes 6 billion rows in the primary fact
table. Information on the star schema bench (SSB) can be found at
http://www.cs.umb.edu/~xuedchen/research/publications/DataWarehousePerformanceDissertationProposal.pdf
.
-----------------------------------------------------------------------------------------------------------------------------------------
-- Note that these queries are run without any tuning or indices
created for these joins or filters.
-- Basically, this is just 1) Create tables (without index or
partition declarations).
-- 2) Load tables, fact table was loaded one month at a
time.
-- 3) Run queries.
-- In addition, there is no expectation that rows in different
tables are co-located.
-- Therefore, result should be similar across a wide variety of
join and predicate cases.
-----------------------------------------------------------------------------------------------------------------------------------------
The x axis show additional servers being included, the y axis is
elapsed time in seconds (log).
Base table cardinality is 30 million customers, 10 million
suppliers, 20 million parts, 2556 days, and 6 billion lineorders
(actually 5,999,989,709).
The system under test includes between 1 and 8 performance
modules (scalable component of Calpont architecture), each with
two quad core CPUs at 2.0 GHz and 16 GB memory. This is run with
a shared disk infrastructure and single fiber channel to each
performance module. The system could also be configured with
direct attached/on-board disk, or with dual fiber channel.
Filters used for each table and cardinality returned are included
here:
-- 30 million rows, find 360k (about 1.2%)
select count(*) from customer where c_city in
('INDONESIA1','INDONESIA2','INDONESIA3');
+----------+
| count(*) |
+----------+
| 360736 |
+----------+
-- 10 million rows, find 120k (about 1.2%)
select count(*) from supplier where s_city in
('INDONESIA1','INDONESIA2','INDONESIA3');
+----------+
| count(*) |
+----------+
| 119731 |
+----------+
-- 20 million rows, find 240k (about 12%)
select count(*) from part where p_mfgr in
('MFGR#1','MFGR#4')
and p_category not in
('MFGR#12','MFGR#11','MFGR#13','MFGR#14','MFGR#41','MFGR#42','MFGR#43');
+----------+
| count(*) |
+----------+
| 239495 |
+----------+
-- 2556 rows, find 365 (about 14%)
select count(*) from dateinfo where d_year = 1997;
+----------+
| count(*) |
+----------+
| 365 |
+----------+
-- 6 billions rows, find 910 million (about 15%)
select count(*) from lineorder where lo_orderdate between
19970101 and 19971231;
+-----------+
| count(*) |
+-----------+
| 910240598 |
+-----------+
This test measures a 5 table join operation, as well as
variations that remove one table from this query, yielding 4
distinct flavors. The first query is run with an empty cache, and
then re-run. The remaining queries are run immediately
afterwards.
-- --------------------------------------------------------
-- 5 tables in join
-- 360k, 120k, 240k, dates joined with 900 million
-- --------------------------------------------------------
select d_yearmonthnum, s_city, c_city, p_mfgr,
sum(lo_extendedprice), count(*)
from customer, supplier, part, dateinfo, lineorder
where c_custkey = lo_custkey and
s_suppkey = lo_suppkey and
p_partkey = lo_partkey and
d_datekey = lo_orderdate and
lo_orderdate between 19970101 and 19971231 and
c_city in ('INDONESIA1','INDONESIA2','INDONESIA3') and
s_city in ('INDONESIA1','INDONESIA2','INDONESIA3') and
p_mfgr in ('MFGR#1','MFGR#4') and
p_category in ('MFGR#12','MFGR#11','MFGR#43') and
d_year = 1997
group by 1,2,3,4
order by 1,2,3,4;