Scalable Star Schema Benchmark (SSB) Join Metrics


We ran a quick scalability test of Calpont join behavior across using a Star Schema Benchmark data set at a scale factor of 1000. The Star Schema Benchmark transforms a TPC-H / DBT-3 data to a more standardized data warehouse star schema data model, and the 1000 scale factor includes 6 billion rows in the primary fact table. Information on the star schema bench (SSB) can be found at http://www.cs.umb.edu/~xuedchen/research/publications/DataWarehousePerformanceDissertationProposal.pdf .

-----------------------------------------------------------------------------------------------------------------------------------------
-- Note that these queries are run without any tuning or indices created for these joins or filters.
-- Basically, this is just 1) Create tables (without index or partition declarations).
-- 2) Load tables, fact table was loaded one month at a time.
-- 3) Run queries.
-- In addition, there is no expectation that rows in different tables are co-located.
-- Therefore, result should be similar across a wide variety of join and predicate cases.
-----------------------------------------------------------------------------------------------------------------------------------------
 





The x axis show additional servers being included, the y axis is elapsed time in seconds (log).

 


Base table cardinality is 30 million customers, 10 million suppliers, 20 million parts, 2556 days, and 6 billion lineorders (actually 5,999,989,709).

The system under test includes between 1 and 8 performance modules (scalable component of Calpont architecture), each with two quad core CPUs at 2.0 GHz and 16 GB memory. This is run with a shared disk infrastructure and single fiber channel to each performance module. The system could also be configured with direct attached/on-board disk, or with dual fiber channel.

Filters used for each table and cardinality returned are included here:

-- 30 million rows, find 360k (about 1.2%)
select count(*) from customer where c_city in ('INDONESIA1','INDONESIA2','INDONESIA3');
+----------+
| count(*) |
+----------+
| 360736 |
+----------+

-- 10 million rows, find 120k (about 1.2%)
select count(*) from supplier where s_city in ('INDONESIA1','INDONESIA2','INDONESIA3');
+----------+
| count(*) |
+----------+
| 119731 |
+----------+

-- 20 million rows, find 240k (about 12%)
select count(*) from part where p_mfgr in ('MFGR#1','MFGR#4')
and p_category not in ('MFGR#12','MFGR#11','MFGR#13','MFGR#14','MFGR#41','MFGR#42','MFGR#43');
+----------+
| count(*) |
+----------+
| 239495 |
+----------+

-- 2556 rows, find 365 (about 14%)
select count(*) from dateinfo where d_year = 1997;
+----------+
| count(*) |
+----------+
| 365 |
+----------+

-- 6 billions rows, find 910 million (about 15%)
select count(*) from lineorder where lo_orderdate between 19970101 and 19971231;
+-----------+
| count(*) |
+-----------+
| 910240598 |
+-----------+

This test measures a 5 table join operation, as well as variations that remove one table from this query, yielding 4 distinct flavors. The first query is run with an empty cache, and then re-run. The remaining queries are run immediately afterwards.

-- --------------------------------------------------------
-- 5 tables in join
-- 360k, 120k, 240k, dates joined with 900 million
-- --------------------------------------------------------
select d_yearmonthnum, s_city, c_city, p_mfgr, sum(lo_extendedprice), count(*)
from customer, supplier, part, dateinfo, lineorder
where c_custkey = lo_custkey and
s_suppkey = lo_suppkey and
p_partkey = lo_partkey and
d_datekey = lo_orderdate and
lo_orderdate between 19970101 and 19971231 and
c_city in ('INDONESIA1','INDONESIA2','INDONESIA3') and
s_city in ('INDONESIA1','INDONESIA2','INDONESIA3') and
p_mfgr in ('MFGR#1','MFGR#4') and
p_category in ('MFGR#12','MFGR#11','MFGR#43') and
d_year = 1997
group by 1,2,3,4
order by 1,2,3,4;