Tools and Techniques for Index Design Webinar Questions Followup

I presented a webinar this week to give an overview of Tools and Techniques for Index Design. Even if you missed the webinar, you can register for it, and you’ll be emailed a link to the recording.

I’d like to invite folks who are interested in tools for query optimization to attend the new Percona online virtual training Analyzing SQL Queries with Percona Toolkit, October 29 – November 2, 2012. See for details on this training class and how to register to attend.

During my webinar, I got a number of good questions. Here are the questions (paraphrased), with my answers.

Q: Can you link to the book on index optimization you mentioned in the webinar?

Sure!  The book is Relational Database Index Design and the Optimizers by Tapio Lahdenmäki and Michael Leach.

Also Baron Schwartz has written a good review of this book.

Q: Given that the book was written in 2005, before MySQL 5.1 and 5.5 were released, is it still relevant?

That book is designed to be RDBMS-neutral, and the concepts it covers apply to other products besides MySQL. The logical principles still hold true: using an index really means leveraging a pre-sorted data structure to reduce the work a query needs to do.  The basic data structure of an index hasn’t changed in recent releases.  So yes, the principles of this book are still relevant to current versions of MySQL.

Q: How do the upcoming changes in MySQL 5.6 change the best practices for index design?

Fundamentally, index design best practices will not change.  MySQL 5.6 is still in beta as of this writing, but the new version promises some exciting new ways that MySQL can use indexes, such as:

Q: Is it always the case that no second-star index exists for a query with both a GROUP BY and an ORDER BY clause?

Right; you’re stuck, unless your query groups by the same column as its sort column. You can optimize by adding the grouping column to your index, so all rows in a given group are read together. But doing so fixed the order in which rows are accessed, so if you need to sort by another column, it must collect the interim result set and sort it as a separate step. This is when we see “Using temporary; Using filesort” in the EXPLAIN report.

Hopefully, the size of this interim result set is relatively small, after the GROUP BY reduces it to one row per distinct value in the grouping column. Then sorting it may be more likely to happen in memory.

Q: Can you comment on the QUBE method for index estimation?

QUBE stands for Quick Upper-Bound Estimate. This is a method for estimating the theoretical cost of a query, by counting how many lookups to index entries or table rows will be necessary for a given query in the worst case. This method tries to quantify the cost of a query, but ultimately it arrives at similar conclusions as the star system: define an index with all columns in the predicate, add columns to avoid sorting per query, and add further columns to cover the select-list. The difference is that QUBE helps you to predict quantitatively how much a new index is likely to improve the query.

Q: Could you provide some best practices for partitioning and its effect on index design?

The most important consideration is that “every unique key on the table must use every column in the table’s partitioning expression.”  It might even be impossible to partition a table while using UNIQUE KEY constraints the way that makes most sense for your table.

Logically, your index design will be the same with a partitioned table as with a non-partitioned table. But since each partition is physically stored as a separate table, each index contains only values of its respective subset of the data. The index is more shallow and quicker to search. If your query has predicates that allow partition pruning to reduce the scope of the search, the index search will be even more efficient than if you were searching for the same rows in an non-partitioned table.

Thanks to folks who attended my webinar, and thanks for the questions!