Last week I wrote up my thoughts about the Percona acquisition of
Tokutek from the perspective of TokuDB and TokuMX[se]. In this third blog of the trilogy
I'll cover the acquisition and the future of the Fractal Tree
Index. The Fractal Tree Index is the foundational technology upon
which all Tokutek products are built.
So what is a Fractal Tree Index? To quote the Wikipedia page:
"a Fractal Tree index is a tree data structure that keeps data
sorted and allows searches and sequential access in the same time
as a B-tree but with insertions and deletions that are
asymptotically faster than a B-tree." Fractal Tree Indexes
are really cool, they enable the following capabilities in TokuDB
and TokuMX[se]:
- Great compression
- High performance index maintenance
- ACID, MVCC
Lastly, I think it's important to disclose that I
worked at Tokutek for 3.5 years (08/2011 - 01/2015) as
VP/Engineering and I do not have any equity in Tokutek or
Percona.
Thoughts on Percona + Fractal Tree Indexes
Files, Files, Files
- Currently, each Fractal Tree Index is stored in it's own file. The benefit of this approach is that dropping an index instantly returns all space used by the index to the filesystem and the execution of the drop-index operation is very fast. The downside is the number of files on a server can become overwhelming with a large number of tables/collections.
- I think it would be a great feature to allow users the choice of file-per-index, file-per-table, and tablespaces.
- There is a Jira ticket for the effort.
Competition
- Things are heating up in the write-optimized storage engine space.
- This raises two concerns:
- Choice is great for the customer, but how much room is there for these technologies to differentiate from each other?
- Facebook is backing RocksDB and MongoDB is backing WiredTiger, both of these companies have vast resources. It's as much about the marketing as it is the technology.
Online Backup
- Tokutek's "hot backup" functionality is closed source and only provided with enterprise editions of TokuDB and TokuMX.
- The other backup solution is file system snapshots.
- I'm curious to see if Percona "open sources" hot backup or creates a different open source hot backup technology.
Compression
- As far as I know, the Fractal Tree Index currently supports quicklz, zlib, and lzma compression libraries.
- There are likely benefits to be had with Snappy, especially
on fast storage.
- BohuTANG from the community seems to have a it working.
- It might be interesting to experiment with index-prefix-compression, as WiredTiger has done. WiredTiger claims both on-disk and in-memory space savings using this technique.
Checkpointing
- Simply put, a checkpoint is a process by which a database gets to a known state on disk. Should the database server crash (or lose power) the recovery process requires starting from the checkpoint and playing forward all subsequent committed transactions.
- By default Fractal Tree Indexes checkpoint every 60 seconds.
- A checkpoint affects the server's performance in that it requires CPU, RAM, and IO.
- I assume some effort will go toward reducing the impact of a checkpoint on the running system.
Code Complexity
- I don't have specific numbers but I suspect that the Fractal Tree Index code base is more complicated than the other open source write-optimized storage engines.
- I'm happy to be proven wrong about this if anyone wants to present their findings based on lines of code or other accepted code metrics.
- On a side note WiredTiger has some very nice developer documentation, I'm not sure about the RocksDB documentation.
License
- From the Tokutek website
Tokutek’s patented Fractal Tree indexing technology is a result of ten years of research and development by experts in cache-oblivious algorithmics and is protected by multiple patents.
- The Fractal Tree Index is licensed as GNU GPL v2 plus a "Patent Rights Grant".
- Does the modified GPLv2 license affects potential users or developers?
Anti-use-cases
- I'd like to highlight a three "soft spots" in Fractal Tree Indexing, as in areas where there is room for improvement or simply things to be aware of.
- Leftmost deletion patterns
- Fractal Tree Indexes contain large message buffers in the upper levels. These buffers provide much allow for IO "avoidance" of certain operations.
- Consider a workload where 100 million rows are inserted using an auto-incrementing primary key, then 50 million rows are deleted.
- At this point the Fractal Tree Index will likely have "buffered" the deletes (deletes are just small messages), the inserted data is still in the leaf nodes.
- Queries against this deleted data will perform poorly, the extreme case is to restart your server and "select min(pk) from foo;"
- Partitioning is one way to deal with this pattern, rather than deleting the rows you'd merely drop them one partition at a time.
- Random primary key inserts
- Fractal Tree Indexes have a huge advantage, from an IO perspective, when maintaining non-unique indexes. Insert, update, and delete operations can be buffered.
- However, unique indexes must be checked for uniqueness, and thus an IO is required unless the node holding the key is in memory.
- Primary key indexes are always unique.
- So Fractal Tree Indexes perform much like a standard B-tree when randomly inserting into a primary key index. When the data set is larger than RAM, each insert will require an IO to check for uniqueness.
- Latency
- Fractal Tree Indexes employ two techniques to achieve
high compression
- Large block size - The default is 64KB, and can be set higher.
- Algorithms - LZMA > zlib > quicklz (and Snappy is likely coming soon)
- Compression comes at a cost, latency.
- The larger the node size, the longer the decompress operation.
- The higher the compression, the longer the decompress operation.
- Users are purchasing SSD/Flash for their IO performance, but they also want high compression because these devices are expensive.
- At the moment it's complicated to determine the best combination of node size and compression algorithm, creating user-facing metrics will be helpful.
- Fractal Tree Indexes employ two techniques to achieve
high compression
Human Resources
- As with TokuDB and TokuMX[se], I'm curious to see how much Percona is looking to grow the team. They've already posted on their jobs page for "C/C++ Developers for TokuDB, TokuMX, and Tokutek Products".
- Prioritizing resources between Fractal Tree Indexes, TokuDB, TokuMX, and TokuMXse will be tricky.
Please asks questions or comment below.