Percona Acquires Tokutek : My Thoughts #3 : Fractal Tree Indexes

Last week I wrote up my thoughts about the Percona acquisition of Tokutek from the perspective of TokuDB and TokuMX[se]. In this third blog of the trilogy I'll cover the acquisition and the future of the Fractal Tree Index. The Fractal Tree Index is the foundational technology upon which all Tokutek products are built.



 So what is a Fractal Tree Index? To quote the Wikipedia page:
"a Fractal Tree index is a tree data structure that keeps data sorted and allows searches and sequential access in the same time as a B-tree but with insertions and deletions that are asymptotically faster than a B-tree." Fractal Tree Indexes are really cool, they enable the following capabilities in TokuDB and TokuMX[se]:

  • Great compression
  • High performance index maintenance
  • ACID, MVCC

Lastly, I think it's important to disclose that I worked at Tokutek for 3.5 years (08/2011 - 01/2015) as VP/Engineering and I do not have any equity in Tokutek or Percona.


Thoughts on Percona + Fractal Tree Indexes
 
Files, Files, Files

  • Currently, each Fractal Tree Index is stored in it's own file. The benefit of this approach is that dropping an index instantly returns all space used by the index to the filesystem and the execution of the drop-index operation is very fast. The downside is the number of files on a server can become overwhelming with a large number of tables/collections.
  • I think it would be a great feature to allow users the choice of file-per-index, file-per-table, and tablespaces.
  • There is a Jira ticket for the effort.

Competition

Online Backup

  • Tokutek's "hot backup" functionality is closed source and only provided with enterprise editions of TokuDB and TokuMX.
  • The other backup solution is file system snapshots.
  • I'm curious to see if Percona "open sources" hot backup or creates a different open source hot backup technology.

Compression

  • As far as I know, the Fractal Tree Index currently supports quicklz, zlib, and lzma compression libraries.
  • There are likely benefits to be had with Snappy, especially on fast storage.
    • BohuTANG from the community seems to have a it working.
  • It might be interesting to experiment with index-prefix-compression, as WiredTiger has done. WiredTiger claims both on-disk and in-memory space savings using this technique.

Checkpointing

  • Simply put, a checkpoint is a process by which a database gets to a known state on disk. Should the database server crash (or lose power) the recovery process requires starting from the checkpoint and playing forward all subsequent committed transactions.
  • By default Fractal Tree Indexes checkpoint every 60 seconds.
  • A checkpoint affects the server's performance in that it requires CPU, RAM, and IO.
  • I assume some effort will go toward reducing the impact of a checkpoint on the running system.

Code Complexity

  • I don't have specific numbers but I suspect that the Fractal Tree Index code base is more complicated than the other open source write-optimized storage engines.
  • I'm happy to be proven wrong about this if anyone wants to present their findings based on lines of code or other accepted code metrics.
  • On a side note WiredTiger has some very nice developer documentation, I'm not sure about the RocksDB documentation.

License

Tokutek’s patented Fractal Tree indexing technology is a result of ten years of research and development by experts in cache-oblivious algorithmics and is protected by multiple patents.

  • The Fractal Tree Index is licensed as GNU GPL v2 plus a "Patent Rights Grant".
  • Does the modified GPLv2 license affects potential users or developers?

Anti-use-cases

  • I'd like to highlight a three "soft spots" in Fractal Tree Indexing, as in areas where there is room for improvement or simply things to be aware of.
  • Leftmost deletion patterns
    • Fractal Tree Indexes contain large message buffers in the upper levels. These buffers provide much allow for IO "avoidance" of certain operations.
    • Consider a workload where 100 million rows are inserted using an auto-incrementing primary key, then 50 million rows are deleted.
    • At this point the Fractal Tree Index will likely have "buffered" the deletes (deletes are just small messages), the inserted data is still in the leaf nodes.
    • Queries against this deleted data will perform poorly, the extreme case is to restart your server and "select min(pk) from foo;" 
    • Partitioning is one way to deal with this pattern, rather than deleting the rows you'd merely drop them one partition at a time.
  • Random primary key inserts
    • Fractal Tree Indexes have a huge advantage, from an IO perspective, when maintaining non-unique indexes. Insert, update, and delete operations can be buffered.
    • However, unique indexes must be checked for uniqueness, and thus an IO is required unless the node holding the key is in memory.
    • Primary key indexes are always unique.
    • So Fractal Tree Indexes perform much like a standard B-tree when randomly inserting into a primary key index. When the data set is larger than RAM, each insert will require an IO to check for uniqueness.
  • Latency
    • Fractal Tree Indexes employ two techniques to achieve high compression
      • Large block size - The default is 64KB, and can be set higher.
      • Algorithms - LZMA > zlib > quicklz (and Snappy is likely coming soon)
    • Compression comes at a cost, latency.
    • The larger the node size, the longer the decompress operation.
    • The higher the compression, the longer the decompress operation.
    • Users are purchasing SSD/Flash for their IO performance, but they also want high compression because these devices are expensive.
    • At the moment it's complicated to determine the best combination of node size and compression algorithm, creating user-facing metrics will be helpful.

Human Resources

  • As with TokuDB and TokuMX[se], I'm curious to see how much Percona is  looking to grow the team. They've already posted on their jobs page for "C/C++ Developers for TokuDB, TokuMX, and Tokutek Products".
  • Prioritizing resources between Fractal Tree Indexes, TokuDB, TokuMX, and TokuMXse will be tricky.

 Please asks questions or comment below.