UFS has been the main filesystem on Solaris until the arrival of ZFS in 2004. UFS has been around since the SunOS 4.x days and is still quite widely used on Solaris. Although OpenSolaris uses ZFS by default, many database users have mostly used UFS.
UFS is a buffered filesystem because by default it uses free memory to cache data. While buffering is useful for a large number of applications, database users have generally stayed away from it by using DirectIO. If you are using UFS with MySQL, you have the option of using UFS buffered or UFS DirectIO. In this blog I will try to describe some of the reasoning behind why UFS DirectIO should be used with MySQL.
Buffered UFS Limitations (with MySQL/InnoDB)
UFS buffers filesystem data using 8K pages (4K on x86) in the buffer cache. Innodb stores records in 16K size pages and issues reads of size 16k. This corresponds to two UFS page reads of 8K. The current read-ahead logic in UFS is triggered when it sees a sequential read of 2 UFS pages. The read ahead then issues large (1MB) reads. Since MySQL (like other databases) does random IO, this large read is unnecessary. In many cases , this large read will overwhelm the storage system leading to large service times, and ultimately bad performance.
DirectIO to the rescue
To get around this issue, we can use DirectIO. Directio is an overloaded term that basically means
- The filesystem does not cache data (read & write)
- If possible, data is transferred directly between the applications memory & device.
One of the main benefits of using UFS DirectIO is the avoidance of the single writer lock. POSIX dictates that writes are atomic i.e multiple writers to a same block need to be queued1. When using DirectIO this constraint is relaxed and it is up to the databases to ensure atomicity. Additionally, the copying of data directly to/from the applications memory saves CPU cycles, making transactions more efficient. However, using DirectIO means also that storage engines like MyISAM, that relies on the filesystem buffer cache, will be negatively affected. Storage engines like Innodb, will not be able to benefit from write coalescing.
For writes, Innodb provides an option of using regular writes followed by a fsync() at transaction commit time (the default), or using O_SYNC writes. Many customer use "regular writes followed by fsync()" because there is a possibility of better performance because of write coalescing. InnoDB also provides an unsafe option where, it does not flush the writes at the end of the transaction, but rather flushes it every second. The benefit of write coalescing is much higher here, but it comes at the cost of data integrity -- a crash can leave your database in an inconsistent state. For customers relying on this feature, using DirectIO will negate the effect of the write coalescing.
Luckily Innodb provides an option (innodb_flush_method=O_DIRECT) where directio is only enabled for the datafiles. Log files still use the default flush method. Using this option gives you the best of both worlds -- Use DirectIO, but still benefit from write coalescing for the logs. You still lose write coalescing for the datafiles, but since those writes happen asynchronously (the query does not wait for the datafile write to complete), it is less critical for performance. If you want to use DirectIO for the logs also, you can mount the filesystem with the forcedirectio option.
In our tests with Sysbench, we have observed that the penalty introduced by UFS read ahead is much more than the benefit of write coalescing.
How does this apply to ZFS
In case you are wondering about ZFS, it does not suffer from the main limitations of UFS, and as a result it has less need for DirectIO (which is not yet supported on ZFS). ZFS does not suffer from the main reasons behind why DirectIO is so beneficial with UFS as itdoes not have the single writer lock problem. ZFS uses range locks to ensure multiple writes are able to update different parts of the file simultaneously. It also has a highly scalable cache (called the ARC). I will blog about ZFS and MySQL some other time.
1Update 2/10/09: FrankB (UFS developer) gently reminded me that the POSIX behavior for concurrent writes is undefined and the single writer lock is an implementation artifact.