It is the third part of our serial blogs on In-storage Transparent Compression: A Catalyst for Innovating Infrastructure Software. Part 3: Reduce the write amplification of write-ahead logging (WAL).
Write-ahead logging (WAL) is almost universally used by database and filesystem to ensure operational atomicity and data durability. Its basic concept is quite simple: Data modifications are persisted to a dedicated log area before being applied to the data files. In typical implementations, data modifications are first accumulated as log records in an in-memory WAL buffer and later are flushed to the on-storage WAL file. To maximize the data durability, systems execute the memory-to-storage flush (using commands like fsync or fdatasync) at every transaction commit. To reduce the log-induced storage cost, conventional practice always tightly packs log records into the in-memory WAL buffer and hence the on-storage WAL file. As a result, multiple consecutive WAL buffer flushes may write to the same 4KB LBA block on the storage device, especially when transaction records are significantly smaller than 4KB and/or the workload concurrency is not high. This can be illustrated in Fig. 1: Suppose three transactions TRX-1, TRX-2, and TRX-3 (with log records L1, L2, and L3) commit at the time t1, t2, and t3, respectively. As illustrated in Fig. 1, at the time t1, 4KB data [L1, O] is flushed from the in-memory WAL buffer to the LBA 0x0001 on the storage device, where O represents an all-zero vector. Later, the log record L2 is appended into the WAL buffer, and at the time t2, the 4KB data [L1, L2, O] is flushed to the same LBA 0x0001 on the storage device. Similarly, at the time t3, the 4KB data [L1, L2, L3, O] is flushed to the same LBA 0x0001 on the storage device. As illustrated in Fig. 1, the same log record (e.g., L1 and L2) are written to the NAND flash memory multiple times, leading to large write amplification. It is well known that a large write amplification directly leads to a shorter flash memory lifetime and lower storage IOPS performance. One could reduce the WAL-induced write amplification by simply disabling the flush-per-transaction-commit (e.g., invoking memory-to-storage flush every minute), which however can no longer guarantee the data durability of recently committed transactions.
Leveraging the virtually variable-size block I/O enabled by in-storage transparent compression, one could apply a technique called sparse logging to significantly reduce the WAL-induced write amplification, without sacrificing the data durability. Its basic idea is simple: At each transaction commit, we always pad zeros into the in-memory WAL buffer to make its content 4KB-aligned. As a result, the next log record will be written into a new 4KB space in the WAL buffer. Therefore, each log record will be written to the NAND flash memory only once, leading to a lower write amplification compared with the conventional practice. This can be further illustrated in Fig. 2: Assuming the same scenario as shown above in Fig. 1, after the transaction TRX-1 commits at the time t1, we pad zeros into the WAL buffer and flush the 4KB data [L1, O] to the LBA 0x0001 on the storage device. Subsequently, we put the next log record L2 in a new 4KB space in the WAL buffer. At the time t2, the 4KB data [L2, O] is flushed to a new LBA 0x0002 on the storage device. Similarly, at the time t3, the 4KB data [L3, O] is flushed to another new LBA 0x0003 on the storage device. Clearly, each log record is written to the NAND flash memory only once. Meanwhile, all the zeros in each sparse 4KB block are compressed away by the in-storage transparent compression. This leads to a smaller WAL-induced write amplification, compared with the conventional practice.
For demonstration, we implemented the sparse logging method and compared its write amplification with the the WAL-induced write amplification in RocksDB and Wiredtiger (the default storage engine of MongoDB). All the experiments were carried out on our CSD 2000 drives that implement built-in transparent compression. We set the flush policy as per-transaction-commit (i.e., flush the dirty in-memory WAL buffer data to on-storage WAL file at every transaction commit). Fig. 3 shows the measured WAL-induced write amplification, where we considered different log record size (i.e., 128B, 32B, and 16B) and different number of threads that issue concurrent writes. The results show that the sparse logging method can reduce the WAL-induced write amplification by over 10x. For example, under 16B record size and a single client thread, WAL-induced write amplification in RocksDB and WiredTiger is 138 and 128, while the simple idea of sparse logging could make the WAL-induced write amplification drop to only 4, representing over 30x reduction.