Skip to content

Reduce the write amplification of B+ tree

  • ScaleFlux 
  • 7 min read

It is the fourth part of our serial blogs on In-storage Transparent Compression: A Catalyst for Innovating Infrastructure Software. Part 4: Reduce the write amplification of B+ tree.

As the most widely used indexing data structure, B+ tree powers almost all the relational database management systems (RDBMs) today (e.g., MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, and DB2). Recently, log-structured merge tree (LSM-tree) has attracted significant interest as a contender to B+ tree, mainly because of two widely cited advantages of LSM-tree: lower storage cost and lower write amplification. Here the write amplification is defined as the ratio between the total amount of data being written to storage device by B+ tree or LSM-tree and the amount of user data being written into B+ tree or LSM-tree. For example, suppose a B+ tree writes total 8KB data to storage device to serve a user’s write request of 128B data, the write amplification is 8KB/128B=64. In general, write amplification of LSM-tree can be at least 5~10x smaller than that of B+ tree. The arrival of in-storage transparent compression could straightforwardly reduce or even eliminate the storage cost gap between B+ tree and LSM-tree. For example, with key-value pair size of 128B, we load total 150GBs of key-value pairs into WiredTiger (the default storage engine of MongoDB and uses B+ tree data structure) and RocksDB, both of which operate on our CSD 2000 drives that support in-storage transparent compression. The following table lists both the logical storage usage on the LBA space (i.e., before in-storage compression) and physical storage usage (i.e., after in-storage compression). Since LSM-tree has a more compact data structure than B+ tree, RocksDB has a smaller logical storage space usage than WiredTiger (i.e., 218GB vs.~280GB). Nevertheless, after in-storage transparent compression, WiredTiger consumes even less physical storage space than RocksDB, most likely due to the space amplification of the LSM-tree. 

Nevertheless, one could not reduce the B+ tree vs. LSM-tree write amplification gap by simply deploying storage drives with built-in transparent compression. To further reduce or even close the write amplification gap, one must appropriately modify the implementation of B+ tree in adaptation to in-storage transparent compression. This blog presents such a solution that is motivated by a simple observation: For a B+ tree page, let D denote the difference between its in-memory image and on-storage image. If the difference is significantly smaller than the page size (B+ tree page size typically is 4KB, 8KB, or 16KB), we can largely reduce the write amplification by logging the page modification D, instead of writing the entire in-memory page image, to the storage device. This is conceptually the same as the similarity-based data deduplication and delta encoding. Unfortunately, when B+ tree runs on normal storage devices without built-in transparent compression, this simple idea is not practical due to significant operational overhead: Given the 4KB block I/O interface, we must coalesce multiple D’s from different pages into one 4KB LBA block to materialize the write amplification reduction. Accordingly, multiple D’s associated with the same page will spread over multiple 4KB blocks on the storage device, which however will cause two problems: (1) For each page, B+ tree must keep track of all its associated D’s and periodically carry out garbage collection, leading to a high storage management complexity. (2) To load a page from storage, B+ tree must read the existing on-storage page image and multiple D’s from multiple non-contiguous 4KB LBA blocks, leading to a long page load latency. Therefore, to our best knowledge, this simple design concept has not been used by real-world B+ tree implementations.

Figure 1: Illustration of the page write when using (a) conventional design practice, and (b) using the proposed method that is enabled by in-storage transparent compression.

In-storage transparent compression for the first time makes the above simple idea practically viable. By applying virtually variable-size I/O enabled by in-storage transparent compression, we no longer have to coalesce multiple D’s from different pages into the same 4KB LBA block. For each B+ tree page, we dedicate one 4KB LBA block as a modification logging space to store all the D’s associated with this page. When B+ tree needs to persist an in-memory dirty page, instead of flushing the entire in-memory page image to the storage, we only write its D’s to the associated modification logging space. Since the total size of D’s inside each 4KB modification logging space is typically much smaller than 4KB, we can achieve much less write amplification than flushing the entire in-memory page image to the storage. This can be further illustrated in Fig. 1: Assume the B+ tree has a page size of 8KB and needs to persist one page at the time t1, t2, and t3, respectively. We modify three 128B segments m1, m2, and m3 during (0, t1), (t1, t2), and (t2, t3), respectively. As shown in Fig. 1(a), in conventional design practice, B+ tree always flushes the entire 8KB page to the storage device at each time, regardless of the exact amount of data modification. This leads to a total 24KB write volume. In comparison, as shown in Fig. 1(b), when we use the new design method enabled by in-storage transparent compression, at the time t1, we write [m1, O] to the 4KB modification logging block, where O represents an all-zero vector and hence can be compressed away inside the storage device. At the time t2, we write [m1, m2, O] to the 4KB modification logging space, and again at the time t3, we write [m1, m2, m3, O]. Utilizing the virtually variable-size block I/O enabled by in-storage transparent compression, at the time t1, we write 128B useful data through the 4KB block I/O interface, at the time t2, we write 2´128B=256B useful data through the 4KB block I/O interface, and finally at the time t3, we write 3´128B=384B useful data through the 4KB block I/O interface. This leads to total 128B+256B+384B=768B data written to the physical storage media (after in-storage compression). Compared to the conventional practice as shown in Fig. 2(a), this simple method can reduce the B+ tree write amplification by 24KB/768B = 32.

For demonstration, we implemented this method and compared it with RocksDB and WiredTiger. As mentioned above, RocksDB uses the LSM-tree data structure, while WiredTiger uses the B+ tree data structure. All the experiments were carried out on our CSD 2000 drives that support built-in transparent compression. As we discussed in our previous blog, WAL-induced write amplification could be significant when we use the flush per transaction commit policy. Therefore, we disabled the WAL in all the experiments to eliminate the impact of WAL on the write amplification. We set the dataset size as 500GB and considered different record size (i.e., 128B, 32B, and 16B). For the improved B+ tree and WiredTiger, we set the page size as 8KB. We ran random write-only workloads. As shown in Fig. 2, compared with RocksDB, normal B+ tree (i.e., WiredTiger) has a much larger write amplification, while our improved B+ tree can essentially close the B+ tree vs. LSM-tree write amplification gap. For example, in the case of 32B record size and 4 client threads, the write amplification of RocksDB is 38, while the write amplification of WiredTiger is 268, which is 7.1x larger than that of RocksDB. In comparison, the write amplification of our improved B+ tree is only 73.7% of RocksDB’s write amplification.

Figure 2: Measured write amplification.

The above results suggest that the arrival of in-storage transparent compression makes the two advantages of LSM-tree over B+ tree largely disappear. Meanwhile, compared with LSM-tree, B+ tree has many distinct advantages such as higher read and scan performance, and better support of concurrency and consistency. By mitigating the two major disadvantages of B+ tree, in-storage transparent compression makes B+ tree the (almost) perfect data structure. As storage hardware with built-in transparent compression becomes more and more pervasive, B+ tree will continue to serve as the data structure that empowers the data management infrastructure.



ScaleFlux is the pioneer in deploying Computational Storage at scale. Computational Storage is the foundation for modern, data-driven infrastructure that enables responsive performance, affordable scaling, and agile platforms for compute and storage I/O intensive applications. Founded in 2014, ScaleFlux is a well-funded startup with leaders proven to deploy complex computing and solid-state storage solutions in volume.