Learn from Dr Tong Zhong, ScaleFlux Chief Scientist, about how the use of transparent compression in the Flash Drive controller and FTL combine to improve the operation of the drive and offset the lower endurance and performance attributes of QLC NAND.
QLC NAND flash memory
NAND flash memory has a long history of applying the multi-bits per cell technique to reduce its bit cost, and most solid-state drive (SSD) users must have already been very familiar with the terms MLC (2bits/cell), TLC (3bits/cell), and QLC (4bits/cell). Now it is rumored that PLC (5bits/cell) NAND flash memory is coming soon. Whenever the number of bits per cell increments by 1, the flash memory operational noise margin degrades by almost 2x, which inevitably leads to lower cycling endurance and longer flash memory read/write latency. Therefore, not surprisingly, every time when NAND flash memory marched from n bits/cell to (n+1) bits/cell, it was always “greeted” with wide concerns and skepticism on its applicability to enterprise-grade SSDs that demand high cycling endurance and high random IOPS (IO per second). The concerns/skepticism were consistently overcome by the commercial success of enterprise-grade MLC SSDs and TLC SSDs over the years. As NAND flash memory manufacturers are ramping up QLC products, the same concerns and skepticism are naturally rising again. Will history repeat itself, or is the situation different this time? Indeed, compared with MLC and TLC, people have much stronger reasons to question the feasibility/validity of enterprise-grade QLC SSDs:
- Cycling endurance: QLC NAND flash memory has ~1,000 cycling endurance, in contrast to ~5,000 cycling endurance of its TLC counterpart. As a result, with 4x intra-SSD write amplification under full-LBA-span random write workloads, 1,000 cycling endurance translates into only less than 0.14 DWPD (drive write per day) over 5 years, which clearly raises concerns about QLC’s viability for many enterprise and data center users.
- Random IOPS: As the most important enterprise-grade SSD performance metric, random IOPS is fundamentally limited by the minimum of (i) frontend I/O interface bandwidth and (ii) backend aggregated flash memory data access bandwidth. The latter further depends on the intra-SSD architectural parallelism (e.g., the number of channels and flash memory dies) and the per-die data access latency. Today most TLC SSDs still operate over PCIe Gen3x4 interface whose bandwidth is typically lower than the backend flash memory bandwidth. Hence their random IOPS is limited by the PCIe Gen3x4 I/O interface, which explains why 3.2TB/6.4TB NVMe SSDs on today’s market have almost the same random IOPS even though 6.4TB SSDs have 2x more flash memory dies (hence higher architectural parallelism). This also explains why SSD random IOPS was not impacted by the MLC-to-TLC transition, even though MLC flash memory has shorter read/write latency than its TLC counterpart. However, due to the longer read/write latency of QLC flash memory (especially the write latency), even under the same PCIe Gen3x4 interface, TLC-to-QLC transition will shift the limiting factor of random IOPS from the frontend I/O interface bandwidth to the backend QLC flash memory data access bandwidth. As a result, different from MLC-to-TLC transition, TLC-to-QLC transition will cause noticeable SSD random IOPS performance degradation. The arrival of PCIe Gen4 I/O interface will make the random IOPS difference between TLC SSDs and QLC SSDs even larger.
As further illustrated in Fig. 1, the above discussions suggest that, although with ~30% cost saving over TLC SSDs, QLC SSDs indeed could have a very hard time to compete with their TLC older brothers on the enterprise market. So, should we simply write off QLC flash memory from the enterprise market? Not so fast. We believe QLC could still play an important role in the enterprise market, and the key is to close the endurance/performance gap via transparent compression.
Fig. 2 illustrates the architecture of SSDs with built-in transparent compression: SSDs internally carry out hardware-based high-speed data compression/decompression on the I/O path, being transparent to the host. In spite of pervasively present data compressibility, many applications/systems do not explicitly compress their data in order to avoid compression-induced CPU overhead and system performance degradation, which is particularly true for enterprise-grade applications (e.g., relational database) that are dominated with random data access patterns.
CSD 2000: 2nd-generation computational storage drive (CSD)
ScaleFlux recently launched its 2nd-generation computational storage drive (CSD) product CSD 2000 that integrates transparent compression capability, where the strong GZIP compression algorithm is used to maximize the compression ratio. By seamlessly exploiting runtime data compressibility, CSD 2000 can improve various cost/performance metrics without any changes to the existing software ecosystem. Now let’s see how transparent compression can make QLC practically viable for enterprise market:
- Using transparent compression to improve Endurance: It is well known that one could reduce SSD write amplification by increasing the intra-SSD storage space over-provisioning. For example, by increasing the over-provisioning from the standard 7% to 33%, we could reduce the write amplification from 4x to 1.6x. In addition to reducing the storage cost, transparent compression could opportunistically create higher storage space over-provisioning, which can contribute to lowering write amplification and hence improving the DWPD. Suppose we format a QLC CSD 2000 drive that contains 8TB NAND flash memory as 16TB usable storage capacity, and the average runtime data compression ratio is 2.5:1, then transparent compression will enable the CSD 2000 drive internally operate at 33% over-provisioning (i.e., 1.6x write amplification). Since the data volume of each write request is reduced by 2.5x via transparent compression, the overall write amplification of the CSD 2000 drive is around 0.65. Under the same 1,000 cycling endurance of QLC flash memory, the CSD 2000 drive could sustain 0.46 DWPD over 5 years while offering 16TB drive storage capacity. In short, transparent compression not only reduces the storage cost by 2x (i.e., 16TB usable storage capacity with only 8TB physical flash memory), but also improves the DWPD by over 3x (i.e., from 0.14 to 0.46).
- Using transparent compression to improve random IOPS: By directly compressing data on the I/O path, CSD 2000 effectively amplifies the backend flash memory access bandwidth as illustrated in Fig. 3. In the context of TLC flash memory and PCIe Gen3 interface, SSD random IOPS is mainly limited by the frontend I/O interface bandwidth. Hence, an amplified backend flash memory access bandwidth does not translate into a higher random IOPS. In contrast, TLC-to-QLC transition makes SSD random IOPS limited by the backend flash memory access bandwidth. Therefore, by amplifying the backend flash memory access bandwidth, transparent compression could directly improve the random IOPS of QLC SSDs. As a result, QLC CSD 2000 can achieve a random IOPS much closer to that of normal TLC SSDs. Upon the arrival of PCIe Gen4, SSD random IOPS is completely determined by the backend flash memory access bandwidth. Accordingly, transparent compression can fully exploit the runtime data compressibility to amplify the backend flash memory access bandwidth and hence improve the random IOPS performance for both TLC and QLC.
As further illustrated in Fig. 4, the above discussions suggest that, in addition to reducing the QLC SSD vs. HDD cost gap, transparent compression could meanwhile effectively tackle the two obstacles (i.e., cycling endurance and random IOPS) that prevent QLC flash memory from being deployed for enterprise applications.
Dr Tong Zhang is the Chief Scientist and Co-Founder of ScaleFlux. He is also a professor at Rensselaer Polytechnic Institute (RPI), specializing in databases, filesystems, and data storage technologies.