A direct approach to planning and operate next-gen SSDs. NVMe SSDs with computational storage technology install and work like any other SSD, but there are a few things to know to take full advantage of their capabilities at scale.
Learn more by reading our Guide.
Good morning and welcome to learning how to deploy computational storage with your friends here at ScaleFlux. You may have noticed that I look a little different from my profile photo. I’m not Mat. I am J.B. Baker, the VP of Product Management. I’m filling in for Mat, who was called away to customer meetings.
Today we’re going to go through:
- a level set up on “what is computational storage”;
- some of the terminology that we’re going to use;
- a demo of how to configure extended capacity;
- the interactions and tradeoffs between utilization, performance, endurance;
- a demo of how the capacity utilization plays out in the drives;
- and wrap up with Q&A
So, just a quick reminder up front for “what is computational storage.” You can always go to snia.org for the official definition. But, a computational storage drive (CSD), which is what ScaleFlux is focused on, integrates compute capabilities directly into the drive. This is offloading some work from the CPU and the host DRAM, performing that work right down in the drive with the intent to improve overall system efficiency. This is all about getting more out of your IT spend and getting more out of your power budget — getting more capacity, more performance, more endurance or lifespan and more uptime (or, as you might be more focused on, less downtime through fewer points of failure).
Where can you deploy computational storage?
So just to level set on some of the terms that we might use here in the demos on general storage and SSD are going to be:
- capacity utilization or fill rate: This is what percent of the physical storage capacity that you have in the drives that you use before saying, “Hey, you know what, I don’t want to fill this drive anymore. I’m going to deploy more storage or write elsewhere.”
- write amplification: This is an SSD term about the amount of data that’s actually written to the drive for every gigabyte of data that sent from the host. We’re going to draw this up for you in a couple of minutes.
- Namespace: is an NVMe term. This is effectively a logical partition on your NVMe SSD.
Other terms that are going to be more specific to computational storage are:
- extended capacity or capacity multiplier (as we refer to it on the marketing side): this is formatting the drive to a logical capacity that exceeds its user capacity. Or, in other words, setting the drive up to store more data than the physical amount of NAND on the drive.
- thin provisioned namespace or extended namespace: ties right back into that extended capacity. This is a namespace that has a larger logical capacity than the physical capacity that is set up behind it. That is, namespace size is greater than namespace capacity in the nvme terminology.
- transparent compression in the drives: this is data compression that’s happening without the host or the application needing to take any action to initiate or trigger the compression. The compression is going to happen automatically as the data gets sent to the drive. The controller is going to compress the data before it writes to the NAND and then decompress on a read -automatically. There’s no integration, substitution of libraries, modifications to the OS, modifications to the application or commands from the user to make this happen. Hence, we consider that “transparent.”
- IT infrastructure management and monitoring tools: These are server storage, networking, monitoring and management tools like Nagios and Prometheus that allow you to allocate resources and track how your infrastructure is performing.
Okay, that’s it for the intro. Let’s go to Mat’s recorded demo of how to configure extended capacity or a thin provision namespace.
[Watch the demo after 00:06:44:14]
From the demo, you can see how quick that was to do. It’s very simple and straightforward to set up the drive into extended capacity mode or thin provisioned namespace. It’s all standardized commands out of the NVMe command set. Once you’ve got it set up, you do have to start monitoring an extra parameter. This is the one change in your process potentially. Now that you’ve set up a namespace that is larger than the physical capacity of the drive, you do have to start monitoring the namespace usage, which is what portion of the physical capacity that’s assigned to that namespace is being used. So that’s in addition to managing or monitoring the free space in your file system.
Moving on, let’s talk about the capacity utilization. ScaleFlux ran a couple of polls earlier this year on this topic and what we found was that over 60% of the respondents said they only set their capacity utilization to 70% or even lower. That means that if you set your threshold to 70%, for every gigabyte of data that you’re storing, you’re having to actually deploy 1.4 gigabytes of capacity. So, you’re paying an extra 40% per gigabyte of data that you want to store because you’re reserving 30% of your capacity as free space to address the concerns that were highlighted in the second poll — 68% of respondents noted concerns about performance, about reliability, and about endurance as the key drivers for establishing their fill rates.
Now, we absolutely agree that those are valid concerns. There is a very tight interaction between utilization rate, performance, and endurance. I’m going to try to draw this out for you to explain this relationship.
There are pretty well-established curves. You can look these up online in various research papers. On the y axis here, we have write amplification. On the x axis, we have over provisioning or how much free space the drive has set aside in itself. This curve is going to align to overall utilization as well. What we see is that when you have a very low free space, the write amplification is very high – it’s an asymptotic curve that rises rapidly below 7% and flattens out beyond 28%. There are a few key points that I’m going to point out here and excuse my lack of doing this to scale. The first one here is at 7% OP, the second one at 28% and the third one at 100%. I’m choosing these because 7% OP lines up pretty well with what you’re going to see with most of the one drive write per day (DWPD) class drives out there and at this point it’s kind of a kink in the curve where you go lower than this and the write amp really, really starts to go crazy. At 7%, write amps are going to be around 4.5. So for every gigabyte of data you write to the drive, the NAND is going to see four and a half gigabytes of data written over its lifetime as it moves the data around the does garbage collection. 28% lines up pretty well with your three drive write per day class drives. Here, your write amp is going to be somewhere around 2 to 2.5 (depends on the drive). Out at 100% OP, there’s always free space to write to and the write amp is going to be about one or a little bit over one.
The write amp ties directly into your performance and endurance. I’m going to draw the opposite graph here, the mirror image. This red line is your write performance and also your endurance with my lovely chicken scratch handwriting. It’s really a mirror image of the write amp curve. So, the higher your write amp, the worse your write performance is going to be because the drive is writing more on the back end versus what you’re trying to send from the host. Also, the worse your endurance is going to be because for every gigabyte of data, you’re consuming more of the lifetime of the NAND.
Now, what is the relationship in terms of utilization rate? Well, on the X axis, we can kind of look at like this. If we flip the what’s low and what’s high, we are showing what happens at various utilization rates of the physical capacity. When you’re at about 50% utilization, that’s when you’re at 100% OP – you’ve used 50% of the drive and you have 50% free capacity, which is the same as “100% extra space.” As you move up in your utilization, you’re moving down on this curve on write performance and you’re going to start to crater your write performance and endurance. All those concerns about endurance, performance, and reliability as you increase utilization are true.
But, when we start to do compression in the drive, what happens is that we’re able to maintain that high performance and better endurance even at very high utilization rates. Now that I’ve explained that in theory, we’re going to show you the demo of this running in reality on a ScaleFlux drive and another NVMe SSD.
[2nd demo after 00:14:58:29]
As Mat demonstrated, you can get significantly higher throughput with a CSD. It was about three times the write IOPS at about one third of the latency by having that CSD compression right in the drive, even at a very high utilization rate of the storage. So, transparent compression is a key way to mitigate all of those concerns that were mentioned about write performance, endurance, reliability, and application performance.
We also did promise in our abstract to talk about endurance. A demo on endurance is going to take longer than we have time for in this video. We’re going to work up some kind of time lapse version of that because it can take weeks to show the absolute impact there.
But, as I drew on the whiteboard, there is that interaction of write amplification and utilization of the NAND. As we compress the data, not only do we reduce the amount of data that’s written at initially, for example a 2 to 1 compression ratio goes from two gigabytes of data down to one gigabyte of data on that hot write, but also we reduce the write amplification on the back end or the cold writes that happen with garbage collection. We’ll follow up with a short video on that later on.
At this point, configuring extended capacity and using transparent compression should look pretty easy.
If you’re curious about how to evaluate computational storage for your use case and what other considerations you might take into your thought process, the first step is to start being selective in what SSD you choose, rather than just taking whatever SSD is delivered by the OEM. Many users simply think, “all these SSDs are fast, right? So, it doesn’t matter which one I buy.” But as we just showed, that’s not really true anymore. Then, consider what complexity is this going to add to your system? Hopefully, we have shown you here that it’s fairly simple to configure and deploy. The feedback from customers who have deployed the extended capacity has been that once they got over their initial fears in that first meeting and got to deploying it, they found it was actually very easy to integrate into their server management. We also do offer plug-ins for Nagios and Prometheus. If you have another management tool, you can work with us. We may be able to provide a plug-in for that as well.
Just do realize that your choice of SSD can have a real impact the overall ROI of your data center. With that reduction in write amplification, you can get significantly better life out of your SSDs, potentially extending the life of some of your infrastructure beyond the 3 to 5 years out into the seven year range. CSDs also support some of your other initiatives, you may have initiatives around sustainability or power efficiency or modernizing your database overall, or how much space do you consume to deliver on your workloads.
In order to test this in your environment, you can reach out to us at www.scaleflux.com or hit us up on LinkedIn and we’ll be happy to discuss with you further.
- How do I know what to set the drive to or how compressible is my data?
- We get this one all the time. It’s pretty easy to estimate how compressible your data is. There’s a couple of different ways. You could just take a sample of your data set and run it through GZIP. That’s going to give you a pretty good estimation of what you would see with the ScaleFlux drive. Also, we do have a tool that we can send to you that you can again run a sample of your data set. You’re not sharing the data with us, you’re keeping it on-prem. And then just in general, the feedback we get is that on those databases, the typical compression ratio people are seeing is anywhere from 2:1 to 5:1. So extending extending capacity to 2x to 3x, people have been very, very comfortable with
- What do I need to do to install the drives? Is there a ScaleFlux driver?
- No, there’s not a ScaleFlux driver. In our prior generations of product that were FPGA based, we did have a ScaleFlux driver. But, in the 3000 and beyond we are on a ScaleFlux designed SoC (system-on-chip). With that move to SoC, we transitioned to using the in-box NVMe drivers. So, there is no software to install, there is no driver to install to use the ScaleFlux drive, you plug it in and it gets recognized as an NVMe SSD.
- If I want to reformat drive to a different capacity, how do I do that?
- As Mat showed, you’re going to create a new namespace. And so all of this is going to be managed under the standard and NVMe commands. You will have to delete the existing namespace or remove the namespace that you have already. You’re going to want to migrate your data off the drive before you reformat the drive. Then, just create a new namespace or multiple new namespaces to the desired capacity. Just to reiterate: that is not any different than you would have to do with any other NVMe SSD.
- What happens to my data if I run out of physical space on the drive?
- Good question. We’ve had some people intentionally try to break things to see what happens! But, this is where you absolutely need to be monitoring the namespace capacity or and NUSE information on your drive so that you avoid running out of space. We do have a couple of options for you on the behavior as the drive approaches full – we can we throttle the performance of the drive when it nears full so that you don’t get a surprise there, and we can dip into the over provisioning to protect you from running over by a little bit so that you can still issue commands to reclaim space and clear out the drive.
- Do all drives need to be configured the same. Can I mix and match?
- You can absolutely configure drives to different capacities. Each drive that you have deployed could be configured to a different capacity and you can mix and match ScaleFlux drives with other vendors drives.
- How do I not run off the road?
- Well, that’s that’s kind of the same question of “how do I avoid running out of space.” You do need to be monitoring that NUSE feature attribute on your drive.
- If one of my drives fails and I have to swap it, how do I know what configuration to use of extended capacity was enabled?
- Well, you’ll have to rely upon what data you have or what information you have in your file system about how big of a namespace you had created on that drive. Then, when you install a new drive, if it’s a ScaleFlux drive, then you want to at least configure it to the same the same capacity point. Or, if you’re buying a another vendor’s drive for some reason, you’ll need to buy at least enough capacity to match that overall namespace. You may have to buy a bigger drive.
I do encourage you to visit us on www.scaleflux.com, where we have additional information about the drives. We have many other resources on how to use and configure the capacity including a nice white paper that was written up by our senior director of Technical Marketing.
All right. Well, thank you for joining us this morning and have a great day.