Getting more from less is a common theme across corporate functional teams. The IT and Infrastructure & Operations teams we engage with frequently highlight this challenge. While the exact terminology used varies across companies and industries, the problem bubbles up to the same common thread – efficiency in the face of increasing demands on the IT infrastructure.
Examples of what we hear from IT teams
The demands on my servers and network are increasing, but I don’t have…
- Budget to just scale out more of what I’m using today
- Enough power in the rack to just scale up
- Enough space in my datacenter to scale out (or enough budget to go add another datacenter)
- Enough OpEx budget to push everything to the cloud
- Enough people to handle complex architectures
- In-house developers to re-write applications
I need help!
When we dig in further with them on their architectures and operating rules, we find that rethinking compression can be a significant tool in alleviating the pains of growing demands not just on storage, but on the rest of the infrastructure as well. Learn more about compression.
Step 1: Why bother compressing data?
The first step in rethinking compression is to clearly define what the objectives of compressing data are in the first place. The obvious top 2 reasons are:
1. To reduce the storage capacity you need to buy
2. To reduce the bandwidth consumed to transfer the data across the network.
If those were the only considerations, simply compressing all data before sending it to storage (whether that’s SSD, HDD, or Tape) and compressing all of it before sending it across the network would be enough to meet those goals. However, digging one layer deeper brings up other considerations such as how the data will be used, what is the cost of performing the compression, how much throughput do you need, and how does the act of compression affect SLAs, system efficiency, user experience (all aspects of performance). Looking at another vector, “how does my choice for compression tie into my data security needs?”
Step 2: Choosing a compression solution that suits your goals and constraints.
Selecting a compression solution is a critical decision. Each aspect of the solution comes with tradeoffs. So, consider these questions in assessing which solution is optimal for your situation:
1. Which compression algorithm?
2. What compression granularity?
3. Do I use software or hardware compression?
4. Where should I perform the compression function?
5. How do my security & compression choices interact?
6. How do I manage the effective capacity?
Which compression algorithm?
There are many choices for compression algorithms and levels within them. Common choices for general purpose, lossless compression include LZ4, GZIP, ZSTD and ZLIB. LZ4 is a “light weight” algorithm. It does not achieve as much space savings, measured by compression ratio (the ratio of original data size to compressed data size) as GZIP, ZSTD and ZLIB. But, it is less computationally intensive, meaning that a CPU core can perform LZ4 compression significantly faster than it can perform the “heavy weight” algorithms.
The tradeoff: Performance & Cost to do compression vs Space saved on disk. In an experiment run in ScaleFlux’s labs, the compression throughput per core with LZ4 was up to 13x the throughput per core with ZLIB Level 6. However, the Compression Ratio was much lower, 1.4:1 vs 2:1. (Note: the test data set was the Canterbury Corpus Files).
There are many similar examples available online like:
- Smaller and faster data compression with Zstandard – Engineering at Meta (fb.com)
- Squash Compression Benchmark (quixdb.github.io)
What is Compression Granularity?
Here, we’re talking about how much data you feed into the compressor at one time. You may hear this referred to as the “compression window” or “look back window.”
Using larger chunks of data in each compression operation can result in a higher compression ratio. This can be great for compressing files to be shipped across the network or for long term storage on networked storage arrays.
Using smaller chunks of data per compression operation may sacrifice some space savings. The advantage of smaller compression windows comes when reading the data back. High-performance applications, such as transactional databases, access smaller blocks of data at a time (4, 8, and 16KB being common I/O sizes). These applications benefit from aligning the compression window with their read I/O sizes to avoid read amplification.
For example, let’s say the compression window was set to 1MB to maximize storage savings. In order for the database to pull the 8KB of data that it needs for an operation, the system would need to read back and decompress the entire 1MB just to access that relevant 8KB. Operating in this manner is inefficient. So, again there’s a tradeoff to be made between space savings and other system-level metrics.
Figure 2 below shows an example of the impact of the compression window on the total data reduction. We ran the canterbury corpus data set through various compression algorithms with 2 different compression windows (8KB and 16KB). The 16KB compression window did yield slightly more data reduction.
In summary, improving IT infrastructure efficiency is a constant challenge. Compression can be a valuable tool in dealing with this challenge. The choices of compression algorithm and compression window have tradeoffs between performance and total data reduction achieved. In the next installment, we’ll discuss additional decisions to make in selecting your compression solution (and the tradeoffs associated with those decisions).
Stay tuned for the next chapter…