Redis Compression
Why compress data in Redis
Redis does not natively include any mechanism for compressing data. Efficient memory usage is crucial to minimize the need for large instances, as all data is stored in memory. However, some lists in Xalok, such as *:publish:current_lists
, have a quadratic cost (number of news items by the number of lists), which results in high memory consumption. Therefore, it becomes necessary to find a mechanism to compress this information.
ZSTD Dictionary Compression
Zstandard (ZSTD) is a modern compression algorithm designed to balance high compression ratios with fast decompression speeds. One of its key features is dictionary-based compression, which makes it particularly effective for datasets with a lot of repeated patterns—like the *:publish:current_lists
in Xalok.
Key Benefits of ZSTD:
- Efficient Memory Usage: Redis stores all data in memory, so reducing the memory footprint is critical. ZSTD's dictionary compression excels at handling repetitive data patterns by referencing a pre-built dictionary that contains common sequences.
- High Compression Ratio: Lists like
*:publish:current_lists
can have a quadratic memory cost due to a large number of repeated items across multiple lists. ZSTD can dramatically reduce this overhead by compressing recurring patterns more effectively than standard compression techniques. - Speed: ZSTD offers fast decompression, ensuring that while we save memory, we still maintain high read performance when accessing this data.
- Customization: You can create a custom dictionary tailored to the specific patterns in the
*:publish:current_lists
dataset, which further optimizes compression efficiency.
Applying ZSTD compression to *:publish:current_lists
would significantly reduce memory usage by leveraging the repetitive nature of the data, all while maintaining high performance in Redis. Since the compression process is independent of any business logic, it is completely decoupled and can be modified or removed in the future without impacting other parts of the system.
How to compress new data
In order to automatically compress and decompress the data stored in Redis, we can wrap the existing getter and setter methods with a compression service. This approach ensures that the logic for compression and decompression is handled transparently, without requiring changes in other parts of the codebase.
Here’s how you can adapt the existing methods:
Original Getter and Setter
The original methods simply retrieve and store data without any compression:
protected function _getLists($key)
{
return $this->redis->hget($this->prefix . static::CURRENT_LIST_SET_NAME, $key);
}
protected function _setLists($key, $value)
{
return $this->redis->hset($this->prefix . static::CURRENT_LIST_SET_NAME, $key, $value);
}
Adding Compression
By wrapping these methods with a compression service, we can ensure that data is compressed when it is stored and decompressed when it is retrieved:
protected function _getLists($key)
{
return $this->compressionService->decompress_publish_current_list(
$this->redis->hget($this->prefix . static::CURRENT_LIST_SET_NAME, $key)
);
}
protected function _setLists($key, $value)
{
return $this->redis->hset($this->prefix . static::CURRENT_LIST_SET_NAME, $key,
$this->compressionService->compress_publish_current_list($value)
);
}
BOM-like Marker for Asynchronous and Offline Compression
In our system, we use a BOM-like (Byte Order Mark) mechanism to determine whether a record in Redis is compressed or not. This allows us to gradually compress data asynchronously, compress it offline, or stop compressing without impacting the service. By adding a small marker to the beginning of compressed data, we can easily identify whether the data needs to be decompressed when retrieved.
How it works:
1. Compressing with the BOM:
When storing data, if compression is enabled and a BOM marker is configured, the system will prepend this marker to the compressed data. This enables us to flag that the data is compressed.
// if BOM is enabled add to data
if($bom) {
$compressed = $bom . $compressed;
}
2. Decompressing based on the BOM:
When retrieving data from Redis, we check if the BOM marker is present. If the marker is found at the beginning of the data, we know that it is compressed and proceed to decompress it. If the marker is not present, we assume the data is stored in its original, uncompressed form.
// if BOM is enabled
if(!is_null($bom)) {
// if the data does not match the BOM as a prefix, assume it is not compressed
if(strncmp($bom, $input, strlen($bom)) !== 0) {
return $input;
}
// strip the BOM from the input before decompression
$input = substr($input, strlen($bom));
}
Benefits:
- Asynchronous Compression: Data can be compressed or left uncompressed without any immediate impact on the system. This allows for a smooth transition to compressed storage without service interruptions.
- Offline Compression: Data can be compressed in the background or during off-peak times, and the BOM will help distinguish between compressed and uncompressed records.
- Flexible Management: Since compression and decompression are controlled via the presence of the BOM, it is possible to modify or disable compression at any time without requiring a complete overhaul of the existing data or Redis entries.
This approach ensures that we can gradually introduce or roll back compression while maintaining system performance and reliability.
Configuring a Compression Group
Each compression group in our system is designed to use a dedicated dictionary, allowing for optimal compression of specific data types. While dictionaries can be reused across different types of data, we've chosen to name compression methods like compress_publish_current_list
for clarity and to reflect the specific use case. These methods are simply wrappers around the generic compress
and decompress
functions that handle the actual compression and decompression using the provided dictionary.
Compression Methods and Wrappers
The compression service defines specific methods like compress_publish_current_list
to make the implementation clearer for each data type. Under the hood, these methods simply call the more generic compress
and decompress
functions with the appropriate dictionary and compression level:
public function compress_publish_current_list($data)
{
// if not enabled the data is not modified, if mixed (compressed/uncompressed) data is stored use BOM
if(!$this->publish_current_lists_enabled)
return $publish_current_list_value;
return $this->compress(
$this->publish_current_lists_dictionary,
$this->publish_current_lists_BOM,
$this->default_compression_level,
$publish_current_list_value
);
}
public function decompress_publish_current_list($data)
{
return $this->decompress(
$this->publish_current_lists_dictionary,
$this->publish_current_lists_BOM,
$compressed_current_list_value
);
}
Zstandard Dictionary Configuration
By default, the service wf_cms.services.compression_service
receives a Zstandard (ZSTD) dictionary, which is passed as a base64-encoded string. This dictionary is essential for achieving optimal compression, especially when dealing with repetitive data structures like lists.
To create a dictionary from multiple examples of typical data, you can use the following Linux command:
# Collecting multiple examples of the data into a training set
cat example1.txt example2.txt example3.txt > training_set.txt
# Generating a ZSTD dictionary from the training set
zstd --train training_set.txt -o dictionary.zstd
# Encoding the dictionary in base64 for usage in the configuration
base64 dictionary.zstd > dictionary_base64.txt
The resulting base64-encoded dictionary can then be provided to the compression service as part of the configuration.
BOM, Compression Status, and Compression Level
In addition to the dictionary, each compression group also has its own BOM (Byte Order Mark). The BOM is used to determine whether the data has been compressed. Even if compression is disabled (false
), the service will still decompress any previously compressed data, and subsequent writes will store the data uncompressed. This allows for smooth transitions between compressed and uncompressed states without breaking the system.
Exists also a global configurable compression level, which determines how aggressively the data should be compressed. Higher compression levels result in smaller data sizes but may require more processing power.
Configuration File
These configurations are specified in the XML configuration file located at:
Wf/Bundle/CmsBaseBundle/Resources/config/compression_service.xml
An example entry might look like this:
<services>
<service id="wf_cms.services.compression_service">
<argument>%publish_current_lists_compression_dictionary%</argument>
<argument>%publish_current_lists_compression_bom%</argument>
<argument>%publish_current_lists_compression_enabled%</argument>
<argument>%redis_compression_level%</argument>
</service>
</services>
Here, the compression service receives:
- The ZSTD dictionary in base64 format
- The BOM for identifying compressed data
- A flag indicating whether compression is currently enabled
- The global compression level
This configuration ensures that each data group is managed effectively in terms of compression, while allowing for easy adjustments based on service requirements.
Management and diagnosis
Reading a value from Redis can be done running:
$ redis-cli --raw hget foo:publish:current_lists lists:190
latest:category:1634,latest:category:1634:full,latest,latest:secondary:category:1634:full,...
If we enable compression and save the same content will get:
$ redis-cli --raw hget foo:publish:current_lists lists:190
Z|p�@!��x�1634
&B�б
To read the content we need the dictionary using directly the base64 string:
$ echo 'N6Qw7AbYKFQjE...eTpjYXRlZ28=' | base64 -d > dict.zstd
Now, we can decompress but remember:
- remove the BOM prefix.
- remove the EOF char emmited by
redis-cli
$ redis-cli --raw hget foo:publish:current_lists lists:190 \
| tail -c+2 | head -c-1 | zstd -D dict.zstd --decompress -q
latest:category:1634,latest:category:1634:full,latest,latest:secondary:category:1634:full,...
Of course, we can write a simple php
inline script:
$ redis-cli --raw hget foo:publish:current_lists lists:190 \
| php -r '$x = file_get_contents("php://stdin"); echo zstd_uncompress_dict(substr($x, 1, -1), file_get_contents("dict.zstd"));'
latest:category:1634,latest:category:1634:full,latest,latest:secondary:category:1634:full,...
PHP zstd
zstd
is a PHP module and must be enabled, i.e. for console:
$ sudo phpenmod zstd
At deployment level:
$ grep -Ri zstd /etc/php/
/etc/php/7.4/mods-available/zstd.ini:extension=zstd.so
/etc/php/7.4/fpm/conf.d/30-zstd.ini:extension=zstd.so
/etc/php/7.4/cli/conf.d/30-zstd.ini:extension=zstd.so
Full compression / decompression
Maybe required / desirable to compress or decompress the entire list, you can run:
$ ./app/admin/console wf:redis:compression rewrite-publish-lists --from-page-id=1 --to-page-id=200
200/200 [============================] 100% 1 sec/1 sec
Done.
This process only read the current list from Redis and, if exists, rewrite the content to _Redis. No database queries are involved since is expected to be the content with contiguous ids except, maybe, some holes.
Parallelize recompression (or uncompression)
maxPageId=3328707
cpus=4
size=$((maxPageId / cpus + 1))
for i in `seq 1 $cpus`
do
a=$(((i - 1) * size + 1))
b=$((i * size))
echo "$a ~ $b"
done
Statistics and usage cases
For only 197 contents with only the main category, we go from disabled compression:
www-data@ubuntu-jammy:/var/www/sites/enabled/cope$ ./app/admin/console wf:redis:compression rewrite-publish-lists --from-page-id=1 --to-page-id=200
200/200 [============================] 100% < 1 sec/< 1 sec
Done.
www-data@ubuntu-jammy:/var/www/sites/enabled/cope$ redis-cli memory usage foo:publish:current_lists
(integer) 91936
To enabled compression:
www-data@ubuntu-jammy:/var/www/sites/enabled/cope$ ./app/admin/console wf:redis:compression rewrite-publish-lists --from-page-id=1 --to-page-id=200
200/200 [============================] 100% 1 sec/1 sec
Done.
www-data@ubuntu-jammy:/var/www/sites/enabled/cope$ redis-cli memory usage foo:publish:current_lists
(integer) 20550
Reduced to the 22.4% ( x4.5 compression or saved memory ).
Usage Case Abside
Emisora | Antes | Después | % Después/Antes |
---|---|---|---|
Cope | 2070426473 | 403793940 | 19.51% |
RockFM | 25568640 | 6551632 | 25.63% |
Cadena100 | 62190358 | 15561608 | 25.02% |
MegaStar | 16066472 | 4009504 | 24.96% |
note the custom dictionary was made using Cope samples, that is, a custom dictionary is relevant.
The recompression of ~3e6 contents taken about ~6h using a single thread.