Skip to content

Redis Compression

Why compress data in Redis

Redis does not natively include any mechanism for compressing data. Efficient memory usage is crucial to minimize the need for large instances, as all data is stored in memory. However, some lists in Xalok, such as *:publish:current_lists, have a quadratic cost (number of news items by the number of lists), which results in high memory consumption. Therefore, it becomes necessary to find a mechanism to compress this information.

ZSTD Dictionary Compression

Zstandard (ZSTD) is a modern compression algorithm designed to balance high compression ratios with fast decompression speeds. One of its key features is dictionary-based compression, which makes it particularly effective for datasets with a lot of repeated patterns—like the *:publish:current_lists in Xalok.

Key Benefits of ZSTD:

  • Efficient Memory Usage: Redis stores all data in memory, so reducing the memory footprint is critical. ZSTD's dictionary compression excels at handling repetitive data patterns by referencing a pre-built dictionary that contains common sequences.
  • High Compression Ratio: Lists like *:publish:current_lists can have a quadratic memory cost due to a large number of repeated items across multiple lists. ZSTD can dramatically reduce this overhead by compressing recurring patterns more effectively than standard compression techniques.
  • Speed: ZSTD offers fast decompression, ensuring that while we save memory, we still maintain high read performance when accessing this data.
  • Customization: You can create a custom dictionary tailored to the specific patterns in the *:publish:current_lists dataset, which further optimizes compression efficiency.

Applying ZSTD compression to *:publish:current_lists would significantly reduce memory usage by leveraging the repetitive nature of the data, all while maintaining high performance in Redis. Since the compression process is independent of any business logic, it is completely decoupled and can be modified or removed in the future without impacting other parts of the system.

How to compress new data

In order to automatically compress and decompress the data stored in Redis, we can wrap the existing getter and setter methods with a compression service. This approach ensures that the logic for compression and decompression is handled transparently, without requiring changes in other parts of the codebase.

Here’s how you can adapt the existing methods:

Original Getter and Setter

The original methods simply retrieve and store data without any compression:

php
protected function _getLists($key)
{
    return $this->redis->hget($this->prefix . static::CURRENT_LIST_SET_NAME, $key);
}

protected function _setLists($key, $value)
{
    return $this->redis->hset($this->prefix . static::CURRENT_LIST_SET_NAME, $key, $value);
}

Adding Compression

By wrapping these methods with a compression service, we can ensure that data is compressed when it is stored and decompressed when it is retrieved:

php
protected function _getLists($key)
{
    return $this->compressionService->decompress_publish_current_list(
        $this->redis->hget($this->prefix . static::CURRENT_LIST_SET_NAME, $key)
    );
}

protected function _setLists($key, $value)
{
    return $this->redis->hset($this->prefix . static::CURRENT_LIST_SET_NAME, $key,
        $this->compressionService->compress_publish_current_list($value)
    );
}

BOM-like Marker for Asynchronous and Offline Compression

In our system, we use a BOM-like (Byte Order Mark) mechanism to determine whether a record in Redis is compressed or not. This allows us to gradually compress data asynchronously, compress it offline, or stop compressing without impacting the service. By adding a small marker to the beginning of compressed data, we can easily identify whether the data needs to be decompressed when retrieved.

How it works:

1. Compressing with the BOM:

When storing data, if compression is enabled and a BOM marker is configured, the system will prepend this marker to the compressed data. This enables us to flag that the data is compressed.

php
// if BOM is enabled add to data
if($bom) {
    $compressed = $bom . $compressed;
}

2. Decompressing based on the BOM:

When retrieving data from Redis, we check if the BOM marker is present. If the marker is found at the beginning of the data, we know that it is compressed and proceed to decompress it. If the marker is not present, we assume the data is stored in its original, uncompressed form.

php
// if BOM is enabled
if(!is_null($bom)) {
    // if the data does not match the BOM as a prefix, assume it is not compressed
    if(strncmp($bom, $input, strlen($bom)) !== 0) {
        return $input;
    }
    // strip the BOM from the input before decompression
    $input = substr($input, strlen($bom));
}

Benefits:

  • Asynchronous Compression: Data can be compressed or left uncompressed without any immediate impact on the system. This allows for a smooth transition to compressed storage without service interruptions.
  • Offline Compression: Data can be compressed in the background or during off-peak times, and the BOM will help distinguish between compressed and uncompressed records.
  • Flexible Management: Since compression and decompression are controlled via the presence of the BOM, it is possible to modify or disable compression at any time without requiring a complete overhaul of the existing data or Redis entries.

This approach ensures that we can gradually introduce or roll back compression while maintaining system performance and reliability.

Configuring a Compression Group

Each compression group in our system is designed to use a dedicated dictionary, allowing for optimal compression of specific data types. While dictionaries can be reused across different types of data, we've chosen to name compression methods like compress_publish_current_list for clarity and to reflect the specific use case. These methods are simply wrappers around the generic compress and decompress functions that handle the actual compression and decompression using the provided dictionary.

Compression Methods and Wrappers

The compression service defines specific methods like compress_publish_current_list to make the implementation clearer for each data type. Under the hood, these methods simply call the more generic compress and decompress functions with the appropriate dictionary and compression level:

php
public function compress_publish_current_list($data)
{
    // if not enabled the data is not modified, if mixed (compressed/uncompressed) data is stored use BOM
    if(!$this->publish_current_lists_enabled)
        return $publish_current_list_value;

    return $this->compress(
        $this->publish_current_lists_dictionary,
        $this->publish_current_lists_BOM,
        $this->default_compression_level,
        $publish_current_list_value
    );
}

public function decompress_publish_current_list($data)
{
    return $this->decompress(
        $this->publish_current_lists_dictionary,
        $this->publish_current_lists_BOM,
        $compressed_current_list_value
    );
}

Zstandard Dictionary Configuration

By default, the service wf_cms.services.compression_service receives a Zstandard (ZSTD) dictionary, which is passed as a base64-encoded string. This dictionary is essential for achieving optimal compression, especially when dealing with repetitive data structures like lists.

To create a dictionary from multiple examples of typical data, you can use the following Linux command:

bash
# Collecting multiple examples of the data into a training set
cat example1.txt example2.txt example3.txt > training_set.txt

# Generating a ZSTD dictionary from the training set
zstd --train training_set.txt -o dictionary.zstd

# Encoding the dictionary in base64 for usage in the configuration
base64 dictionary.zstd > dictionary_base64.txt

The resulting base64-encoded dictionary can then be provided to the compression service as part of the configuration.

BOM, Compression Status, and Compression Level

In addition to the dictionary, each compression group also has its own BOM (Byte Order Mark). The BOM is used to determine whether the data has been compressed. Even if compression is disabled (false), the service will still decompress any previously compressed data, and subsequent writes will store the data uncompressed. This allows for smooth transitions between compressed and uncompressed states without breaking the system.

Exists also a global configurable compression level, which determines how aggressively the data should be compressed. Higher compression levels result in smaller data sizes but may require more processing power.

Configuration File

These configurations are specified in the XML configuration file located at:

Wf/Bundle/CmsBaseBundle/Resources/config/compression_service.xml

An example entry might look like this:

xml
<services>
    <service id="wf_cms.services.compression_service">
        <argument>%publish_current_lists_compression_dictionary%</argument>
        <argument>%publish_current_lists_compression_bom%</argument>
        <argument>%publish_current_lists_compression_enabled%</argument>
        <argument>%redis_compression_level%</argument>
    </service>
</services>

Here, the compression service receives:

  • The ZSTD dictionary in base64 format
  • The BOM for identifying compressed data
  • A flag indicating whether compression is currently enabled
  • The global compression level

This configuration ensures that each data group is managed effectively in terms of compression, while allowing for easy adjustments based on service requirements.

Management and diagnosis

Reading a value from Redis can be done running:

shell
$ redis-cli --raw hget foo:publish:current_lists lists:190
latest:category:1634,latest:category:1634:full,latest,latest:secondary:category:1634:full,...

If we enable compression and save the same content will get:

shell
$ redis-cli --raw hget foo:publish:current_lists lists:190
Z|p�@!��x�1634
          &B�б

To read the content we need the dictionary using directly the base64 string:

shell
$ echo 'N6Qw7AbYKFQjE...eTpjYXRlZ28=' | base64 -d > dict.zstd

Now, we can decompress but remember:

  1. remove the BOM prefix.
  2. remove the EOF char emmited by redis-cli
shell
$ redis-cli --raw hget foo:publish:current_lists lists:190 \
  | tail -c+2 | head -c-1 | zstd -D dict.zstd --decompress -q
latest:category:1634,latest:category:1634:full,latest,latest:secondary:category:1634:full,...

Of course, we can write a simple php inline script:

shell
$ redis-cli --raw hget foo:publish:current_lists lists:190 \
  | php -r '$x = file_get_contents("php://stdin"); echo zstd_uncompress_dict(substr($x, 1, -1), file_get_contents("dict.zstd"));'
latest:category:1634,latest:category:1634:full,latest,latest:secondary:category:1634:full,...

PHP zstd

zstd is a PHP module and must be enabled, i.e. for console:

shell
$ sudo phpenmod zstd

At deployment level:

shell
$ grep -Ri zstd /etc/php/
/etc/php/7.4/mods-available/zstd.ini:extension=zstd.so
/etc/php/7.4/fpm/conf.d/30-zstd.ini:extension=zstd.so
/etc/php/7.4/cli/conf.d/30-zstd.ini:extension=zstd.so

Full compression / decompression

Maybe required / desirable to compress or decompress the entire list, you can run:

shell
$ ./app/admin/console wf:redis:compression rewrite-publish-lists --from-page-id=1 --to-page-id=200
 200/200 [============================] 100%  1 sec/1 sec
Done.

This process only read the current list from Redis and, if exists, rewrite the content to _Redis. No database queries are involved since is expected to be the content with contiguous ids except, maybe, some holes.

Parallelize recompression (or uncompression)

shell
maxPageId=3328707
cpus=4
size=$((maxPageId / cpus + 1))
for i in `seq 1 $cpus`
do
  a=$(((i - 1) * size + 1))
  b=$((i * size))
  echo "$a ~ $b"
done

Statistics and usage cases

For only 197 contents with only the main category, we go from disabled compression:

shell
www-data@ubuntu-jammy:/var/www/sites/enabled/cope$ ./app/admin/console wf:redis:compression rewrite-publish-lists --from-page-id=1 --to-page-id=200
 200/200 [============================] 100% < 1 sec/< 1 sec
Done.
www-data@ubuntu-jammy:/var/www/sites/enabled/cope$ redis-cli memory usage foo:publish:current_lists
(integer) 91936

To enabled compression:

shell
www-data@ubuntu-jammy:/var/www/sites/enabled/cope$ ./app/admin/console wf:redis:compression rewrite-publish-lists --from-page-id=1 --to-page-id=200
 200/200 [============================] 100%  1 sec/1 sec
Done.
www-data@ubuntu-jammy:/var/www/sites/enabled/cope$ redis-cli memory usage foo:publish:current_lists
(integer) 20550

Reduced to the 22.4% ( x4.5 compression or saved memory ).

Usage Case Abside

EmisoraAntesDespués% Después/Antes
Cope207042647340379394019.51%
RockFM25568640655163225.63%
Cadena100621903581556160825.02%
MegaStar16066472400950424.96%

note the custom dictionary was made using Cope samples, that is, a custom dictionary is relevant.

The recompression of ~3e6 contents taken about ~6h using a single thread.