Retraining #20

anacrolix · 2022-09-28T02:39:00Z

It's not clear from the docs, or your blog note about how to retrain dictionaries. If my read of the code is correct, if transparent compression is used, the dict is created once on the first maintenance. For manual compression, one could train and manage one's own dictionaries, presumably you could clean up old ones by ensuring no references were retained?

I assume it's not possible to use a new dictionary for data that was compressed with another dictionary?

Thanks!

phiresky · 2022-09-28T12:35:20Z

If you use the "base" set of functions (zstd_train_dict, zstd_compress, zstd_dcompress), then you can handle dictionaries however you want (store them as blobs in a separate table, identified however you want). You'll also have to do the "reference counting" yourself.

If you use transparent compression, dictionaries are created based on a few factors

The dict chooser expression has to return a non-null value. If it returns null, the corresponding rows are not compressed. This means you can delay compression for specific rows based on the value of other table columns.
If you return '[nodict]' from the dict chooser, the row is compressed without a dictionary
If the amount of data that would be used to train a dictionary is too small, the data stays uncompressed. This is computed separately for each dictionary (each unique value returned from dict_chooser). The heuristic to determine the target dictionary size is total_bytes_in_dict_group * config.dict_size_ratio (default 1%). If that value is < config.min_dict_size_bytes_for_training then a dictionary will not be trained. The default here is 5000 bytes. So your data will by default only be compressed if you have at least 500kByte in one group

Right now, there's no integrated functionality to "retrain" dictionaries or de/recompress data. That would be future functionality, though I think if you chose your dict_chooser well it shouldn't give you that much. You can work around the lack of this feature by decompressing everything and then enabling the compression again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retraining #20

Retraining #20

anacrolix commented Sep 28, 2022

phiresky commented Sep 28, 2022

Retraining #20

Retraining #20

Comments

anacrolix commented Sep 28, 2022

phiresky commented Sep 28, 2022