A tool to select and divide lines of text into groups and baskets (bins) within them according to a regex pattern and some additional rules. These lines of text (samples) are intended to be file paths or other backup archive IDs most of the time.
It may be useful when backups take up too much space and some of the old ones could be deleted.
Backup tools usually contain such a functionality but it might be too
simple or maybe you just want to prune some .tgz
backups.
The user defines:
-
the regular expression to divide every sample name into named regex match groups and optionally declare their types,
-
basket groups rules that:
- first classify a sample into one of the basket groups,
- then put a sample into one of the group's baskets (named according to the specified rule),
- select some sample from every basket.
The sample name pattern regex should use the (?P<name>...)
groups. If the name contains __
(double underscore) if is parsed
as name__type
.
Supported types are int
and dt
(for datetime
).
A basket group is specified in the following way (with every part being
optional): filter:basket_pattern:selection_method
.
The filter might be something like date>=2023-01-01
.
The basket pattern defines the way baskets are named and it might be
something like ${service}-${date__Y}-${date__m}
. It uses the Python's
template string syntax. The template might contain groups
defined in the sample name pattern.
Fields referencing groups of datetime
type can have a suffix
corresponding to the strftime
format codes, eg. date__Y
for year.
The selection method specifies how many files to keep or delete (if
inversion used). It may be something like 3
to keep 3 files from every
basket.
Let's assume you have a list of files (eg. generated by the find
command line tool) like the following (see examples/files.txt
for the
full list):
foo-2022-03-14.tar.gz
bar-2022-04-16.tar.gz
[...]
Then run something like the following to:
- keep all files from year 2023 (first basket group)
- in case of older files (second basket group) keep only one file per
month per service (baskets named like
foo-2022-05
)
./baskets.py \
-i -o lines \
-b 'date>=2023-01-01' \
-b ':${service}-${date__Y}-${date__m}:1' \
'(?P<service>\w+)-(?P<date__dt>\d+-\d+-\d+).*' \
< examples/files.txt
This will output a list of files to delete which you can pass to
something like xargs rm
.
More examples yet to come...
- BorgBackup prune command, https://borgbackup.readthedocs.io/en/stable/usage/prune.html
- Obnam backup tool in Python (the first time I saw something like this), http://git.liw.fi/obnam/tree/manual/en/080-forgetting.mdwn; BTW, there is Obnam2 in Rust under way https://obnam.org/