Skip to content

Releases: facebook/rocksdb

RocksDB 8.5.4

27 Sep 05:20
Compare
Choose a tag to compare

8.5.4 (2023-09-26)

Bug Fixes

  • Fixed a bug where compaction read under non direct IO still falls back to RocksDB internal prefetching after file system's prefetching returns non-OK status other than Status::NotSupported()

Behavior Changes

  • For non direct IO, eliminate the file system prefetching attempt for compaction read when Options::compaction_readahead_size is 0

RocksDB 8.5.3

01 Sep 21:01
Compare
Choose a tag to compare

Please note 8.5.1 includes a fix for a persisted database corruption in an unlikely edge case. Upgrading to a version including this fix, like this one, is highly recommended!

8.5.3 (2023-09-01)

Bug Fixes

  • Fixed a race condition in GenericRateLimiter that could cause it to stop granting requests

8.5.2 (2023-08-31)

Bug fixes

  • Fix a bug where iterator may return incorrect result for DeleteRange() users if there was an error reading from a file.

8.5.1 (2023-08-31)

Bug fixes

  • Fix a bug where if there is an error reading from offset 0 of a file from L1+ and that the file is not the first file in the sorted run, data can be lost in compaction and read/scan can return incorrect results.

8.5.0 (2023-07-21)

Public API Changes

  • Removed recently added APIs GeneralCache and MakeSharedGeneralCache() as our plan changed to stop exposing a general-purpose cache interface. The old forms of these APIs, Cache and NewLRUCache(), are still available, although general-purpose caching support will be dropped eventually.

Behavior Changes

  • Option periodic_compaction_seconds no longer supports FIFO compaction: setting it has no effect on FIFO compactions. FIFO compaction users should only set option ttl instead.
  • Move prefetching responsibility to page cache for compaction read for non directIO use case

Performance Improvements

  • In case of direct_io, if buffer passed by callee is already aligned, RandomAccessFileRead::Read will avoid realloacting a new buffer, reducing memcpy and use already passed aligned buffer.
  • Small efficiency improvement to HyperClockCache by reducing chance of compiler-generated heap allocations

Bug Fixes

  • Fix use_after_free bug in async_io MultiReads when underlying FS enabled kFSBuffer. kFSBuffer is when underlying FS pass their own buffer instead of using RocksDB scratch in FSReadRequest. Right now it's an experimental feature.
  • Fix a bug in FileTTLBooster that can cause users with a large number of levels (more than 65) to see errors like "runtime error: shift exponent .. is too large.." (#11673).

RocksDB 8.4.4

01 Sep 20:54
Compare
Choose a tag to compare

8.4.4 (2023-09-01)

Bug Fixes

  • Fix a bug where if there is an error reading from offset 0 of a file from L1+ and that the file is not the first file in the sorted run, data can be lost in compaction and read/scan can return incorrect results.
  • Fix a bug where iterator may return incorrect result for DeleteRange() users if there was an error reading from a file.
  • Fixed a race condition in GenericRateLimiter that could cause it to stop granting requests

8.4.3 (2023-07-27)

Bug Fixes

  • Fix use_after_free bug in async_io MultiReads when underlying FS enabled kFSBuffer. kFSBuffer is when underlying FS pass their own buffer instead of using RocksDB scratch in FSReadRequest.

8.4.0 (2023-06-26)

New Features

  • Add FSReadRequest::fs_scratch which is a data buffer allocated and provided by underlying FileSystem to RocksDB during reads, when FS wants to provide its own buffer with data instead of using RocksDB provided FSReadRequest::scratch. This can help in cpu optimization by avoiding copy from file system's buffer to RocksDB buffer. More details on how to use/enable it in file_system.h. Right now its supported only for MultiReads(async + sync) with non direct io.
  • Start logging non-zero user-defined timestamp sizes in WAL to signal user key format in subsequent records and use it during recovery. This change will break recovery from WAL files written by early versions that contain user-defined timestamps. The workaround is to ensure there are no WAL files to recover (i.e. by flushing before close) before upgrade.
  • Added new property "rocksdb.obsolete-sst-files-size-property" that reports the size of SST files that have become obsolete but have not yet been deleted or scheduled for deletion
  • Start to record the value of the flag AdvancedColumnFamilyOptions.persist_user_defined_timestamps in the Manifest and table properties for a SST file when it is created. And use the recorded flag when creating a table reader for the SST file. This flag is only explicitly record if it's false.
  • Add a new option OptimisticTransactionDBOptions::shared_lock_buckets that enables sharing mutexes for validating transactions between DB instances, for better balancing memory efficiency and validation contention across DB instances. Different column families and DBs also now use different hash seeds in this validation, so that the same set of key names will not contend across DBs or column families.
  • Add a new ticker rocksdb.files.marked.trash.deleted to track the number of trash files deleted by background thread from the trash queue.
  • Add an API NewTieredVolatileCache() in include/rocksdb/cache.h to allocate an instance of a block cache with a primary block cache tier and a compressed secondary cache tier. A cache of this type distributes memory reservations against the block cache, such as WriteBufferManager, table reader memory etc., proportionally across both the primary and compressed secondary cache.
  • Add WaitForCompact() to wait for all flush and compactions jobs to finish. Jobs to wait include the unscheduled (queued, but not scheduled yet).
  • Add WriteBatch::Release() that releases the batch's serialized data to the caller.

Public API Changes

  • Add C API rocksdb_options_add_compact_on_deletion_collector_factory_del_ratio.
  • change the FileSystem::use_async_io() API to SupportedOps API in order to extend it to various operations supported by underlying FileSystem. Right now it contains FSSupportedOps::kAsyncIO and FSSupportedOps::kFSBuffer. More details about FSSupportedOps in filesystem.h
  • Add new tickers: rocksdb.error.handler.bg.error.count, rocksdb.error.handler.bg.io.error.count, rocksdb.error.handler.bg.retryable.io.error.count to replace the misspelled ones: rocksdb.error.handler.bg.errro.count, rocksdb.error.handler.bg.io.errro.count, rocksdb.error.handler.bg.retryable.io.errro.count ('error' instead of 'errro'). Users should switch to use the new tickers before 9.0 release as the misspelled old tickers will be completely removed then.
  • Overload the API CreateColumnFamilyWithImport() to support creating ColumnFamily by importing multiple ColumnFamilies It requires that CFs should not overlap in user key range.

Behavior Changes

  • Change the default value for option level_compaction_dynamic_level_bytes to true. This affects users who use leveled compaction and do not set this option explicitly. These users may see additional background compactions following DB open. These compactions help to shape the LSM according to level_compaction_dynamic_level_bytes such that the size of each level Ln is approximately size of Ln-1 * max_bytes_for_level_multiplier. Turning on this option has other benefits too: see more detail in wiki: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size and in option comment in advanced_options.h (#11525).
  • For Leveled Compaction users, CompactRange() will now always try to compact to the last non-empty level. (#11468)
    For Leveled Compaction users, CompactRange() with bottommost_level_compaction = BottommostLevelCompaction::kIfHaveCompactionFilter will behave similar to kForceOptimized in that it will skip files created during this manual compaction when compacting files in the bottommost level. (#11468)
  • RocksDB will try to drop range tombstones during non-bottommost compaction when it is safe to do so. (#11459)
  • When a DB is openend with allow_ingest_behind=true (currently only Universal compaction is supported), files in the last level, i.e. the ingested files, will not be included in any compaction. (#11489)
  • Statistics rocksdb.sst.read.micros scope is expanded to all SST reads except for file ingestion and column family import (some compaction reads were previously excluded).

Bug Fixes

  • Reduced cases of illegally using Env::Default() during static destruction by never destroying the internal PosixEnv itself (except for builds checking for memory leaks). (#11538)
  • Fix extra prefetching during seek in async_io when BlockBasedTableOptions.num_file_reads_for_auto_readahead is 1 leading to extra reads than required.
  • Fix a bug where compactions that are qualified to be run as 2 subcompactions were only run as one subcompaction.
  • Fix a use-after-move bug in block.cc.

RocksDB 8.3.3

01 Sep 20:48
Compare
Choose a tag to compare

8.3.3 (2023-09-01)

Bug Fixes

  • Fix a bug where if there is an error reading from offset 0 of a file from L1+ and that the file is not the first file in the sorted run, data can be lost in compaction and read/scan can return incorrect results.
  • Fix a bug where iterator may return incorrect result for DeleteRange() users if there was an error reading from a file.
  • Fixed a race condition in GenericRateLimiter that could cause it to stop granting requests

RocksDB 8.3.2

23 Jun 23:45
Compare
Choose a tag to compare

8.3.2 (2023-06-14)

Bug Fixes

  • Reduced cases of illegally using Env::Default() during static destruction by never destroying the internal PosixEnv itself (except for builds checking for memory leaks). (#11538)

8.3.1 (2023-06-07)

Performance Improvements

  • Fixed higher read QPS during DB::Open() reading files created prior to #11406, especially when reading many small file (size < 52 MB) during DB::Open() and partitioned filter or index is used.

8.3.0 (2023-05-19)

New Features

  • Introduced a new option block_protection_bytes_per_key, which can be used to enable per key-value integrity protection for in-memory blocks in block cache (#11287).
  • Added JemallocAllocatorOptions::num_arenas. Setting num_arenas > 1 may mitigate mutex contention in the allocator, particularly in scenarios where block allocations commonly bypass jemalloc tcache.
  • Improve the operational safety of publishing a DB or SST files to many hosts by using different block cache hash seeds on different hosts. The exact behavior is controlled by new option ShardedCacheOptions::hash_seed, which also documents the solved problem in more detail.
  • Introduced a new option CompactionOptionsFIFO::file_temperature_age_thresholds that allows FIFO compaction to compact files to different temperatures based on key age (#11428).
  • Added a new ticker stat to count how many times RocksDB detected a corruption while verifying a block checksum: BLOCK_CHECKSUM_MISMATCH_COUNT.
  • New statistics rocksdb.file.read.db.open.micros that measures read time of block-based SST tables or blob files during db open.
  • New statistics tickers for various iterator seek behaviors and relevant filtering, as *_LEVEL_SEEK_*. (#11460)

Public API Changes

  • EXPERIMENTAL: Add new API DB::ClipColumnFamily to clip the key in CF to a certain range. It will physically deletes all keys outside the range including tombstones.
  • Add MakeSharedCache() construction functions to various cache Options objects, and deprecated the NewWhateverCache() functions with long parameter lists.
  • Changed the meaning of various Bloom filter stats (prefix vs. whole key), with iterator-related filtering only being tracked in the new *_LEVEL_SEEK_*. stats. (#11460)

Behavior changes

  • For x86, CPU features are no longer detected at runtime nor in build scripts, but in source code using common preprocessor defines. This will likely unlock some small performance improvements on some newer hardware, but could hurt performance of the kCRC32c checksum, which is no longer the default, on some "portable" builds. See PR #11419 for details.

Bug Fixes

  • Delete an empty WAL file on DB open if the log number is less than the min log number to keep
  • Delete temp OPTIONS file on DB open if there is a failure to write it out or rename it

Performance Improvements

  • Improved the I/O efficiency of prefetching SST metadata by recording more information in the DB manifest. Opening files written with previous versions will still rely on heuristics for how much to prefetch (#11406).

RocksDB 8.1.1

20 Apr 22:02
Compare
Choose a tag to compare

8.1.1 (2023-04-06)

Bug Fixes

  • In the DB::VerifyFileChecksums API, ensure that file system reads of SST files are equal to the readahead_size in ReadOptions, if specified. Previously, each read was 2x the readahead_size.

8.1.0 (2023-03-18)

Behavior changes

  • Compaction output file cutting logic now considers range tombstone start keys. For example, SST partitioner now may receive ParitionRequest for range tombstone start keys.
  • If the async_io ReadOption is specified for MultiGet or NewIterator on a platform that doesn't support IO uring, the option is ignored and synchronous IO is used.

Bug Fixes

  • Fixed an issue for backward iteration when user defined timestamp is enabled in combination with BlobDB.
  • Fixed a couple of cases where a Merge operand encountered during iteration wasn't reflected in the internal_merge_count PerfContext counter.
  • Fixed a bug in CreateColumnFamilyWithImport()/ExportColumnFamily() which did not support range tombstones (#11252).
  • Fixed a bug where an excluded column family from an atomic flush contains unflushed data that should've been included in this atomic flush (i.e, data of seqno less than the max seqno of this atomic flush), leading to potential data loss in this excluded column family when WriteOptions::disableWAL == true (#11148).

New Features

  • Add statistics rocksdb.secondary.cache.filter.hits, rocksdb.secondary.cache.index.hits, and rocksdb.secondary.cache.filter.hits
  • Added a new PerfContext counter internal_merge_point_lookup_count which tracks the number of Merge operands applied while serving point lookup queries.
  • Add new statistics rocksdb.table.open.prefetch.tail.read.bytes, rocksdb.table.open.prefetch.tail.{miss|hit}
  • Add support for SecondaryCache with HyperClockCache (HyperClockCacheOptions inherits secondary_cache option from ShardedCacheOptions)
  • Add new db properties rocksdb.cf-write-stall-stats, rocksdb.db-write-stall-statsand APIs to examine them in a structured way. In particular, users of GetMapProperty() with property kCFWriteStallStats/kDBWriteStallStats can now use the functions in WriteStallStatsMapKeys to find stats in the map.

Public API Changes

  • Changed various functions and features in Cache that are mostly relevant to custom implementations or wrappers. Especially, asychronous lookup functionality is moved from Lookup() to a new StartAsyncLookup() function.

RocksDB 7.10.2

02 Mar 01:00
Compare
Choose a tag to compare

7.10.2 (2023-02-10)

Bug Fixes

  • Fixed a bug in DB open/recovery from a compressed WAL that was caused due to incorrect handling of certain record fragments with the same offset within a WAL block.

7.10.1 (2023-02-01)

Bug Fixes

  • Fixed a data race on ColumnFamilyData::flush_reason caused by concurrent flushes.
  • Fixed DisableManualCompaction() and CompactRangeOptions::canceled to cancel compactions even when they are waiting on conflicting compactions to finish
  • Fixed a bug in which a successful GetMergeOperands() could transiently return Status::MergeInProgress()
  • Return the correct error (Status::NotSupported()) to MultiGet caller when ReadOptions::async_io flag is true and IO uring is not enabled. Previously, Status::Corruption() was being returned when the actual failure was lack of async IO support.

7.10.0 (2023-01-23)

Behavior changes

  • Make best-efforts recovery verify SST unique ID before Version construction (#10962)
  • Introduce epoch_number and sort L0 files by epoch_number instead of largest_seqno. epoch_number represents the order of a file being flushed or ingested/imported. Compaction output file will be assigned with the minimum epoch_number among input files'. For L0, larger epoch_number indicates newer L0 file.

Bug Fixes

  • Fixed a regression in iterator where range tombstones after iterate_upper_bound is processed.
  • Fixed a memory leak in MultiGet with async_io read option, caused by IO errors during table file open
  • Fixed a bug that multi-level FIFO compaction deletes one file in non-L0 even when CompactionOptionsFIFO::max_table_files_size is no exceeded since #10348 or 7.8.0.
  • Fixed a bug caused by DB::SyncWAL() affecting track_and_verify_wals_in_manifest. Without the fix, application may see "open error: Corruption: Missing WAL with log number" while trying to open the db. The corruption is a false alarm but prevents DB open (#10892).
  • Fixed a BackupEngine bug in which RestoreDBFromLatestBackup would fail if the latest backup was deleted and there is another valid backup available.
  • Fix L0 file misorder corruption caused by ingesting files of overlapping seqnos with memtable entries' through introducing epoch_number. Before the fix, force_consistency_checks=true may catch the corruption before it's exposed to readers, in which case writes returning Status::Corruption would be expected. Also replace the previous incomplete fix (#5958) to the same corruption with this new and more complete fix.
  • Fixed a bug in LockWAL() leading to re-locking mutex (#11020).
  • Fixed a heap use after free bug in async scan prefetching when the scan thread and another thread try to read and load the same seek block into cache.
  • Fixed a heap use after free in async scan prefetching if dictionary compression is enabled, in which case sync read of the compression dictionary gets mixed with async prefetching
  • Fixed a data race bug of CompactRange() under change_level=true acts on overlapping range with an ongoing file ingestion for level compaction. This will either result in overlapping file ranges corruption at a certain level caught by force_consistency_checks=true or protentially two same keys both with seqno 0 in two different levels (i.e, new data ends up in lower/older level). The latter will be caught by assertion in debug build but go silently and result in read returning wrong result in release build. This fix is general so it also replaced previous fixes to a similar problem for CompactFiles() (#4665), general CompactRange() and auto compaction (commit 5c64fb6 and 87dfc1d).
  • Fixed a bug in compaction output cutting where small output files were produced due to TTL file cutting states were not being updated (#11075).

New Features

  • When an SstPartitionerFactory is configured, CompactRange() now automatically selects for compaction any files overlapping a partition boundary that is in the compaction range, even if no actual entries are in the requested compaction range. With this feature, manual compaction can be used to (re-)establish SST partition points when SstPartitioner changes, without a full compaction.
  • Add BackupEngine feature to exclude files from backup that are known to be backed up elsewhere, using CreateBackupOptions::exclude_files_callback. To restore the DB, the excluded files must be provided in alternative backup directories using RestoreOptions::alternate_dirs.

Public API Changes

  • Substantial changes have been made to the Cache class to support internal development goals. Direct use of Cache class members is discouraged and further breaking modifications are expected in the future. SecondaryCache has some related changes and implementations will need to be updated. (Unlike Cache, SecondaryCache is still intended to support user implementations, and disruptive changes will be avoided.) (#10975)
  • Add MergeOperationOutput::op_failure_scope for merge operator users to control the blast radius of merge operator failures. Existing merge operator users do not need to make any change to preserve the old behavior

Performance Improvements

  • Updated xxHash source code, which should improve kXXH3 checksum speed, at least on ARM (#11098).
  • Improved CPU efficiency of DB reads, from block cache access improvements (#10975).

RocksDB 8.0.0

18 Mar 00:15
Compare
Choose a tag to compare

8.0.0 (02/19/2023)

Behavior changes

  • ReadOptions::verify_checksums=false disables checksum verification for more reads of non-CacheEntryRole::kDataBlock blocks.
  • In case of scan with async_io enabled, if posix doesn't support IOUring, Status::NotSupported error will be returned to the users. Initially that error was swallowed and reads were switched to synchronous reads.

Bug Fixes

  • Fixed a data race on ColumnFamilyData::flush_reason caused by concurrent flushes.
  • Fixed an issue in Get and MultiGet when user-defined timestamps is enabled in combination with BlobDB.
  • Fixed some atypical behaviors for LockWAL() such as allowing concurrent/recursive use and not expecting UnlockWAL() after non-OK result. See API comments.
  • Fixed a feature interaction bug where for blobs GetEntity would expose the blob reference instead of the blob value.
  • Fixed DisableManualCompaction() and CompactRangeOptions::canceled to cancel compactions even when they are waiting on conflicting compactions to finish
  • Fixed a bug in which a successful GetMergeOperands() could transiently return Status::MergeInProgress()
  • Return the correct error (Status::NotSupported()) to MultiGet caller when ReadOptions::async_io flag is true and IO uring is not enabled. Previously, Status::Corruption() was being returned when the actual failure was lack of async IO support.
  • Fixed a bug in DB open/recovery from a compressed WAL that was caused due to incorrect handling of certain record fragments with the same offset within a WAL block.

Feature Removal

  • Remove RocksDB Lite.
  • The feature block_cache_compressed is removed. Statistics related to it are removed too.
  • Remove deprecated Env::LoadEnv(). Use Env::CreateFromString() instead.
  • Remove deprecated FileSystem::Load(). Use FileSystem::CreateFromString() instead.
  • Removed the deprecated version of these utility functions and the corresponding Java bindings: LoadOptionsFromFile, LoadLatestOptions, CheckOptionsCompatibility.
  • Remove the FactoryFunc from the LoadObject method from the Customizable helper methods.

Public API Changes

  • Moved rarely-needed Cache class definition to new advanced_cache.h, and added a CacheWrapper class to advanced_cache.h. Minor changes to SimCache API definitions.
  • Completely removed the following deprecated/obsolete statistics: the tickers BLOCK_CACHE_INDEX_BYTES_EVICT, BLOCK_CACHE_FILTER_BYTES_EVICT, BLOOM_FILTER_MICROS, NO_FILE_CLOSES, STALL_L0_SLOWDOWN_MICROS, STALL_MEMTABLE_COMPACTION_MICROS, STALL_L0_NUM_FILES_MICROS, RATE_LIMIT_DELAY_MILLIS, NO_ITERATORS, NUMBER_FILTERED_DELETES, WRITE_TIMEDOUT, BLOB_DB_GC_NUM_KEYS_OVERWRITTEN, BLOB_DB_GC_NUM_KEYS_EXPIRED, BLOB_DB_GC_BYTES_OVERWRITTEN, BLOB_DB_GC_BYTES_EXPIRED, BLOCK_CACHE_COMPRESSION_DICT_BYTES_EVICT as well as the histograms STALL_L0_SLOWDOWN_COUNT, STALL_MEMTABLE_COMPACTION_COUNT, STALL_L0_NUM_FILES_COUNT, HARD_RATE_LIMIT_DELAY_COUNT, SOFT_RATE_LIMIT_DELAY_COUNT, BLOB_DB_GC_MICROS, and NUM_DATA_BLOCKS_READ_PER_LEVEL. Note that as a result, the C++ enum values of the still supported statistics have changed. Developers are advised to not rely on the actual numeric values.
  • Deprecated IngestExternalFileOptions::write_global_seqno and change default to false. This option only needs to be set to true to generate a DB compatible with RocksDB versions before 5.16.0.
  • Remove deprecated APIs GetColumnFamilyOptionsFrom{Map|String}(const ColumnFamilyOptions&, ..), GetDBOptionsFrom{Map|String}(const DBOptions&, ..), GetBlockBasedTableOptionsFrom{Map|String}(const BlockBasedTableOptions& table_options, ..) and GetPlainTableOptionsFrom{Map|String}(const PlainTableOptions& table_options,..).
  • Added a subcode of Status::Corruption, Status::SubCode::kMergeOperatorFailed, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions

Build Changes

  • The make build now builds a shared library by default instead of a static library. Use LIB_MODE=static to override.

New Features

  • Compaction filters are now supported for wide-column entities by means of the FilterV3 API. See the comment of the API for more details.
  • Added do_not_compress_roles to CompressedSecondaryCacheOptions to disable compression on certain kinds of block. Filter blocks are now not compressed by CompressedSecondaryCache by default.
  • Added a new MultiGetEntity API that enables batched wide-column point lookups. See the API comments for more details.

RocksDB 7.9.2

17 Jan 18:51
Compare
Choose a tag to compare

7.9.2 (2022-12-21)

Bug Fixes

  • Fixed a heap use after free bug in async scan prefetching when the scan thread and another thread try to read and load the same seek block into cache.

7.9.1 (2022-12-08)

Bug Fixes

  • Fixed a regression in iterator where range tombstones after iterate_upper_bound is processed.
  • Fixed a memory leak in MultiGet with async_io read option, caused by IO errors during table file open

Behavior changes

  • Make best-efforts recovery verify SST unique ID before Version construction (#10962)

7.9.0 (2022-11-21)

Performance Improvements

  • Fixed an iterator performance regression for delete range users when scanning through a consecutive sequence of range tombstones (#10877).

Bug Fixes

  • Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if there is IOError while reading the data leading to empty buffer and other buffer already in progress of async read goes again for reading.
  • Fix failed memtable flush retry bug that could cause wrongly ordered updates, which would surface to writers as Status::Corruption in case of force_consistency_checks=true (default). It affects use cases that enable both parallel flush (max_background_flushes > 1 or max_background_jobs >= 8) and non-default memtable count (max_write_buffer_number > 2).
  • Fixed an issue where the READ_NUM_MERGE_OPERANDS ticker was not updated when the base key-value or tombstone was read from an SST file.
  • Fixed a memory safety bug when using a SecondaryCache with block_cache_compressed. block_cache_compressed no longer attempts to use SecondaryCache features.
  • Fixed a regression in scan for async_io. During seek, valid buffers were getting cleared causing a regression.
  • Tiered Storage: fixed excessive keys written to penultimate level in non-debug builds.

New Features

  • Add basic support for user-defined timestamp to Merge (#10819).
  • Add stats for ReadAsync time spent and async read errors.
  • Basic support for the wide-column data model is now available. Wide-column entities can be stored using the PutEntity API, and retrieved using GetEntity and the new columns API of iterator. For compatibility, the classic APIs Get and MultiGet, as well as iterator's value API return the value of the anonymous default column of wide-column entities; also, GetEntity and iterator's columns return any plain key-values in the form of an entity which only has the anonymous default column. Merge (and GetMergeOperands) currently also apply to the default column; any other columns of entities are unaffected by Merge operations. Note that some features like compaction filters, transactions, user-defined timestamps, and the SST file writer do not yet support wide-column entities; also, there is currently no MultiGet-like API to retrieve multiple entities at once. We plan to gradually close the above gaps and also implement new features like column-level operations (e.g. updating or querying only certain columns of an entity).
  • Marked HyperClockCache as a production-ready alternative to LRUCache for the block cache. HyperClockCache greatly improves hot-path CPU efficiency under high parallel load or high contention, with some documented caveats and limitations. As much as 4.5x higher ops/sec vs. LRUCache has been seen in db_bench under high parallel load.
  • Add periodic diagnostics to info_log (LOG file) for HyperClockCache block cache if performance is degraded by bad estimated_entry_charge option.

Public API Changes

  • Marked block_cache_compressed as a deprecated feature. Use SecondaryCache instead.
  • Added a SecondaryCache::InsertSaved() API, with default implementation depending on Insert(). Some implementations might need to add a custom implementation of InsertSaved(). (Details in API comments.)

RocksDB 7.8.3

15 Dec 18:56
Compare
Choose a tag to compare

7.8.3 (2022-11-29)

  • Revert an internal change in 7.8.0 associated with some memory usage churn.

7.8.2 (2022-11-27)

Behavior changes

  • Make best-efforts recovery verify SST unique ID before Version construction (#10962)
  • Fix failed memtable flush retry bug that could cause wrongly ordered updates, which would surface to writers as Status::Corruption in case of force_consistency_checks=true (default). It affects use cases that enable both parallel flush (max_background_flushes > 1 or max_background_jobs >= 8) and non-default memtable count (max_write_buffer_number > 2).
  • Tiered Storage: fixed excessive keys written to penultimate level in non-debug builds.

Bug Fixes

  • Fixed a regression in scan for async_io. During seek, valid buffers were getting cleared causing a regression.
  • Fixed a performance regression in iterator where range tombstones after iterate_upper_bound is processed.

7.8.1 (2022-11-02)

Bug Fixes

  • Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if there is IOError while reading the data leading to empty buffer and other buffer already in progress of async read goes again for reading.

7.8.0 (2022-10-22)

New Features

  • DeleteRange() now supports user-defined timestamp.
  • Provide support for async_io with tailing iterators when ReadOptions.tailing is enabled during scans.
  • Tiered Storage: allow data moving up from the last level to the penultimate level if the input level is penultimate level or above.
  • Added DB::Properties::kFastBlockCacheEntryStats, which is similar to DB::Properties::kBlockCacheEntryStats, except returns cached (stale) values in more cases to reduce overhead.
  • FIFO compaction now supports migrating from a multi-level DB via DB::Open(). During the migration phase, FIFO compaction picker will:
  • picks the sst file with the smallest starting key in the bottom-most non-empty level.
  • Note that during the migration phase, the file purge order will only be an approximation of "FIFO" as files in lower-level might sometime contain newer keys than files in upper-level.
  • Added an option ignore_max_compaction_bytes_for_input to ignore max_compaction_bytes limit when adding files to be compacted from input level. This should help reduce write amplification. The option is enabled by default.
  • Tiered Storage: allow data moving up from the last level even if it's a last level only compaction, as long as the penultimate level is empty.
  • Add a new option IOOptions.do_not_recurse that can be used by underlying file systems to skip recursing through sub directories and list only files in GetChildren API.
  • Add option preserve_internal_time_seconds to preserve the time information for the latest data. Which can be used to determine the age of data when preclude_last_level_data_seconds is enabled. The time information is attached with SST in table property rocksdb.seqno.time.map which can be parsed by tool ldb or sst_dump.

Bug Fixes

  • Fix a bug in io_uring_prep_cancel in AbortIO API for posix which expects sqe->addr to match with read request submitted and wrong paramter was being passed.
  • Fixed a regression in iterator performance when the entire DB is a single memtable introduced in #10449. The fix is in #10705 and #10716.
  • Fixed an optimistic transaction validation bug caused by DBImpl::GetLatestSequenceForKey() returning non-latest seq for merge (#10724).
  • Fixed a bug in iterator refresh which could segfault for DeleteRange users (#10739).
  • Fixed a bug causing manual flush with flush_opts.wait=false to stall when database has stopped all writes (#10001).
  • Fixed a bug in iterator refresh that was not freeing up SuperVersion, which could cause excessive resource pinniung (#10770).
  • Fixed a bug where RocksDB could be doing compaction endlessly when allow_ingest_behind is true and the bottommost level is not filled (#10767).
  • Fixed a memory safety bug in experimental HyperClockCache (#10768)
  • Fixed some cases where ldb update_manifest and ldb unsafe_remove_sst_file are not usable because they were requiring the DB files to match the existing manifest state (before updating the manifest to match a desired state).

Performance Improvements

  • Try to align the compaction output file boundaries to the next level ones, which can reduce more than 10% compaction load for the default level compaction. The feature is enabled by default, to disable, set AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size to false. As a side effect, it can create SSTs larger than the target_file_size (capped at 2x target_file_size) or smaller files.
  • Improve RoundRobin TTL compaction, which is going to be the same as normal RoundRobin compaction to move the compaction cursor.
  • Fix a small CPU regression caused by a change that UserComparatorWrapper was made Customizable, because Customizable itself has small CPU overhead for initialization.
  • Fixed an iterator performance regression for delete range users when scanning through a consecutive sequence of range tombstones (#10877).

Behavior Changes

  • Sanitize min_write_buffer_number_to_merge to 1 if atomic flush is enabled to prevent unexpected data loss when WAL is disabled in a multi-column-family setting (#10773).

Public API changes

  • Make kXXH3 checksum the new default, because it is faster on common hardware, especially with kCRC32c affected by a performance bug in some versions of clang (#9891). DBs written with this new setting can be read by RocksDB 6.27 and newer.
  • Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Introduced an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter. More details in rocksdb/includb/block_cache_trace_writer.h.