rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.

GPL-2.0 License

Stars
28K
Committers
977

Bot releases are hidden (Show)

rocksdb - RocksDB 8.1.1

Published by ltamasi over 1 year ago

8.1.1 (04/06/2023)

Bug Fixes

  • In the DB::VerifyFileChecksums API, ensure that file system reads of SST files are equal to the readahead_size in ReadOptions, if specified. Previously, each read was 2x the readahead_size.

8.1.0 (03/18/2023)

Behavior changes

  • Compaction output file cutting logic now considers range tombstone start keys. For example, SST partitioner now may receive ParitionRequest for range tombstone start keys.
  • If the async_io ReadOption is specified for MultiGet or NewIterator on a platform that doesn't support IO uring, the option is ignored and synchronous IO is used.

Bug Fixes

  • Fixed an issue for backward iteration when user defined timestamp is enabled in combination with BlobDB.
  • Fixed a couple of cases where a Merge operand encountered during iteration wasn't reflected in the internal_merge_count PerfContext counter.
  • Fixed a bug in CreateColumnFamilyWithImport()/ExportColumnFamily() which did not support range tombstones (#11252).
  • Fixed a bug where an excluded column family from an atomic flush contains unflushed data that should've been included in this atomic flush (i.e, data of seqno less than the max seqno of this atomic flush), leading to potential data loss in this excluded column family when WriteOptions::disableWAL == true (#11148).

New Features

  • Add statistics rocksdb.secondary.cache.filter.hits, rocksdb.secondary.cache.index.hits, and rocksdb.secondary.cache.filter.hits
  • Added a new PerfContext counter internal_merge_point_lookup_count which tracks the number of Merge operands applied while serving point lookup queries.
  • Add new statistics rocksdb.table.open.prefetch.tail.read.bytes, rocksdb.table.open.prefetch.tail.{miss|hit}
  • Add support for SecondaryCache with HyperClockCache (HyperClockCacheOptions inherits secondary_cache option from ShardedCacheOptions)
  • Add new db properties rocksdb.cf-write-stall-stats, rocksdb.db-write-stall-statsand APIs to examine them in a structured way. In particular, users of GetMapProperty() with property kCFWriteStallStats/kDBWriteStallStats can now use the functions in WriteStallStatsMapKeys to find stats in the map.

Public API Changes

  • Changed various functions and features in Cache that are mostly relevant to custom implementations or wrappers. Especially, asychronous lookup functionality is moved from Lookup() to a new StartAsyncLookup() function.
rocksdb - RocksDB 8.0.0

Published by ajkr over 1 year ago

8.0.0 (02/19/2023)

Behavior changes

  • ReadOptions::verify_checksums=false disables checksum verification for more reads of non-CacheEntryRole::kDataBlock blocks.
  • In case of scan with async_io enabled, if posix doesn't support IOUring, Status::NotSupported error will be returned to the users. Initially that error was swallowed and reads were switched to synchronous reads.

Bug Fixes

  • Fixed a data race on ColumnFamilyData::flush_reason caused by concurrent flushes.
  • Fixed an issue in Get and MultiGet when user-defined timestamps is enabled in combination with BlobDB.
  • Fixed some atypical behaviors for LockWAL() such as allowing concurrent/recursive use and not expecting UnlockWAL() after non-OK result. See API comments.
  • Fixed a feature interaction bug where for blobs GetEntity would expose the blob reference instead of the blob value.
  • Fixed DisableManualCompaction() and CompactRangeOptions::canceled to cancel compactions even when they are waiting on conflicting compactions to finish
  • Fixed a bug in which a successful GetMergeOperands() could transiently return Status::MergeInProgress()
  • Return the correct error (Status::NotSupported()) to MultiGet caller when ReadOptions::async_io flag is true and IO uring is not enabled. Previously, Status::Corruption() was being returned when the actual failure was lack of async IO support.
  • Fixed a bug in DB open/recovery from a compressed WAL that was caused due to incorrect handling of certain record fragments with the same offset within a WAL block.

Feature Removal

  • Remove RocksDB Lite.
  • The feature block_cache_compressed is removed. Statistics related to it are removed too.
  • Remove deprecated Env::LoadEnv(). Use Env::CreateFromString() instead.
  • Remove deprecated FileSystem::Load(). Use FileSystem::CreateFromString() instead.
  • Removed the deprecated version of these utility functions and the corresponding Java bindings: LoadOptionsFromFile, LoadLatestOptions, CheckOptionsCompatibility.
  • Remove the FactoryFunc from the LoadObject method from the Customizable helper methods.

Public API Changes

  • Moved rarely-needed Cache class definition to new advanced_cache.h, and added a CacheWrapper class to advanced_cache.h. Minor changes to SimCache API definitions.
  • Completely removed the following deprecated/obsolete statistics: the tickers BLOCK_CACHE_INDEX_BYTES_EVICT, BLOCK_CACHE_FILTER_BYTES_EVICT, BLOOM_FILTER_MICROS, NO_FILE_CLOSES, STALL_L0_SLOWDOWN_MICROS, STALL_MEMTABLE_COMPACTION_MICROS, STALL_L0_NUM_FILES_MICROS, RATE_LIMIT_DELAY_MILLIS, NO_ITERATORS, NUMBER_FILTERED_DELETES, WRITE_TIMEDOUT, BLOB_DB_GC_NUM_KEYS_OVERWRITTEN, BLOB_DB_GC_NUM_KEYS_EXPIRED, BLOB_DB_GC_BYTES_OVERWRITTEN, BLOB_DB_GC_BYTES_EXPIRED, BLOCK_CACHE_COMPRESSION_DICT_BYTES_EVICT as well as the histograms STALL_L0_SLOWDOWN_COUNT, STALL_MEMTABLE_COMPACTION_COUNT, STALL_L0_NUM_FILES_COUNT, HARD_RATE_LIMIT_DELAY_COUNT, SOFT_RATE_LIMIT_DELAY_COUNT, BLOB_DB_GC_MICROS, and NUM_DATA_BLOCKS_READ_PER_LEVEL. Note that as a result, the C++ enum values of the still supported statistics have changed. Developers are advised to not rely on the actual numeric values.
  • Deprecated IngestExternalFileOptions::write_global_seqno and change default to false. This option only needs to be set to true to generate a DB compatible with RocksDB versions before 5.16.0.
  • Remove deprecated APIs GetColumnFamilyOptionsFrom{Map|String}(const ColumnFamilyOptions&, ..), GetDBOptionsFrom{Map|String}(const DBOptions&, ..), GetBlockBasedTableOptionsFrom{Map|String}(const BlockBasedTableOptions& table_options, ..) and GetPlainTableOptionsFrom{Map|String}(const PlainTableOptions& table_options,..).
  • Added a subcode of Status::Corruption, Status::SubCode::kMergeOperatorFailed, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions

Build Changes

  • The make build now builds a shared library by default instead of a static library. Use LIB_MODE=static to override.

New Features

  • Compaction filters are now supported for wide-column entities by means of the FilterV3 API. See the comment of the API for more details.
  • Added do_not_compress_roles to CompressedSecondaryCacheOptions to disable compression on certain kinds of block. Filter blocks are now not compressed by CompressedSecondaryCache by default.
  • Added a new MultiGetEntity API that enables batched wide-column point lookups. See the API comments for more details.
rocksdb - RocksDB 7.10.2

Published by hx235 over 1 year ago

7.10.2 (2023-02-10)

Bug Fixes

  • Fixed a bug in DB open/recovery from a compressed WAL that was caused due to incorrect handling of certain record fragments with the same offset within a WAL block.

7.10.1 (2023-02-01)

Bug Fixes

  • Fixed a data race on ColumnFamilyData::flush_reason caused by concurrent flushes.
  • Fixed DisableManualCompaction() and CompactRangeOptions::canceled to cancel compactions even when they are waiting on conflicting compactions to finish
  • Fixed a bug in which a successful GetMergeOperands() could transiently return Status::MergeInProgress()
  • Return the correct error (Status::NotSupported()) to MultiGet caller when ReadOptions::async_io flag is true and IO uring is not enabled. Previously, Status::Corruption() was being returned when the actual failure was lack of async IO support.

7.10.0 (2023-01-23)

Behavior changes

  • Make best-efforts recovery verify SST unique ID before Version construction (#10962)
  • Introduce epoch_number and sort L0 files by epoch_number instead of largest_seqno. epoch_number represents the order of a file being flushed or ingested/imported. Compaction output file will be assigned with the minimum epoch_number among input files'. For L0, larger epoch_number indicates newer L0 file.

Bug Fixes

  • Fixed a regression in iterator where range tombstones after iterate_upper_bound is processed.
  • Fixed a memory leak in MultiGet with async_io read option, caused by IO errors during table file open
  • Fixed a bug that multi-level FIFO compaction deletes one file in non-L0 even when CompactionOptionsFIFO::max_table_files_size is no exceeded since #10348 or 7.8.0.
  • Fixed a bug caused by DB::SyncWAL() affecting track_and_verify_wals_in_manifest. Without the fix, application may see "open error: Corruption: Missing WAL with log number" while trying to open the db. The corruption is a false alarm but prevents DB open (#10892).
  • Fixed a BackupEngine bug in which RestoreDBFromLatestBackup would fail if the latest backup was deleted and there is another valid backup available.
  • Fix L0 file misorder corruption caused by ingesting files of overlapping seqnos with memtable entries' through introducing epoch_number. Before the fix, force_consistency_checks=true may catch the corruption before it's exposed to readers, in which case writes returning Status::Corruption would be expected. Also replace the previous incomplete fix (#5958) to the same corruption with this new and more complete fix.
  • Fixed a bug in LockWAL() leading to re-locking mutex (#11020).
  • Fixed a heap use after free bug in async scan prefetching when the scan thread and another thread try to read and load the same seek block into cache.
  • Fixed a heap use after free in async scan prefetching if dictionary compression is enabled, in which case sync read of the compression dictionary gets mixed with async prefetching
  • Fixed a data race bug of CompactRange() under change_level=true acts on overlapping range with an ongoing file ingestion for level compaction. This will either result in overlapping file ranges corruption at a certain level caught by force_consistency_checks=true or protentially two same keys both with seqno 0 in two different levels (i.e, new data ends up in lower/older level). The latter will be caught by assertion in debug build but go silently and result in read returning wrong result in release build. This fix is general so it also replaced previous fixes to a similar problem for CompactFiles() (#4665), general CompactRange() and auto compaction (commit 5c64fb6 and 87dfc1d).
  • Fixed a bug in compaction output cutting where small output files were produced due to TTL file cutting states were not being updated (#11075).

New Features

  • When an SstPartitionerFactory is configured, CompactRange() now automatically selects for compaction any files overlapping a partition boundary that is in the compaction range, even if no actual entries are in the requested compaction range. With this feature, manual compaction can be used to (re-)establish SST partition points when SstPartitioner changes, without a full compaction.
  • Add BackupEngine feature to exclude files from backup that are known to be backed up elsewhere, using CreateBackupOptions::exclude_files_callback. To restore the DB, the excluded files must be provided in alternative backup directories using RestoreOptions::alternate_dirs.

Public API Changes

  • Substantial changes have been made to the Cache class to support internal development goals. Direct use of Cache class members is discouraged and further breaking modifications are expected in the future. SecondaryCache has some related changes and implementations will need to be updated. (Unlike Cache, SecondaryCache is still intended to support user implementations, and disruptive changes will be avoided.) (#10975)
  • Add MergeOperationOutput::op_failure_scope for merge operator users to control the blast radius of merge operator failures. Existing merge operator users do not need to make any change to preserve the old behavior

Performance Improvements

  • Updated xxHash source code, which should improve kXXH3 checksum speed, at least on ARM (#11098).
  • Improved CPU efficiency of DB reads, from block cache access improvements (#10975).
rocksdb - RocksDB 7.9.2

Published by anand1976 almost 2 years ago

7.9.2 (2022-12-21)

Bug Fixes

  • Fixed a heap use after free bug in async scan prefetching when the scan thread and another thread try to read and load the same seek block into cache.

7.9.1 (2022-12-08)

Bug Fixes

  • Fixed a regression in iterator where range tombstones after iterate_upper_bound is processed.
  • Fixed a memory leak in MultiGet with async_io read option, caused by IO errors during table file open

Behavior changes

  • Make best-efforts recovery verify SST unique ID before Version construction (#10962)

7.9.0 (2022-11-21)

Performance Improvements

  • Fixed an iterator performance regression for delete range users when scanning through a consecutive sequence of range tombstones (#10877).

Bug Fixes

  • Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if there is IOError while reading the data leading to empty buffer and other buffer already in progress of async read goes again for reading.
  • Fix failed memtable flush retry bug that could cause wrongly ordered updates, which would surface to writers as Status::Corruption in case of force_consistency_checks=true (default). It affects use cases that enable both parallel flush (max_background_flushes > 1 or max_background_jobs >= 8) and non-default memtable count (max_write_buffer_number > 2).
  • Fixed an issue where the READ_NUM_MERGE_OPERANDS ticker was not updated when the base key-value or tombstone was read from an SST file.
  • Fixed a memory safety bug when using a SecondaryCache with block_cache_compressed. block_cache_compressed no longer attempts to use SecondaryCache features.
  • Fixed a regression in scan for async_io. During seek, valid buffers were getting cleared causing a regression.
  • Tiered Storage: fixed excessive keys written to penultimate level in non-debug builds.

New Features

  • Add basic support for user-defined timestamp to Merge (#10819).
  • Add stats for ReadAsync time spent and async read errors.
  • Basic support for the wide-column data model is now available. Wide-column entities can be stored using the PutEntity API, and retrieved using GetEntity and the new columns API of iterator. For compatibility, the classic APIs Get and MultiGet, as well as iterator's value API return the value of the anonymous default column of wide-column entities; also, GetEntity and iterator's columns return any plain key-values in the form of an entity which only has the anonymous default column. Merge (and GetMergeOperands) currently also apply to the default column; any other columns of entities are unaffected by Merge operations. Note that some features like compaction filters, transactions, user-defined timestamps, and the SST file writer do not yet support wide-column entities; also, there is currently no MultiGet-like API to retrieve multiple entities at once. We plan to gradually close the above gaps and also implement new features like column-level operations (e.g. updating or querying only certain columns of an entity).
  • Marked HyperClockCache as a production-ready alternative to LRUCache for the block cache. HyperClockCache greatly improves hot-path CPU efficiency under high parallel load or high contention, with some documented caveats and limitations. As much as 4.5x higher ops/sec vs. LRUCache has been seen in db_bench under high parallel load.
  • Add periodic diagnostics to info_log (LOG file) for HyperClockCache block cache if performance is degraded by bad estimated_entry_charge option.

Public API Changes

  • Marked block_cache_compressed as a deprecated feature. Use SecondaryCache instead.
  • Added a SecondaryCache::InsertSaved() API, with default implementation depending on Insert(). Some implementations might need to add a custom implementation of InsertSaved(). (Details in API comments.)
rocksdb - RocksDB 7.8.3

Published by cbi42 almost 2 years ago

7.8.3 (2022-11-29)

  • Revert an internal change in 7.8.0 associated with some memory usage churn.

7.8.2 (2022-11-27)

Behavior changes

  • Make best-efforts recovery verify SST unique ID before Version construction (#10962)
  • Fix failed memtable flush retry bug that could cause wrongly ordered updates, which would surface to writers as Status::Corruption in case of force_consistency_checks=true (default). It affects use cases that enable both parallel flush (max_background_flushes > 1 or max_background_jobs >= 8) and non-default memtable count (max_write_buffer_number > 2).
  • Tiered Storage: fixed excessive keys written to penultimate level in non-debug builds.

Bug Fixes

  • Fixed a regression in scan for async_io. During seek, valid buffers were getting cleared causing a regression.
  • Fixed a performance regression in iterator where range tombstones after iterate_upper_bound is processed.

7.8.1 (2022-11-02)

Bug Fixes

  • Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if there is IOError while reading the data leading to empty buffer and other buffer already in progress of async read goes again for reading.

7.8.0 (2022-10-22)

New Features

  • DeleteRange() now supports user-defined timestamp.
  • Provide support for async_io with tailing iterators when ReadOptions.tailing is enabled during scans.
  • Tiered Storage: allow data moving up from the last level to the penultimate level if the input level is penultimate level or above.
  • Added DB::Properties::kFastBlockCacheEntryStats, which is similar to DB::Properties::kBlockCacheEntryStats, except returns cached (stale) values in more cases to reduce overhead.
  • FIFO compaction now supports migrating from a multi-level DB via DB::Open(). During the migration phase, FIFO compaction picker will:
  • picks the sst file with the smallest starting key in the bottom-most non-empty level.
  • Note that during the migration phase, the file purge order will only be an approximation of "FIFO" as files in lower-level might sometime contain newer keys than files in upper-level.
  • Added an option ignore_max_compaction_bytes_for_input to ignore max_compaction_bytes limit when adding files to be compacted from input level. This should help reduce write amplification. The option is enabled by default.
  • Tiered Storage: allow data moving up from the last level even if it's a last level only compaction, as long as the penultimate level is empty.
  • Add a new option IOOptions.do_not_recurse that can be used by underlying file systems to skip recursing through sub directories and list only files in GetChildren API.
  • Add option preserve_internal_time_seconds to preserve the time information for the latest data. Which can be used to determine the age of data when preclude_last_level_data_seconds is enabled. The time information is attached with SST in table property rocksdb.seqno.time.map which can be parsed by tool ldb or sst_dump.

Bug Fixes

  • Fix a bug in io_uring_prep_cancel in AbortIO API for posix which expects sqe->addr to match with read request submitted and wrong paramter was being passed.
  • Fixed a regression in iterator performance when the entire DB is a single memtable introduced in #10449. The fix is in #10705 and #10716.
  • Fixed an optimistic transaction validation bug caused by DBImpl::GetLatestSequenceForKey() returning non-latest seq for merge (#10724).
  • Fixed a bug in iterator refresh which could segfault for DeleteRange users (#10739).
  • Fixed a bug causing manual flush with flush_opts.wait=false to stall when database has stopped all writes (#10001).
  • Fixed a bug in iterator refresh that was not freeing up SuperVersion, which could cause excessive resource pinniung (#10770).
  • Fixed a bug where RocksDB could be doing compaction endlessly when allow_ingest_behind is true and the bottommost level is not filled (#10767).
  • Fixed a memory safety bug in experimental HyperClockCache (#10768)
  • Fixed some cases where ldb update_manifest and ldb unsafe_remove_sst_file are not usable because they were requiring the DB files to match the existing manifest state (before updating the manifest to match a desired state).

Performance Improvements

  • Try to align the compaction output file boundaries to the next level ones, which can reduce more than 10% compaction load for the default level compaction. The feature is enabled by default, to disable, set AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size to false. As a side effect, it can create SSTs larger than the target_file_size (capped at 2x target_file_size) or smaller files.
  • Improve RoundRobin TTL compaction, which is going to be the same as normal RoundRobin compaction to move the compaction cursor.
  • Fix a small CPU regression caused by a change that UserComparatorWrapper was made Customizable, because Customizable itself has small CPU overhead for initialization.
  • Fixed an iterator performance regression for delete range users when scanning through a consecutive sequence of range tombstones (#10877).

Behavior Changes

  • Sanitize min_write_buffer_number_to_merge to 1 if atomic flush is enabled to prevent unexpected data loss when WAL is disabled in a multi-column-family setting (#10773).

Public API changes

  • Make kXXH3 checksum the new default, because it is faster on common hardware, especially with kCRC32c affected by a performance bug in some versions of clang (https://github.com/facebook/rocksdb/issues/9891). DBs written with this new setting can be read by RocksDB 6.27 and newer.
  • Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Introduced an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter. More details in rocksdb/includb/block_cache_trace_writer.h.
rocksdb - RocksDB 7.7.8

Published by cbi42 almost 2 years ago

7.7.8 (2022-11-27)

Bug Fixes

  • Fix failed memtable flush retry bug that could cause wrongly ordered updates, which would surface to writers as Status::Corruption in case of force_consistency_checks=true (default). It affects use cases that enable both parallel flush (max_background_flushes > 1 or max_background_jobs >= 8) and non-default memtable count (max_write_buffer_number > 2).
  • Tiered Storage: fixed excessive keys written to penultimate level in non-debug builds.
  • Fixed a regression in iterator where range tombstones after iterate_upper_bound is processed.

7.7.7 (2022-11-15)

Bug Fixes

  • Fixed a regression in scan for async_io. During seek, valid buffers were getting cleared causing a regression.

7.7.6 (2022-11-03)

Bug Fixes

  • Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if there is IOError while reading the data leading to empty buffer and other buffer already in progress of async read goes again for reading.

7.7.5 (2022-10-28)

Bug Fixes

  • Fixed an iterator performance regression for delete range users when scanning through a consecutive sequence of range tombstones (#10877).

7.7.4 (2022-10-28)

Bug Fixes

  • Fixed a case of calling malloc_usable_size on result of operator new[].
rocksdb - RocksDB 7.7.3

Published by pdillinger about 2 years ago

7.7.3 (2022-10-11)

Bug Fixes

  • Fixed a memory safety bug in experimental HyperClockCache (#10768)
rocksdb - RocksDB 7.7.2

Published by ajkr about 2 years ago

7.7.2 (2022-10-05)

Bug Fixes

  • Fixed a bug in iterator refresh that was not freeing up SuperVersion, which could cause excessive resource pinniung (#10770).
  • Fixed a bug where RocksDB could be doing compaction endlessly when allow_ingest_behind is true and the bottommost level is not filled (#10767).

Behavior Changes

  • Sanitize min_write_buffer_number_to_merge to 1 if atomic flush is enabled to prevent unexpected data loss when WAL is disabled in a multi-column-family setting (#10773).

7.7.1 (2022-09-26)

Bug Fixes

  • Fixed an optimistic transaction validation bug caused by DBImpl::GetLatestSequenceForKey() returning non-latest seq for merge (#10724).
  • Fixed a bug in iterator refresh which could segfault for DeleteRange users (#10739).

7.7.0 (2022-09-18)

Bug Fixes

  • Fixed a hang when an operation such as GetLiveFiles or CreateNewBackup is asked to trigger and wait for memtable flush on a read-only DB. Such indirect requests for memtable flush are now ignored on a read-only DB.
  • Fixed bug where FlushWAL(true /* sync */) (used by GetLiveFilesStorageInfo(), which is used by checkpoint and backup) could cause parallel writes at the tail of a WAL file to never be synced.
  • Fix periodic_task unable to re-register the same task type, which may cause SetOptions() fail to update periodical_task time like: stats_dump_period_sec, stats_persist_period_sec.
  • Fixed a bug in the rocksdb.prefetched.bytes.discarded stat. It was counting the prefetch buffer size, rather than the actual number of bytes discarded from the buffer.
  • Fix bug where the directory containing CURRENT can left unsynced after CURRENT is updated to point to the latest MANIFEST, which leads to risk of unsync data loss of CURRENT.
  • Update rocksdb.multiget.io.batch.size stat in non-async MultiGet as well.
  • Fix a bug in key range overlap checking with concurrent compactions when user-defined timestamp is enabled. User-defined timestamps should be EXCLUDED when checking if two ranges overlap.
  • Fixed a bug where the blob cache prepopulating logic did not consider the secondary cache (see #10603).
  • Fixed the rocksdb.num.sst.read.per.level, rocksdb.num.index.and.filter.blocks.read.per.level and rocksdb.num.level.read.per.multiget stats in the MultiGet coroutines
  • Fix a bug in io_uring_prep_cancel in AbortIO API for posix which expects sqe->addr to match with read request submitted and wrong paramter was being passed.
  • Fixed a regression in iterator performance when the entire DB is a single memtable introduced in #10449. The fix is in #10705 and #10716.
  • Fix a bug in io_uring_prep_cancel in AbortIO API for posix which expects sqe->addr to match with read request submitted and wrong paramter was being passed.
  • Fixed a regression in iterator performance when the entire DB is a single memtable introduced in #10449. The fix is in #10705 and #10716.

Public API changes

  • Add rocksdb_column_family_handle_get_id, rocksdb_column_family_handle_get_name to get name, id of column family in C API
  • Add a new stat rocksdb.async.prefetch.abort.micros to measure time spent waiting for async prefetch reads to abort

Java API Changes

  • Add CompactionPriority.RoundRobin.
  • Revert to using the default metadata charge policy when creating an LRU cache via the Java API.

Behavior Change

  • DBOptions::verify_sst_unique_id_in_manifest is now an on-by-default feature that verifies SST file identity whenever they are opened by a DB, rather than only at DB::Open time.
  • Right now, when the option migration tool (OptionChangeMigration()) migrates to FIFO compaction, it compacts all the data into one single SST file and move to L0. This might create a problem for some users: the giant file may be soon deleted to satisfy max_table_files_size, and might cayse the DB to be almost empty. We change the behavior so that the files are cut to be smaller, but these files might not follow the data insertion order. With the change, after the migration, migrated data might not be dropped by insertion order by FIFO compaction.
  • When a block is firstly found from CompressedSecondaryCache, we just insert a dummy block into the primary cache and don’t erase the block from CompressedSecondaryCache. A standalone handle is returned to the caller. Only if the block is found again from CompressedSecondaryCache before the dummy block is evicted, we erase the block from CompressedSecondaryCache and insert it into the primary cache.
  • When a block is firstly evicted from the primary cache to CompressedSecondaryCache, we just insert a dummy block in CompressedSecondaryCache. Only if it is evicted again before the dummy block is evicted from the cache, it is treated as a hot block and is inserted into CompressedSecondaryCache.
  • Improved the estimation of memory used by cached blobs by taking into account the size of the object owning the blob value and also the allocator overhead if malloc_usable_size is available (see #10583).
  • Blob values now have their own category in the cache occupancy statistics, as opposed to being lumped into the "Misc" bucket (see #10601).
  • Change the optimize_multiget_for_io experimental ReadOptions flag to default on.

New Features

  • RocksDB does internal auto prefetching if it notices 2 sequential reads if readahead_size is not specified. New option num_file_reads_for_auto_readahead is added in BlockBasedTableOptions which indicates after how many sequential reads internal auto prefetching should be start (default is 2).
  • Added new perf context counters block_cache_standalone_handle_count, block_cache_real_handle_count,compressed_sec_cache_insert_real_count, compressed_sec_cache_insert_dummy_count, compressed_sec_cache_uncompressed_bytes, and compressed_sec_cache_compressed_bytes.
  • Memory for blobs which are to be inserted into the blob cache is now allocated using the cache's allocator (see #10628 and #10647).
  • HyperClockCache is an experimental, lock-free Cache alternative for block cache that offers much improved CPU efficiency under high parallel load or high contention, with some caveats. As much as 4.5x higher ops/sec vs. LRUCache has been seen in db_bench under high parallel load.
  • CompressedSecondaryCacheOptions::enable_custom_split_merge is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins.

Performance Improvements

  • Iterator performance is improved for DeleteRange() users. Internally, iterator will skip to the end of a range tombstone when possible, instead of looping through each key and check individually if a key is range deleted.
  • Eliminated some allocations and copies in the blob read path. Also, PinnableSlice now only points to the blob value and pins the backing resource (cache entry or buffer) in all cases, instead of containing a copy of the blob value. See #10625 and #10647.
  • In case of scans with async_io enabled, few optimizations have been added to issue more asynchronous requests in parallel in order to avoid synchronous prefetching.
  • DeleteRange() users should see improvement in get/iterator performance from mutable memtable (see #10547).
rocksdb - 7.6.0 (2022-08-19)

Published by gitbw95 about 2 years ago

New Features

  • Added prepopulate_blob_cache to ColumnFamilyOptions. If enabled, prepopulate warm/hot blobs which are already in memory into blob cache at the time of flush. On a flush, the blob that is in memory (in memtables) get flushed to the device. If using Direct IO, additional IO is incurred to read this blob back into memory again, which is avoided by enabling this option. This further helps if the workload exhibits high temporal locality, where most of the reads go to recently written data. This also helps in case of the remote file system since it involves network traffic and higher latencies.
  • Support using secondary cache with the blob cache. When creating a blob cache, the user can set a secondary blob cache by configuring secondary_cache in LRUCacheOptions.
  • Charge memory usage of blob cache when the backing cache of the blob cache and the block cache are different. If an operation reserving memory for blob cache exceeds the avaible space left in the block cache at some point (i.e, causing a cache full under LRUCacheOptions::strict_capacity_limit = true), creation will fail with Status::MemoryLimit(). To opt in this feature, enable charging CacheEntryRole::kBlobCache in BlockBasedTableOptions::cache_usage_options.
  • Improve subcompaction range partition so that it is likely to be more even. More evenly distribution of subcompaction will improve compaction throughput for some workloads. All input files' index blocks to sample some anchor key points from which we pick positions to partition the input range. This would introduce some CPU overhead in compaction preparation phase, if subcompaction is enabled, but it should be a small fraction of the CPU usage of the whole compaction process. This also brings a behavier change: subcompaction number is much more likely to maxed out than before.
  • Add CompactionPri::kRoundRobin, a compaction picking mode that cycles through all the files with a compact cursor in a round-robin manner. This feature is available since 7.5.
  • Provide support for subcompactions for user_defined_timestamp.
  • Added an option memtable_protection_bytes_per_key that turns on memtable per key-value checksum protection. Each memtable entry will be suffixed by a checksum that is computed during writes, and verified in reads/compaction. Detected corruption will be logged and with corruption status returned to user.
  • Added a blob-specific cache priority level - bottom level. Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. The user can specify the new option low_pri_pool_ratio in LRUCacheOptions to configure the ratio of capacity reserved for low priority cache entries (and therefore the remaining ratio is the space reserved for the bottom level), or configuring the new argument low_pri_pool_ratio in NewLRUCache() to achieve the same effect.

Public API changes

  • Removed Customizable support for RateLimiter and removed its CreateFromString() and Type() functions.
  • CompactRangeOptions::exclusive_manual_compaction is now false by default. This ensures RocksDB does not introduce artificial parallelism limitations by default.
  • Tiered Storage: change bottommost_temperture to last_level_temperture. The old option name is kept only for migration, please use the new option. The behavior is changed to apply temperature for the last_level SST files only.
  • Added a new experimental ReadOption flag called optimize_multiget_for_io, which when set attempts to reduce MultiGet latency by spawning coroutines for keys in multiple levels.

Bug Fixes

  • Fix a bug starting in 7.4.0 in which some fsync operations might be skipped in a DB after any DropColumnFamily on that DB, until it is re-opened. This can lead to data loss on power loss. (For custom FileSystem implementations, this could lead to FSDirectory::Fsync or FSDirectory::Close after the first FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.)
  • Fix a bug where GenericRateLimiter could revert the bandwidth set dynamically using SetBytesPerSecond() when a user configures a structure enclosing it, e.g., using GetOptionsFromString() to configure an Options that references an existing RateLimiter object.
  • Fix race conditions in GenericRateLimiter.
  • Fix a bug in FIFOCompactionPicker::PickTTLCompaction where total_size calculating might cause underflow
  • Fix data race bug in hash linked list memtable. With this bug, read request might temporarily miss an old record in the memtable in a race condition to the hash bucket.
  • Fix a bug that best_efforts_recovery may fail to open the db with mmap read.
  • Fixed a bug where blobs read during compaction would pollute the cache.
  • Fixed a data race in LRUCache when used with a secondary_cache.
  • Fixed a bug where blobs read by iterators would be inserted into the cache even with the fill_cache read option set to false.
  • Fixed the segfault caused by AllocateData() in CompressedSecondaryCache::SplitValueIntoChunks() and MergeChunksIntoValueTest.
  • Fixed a bug in BlobDB where a mix of inlined and blob values could result in an incorrect value being passed to the compaction filter (see #10391).
  • Fixed a memory leak bug in stress tests caused by FaultInjectionSecondaryCache.

Behavior Change

  • Added checksum handshake during the copying of decompressed WAL fragment. This together with #9875, #10037, #10212, #10114 and #10319 provides end-to-end integrity protection for write batch during recovery.
  • To minimize the internal fragmentation caused by the variable size of the compressed blocks in CompressedSecondaryCache, the original block is split according to the jemalloc bin size in Insert() and then merged back in Lookup().
  • PosixLogger is removed and by default EnvLogger will be used for info logging. The behavior of the two loggers should be very similar when using the default Posix Env.
  • Remove [min|max]_timestamp from VersionEdit for now since they are not tracked in MANIFEST anyway but consume two empty std::string (up to 64 bytes) for each file. Should they be added back in the future, we should store them more compactly.
  • Improve universal tiered storage compaction picker to avoid extra major compaction triggered by size amplification. If preclude_last_level_data_seconds is enabled, the size amplification is calculated within non last_level data only which skip the last level and use the penultimate level as the size base.
  • If an error is hit when writing to a file (append, sync, etc), RocksDB is more strict with not issuing more operations to it, except closing the file, with exceptions of some WAL file operations in error recovery path.
  • A WriteBufferManager constructed with allow_stall == false will no longer trigger write stall implicitly by thrashing until memtable count limit is reached. Instead, a column family can continue accumulating writes while that CF is flushing, which means memory may increase. Users who prefer stalling writes must now explicitly set allow_stall == true.
  • Add CompressedSecondaryCache into the stress tests.
  • Block cache keys have changed, which will cause any persistent caches to miss between versions.

Performance Improvements

  • Instead of constructing FragmentedRangeTombstoneList during every read operation, it is now constructed once and stored in immutable memtables. This improves speed of querying range tombstones from immutable memtables.
  • When using iterators with the integrated BlobDB implementation, blob cache handles are now released immediately when the iterator's position changes.
  • MultiGet can now do more IO in parallel by reading data blocks from SST files in multiple levels, if the optimize_multiget_for_io ReadOption flag is set.
rocksdb - RocksDB 7.5.3

Published by siying about 2 years ago

7.5.2 (2022-08-02)

Bug Fixes

  • Fix a bug starting in 7.4.0 in which some fsync operations might be skipped in a DB after any DropColumnFamily on that DB, until it is re-opened. This can lead to data loss on power loss. (For custom FileSystem implementations, this could lead to FSDirectory::Fsync or FSDirectory::Close after the first FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.)

7.5.1 (2022-08-01)

Bug Fixes

  • Fix a bug where rate_limiter_parameter is not passed into PartitionedFilterBlockReader::GetFilterPartitionBlock.

7.5.0 (2022-07-15)

New Features

  • Mempurge option flag experimental_mempurge_threshold is now a ColumnFamilyOptions and can now be dynamically configured using SetOptions().
  • Support backward iteration when ReadOptions::iter_start_ts is set.
  • Provide support for ReadOptions.async_io with direct_io to improve Seek latency by using async IO to parallelize child iterator seek and doing asynchronous prefetching on sequential scans.
  • Added support for blob caching in order to cache frequently used blobs for BlobDB.
    • User can configure the new ColumnFamilyOptions blob_cache to enable/disable blob caching.
    • Either sharing the backend cache with the block cache or using a completely separate cache is supported.
    • A new abstraction interface called BlobSource for blob read logic gives all users access to blobs, whether they are in the blob cache, secondary cache, or (remote) storage. Blobs can be potentially read both while handling user reads (Get, MultiGet, or iterator) and during compaction (while dealing with compaction filters, Merges, or garbage collection) but eventually all blob reads go through Version::GetBlob or, for MultiGet, Version::MultiGetBlob (and then get dispatched to the interface -- BlobSource).
  • Add experimental tiered compaction feature AdvancedColumnFamilyOptions::preclude_last_level_data_seconds, which makes sure the new data inserted within preclude_last_level_data_seconds won't be placed on cold tier (the feature is not complete).

Public API changes

  • Add metadata related structs and functions in C API, including
    • rocksdb_get_column_family_metadata() and rocksdb_get_column_family_metadata_cf() to obtain rocksdb_column_family_metadata_t.
    • rocksdb_column_family_metadata_t and its get functions & destroy function.
    • rocksdb_level_metadata_t and its and its get functions & destroy function.
    • rocksdb_file_metadata_t and its and get functions & destroy functions.
  • Add suggest_compact_range() and suggest_compact_range_cf() to C API.
  • When using block cache strict capacity limit (LRUCache with strict_capacity_limit=true), DB operations now fail with Status code kAborted subcode kMemoryLimit (IsMemoryLimit()) instead of kIncomplete (IsIncomplete()) when the capacity limit is reached, because Incomplete can mean other specific things for some operations. In more detail, Cache::Insert() now returns the updated Status code and this usually propagates through RocksDB to the user on failure.
  • NewClockCache calls temporarily return an LRUCache (with similar characteristics as the desired ClockCache). This is because ClockCache is being replaced by a new version (the old one had unknown bugs) but this is still under development.
  • Add two functions int ReserveThreads(int threads_to_be_reserved) and int ReleaseThreads(threads_to_be_released) into Env class. In the default implementation, both return 0. Newly added xxxEnv class that inherits Env should implement these two functions for thread reservation/releasing features.
  • Removed Customizable support for RateLimiter and removed its CreateFromString() and Type() functions.

Bug Fixes

  • Fix a bug in which backup/checkpoint can include a WAL deleted by RocksDB.
  • Fix a bug where concurrent compactions might cause unnecessary further write stalling. In some cases, this might cause write rate to drop to minimum.
  • Fix a bug in Logger where if dbname and db_log_dir are on different filesystems, dbname creation would fail wrt to db_log_dir path returning an error and fails to open the DB.
  • Fix a CPU and memory efficiency issue introduce by https://github.com/facebook/rocksdb/pull/8336 which made InternalKeyComparator configurable as an unintended side effect
  • Fix a bug where GenericRateLimiter could revert the bandwidth set dynamically using SetBytesPerSecond() when a user configures a structure enclosing it, e.g., using GetOptionsFromString() to configure an Options that references an existing RateLimiter object.

Behavior Change

  • In leveled compaction with dynamic levelling, level multiplier is not anymore adjusted due to oversized L0. Instead, compaction score is adjusted by increasing size level target by adding incoming bytes from upper levels. This would deprioritize compactions from upper levels if more data from L0 is coming. This is to fix some unnecessary full stalling due to drastic change of level targets, while not wasting write bandwidth for compaction while writes are overloaded.
  • For track_and_verify_wals_in_manifest, revert to the original behavior before #10087: syncing of live WAL file is not tracked, and we track only the synced sizes of closed WALs. (PR #10330).
  • WAL compression now computes/verifies checksum during compression/decompression.

Performance Improvements

  • Rather than doing total sort against all files in a level, SortFileByOverlappingRatio() to only find the top 50 files based on score. This can improve write throughput for the use cases where data is loaded in increasing key order and there are a lot of files in one LSM-tree, where applying compaction results is the bottleneck.
  • In leveled compaction, L0->L1 trivial move will allow more than one file to be moved in one compaction. This would allow L0 files to be moved down faster when data is loaded in sequential order, making slowdown or stop condition harder to hit. Also seek L0->L1 trivial move when only some files qualify.
  • In leveled compaction, try to trivial move more than one files if possible, up to 4 files or max_compaction_bytes. This is to allow higher write throughput for some use cases where data is loaded in sequential order, where appying compaction results is the bottleneck.
rocksdb - RocksDB 7.4.5

Published by pdillinger about 2 years ago

7.4.5 (2022-09-02)

Bug Fixes

  • Fix a bug starting in 7.4.0 in which some fsync operations might be skipped in a DB after any DropColumnFamily on that DB, until it is re-opened. This can lead to data loss on power loss. (For custom FileSystem implementations, this could lead to FSDirectory::Fsync or FSDirectory::Close after the first FSDirectory::Close; Also, valgrind could report call to close() with fd=-1.)
rocksdb - RocksDB 7.4.4

Published by pdillinger about 2 years ago

7.4.4 (2022-07-19)

Public API changes

  • Removed Customizable support for RateLimiter and removed its CreateFromString() and Type() functions.

Bug Fixes

  • Fix a bug where GenericRateLimiter could revert the bandwidth set dynamically using SetBytesPerSecond() when a user configures a structure enclosing it, e.g., using GetOptionsFromString() to configure an Options that references an existing RateLimiter object.

7.4.3 (2022-07-13)

Behavior Changes

  • For track_and_verify_wals_in_manifest, revert to the original behavior before #10087: syncing of live WAL file is not tracked, and we track only the synced sizes of closed WALs. (PR #10330).

7.4.2 (2022-06-30)

Bug Fixes

  • Fix a bug in Logger where if dbname and db_log_dir are on different filesystems, dbname creation would fail wrt to db_log_dir path returning an error and fails to open the DB.

7.4.1 (2022-06-28)

Bug Fixes

  • Pass rate_limiter_priority through filter block reader functions to FileSystem.

7.4.0 (2022-06-19)

Bug Fixes

  • Fixed a bug in calculating key-value integrity protection for users of in-place memtable updates. In particular, the affected users would be those who configure protection_bytes_per_key > 0 on WriteBatch or WriteOptions, and configure inplace_callback != nullptr.
  • Fixed a bug where a snapshot taken during SST file ingestion would be unstable.
  • Fixed a bug for non-TransactionDB with avoid_flush_during_recovery = true and TransactionDB where in case of crash, min_log_number_to_keep may not change on recovery and persisting a new MANIFEST with advanced log_numbers for some column families, results in "column family inconsistency" error on second recovery. As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL. If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point. If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
  • Fixed a bug where RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. Now writes to MANIFEST are persisted only after recovery is successful.
  • Fix a race condition in WAL size tracking which is caused by an unsafe iterator access after container is changed.
  • Fix unprotected concurrent accesses to WritableFileWriter::filesize_ by DB::SyncWAL() and DB::Put() in two write queue mode.
  • Fix a bug in WAL tracking. Before this PR (#10087), calling SyncWAL() on the only WAL file of the db will not log the event in MANIFEST, thus allowing a subsequent DB::Open even if the WAL file is missing or corrupted.
  • Fix a bug that could return wrong results with index_type=kHashSearch and using SetOptions to change the prefix_extractor.
  • Fixed a bug in WAL tracking with wal_compression. WAL compression writes a kSetCompressionType record which is not associated with any sequence number. As result, WalManager::GetSortedWalsOfType() will skip these WALs and not return them to caller, e.g. Checkpoint, Backup, causing the operations to fail.
  • Avoid a crash if the IDENTITY file is accidentally truncated to empty. A new DB ID will be written and generated on Open.
  • Fixed a possible corruption for users of manual_wal_flush and/or FlushWAL(true /* sync */), together with track_and_verify_wals_in_manifest == true. For those users, losing unsynced data (e.g., due to power loss) could make future DB opens fail with a Status::Corruption complaining about missing WAL data.
  • Fixed a bug in WriteBatchInternal::Append() where WAL termination point in write batch was not considered and the function appends an incorrect number of checksums.
  • Fixed a crash bug introduced in 7.3.0 affecting users of MultiGet with kDataBlockBinaryAndHash.
  • Add some fixes in async_io which was doing extra prefetching in shorter scans.

Public API changes

  • Add new API GetUnixTime in Snapshot class which returns the unix time at which Snapshot is taken.
  • Add transaction get_pinned and multi_get to C API.
  • Add two-phase commit support to C API.
  • Add rocksdb_transaction_get_writebatch_wi and rocksdb_transaction_rebuild_from_writebatch to C API.
  • Add rocksdb_options_get_blob_file_starting_level and rocksdb_options_set_blob_file_starting_level to C API.
  • Add blobFileStartingLevel and setBlobFileStartingLevel to Java API.
  • Add SingleDelete for DB in C API
  • Add User Defined Timestamp in C API.
    • rocksdb_comparator_with_ts_create to create timestamp aware comparator
    • Put, Get, Delete, SingleDelete, MultiGet APIs has corresponding timestamp aware APIs with suffix with_ts
    • And Add C API's for Transaction, SstFileWriter, Compaction as mentioned here
  • The contract for implementations of Comparator::IsSameLengthImmediateSuccessor has been updated to work around a design bug in auto_prefix_mode.
  • The API documentation for auto_prefix_mode now notes some corner cases in which it returns different results than total_order_seek, due to design bugs that are not easily fixed. Users using built-in comparators and keys at least the size of a fixed prefix length are not affected.
  • Obsoleted the NUM_DATA_BLOCKS_READ_PER_LEVEL stat and introduced the NUM_LEVEL_READ_PER_MULTIGET and MULTIGET_COROUTINE_COUNT stats
  • Introduced WriteOptions::protection_bytes_per_key, which can be used to enable key-value integrity protection for live updates.

New Features

  • Add FileSystem::ReadAsync API in io_tracing
  • Add blob garbage collection parameters blob_garbage_collection_policy and blob_garbage_collection_age_cutoff to both force-enable and force-disable GC, as well as selectively override age cutoff when using CompactRange.
  • Add an extra sanity check in GetSortedWalFiles() (also used by GetLiveFilesStorageInfo(), BackupEngine, and Checkpoint) to reduce risk of successfully created backup or checkpoint failing to open because of missing WAL file.
  • Add a new column family option blob_file_starting_level to enable writing blob files during flushes and compactions starting from the specified LSM tree level.
  • Add support for timestamped snapshots (#9879)
  • Provide support for AbortIO in posix to cancel submitted asynchronous requests using io_uring.
  • Add support for rate-limiting batched MultiGet() APIs

Behavior changes

  • DB::Open(), DB::OpenAsSecondary() will fail if a Logger cannot be created (#9984)
  • Removed support for reading Bloom filters using obsolete block-based filter format. (Support for writing such filters was dropped in 7.0.) For good read performance on old DBs using these filters, a full compaction is required.
  • Per KV checksum in write batch is verified before a write batch is written to WAL to detect any corruption to the write batch (#10114).

Performance Improvements

  • When compiled with folly (Meta-internal integration; experimental in open source build), improve the locking performance (CPU efficiency) of LRUCache by using folly DistributedMutex in place of standard mutex.
rocksdb - RocksDB 7.4.3

Published by pdillinger over 2 years ago

7.4.3 (2022-07-13)

Behavior Changes

  • For track_and_verify_wals_in_manifest, revert to the original behavior before #10087: syncing of live WAL file is not tracked, and we track only the synced sizes of closed WALs. (PR #10330).

7.4.2 (2022-06-30)

Bug Fixes

  • Fix a bug in Logger where if dbname and db_log_dir are on different filesystems, dbname creation would fail wrt to db_log_dir path returning an error and fails to open the DB.

7.4.1 (2022-06-28)

Bug Fixes

  • Pass rate_limiter_priority through filter block reader functions to FileSystem.

7.4.0 (2022-06-19)

Bug Fixes

  • Fixed a bug in calculating key-value integrity protection for users of in-place memtable updates. In particular, the affected users would be those who configure protection_bytes_per_key > 0 on WriteBatch or WriteOptions, and configure inplace_callback != nullptr.
  • Fixed a bug where a snapshot taken during SST file ingestion would be unstable.
  • Fixed a bug for non-TransactionDB with avoid_flush_during_recovery = true and TransactionDB where in case of crash, min_log_number_to_keep may not change on recovery and persisting a new MANIFEST with advanced log_numbers for some column families, results in "column family inconsistency" error on second recovery. As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL. If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point. If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
  • Fixed a bug where RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. Now writes to MANIFEST are persisted only after recovery is successful.
  • Fix a race condition in WAL size tracking which is caused by an unsafe iterator access after container is changed.
  • Fix unprotected concurrent accesses to WritableFileWriter::filesize_ by DB::SyncWAL() and DB::Put() in two write queue mode.
  • Fix a bug in WAL tracking. Before this PR (#10087), calling SyncWAL() on the only WAL file of the db will not log the event in MANIFEST, thus allowing a subsequent DB::Open even if the WAL file is missing or corrupted.
  • Fix a bug that could return wrong results with index_type=kHashSearch and using SetOptions to change the prefix_extractor.
  • Fixed a bug in WAL tracking with wal_compression. WAL compression writes a kSetCompressionType record which is not associated with any sequence number. As result, WalManager::GetSortedWalsOfType() will skip these WALs and not return them to caller, e.g. Checkpoint, Backup, causing the operations to fail.
  • Avoid a crash if the IDENTITY file is accidentally truncated to empty. A new DB ID will be written and generated on Open.
  • Fixed a possible corruption for users of manual_wal_flush and/or FlushWAL(true /* sync */), together with track_and_verify_wals_in_manifest == true. For those users, losing unsynced data (e.g., due to power loss) could make future DB opens fail with a Status::Corruption complaining about missing WAL data.
  • Fixed a bug in WriteBatchInternal::Append() where WAL termination point in write batch was not considered and the function appends an incorrect number of checksums.
  • Fixed a crash bug introduced in 7.3.0 affecting users of MultiGet with kDataBlockBinaryAndHash.
  • Add some fixes in async_io which was doing extra prefetching in shorter scans.

Public API changes

  • Add new API GetUnixTime in Snapshot class which returns the unix time at which Snapshot is taken.
  • Add transaction get_pinned and multi_get to C API.
  • Add two-phase commit support to C API.
  • Add rocksdb_transaction_get_writebatch_wi and rocksdb_transaction_rebuild_from_writebatch to C API.
  • Add rocksdb_options_get_blob_file_starting_level and rocksdb_options_set_blob_file_starting_level to C API.
  • Add blobFileStartingLevel and setBlobFileStartingLevel to Java API.
  • Add SingleDelete for DB in C API
  • Add User Defined Timestamp in C API.
    • rocksdb_comparator_with_ts_create to create timestamp aware comparator
    • Put, Get, Delete, SingleDelete, MultiGet APIs has corresponding timestamp aware APIs with suffix with_ts
    • And Add C API's for Transaction, SstFileWriter, Compaction as mentioned here
  • The contract for implementations of Comparator::IsSameLengthImmediateSuccessor has been updated to work around a design bug in auto_prefix_mode.
  • The API documentation for auto_prefix_mode now notes some corner cases in which it returns different results than total_order_seek, due to design bugs that are not easily fixed. Users using built-in comparators and keys at least the size of a fixed prefix length are not affected.
  • Obsoleted the NUM_DATA_BLOCKS_READ_PER_LEVEL stat and introduced the NUM_LEVEL_READ_PER_MULTIGET and MULTIGET_COROUTINE_COUNT stats
  • Introduced WriteOptions::protection_bytes_per_key, which can be used to enable key-value integrity protection for live updates.

New Features

  • Add FileSystem::ReadAsync API in io_tracing
  • Add blob garbage collection parameters blob_garbage_collection_policy and blob_garbage_collection_age_cutoff to both force-enable and force-disable GC, as well as selectively override age cutoff when using CompactRange.
  • Add an extra sanity check in GetSortedWalFiles() (also used by GetLiveFilesStorageInfo(), BackupEngine, and Checkpoint) to reduce risk of successfully created backup or checkpoint failing to open because of missing WAL file.
  • Add a new column family option blob_file_starting_level to enable writing blob files during flushes and compactions starting from the specified LSM tree level.
  • Add support for timestamped snapshots (#9879)
  • Provide support for AbortIO in posix to cancel submitted asynchronous requests using io_uring.
  • Add support for rate-limiting batched MultiGet() APIs

Behavior changes

  • DB::Open(), DB::OpenAsSecondary() will fail if a Logger cannot be created (#9984)
  • Removed support for reading Bloom filters using obsolete block-based filter format. (Support for writing such filters was dropped in 7.0.) For good read performance on old DBs using these filters, a full compaction is required.
  • Per KV checksum in write batch is verified before a write batch is written to WAL to detect any corruption to the write batch (#10114).

Performance Improvements

  • When compiled with folly (Meta-internal integration; experimental in open source build), improve the locking performance (CPU efficiency) of LRUCache by using folly DistributedMutex in place of standard mutex.
rocksdb - RocksDB 7.3.1

Published by ltamasi over 2 years ago

7.3.1 (2022-06-08)

Bug Fixes

  • Fix a bug in WAL tracking. Before this PR (#10087), calling SyncWAL() on the only WAL file of the db will not log the event in MANIFEST, thus allowing a subsequent DB::Open even if the WAL file is missing or corrupted.
  • Fixed a bug for non-TransactionDB with avoid_flush_during_recovery = true and TransactionDB where in case of crash, min_log_number_to_keep may not change on recovery and persisting a new MANIFEST with advanced log_numbers for some column families, results in "column family inconsistency" error on second recovery. As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL. If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point. If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
  • Fixed a bug where RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. Now writes to MANIFEST are persisted only after recovery is successful.

7.3.0 (2022-05-20)

Bug Fixes

  • Fixed a bug where manual flush would block forever even though flush options had wait=false.
  • Fixed a bug where RocksDB could corrupt DBs with avoid_flush_during_recovery == true by removing valid WALs, leading to Status::Corruption with message like "SST file is ahead of WALs" when attempting to reopen.
  • Fixed a bug in async_io path where incorrect length of data is read by FilePrefetchBuffer if data is consumed from two populated buffers and request for more data is sent.
  • Fixed a CompactionFilter bug. Compaction filter used to use Delete to remove keys, even if the keys should be removed with SingleDelete. Mixing Delete and SingleDelete may cause undefined behavior.
  • Fixed a bug in WritableFileWriter::WriteDirect and WritableFileWriter::WriteDirectWithChecksum. The rate_limiter_priority specified in ReadOptions was not passed to the RateLimiter when requesting a token.
  • Fixed a bug which might cause process crash when I/O error happens when reading an index block in MultiGet().

New Features

  • DB::GetLiveFilesStorageInfo is ready for production use.
  • Add new stats PREFETCHED_BYTES_DISCARDED which records number of prefetched bytes discarded by RocksDB FilePrefetchBuffer on destruction and POLL_WAIT_MICROS records wait time for FS::Poll API completion.
  • RemoteCompaction supports table_properties_collector_factories override on compaction worker.
  • Start tracking SST unique id in MANIFEST, which will be used to verify with SST properties during DB open to make sure the SST file is not overwritten or misplaced. A db option verify_sst_unique_id_in_manifest is introduced to enable/disable the verification, if enabled all SST files will be opened during DB-open to verify the unique id (default is false), so it's recommended to use it with max_open_files = -1 to pre-open the files.
  • Added the ability to concurrently read data blocks from multiple files in a level in batched MultiGet. This can be enabled by setting the async_io option in ReadOptions. Using this feature requires a FileSystem that supports ReadAsync (PosixFileSystem is not supported yet for this), and for RocksDB to be compiled with folly and c++20.
  • Add FileSystem::ReadAsync API in io_tracing.

Public API changes

  • Add rollback_deletion_type_callback to TransactionDBOptions so that write-prepared transactions know whether to issue a Delete or SingleDelete to cancel a previous key written during prior prepare phase. The PR aims to prevent mixing SingleDeletes and Deletes for the same key that can lead to undefined behaviors for write-prepared transactions.
  • EXPERIMENTAL: Add new API AbortIO in file_system to abort the read requests submitted asynchronously.
  • CompactionFilter::Decision has a new value: kRemoveWithSingleDelete. If CompactionFilter returns this decision, then CompactionIterator will use SingleDelete to mark a key as removed.
  • Renamed CompactionFilter::Decision::kRemoveWithSingleDelete to kPurge since the latter sounds more general and hides the implementation details of how compaction iterator handles keys.
  • Added ability to specify functions for Prepare and Validate to OptionsTypeInfo. Added methods to OptionTypeInfo to set the functions via an API. These methods are intended for RocksDB plugin developers for configuration management.
  • Added a new immutable db options, enforce_single_del_contracts. If set to false (default is true), compaction will NOT fail due to a single delete followed by a delete for the same key. The purpose of this temporay option is to help existing use cases migrate.
  • Introduce BlockBasedTableOptions::cache_usage_options and use that to replace BlockBasedTableOptions::reserve_table_builder_memory and BlockBasedTableOptions::reserve_table_reader_memory.
  • Changed GetUniqueIdFromTableProperties to return a 128-bit unique identifier, which will be the standard size now. The old functionality (192-bit) is available from GetExtendedUniqueIdFromTableProperties. Both functions are no longer "experimental" and are ready for production use.
  • In IOOptions, mark prio as deprecated for future removal.
  • In file_system.h, mark IOPriority as deprecated for future removal.
  • Add an option, CompressionOptions::use_zstd_dict_trainer, to indicate whether zstd dictionary trainer should be used for generating zstd compression dictionaries. The default value of this option is true for backward compatibility. When this option is set to false, zstd API ZDICT_finalizeDictionary is used to generate compression dictionaries.
  • Seek API which positions itself every LevelIterator on the correct data block in the correct SST file which can be parallelized if ReadOptions.async_io option is enabled.
  • Add new stat number_async_seek in PerfContext that indicates number of async calls made by seek to prefetch data.

Bug Fixes

  • RocksDB calls FileSystem::Poll API during FilePrefetchBuffer destruction which impacts performance as it waits for read requets completion which is not needed anymore. Calling FileSystem::AbortIO to abort those requests instead fixes that performance issue.
  • Fixed unnecessary block cache contention when queries within a MultiGet batch and across parallel batches access the same data block, which previously could cause severely degraded performance in this unusual case. (In more typical MultiGet cases, this fix is expected to yield a small or negligible performance improvement.)

Behavior changes

  • Enforce the existing contract of SingleDelete so that SingleDelete cannot be mixed with Delete because it leads to undefined behavior. Fix a number of unit tests that violate the contract but happen to pass.
  • ldb --try_load_options default to true if --db is specified and not creating a new DB, the user can still explicitly disable that by --try_load_options=false (or explicitly enable that by --try_load_options).
  • During Flush write or Compaction write/read, the WriteController is used to determine whether DB writes are stalled or slowed down. The priority (Env::IOPriority) can then be determined accordingly and be passed in IOOptions to the file system.
rocksdb - RocksDB 7.2.2

Published by jay-zhuang over 2 years ago

7.2.2 (2022-04-28)

Bug Fixes

  • Fixed a bug in async_io path where incorrect length of data is read by FilePrefetchBuffer if data is consumed from two populated buffers and request for more data is sent.

7.2.1 (2022-04-26)

Bug Fixes

  • Fixed a bug where RocksDB could corrupt DBs with avoid_flush_during_recovery == true by removing valid WALs, leading to Status::Corruption with message like "SST file is ahead of WALs" when attempting to reopen.
  • RocksDB calls FileSystem::Poll API during FilePrefetchBuffer destruction which impacts performance as it waits for read requets completion which is not needed anymore. Calling FileSystem::AbortIO to abort those requests instead fixes that performance issue.

7.2.0 (2022-04-15)

Bug Fixes

  • Fixed bug which caused rocksdb failure in the situation when rocksdb was accessible using UNC path
  • Fixed a race condition when 2PC is disabled and WAL tracking in the MANIFEST is enabled. The race condition is between two background flush threads trying to install flush results, causing a WAL deletion not tracked in the MANIFEST. A future DB open may fail.
  • Fixed a heap use-after-free race with DropColumnFamily.
  • Fixed a bug that rocksdb.read.block.compaction.micros cannot track compaction stats (#9722).
  • Fixed file_type, relative_filename and directory fields returned by GetLiveFilesMetaData(), which were added in inheriting from FileStorageInfo.
  • Fixed a bug affecting track_and_verify_wals_in_manifest. Without the fix, application may see "open error: Corruption: Missing WAL with log number" while trying to open the db. The corruption is a false alarm but prevents DB open (#9766).
  • Fix segfault in FilePrefetchBuffer with async_io as it doesn't wait for pending jobs to complete on destruction.
  • Fix ERROR_HANDLER_AUTORESUME_RETRY_COUNT stat whose value was set wrong in portal.h
  • Fixed a bug for non-TransactionDB with avoid_flush_during_recovery = true and TransactionDB where in case of crash, min_log_number_to_keep may not change on recovery and persisting a new MANIFEST with advanced log_numbers for some column families, results in "column family inconsistency" error on second recovery. As a solution the corrupted WALs whose numbers are larger than the corrupted wal and smaller than the new WAL will be moved to archive folder.
  • Fixed a bug in RocksDB DB::Open() which may creates and writes to two new MANIFEST files even before recovery succeeds. Now writes to MANIFEST are persisted only after recovery is successful.

New Features

  • For db_bench when --seed=0 or --seed is not set then it uses the current time as the seed value. Previously it used the value 1000.
  • For db_bench when --benchmark lists multiple tests and each test uses a seed for a RNG then the seeds across tests will no longer be repeated.
  • Added an option to dynamically charge an updating estimated memory usage of block-based table reader to block cache if block cache available. To enable this feature, set BlockBasedTableOptions::reserve_table_reader_memory = true.
  • Add new stat ASYNC_READ_BYTES that calculates number of bytes read during async read call and users can check if async code path is being called by RocksDB internal automatic prefetching for sequential reads.
  • Enable async prefetching if ReadOptions.readahead_size is set along with ReadOptions.async_io in FilePrefetchBuffer.
  • Add event listener support on remote compaction compactor side.
  • Added a dedicated integer DB property rocksdb.live-blob-file-garbage-size that exposes the total amount of garbage in the blob files in the current version.
  • RocksDB does internal auto prefetching if it notices sequential reads. It starts with readahead size initial_auto_readahead_size which now can be configured through BlockBasedTableOptions.
  • Add a merge operator that allows users to register specific aggregation function so that they can does aggregation using different aggregation types for different keys. See comments in include/rocksdb/utilities/agg_merge.h for actual usage. The feature is experimental and the format is subject to change and we won't provide a migration tool.
  • Meta-internal / Experimental: Improve CPU performance by replacing many uses of std::unordered_map with folly::F14FastMap when RocksDB is compiled together with Folly.
  • Experimental: Add CompressedSecondaryCache, a concrete implementation of rocksdb::SecondaryCache, that integrates with compression libraries (e.g. LZ4) to hold compressed blocks.

Behavior changes

  • Disallow usage of commit-time-write-batch for write-prepared/write-unprepared transactions if TransactionOptions::use_only_the_last_commit_time_batch_for_recovery is false to prevent two (or more) uncommitted versions of the same key in the database. Otherwise, bottommost compaction may violate the internal key uniqueness invariant of SSTs if the sequence numbers of both internal keys are zeroed out (#9794).
  • Make DB::GetUpdatesSince() return NotSupported early for write-prepared/write-unprepared transactions, as the API contract indicates.

Public API changes

  • Exposed APIs to examine results of block cache stats collections in a structured way. In particular, users of GetMapProperty() with property kBlockCacheEntryStats can now use the functions in BlockCacheEntryStatsMapKeys to find stats in the map.
  • Add fail_if_not_bottommost_level to IngestExternalFileOptions so that ingestion will fail if the file(s) cannot be ingested to the bottommost level.
  • Add output parameter is_in_sec_cache to SecondaryCache::Lookup(). It is to indicate whether the handle is possibly erased from the secondary cache after the Lookup.
rocksdb - RocksDB 7.1.2

Published by hx235 over 2 years ago

7.1.2 (2022-04-19)

Bug Fixes

  • Fixed bug which caused rocksdb failure in the situation when rocksdb was accessible using UNC path
  • Fixed a race condition when 2PC is disabled and WAL tracking in the MANIFEST is enabled. The race condition is between two background flush threads trying to install flush results, causing a WAL deletion not tracked in the MANIFEST. A future DB open may fail.
  • Fixed a heap use-after-free race with DropColumnFamily.
  • Fixed a bug that rocksdb.read.block.compaction.micros cannot track compaction stats (#9722).
  • Fixed file_type, relative_filename and directory fields returned by GetLiveFilesMetaData(), which were added in inheriting from FileStorageInfo.
  • Fixed a bug affecting track_and_verify_wals_in_manifest. Without the fix, application may see "open error: Corruption: Missing WAL with log number" while trying to open the db. The corruption is a false alarm but prevents DB open (#9766).
rocksdb - RocksDB 7.1.1

Published by hx235 over 2 years ago

7.1.1 (2022-04-07)

Bug Fixes

  • Fix segfault in FilePrefetchBuffer with async_io as it doesn't wait for pending jobs to complete on destruction.

7.1.0 (2022-03-23)

New Features

  • Allow WriteBatchWithIndex to index a WriteBatch that includes keys with user-defined timestamps. The index itself does not have timestamp.
  • Add support for user-defined timestamps to write-committed transaction without API change. The TransactionDB layer APIs do not allow timestamps because we require that all user-defined-timestamps-aware operations go through the Transaction APIs.
  • Added BlobDB options to ldb
  • BlockBasedTableOptions::detect_filter_construct_corruption can now be dynamically configured using DB::SetOptions.
  • Automatically recover from retryable read IO errors during backgorund flush/compaction.
  • Experimental support for preserving file Temperatures through backup and restore, and for updating DB metadata for outside changes to file Temperature (UpdateManifestForFilesState or ldb update_manifest --update_temperatures).
  • Experimental support for async_io in ReadOptions which is used by FilePrefetchBuffer to prefetch some of the data asynchronously, if reads are sequential and auto readahead is enabled by rocksdb internally.

Bug Fixes

  • Fixed a major performance bug in which Bloom filters generated by pre-7.0 releases are not read by early 7.0.x releases (and vice-versa) due to changes to FilterPolicy::Name() in #9590. This can severely impact read performance and read I/O on upgrade or downgrade with existing DB, but not data correctness.
  • Fixed a data race on versions_ between DBImpl::ResumeImpl() and threads waiting for recovery to complete (#9496)
  • Fixed a bug caused by race among flush, incoming writes and taking snapshots. Queries to snapshots created with these race condition can return incorrect result, e.g. resurfacing deleted data.
  • Fixed a bug that DB flush uses options.compression even options.compression_per_level is set.
  • Fixed a bug that DisableManualCompaction may assert when disable an unscheduled manual compaction.
  • Fix a race condition when cancel manual compaction with DisableManualCompaction. Also DB close can cancel the manual compaction thread.
  • Fixed a potential timer crash when open close DB concurrently.
  • Fixed a race condition for alive_log_files_ in non-two-write-queues mode. The race is between the write_thread_ in WriteToWAL() and another thread executing FindObsoleteFiles(). The race condition will be caught if __glibcxx_requires_nonempty is enabled.
  • Fixed a bug that Iterator::Refresh() reads stale keys after DeleteRange() performed.
  • Fixed a race condition when disable and re-enable manual compaction.
  • Fixed automatic error recovery failure in atomic flush.
  • Fixed a race condition when mmaping a WritableFile on POSIX.

Public API changes

  • Added pure virtual FilterPolicy::CompatibilityName(), which is needed for fixing major performance bug involving FilterPolicy naming in SST metadata without affecting Customizable aspect of FilterPolicy. This change only affects those with their own custom or wrapper FilterPolicy classes.
  • options.compression_per_level is dynamically changeable with SetOptions().
  • Added WriteOptions::rate_limiter_priority. When set to something other than Env::IO_TOTAL, the internal rate limiter (DBOptions::rate_limiter) will be charged at the specified priority for writes associated with the API to which the WriteOptions was provided. Currently the support covers automatic WAL flushes, which happen during live updates (Put(), Write(), Delete(), etc.) when WriteOptions::disableWAL == false and DBOptions::manual_wal_flush == false.
  • Add DB::OpenAndTrimHistory API. This API will open DB and trim data to the timestamp specified by trim_ts (The data with timestamp larger than specified trim bound will be removed). This API should only be used at a timestamp-enabled column families recovery. If the column family doesn't have timestamp enabled, this API won't trim any data on that column family. This API is not compatible with avoid_flush_during_recovery option.
  • Remove BlockBasedTableOptions.hash_index_allow_collision which already takes no effect.
rocksdb - RocksDB 7.0.4

Published by ajkr over 2 years ago

7.0.4 (2022-03-29)

Bug Fixes

  • Fixed a race condition when disable and re-enable manual compaction.
  • Fixed a race condition for alive_log_files_ in non-two-write-queues mode. The race is between the write_thread_ in WriteToWAL() and another thread executing FindObsoleteFiles(). The race condition will be caught if __glibcxx_requires_nonempty is enabled.
  • Fixed a race condition when mmaping a WritableFile on POSIX.
  • Fixed a race condition when 2PC is disabled and WAL tracking in the MANIFEST is enabled. The race condition is between two background flush threads trying to install flush results, causing a WAL deletion not tracked in the MANIFEST. A future DB open may fail.
  • Fixed a heap use-after-free race with DropColumnFamily.
  • Fixed a bug that rocksdb.read.block.compaction.micros cannot track compaction stats (#9722).
rocksdb - RocksDB 6.29.5

Published by ajkr over 2 years ago

6.29.5 (2022-03-29)

Bug Fixes

  • Fixed a race condition for alive_log_files_ in non-two-write-queues mode. The race is between the write_thread_ in WriteToWAL() and another thread executing FindObsoleteFiles(). The race condition will be caught if __glibcxx_requires_nonempty is enabled.
  • Fixed a race condition when mmaping a WritableFile on POSIX.
  • Fixed a race condition when 2PC is disabled and WAL tracking in the MANIFEST is enabled. The race condition is between two background flush threads trying to install flush results, causing a WAL deletion not tracked in the MANIFEST. A future DB open may fail.
  • Fixed a heap use-after-free race with DropColumnFamily.
  • Fixed a bug that rocksdb.read.block.compaction.micros cannot track compaction stats (#9722).
rocksdb - RocksDB 7.0.3

Published by jay-zhuang over 2 years ago

7.0.3 (2022-03-25)

Bug Fixes

  • Fixed a major performance bug in which Bloom filters generated by pre-7.0 releases are not read by early 7.0.x releases (and vice-versa) due to changes to FilterPolicy::Name() in #9590. This can severely impact read performance and read I/O on upgrade or downgrade with existing DB, but not data correctness.
  • Fixed a bug that Iterator::Refresh() reads stale keys after DeleteRange() performed.

Public API changes

  • Added pure virtual FilterPolicy::CompatibilityName(), which is needed for fixing major performance bug involving FilterPolicy naming in SST metadata without affecting Customizable aspect of FilterPolicy. For source code, this change only affects those with their own custom or wrapper FilterPolicy classes, but does break compiled library binary compatibility in a patch release.
  • Since RocksDB 7, RocksJava now requires Java 8 (previously Java 7).
Package Rankings
Top 1.19% on Repo1.maven.org
Top 4.85% on Spack.io
Top 3.59% on Proxy.golang.org
Top 37.05% on Pypi.org
Top 11.69% on Conda-forge.org
Badges
Extracted from project README
CircleCI Status