rocksdb - RocksDB 6.29.4

Published by ajkr over 2 years ago

6.29.4 (2022-03-22)

Bug Fixes

Fixed a bug caused by race among flush, incoming writes and taking snapshots. Queries to snapshots created with these race condition can return incorrect result, e.g. resurfacing deleted data.
Fixed a bug that DisableManualCompaction may assert when disable an unscheduled manual compaction.
Fixed a bug that Iterator::Refresh() reads stale keys after DeleteRange() performed.
Fixed a race condition when disable and re-enable manual compaction.
Fix a race condition when cancel manual compaction with DisableManualCompaction. Also DB close can cancel the manual compaction thread.
Fixed a data race on versions_ between DBImpl::ResumeImpl() and threads waiting for recovery to complete (#9496)
Fixed a read-after-free bug in DB::GetMergeOperands().
Fixed NUM_INDEX_AND_FILTER_BLOCKS_READ_PER_LEVEL, NUM_DATA_BLOCKS_READ_PER_LEVEL, and NUM_SST_READ_PER_LEVEL stats to be reported once per MultiGet batch per level.

rocksdb - RocksDB 7.0.2

Published by ajkr over 2 years ago

Rocksdb Change Log

7.0.2 (2022-03-12)

Fixed a bug that DisableManualCompaction may assert when disable an unscheduled manual compaction.

rocksdb - RocksDB 7.0.1

Published by ajkr over 2 years ago

Rocksdb Change Log

7.0.1 (2022-03-02)

Bug Fixes

Fix a race condition when cancel manual compaction with DisableManualCompaction. Also DB close can cancel the manual compaction thread.
Fixed a data race on versions_ between DBImpl::ResumeImpl() and threads waiting for recovery to complete (#9496)
Fixed a bug caused by race among flush, incoming writes and taking snapshots. Queries to snapshots created with these race condition can return incorrect result, e.g. resurfacing deleted data.

7.0.0 (2022-02-20)

Bug Fixes

Fixed a major bug in which batched MultiGet could return old values for keys deleted by DeleteRange when memtable Bloom filter is enabled (memtable_prefix_bloom_size_ratio > 0). (The fix includes a substantial MultiGet performance improvement in the unusual case of both memtable_whole_key_filtering and prefix_extractor.)
Fixed more cases of EventListener::OnTableFileCreated called with OK status, file_size==0, and no SST file kept. Now the status is Aborted.
Fixed a read-after-free bug in DB::GetMergeOperands().
Fix a data loss bug for 2PC write-committed transaction caused by concurrent transaction commit and memtable switch (#9571).
Fixed NUM_INDEX_AND_FILTER_BLOCKS_READ_PER_LEVEL, NUM_DATA_BLOCKS_READ_PER_LEVEL, and NUM_SST_READ_PER_LEVEL stats to be reported once per MultiGet batch per level.

Performance Improvements

Mitigated the overhead of building the file location hash table used by the online LSM tree consistency checks, which can improve performance for certain workloads (see #9351).
Switched to using a sorted std::vector instead of std::map for storing the metadata objects for blob files, which can improve performance for certain workloads, especially when the number of blob files is high.
DisableManualCompaction() doesn't have to wait scheduled manual compaction to be executed in thread-pool to cancel the job.

Public API changes

Require C++17 compatible compiler (GCC >= 7, Clang >= 5, Visual Studio >= 2017) for compiling RocksDB and any code using RocksDB headers (previously required C++11). See #9388.
Require Java 8 for compiling RocksJava (previously Java 7). See #9541
Removed deprecated automatic finalization of RocksJava RocksObjects, the user must explicitly call close() on their RocksJava objects. See #9523.
Added ReadOptions::rate_limiter_priority. When set to something other than Env::IO_TOTAL, the internal rate limiter (DBOptions::rate_limiter) will be charged at the specified priority for file reads associated with the API to which the ReadOptions was provided.
Remove HDFS support from main repo.
Remove librados support from main repo.
Remove obsolete backupable_db.h and type alias BackupableDBOptions. Use backup_engine.h and BackupEngineOptions. Similar renamings are in the C and Java APIs.
Removed obsolete utility_db.h and UtilityDB::OpenTtlDB. Use db_ttl.h and DBWithTTL::Open.
Remove deprecated API DB::AddFile from main repo.
Remove deprecated API ObjectLibrary::Register() and the (now obsolete) Regex public API. Use ObjectLibrary::AddFactory() with PatternEntry instead.
Remove deprecated option DBOption::table_cache_remove_scan_count_limit.
Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit.
Remove deprecated API AdvancedColumnFamilyOptions::hard_rate_limit.
Remove deprecated API DBOption::base_background_compactions.
Remove deprecated API DBOptions::purge_redundant_kvs_while_flush.
Remove deprecated overloads of API DB::CompactRange.
Remove deprecated option DBOptions::skip_log_error_on_recovery.
Remove ReadOptions::iter_start_seqnum which has been deprecated.
Remove DBOptions::preserved_deletes and DB::SetPreserveDeletesSequenceNumber().
Remove deprecated API AdvancedColumnFamilyOptions::rate_limit_delay_max_milliseconds.
Removed timestamp from WriteOptions. Accordingly, added to DB APIs Put, Delete, SingleDelete, etc. accepting an additional argument 'timestamp'. Added Put, Delete, SingleDelete, etc to WriteBatch accepting an additional argument 'timestamp'. Removed WriteBatch::AssignTimestamps(vector) API. Renamed WriteBatch::AssignTimestamp() to WriteBatch::UpdateTimestamps() with clarified comments.
Changed type of cache buffer passed to Cache::CreateCallback from void* to const void*.
Significant updates to FilterPolicy-related APIs and configuration:
- Remove public API support for deprecated, inefficient block-based filter (use_block_based_builder=true).
  - Old code and configuration strings that would enable it now quietly enable full filters instead, though any built-in FilterPolicy can still read block-based filters. This includes changing the longstanding default behavior of the Java API.
  - Remove deprecated FilterPolicy::CreateFilter() and FilterPolicy::KeyMayMatch()
  - Remove rocksdb_filterpolicy_create() from C API, as the only C API support for custom filter policies is now obsolete.
  - If temporary memory usage in full filter creation is a problem, consider using partitioned filters, smaller SST files, or setting reserve_table_builder_memory=true.
- Remove support for "filter_policy=experimental_ribbon" configuration
  string. Use something like "filter_policy=ribbonfilter:10" instead.
- Allow configuration string like "filter_policy=bloomfilter:10" without
  bool, to minimize acknowledgement of obsolete block-based filter.
- Made FilterPolicy Customizable. Configuration of filter_policy is now accurately saved in OPTIONS file and can be loaded with LoadOptionsFromFile. (Loading an OPTIONS file generated by a previous version only enables reading and using existing filters, not generating new filters. Previously, no filter_policy would be configured from a saved OPTIONS file.)
- Change meaning of nullptr return from GetBuilderWithContext() from "use
  block-based filter" to "generate no filter in this case."
  - Also, when user specifies bits_per_key < 0.5, we now round this down
    to "no filter" because we expect a filter with >= 80% FP rate is
    unlikely to be worth the CPU cost of accessing it (esp with
    cache_index_and_filter_blocks=1 or partition_filters=1).
  - bits_per_key >= 0.5 and < 1.0 is still rounded up to 1.0 (for 62% FP
    rate)
- Remove class definitions for FilterBitsBuilder and FilterBitsReader from
  public API, so these can evolve more easily as implementation details.
  Custom FilterPolicy can still decide what kind of built-in filter to use
  under what conditions.
- Also removed deprecated functions
  - FilterPolicy::GetFilterBitsBuilder()
  - NewExperimentalRibbonFilterPolicy()
- Remove default implementations of
  - FilterPolicy::GetBuilderWithContext()
Remove default implementation of Name() from FileSystemWrapper.
Rename SizeApproximationOptions.include_memtabtles to SizeApproximationOptions.include_memtables.
Remove deprecated option DBOptions::max_mem_compaction_level.
Return Status::InvalidArgument from ObjectRegistry::NewObject if a factory exists but the object ould not be created (returns NotFound if the factory is missing).
Remove deprecated overloads of API DB::GetApproximateSizes.
Remove deprecated option DBOptions::new_table_reader_for_compaction_inputs.
Add Transaction::SetReadTimestampForValidation() and Transaction::SetCommitTimestamp(). Default impl returns NotSupported().
Add support for decimal patterns to ObjectLibrary::PatternEntry
Remove deprecated remote compaction APIs CompactionService::Start() and CompactionService::WaitForComplete(). Please use CompactionService::StartV2(), CompactionService::WaitForCompleteV2() instead, which provides the same information plus extra data like priority, db_id, etc.
ColumnFamilyOptions::OldDefaults and DBOptions::OldDefaults are marked deprecated, as they are no longer maintained.
Add subcompaction callback APIs: OnSubcompactionBegin() and OnSubcompactionCompleted().
Add file Temperature information to FileOperationInfo in event listener API.
Change the type of SizeApproximationFlags from enum to enum class. Also update the signature of DB::GetApproximateSizes API from uint8_t to SizeApproximationFlags.
Add Temperature hints information from RocksDB in API NewSequentialFile(). backup and checkpoint operations need to open the source files with NewSequentialFile(), which will have the temperature hints. Other operations are not covered.

Behavior Changes

Disallow the combination of DBOptions.use_direct_io_for_flush_and_compaction == true and DBOptions.writable_file_max_buffer_size == 0. This combination can cause WritableFileWriter::Append() to loop forever, and it does not make much sense in direct IO.
ReadOptions::total_order_seek no longer affects DB::Get(). The original motivation for this interaction has been obsolete since RocksDB has been able to detect whether the current prefix extractor is compatible with that used to generate table files, probably RocksDB 5.14.0.

New Features

Introduced an option BlockBasedTableOptions::detect_filter_construct_corruption for detecting corruption during Bloom Filter (format_version >= 5) and Ribbon Filter construction.
Improved the SstDumpTool to read the comparator from table properties and use it to read the SST File.
Extended the column family statistics in the info log so the total amount of garbage in the blob files and the blob file space amplification factor are also logged. Also exposed the blob file space amp via the rocksdb.blob-stats DB property.
Introduced the API rocksdb_create_dir_if_missing in c.h that calls underlying file system's CreateDirIfMissing API to create the directory.
Added last level and non-last level read statistics: LAST_LEVEL_READ_*, NON_LAST_LEVEL_READ_*.
Experimental: Add support for new APIs ReadAsync in FSRandomAccessFile that reads the data asynchronously and Poll API in FileSystem that checks if requested read request has completed or not. ReadAsync takes a callback function. Poll API checks for completion of read IO requests and should call callback functions to indicate completion of read requests.

rocksdb - RocksDB 6.29.3

Published by anand1976 over 2 years ago

Rocksdb Change Log

6.29.3 (2022-02-17)

Bug Fixes

Fix a data loss bug for 2PC write-committed transaction caused by concurrent transaction commit and memtable switch (#9571).

6.29.2 (2022-02-15)

Performance Improvements

DisableManualCompaction() doesn't have to wait scheduled manual compaction to be executed in thread-pool to cancel the job.

6.29.1 (2022-01-31)

Bug Fixes

Fixed a major bug in which batched MultiGet could return old values for keys deleted by DeleteRange when memtable Bloom filter is enabled (memtable_prefix_bloom_size_ratio > 0). (The fix includes a substantial MultiGet performance improvement in the unusual case of both memtable_whole_key_filtering and prefix_extractor.)

6.29.0 (2022-01-21)

Note: The next release will be major release 7.0. See https://github.com/facebook/rocksdb/issues/9390 for more info.

Public API change

Added values to TraceFilterType: kTraceFilterIteratorSeek, kTraceFilterIteratorSeekForPrev, and kTraceFilterMultiGet. They can be set in TraceOptions to filter out the operation types after which they are named.
Added TraceOptions::preserve_write_order. When enabled it guarantees write records are traced in the same order they are logged to WAL and applied to the DB. By default it is disabled (false) to match the legacy behavior and prevent regression.
Made the Env class extend the Customizable class. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
Options::OldDefaults is marked deprecated, as it is no longer maintained.
Add ObjectLibrary::AddFactory and ObjectLibrary::PatternEntry classes. This method and associated class are the preferred mechanism for registering factories with the ObjectLibrary going forward. The ObjectLibrary::Register method, which uses regular expressions and may be problematic, is deprecated and will be in a future release.
Changed BlockBasedTableOptions::block_size from size_t to uint64_t.
Added API warning against using Iterator::Refresh() together with DB::DeleteRange(), which are incompatible and have always risked causing the refreshed iterator to return incorrect results.

Behavior Changes

DB::DestroyColumnFamilyHandle() will return Status::InvalidArgument() if called with DB::DefaultColumnFamily().
On 32-bit platforms, mmap reads are no longer quietly disabled, just discouraged.

New Features

Added Options::DisableExtraChecks() that can be used to improve peak write performance by disabling checks that should not be necessary in the absence of software logic errors or CPU+memory hardware errors. (Default options are slowly moving toward some performance overheads for extra correctness checking.)

Performance Improvements

Improved read performance when a prefix extractor is used (Seek, Get, MultiGet), even compared to version 6.25 baseline (see bug fix below), by optimizing the common case of prefix extractor compatible with table file and unchanging.

Bug Fixes

Fix a bug that FlushMemTable may return ok even flush not succeed.
Fixed a bug of Sync() and Fsync() not using fcntl(F_FULLFSYNC) on OS X and iOS.
Fixed a significant performance regression in version 6.26 when a prefix extractor is used on the read path (Seek, Get, MultiGet). (Excessive time was spent in SliceTransform::AsString().)

New Features

Added RocksJava support for MacOS universal binary (ARM+x86)

rocksdb - RocksDB 6.28.2

Published by akankshamahajan15 over 2 years ago

6.28.2 (2022-01-31)

Bug Fixes

Fixed a major bug in which batched MultiGet could return old values for keys deleted by DeleteRange when memtable Bloom filter is enabled (memtable_prefix_bloom_size_ratio > 0). (The fix includes a substantial MultiGet performance improvement in the unusual case of both memtable_whole_key_filtering and prefix_extractor.)

6.28.1 (2022-01-10)

Bug Fixes

Fixed compilation errors on newer compiler, e.g. clang-12

6.28.0 (2021-12-17)

New Features

Introduced 'CommitWithTimestamp' as a new tag. Currently, there is no API for user to trigger a write with this tag to the WAL. This is part of the efforts to support write-commited transactions with user-defined timestamps.

Bug Fixes

Fixed a bug in rocksdb automatic implicit prefetching which got broken because of new feature adaptive_readahead and internal prefetching got disabled when iterator moves from one file to next.
Fixed a bug in TableOptions.prepopulate_block_cache which causes segmentation fault when used with TableOptions.partition_filters = true and TableOptions.cache_index_and_filter_blocks = true.
Fixed a bug affecting custom memtable factories which are not registered with the ObjectRegistry. The bug could result in failure to save the OPTIONS file.
Fixed a bug causing two duplicate entries to be appended to a file opened in non-direct mode and tracked by FaultInjectionTestFS.
Fixed a bug in TableOptions.prepopulate_block_cache to support block-based filters also.
Block cache keys no longer use FSRandomAccessFile::GetUniqueId() (previously used when available), so a filesystem recycling unique ids can no longer lead to incorrect result or crash (#7405). For files generated by RocksDB >= 6.24, the cache keys are stable across DB::Open and DB directory move / copy / import / export / migration, etc. Although collisions are still theoretically possible, they are (a) impossible in many common cases, (b) not dependent on environmental factors, and (c) much less likely than a CPU miscalculation while executing RocksDB.

Behavior Changes

MemTableList::TrimHistory now use allocated bytes when max_write_buffer_size_to_maintain > 0(default in TrasactionDB, introduced in PR#5022) Fix #8371.

Public API change

Extend WriteBatch::AssignTimestamp and AssignTimestamps API so that both functions can accept an optional checker argument that performs additional checking on timestamp sizes.
Introduce a new EventListener callback that will be called upon the end of automatic error recovery.

Performance Improvements

Replaced map property TableProperties::properties_offsets with uint64_t property external_sst_file_global_seqno_offset to save table properties's memory.
Block cache accesses are faster by RocksDB using cache keys of fixed size (16 bytes).

Java API Changes

Removed Java API TableProperties.getPropertiesOffsets() as it exposed internal details to external users.

rocksdb - RocksDB 6.27.3

Published by riversand963 almost 3 years ago

6.27.3 (2021-12-10)

Bug Fixes

Fixed a bug in TableOptions.prepopulate_block_cache which causes segmentation fault when used with TableOptions.partition_filters = true and TableOptions.cache_index_and_filter_blocks = true.
Fixed a bug affecting custom memtable factories which are not registered with the ObjectRegistry. The bug could result in failure to save the OPTIONS file.

6.27.2 (2021-12-01)

Bug Fixes

Fixed a bug in rocksdb automatic implicit prefetching which got broken because of new feature adaptive_readahead and internal prefetching got disabled when iterator moves from one file to next.

6.27.1 (2021-11-29)

Bug Fixes

Fixed a bug that could, with WAL enabled, cause backups, checkpoints, and GetSortedWalFiles() to fail randomly with an error like IO error: 001234.log: No such file or directory

6.27.0 (2021-11-19)

New Features

Added new ChecksumType kXXH3 which is faster than kCRC32c on almost all x86_64 hardware.
Added a new online consistency check for BlobDB which validates that the number/total size of garbage blobs does not exceed the number/total size of all blobs in any given blob file.
Provided support for tracking per-sst user-defined timestamp information in MANIFEST.
Added new option "adaptive_readahead" in ReadOptions. For iterators, RocksDB does auto-readahead on noticing sequential reads and by enabling this option, readahead_size of current file (if reads are sequential) will be carried forward to next file instead of starting from the scratch at each level (except L0 level files). If reads are not sequential it will fall back to 8KB. This option is applicable only for RocksDB internal prefetch buffer and isn't supported with underlying file system prefetching.
Added the read count and read bytes related stats to Statistics for tiered storage hot, warm, and cold file reads.
Added an option to dynamically charge an updating estimated memory usage of block-based table building to block cache if block cache available. It currently only includes charging memory usage of constructing (new) Bloom Filter and Ribbon Filter to block cache. To enable this feature, set BlockBasedTableOptions::reserve_table_builder_memory = true.
Add a new API OnIOError in listener.h that notifies listeners when an IO error occurs during FileSystem operation along with filename, status etc.
Added compaction readahead support for blob files to the integrated BlobDB implementation, which can improve compaction performance when the database resides on higher-latency storage like HDDs or remote filesystems. Readahead can be configured using the column family option blob_compaction_readahead_size.

Bug Fixes

Prevent a CompactRange() with CompactRangeOptions::change_level == true from possibly causing corruption to the LSM state (overlapping files within a level) when run in parallel with another manual compaction. Note that setting force_consistency_checks == true (the default) would cause the DB to enter read-only mode in this scenario and return Status::Corruption, rather than committing any corruption.
Fixed a bug in CompactionIterator when write-prepared transaction is used. A released earliest write conflict snapshot may cause assertion failure in dbg mode and unexpected key in opt mode.
Fix ticker WRITE_WITH_WAL("rocksdb.write.wal"), this bug is caused by a bad extra RecordTick(stats_, WRITE_WITH_WAL) (at 2 place), this fix remove the extra RecordTicks and fix the corresponding test case.
EventListener::OnTableFileCreated was previously called with OK status and file_size==0 in cases of no SST file contents written (because there was no content to add) and the empty file deleted before calling the listener. Now the status is Aborted.
Fixed a bug in CompactionIterator when write-preared transaction is used. Releasing earliest_snapshot during compaction may cause a SingleDelete to be output after a PUT of the same user key whose seq has been zeroed.
Added input sanitization on negative bytes passed into GenericRateLimiter::Request.
Fixed an assertion failure in CompactionIterator when write-prepared transaction is used. We prove that certain operations can lead to a Delete being followed by a SingleDelete (same user key). We can drop the SingleDelete.
Fixed a bug of timestamp-based GC which can cause all versions of a key under full_history_ts_low to be dropped. This bug will be triggered when some of the ikeys' timestamps are lower than full_history_ts_low, while others are newer.
In some cases outside of the DB read and compaction paths, SST block checksums are now checked where they were not before.
Explicitly check for and disallow the BlockBasedTableOptions if insertion into one of {block_cache, block_cache_compressed, persistent_cache} can show up in another of these. (RocksDB expects to be able to use the same key for different physical data among tiers.)
Users who configured a dedicated thread pool for bottommost compactions by explicitly adding threads to the Env::Priority::BOTTOM pool will no longer see RocksDB schedule automatic compactions exceeding the DB's compaction concurrency limit. For details on per-DB compaction concurrency limit, see API docs of max_background_compactions and max_background_jobs.
Fixed a bug of background flush thread picking more memtables to flush and prematurely advancing column family's log_number.
Fixed an assertion failure in ManifestTailer.

Behavior Changes

NUM_FILES_IN_SINGLE_COMPACTION was only counting the first input level files, now it's including all input files.
TransactionUtil::CheckKeyForConflicts can also perform conflict-checking based on user-defined timestamps in addition to sequence numbers.
Removed GenericRateLimiter's minimum refill bytes per period previously enforced.

Public API change

When options.ttl is used with leveled compaction with compactinon priority kMinOverlappingRatio, files exceeding half of TTL value will be prioritized more, so that by the time TTL is reached, fewer extra compactions will be scheduled to clear them up. At the same time, when compacting files with data older than half of TTL, output files may be cut off based on those files' boundaries, in order for the early TTL compaction to work properly.
Made FileSystem extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
Clarified in API comments that RocksDB is not exception safe for callbacks and custom extensions. An exception propagating into RocksDB can lead to undefined behavior, including data loss, unreported corruption, deadlocks, and more.
Marked WriteBufferManager as final because it is not intended for extension.
Removed unimportant implementation details from table_properties.h
Add API FSDirectory::FsyncWithDirOptions(), which provides extra information like directory fsync reason in DirFsyncOptions. File system like btrfs is using that to skip directory fsync for creating a new file, or when renaming a file, fsync the target file instead of the directory, which improves the DB::Open() speed by ~20%.
DB::Open() is not going be blocked by obsolete file purge if DBOptions::avoid_unnecessary_blocking_io is set to true.
In builds where glibc provides gettid(), info log ("LOG" file) lines now print a system-wide thread ID from gettid() instead of the process-local pthread_self(). For all users, the thread ID format is changed from hexadecimal to decimal integer.
In builds where glibc provides pthread_setname_np(), the background thread names no longer contain an ID suffix. For example, "rocksdb:bottom7" (and all other threads in the Env::Priority::BOTTOM pool) are now named "rocksdb:bottom". Previously large thread pools could breach the name size limit (e.g., naming "rocksdb:bottom10" would fail).
Deprecating ReadOptions::iter_start_seqnum and DBOptions::preserve_deletes, please try using user defined timestamp feature instead. The options will be removed in a future release, currently it logs a warning message when using.

Performance Improvements

Released some memory related to filter construction earlier in BlockBasedTableBuilder for FullFilter and PartitionedFilter case (#9070)

rocksdb - RocksDB 6.26.1

Published by siying almost 3 years ago

6.26.1 (2021-11-18)

Bug Fixes

Fix builds for some platforms.

6.26.0 (2021-10-20)

Bug Fixes

Fixes a bug in directed IO mode when calling MultiGet() for blobs in the same blob file. The bug is caused by not sorting the blob read requests by file offsets.
Fix the incorrect disabling of SST rate limited deletion when the WAL and DB are in different directories. Only WAL rate limited deletion should be disabled if its in a different directory.
Fix DisableManualCompaction() to cancel compactions even when they are waiting on automatic compactions to drain due to CompactRangeOptions::exclusive_manual_compactions == true.
Fix contract of Env::ReopenWritableFile() and FileSystem::ReopenWritableFile() to specify any existing file must not be deleted or truncated.
Fixed bug in calls to IngestExternalFiles() with files for multiple column families. The bug could have introduced a delay in ingested file keys becoming visible after IngestExternalFiles() returned. Furthermore, mutations to ingested file keys while they were invisible could have been dropped (not necessarily immediately).
Fixed a possible race condition impacting users of WriteBufferManager who constructed it with allow_stall == true. The race condition led to undefined behavior (in our experience, typically a process crash).
Fixed a bug where stalled writes would remain stalled forever after the user calls WriteBufferManager::SetBufferSize() with new_size == 0 to dynamically disable memory limiting.
Make DB::close() thread-safe.
Fix a bug in atomic flush where one bg flush thread will wait forever for a preceding bg flush thread to commit its result to MANIFEST but encounters an error which is mapped to a soft error (DB not stopped).

New Features

Print information about blob files when using "ldb list_live_files_metadata"
Provided support for SingleDelete with user defined timestamp.
Experimental new function DB::GetLiveFilesStorageInfo offers essentially a unified version of other functions like GetLiveFiles, GetLiveFilesChecksumInfo, and GetSortedWalFiles. Checkpoints and backups could show small behavioral changes and/or improved performance as they now use this new API.
Add remote compaction read/write bytes statistics: REMOTE_COMPACT_READ_BYTES, REMOTE_COMPACT_WRITE_BYTES.
Introduce an experimental feature to dump out the blocks from block cache and insert them to the secondary cache to reduce the cache warmup time (e.g., used while migrating DB instance). More information are in class CacheDumper and CacheDumpedLoader at rocksdb/utilities/cache_dump_load.h Note that, this feature is subject to the potential change in the future, it is still experimental.
Introduced a new BlobDB configuration option blob_garbage_collection_force_threshold, which can be used to trigger compactions targeting the SST files which reference the oldest blob files when the ratio of garbage in those blob files meets or exceeds the specified threshold. This can reduce space amplification with skewed workloads where the affected SST files might not otherwise get picked up for compaction.
Added EXPERIMENTAL support for table file (SST) unique identifiers that are stable and universally unique, available with new function GetUniqueIdFromTableProperties. Only SST files from RocksDB >= 6.24 support unique IDs.
Added GetMapProperty() support for "rocksdb.dbstats" (DB::Properties::kDBStats). As a map property, it includes DB-level internal stats accumulated over the DB's lifetime, such as user write related stats and uptime.

Public API change

Made SystemClock extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
Made SliceTransform extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method. The Capped and Prefixed transform classes return a short name (no length); use GetId for the fully qualified name.
Made FileChecksumGenFactory, SstPartitionerFactory, TablePropertiesCollectorFactory, and WalFilter extend the Customizable class and added a CreateFromString method.
Some fields of SstFileMetaData are deprecated for compatibility with new base class FileStorageInfo.
Add file_temperature to IngestExternalFileArg such that when ingesting SST files, we are able to indicate the temperature of the this batch of files.
If DB::Close() failed with a non aborted status, calling DB::Close() again will return the original status instead of Status::OK.
Add CacheTier to advanced_options.h to describe the cache tier we used. Add a lowest_used_cache_tier option to DBOptions (immutable) and pass it to BlockBasedTableReader. By default it is CacheTier::kNonVolatileBlockTier, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileBlockTier). By set it to CacheTier::kVolatileTier, the DB will not use the secondary cache.
Even when options.max_compaction_bytes is hit, compaction output files are only cut when it aligns with grandparent files' boundaries. options.max_compaction_bytes could be slightly violated with the change, but the violation is no more than one target SST file size, which is usually much smaller.

Performance Improvements

Improved CPU efficiency of building block-based table (SST) files (#9039 and #9040).

Java API Changes

Add Java API bindings for new integrated BlobDB options
keyMayExist() supports ByteBuffer.
Fix multiget throwing Null Pointer Exception for num of keys > 70k (https://github.com/facebook/rocksdb/issues/8039).

rocksdb - RocksDB 6.26.0

Published by siying almost 3 years ago

6.26.0 (2021-10-20)

Bug Fixes

Fixes a bug in directed IO mode when calling MultiGet() for blobs in the same blob file. The bug is caused by not sorting the blob read requests by file offsets.
Fix the incorrect disabling of SST rate limited deletion when the WAL and DB are in different directories. Only WAL rate limited deletion should be disabled if its in a different directory.
Fix DisableManualCompaction() to cancel compactions even when they are waiting on automatic compactions to drain due to CompactRangeOptions::exclusive_manual_compactions == true.
Fix contract of Env::ReopenWritableFile() and FileSystem::ReopenWritableFile() to specify any existing file must not be deleted or truncated.
Fixed bug in calls to IngestExternalFiles() with files for multiple column families. The bug could have introduced a delay in ingested file keys becoming visible after IngestExternalFiles() returned. Furthermore, mutations to ingested file keys while they were invisible could have been dropped (not necessarily immediately).
Fixed a possible race condition impacting users of WriteBufferManager who constructed it with allow_stall == true. The race condition led to undefined behavior (in our experience, typically a process crash).
Fixed a bug where stalled writes would remain stalled forever after the user calls WriteBufferManager::SetBufferSize() with new_size == 0 to dynamically disable memory limiting.
Make DB::close() thread-safe.
Fix a bug in atomic flush where one bg flush thread will wait forever for a preceding bg flush thread to commit its result to MANIFEST but encounters an error which is mapped to a soft error (DB not stopped).

New Features

Print information about blob files when using "ldb list_live_files_metadata"
Provided support for SingleDelete with user defined timestamp.
Experimental new function DB::GetLiveFilesStorageInfo offers essentially a unified version of other functions like GetLiveFiles, GetLiveFilesChecksumInfo, and GetSortedWalFiles. Checkpoints and backups could show small behavioral changes and/or improved performance as they now use this new API.
Add remote compaction read/write bytes statistics: REMOTE_COMPACT_READ_BYTES, REMOTE_COMPACT_WRITE_BYTES.
Introduce an experimental feature to dump out the blocks from block cache and insert them to the secondary cache to reduce the cache warmup time (e.g., used while migrating DB instance). More information are in class CacheDumper and CacheDumpedLoader at rocksdb/utilities/cache_dump_load.h Note that, this feature is subject to the potential change in the future, it is still experimental.
Introduced a new BlobDB configuration option blob_garbage_collection_force_threshold, which can be used to trigger compactions targeting the SST files which reference the oldest blob files when the ratio of garbage in those blob files meets or exceeds the specified threshold. This can reduce space amplification with skewed workloads where the affected SST files might not otherwise get picked up for compaction.
Added EXPERIMENTAL support for table file (SST) unique identifiers that are stable and universally unique, available with new function GetUniqueIdFromTableProperties. Only SST files from RocksDB >= 6.24 support unique IDs.
Added GetMapProperty() support for "rocksdb.dbstats" (DB::Properties::kDBStats). As a map property, it includes DB-level internal stats accumulated over the DB's lifetime, such as user write related stats and uptime.

Public API change

Made SystemClock extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
Made SliceTransform extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method. The Capped and Prefixed transform classes return a short name (no length); use GetId for the fully qualified name.
Made FileChecksumGenFactory, SstPartitionerFactory, TablePropertiesCollectorFactory, and WalFilter extend the Customizable class and added a CreateFromString method.
Some fields of SstFileMetaData are deprecated for compatibility with new base class FileStorageInfo.
Add file_temperature to IngestExternalFileArg such that when ingesting SST files, we are able to indicate the temperature of the this batch of files.
If DB::Close() failed with a non aborted status, calling DB::Close() again will return the original status instead of Status::OK.
Add CacheTier to advanced_options.h to describe the cache tier we used. Add a lowest_used_cache_tier option to DBOptions (immutable) and pass it to BlockBasedTableReader. By default it is CacheTier::kNonVolatileBlockTier, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileBlockTier). By set it to CacheTier::kVolatileTier, the DB will not use the secondary cache.
Even when options.max_compaction_bytes is hit, compaction output files are only cut when it aligns with grandparent files' boundaries. options.max_compaction_bytes could be slightly violated with the change, but the violation is no more than one target SST file size, which is usually much smaller.

Performance Improvements

Improved CPU efficiency of building block-based table (SST) files (#9039 and #9040).

Java API Changes

Add Java API bindings for new integrated BlobDB options
keyMayExist() supports ByteBuffer.
Fix multiget throwing Null Pointer Exception for num of keys > 70k (https://github.com/facebook/rocksdb/issues/8039).

rocksdb - RocksDB 6.25.3

Published by ajkr about 3 years ago

6.25.3 (2021-10-14)

Bug Fixes

Fixed bug in calls to IngestExternalFiles() with files for multiple column families. The bug could have introduced a delay in ingested file keys becoming visible after IngestExternalFiles() returned. Furthermore, mutations to ingested file keys while they were invisible could have been dropped (not necessarily immediately).
Fixed a possible race condition impacting users of WriteBufferManager who constructed it with allow_stall == true. The race condition led to undefined behavior (in our experience, typically a process crash).
Fixed a bug where stalled writes would remain stalled forever after the user calls WriteBufferManager::SetBufferSize() with new_size == 0 to dynamically disable memory limiting.

6.25.2 (2021-10-11)

Bug Fixes

Fix DisableManualCompaction() to cancel compactions even when they are waiting on automatic compactions to drain due to CompactRangeOptions::exclusive_manual_compactions == true.
Fix contract of Env::ReopenWritableFile() and FileSystem::ReopenWritableFile() to specify any existing file must not be deleted or truncated.

rocksdb - RocksDB 6.25.1

Published by ltamasi about 3 years ago

6.25.1 (2021-09-28)

Bug Fixes

Fixes a bug in directed IO mode when calling MultiGet() for blobs in the same blob file. The bug is caused by not sorting the blob read requests by file offsets.

6.25.0 (2021-09-20)

Bug Fixes

Allow secondary instance to refresh iterator. Assign read seq after referencing SuperVersion.
Fixed a bug of secondary instance's last_sequence going backward, and reads on the secondary fail to see recent updates from the primary.
Fixed a bug that could lead to duplicate DB ID or DB session ID in POSIX environments without /proc/sys/kernel/random/uuid.
Fix a race in DumpStats() with column family destruction due to not taking a Ref on each entry while iterating the ColumnFamilySet.
Fix a race in item ref counting in LRUCache when promoting an item from the SecondaryCache.
Fix a race in BackupEngine if RateLimiter is reconfigured during concurrent Restore operations.
Fix a bug on POSIX in which failure to create a lock file (e.g. out of space) can prevent future LockFile attempts in the same process on the same file from succeeding.
Fix a bug that backup_rate_limiter and restore_rate_limiter in BackupEngine could not limit read rates.
Fix the implementation of prepopulate_block_cache = kFlushOnly to only apply to flushes rather than to all generated files.
Fix WAL log data corruption when using DBOptions.manual_wal_flush(true) and WriteOptions.sync(true) together. The sync WAL should work with locked log_write_mutex_.
Add checks for validity of the IO uring completion queue entries, and fail the BlockBasedTableReader MultiGet sub-batch if there's an invalid completion
Add an interface RocksDbIOUringEnable() that, if defined by the user, will allow them to enable/disable the use of IO uring by RocksDB
Fix the bug that when direct I/O is used and MultiRead() returns a short result, RandomAccessFileReader::MultiRead() still returns full size buffer, with returned short value together with some data in original buffer. This bug is unlikely cause incorrect results, because (1) since FileSystem layer is expected to retry on short result, returning short results is only possible when asking more bytes in the end of the file, which RocksDB doesn't do when using MultiRead(); (2) checksum is unlikely to match.

New Features

RemoteCompaction's interface now includes db_name, db_id, session_id, which could help the user uniquely identify compaction job between db instances and sessions.
Added a ticker statistic, "rocksdb.verify_checksum.read.bytes", reporting how many bytes were read from file to serve VerifyChecksum() and VerifyFileChecksums() queries.
Added ticker statistics, "rocksdb.backup.read.bytes" and "rocksdb.backup.write.bytes", reporting how many bytes were read and written during backup.
Added properties for BlobDB: rocksdb.num-blob-files, rocksdb.blob-stats, rocksdb.total-blob-file-size, and rocksdb.live-blob-file-size. The existing property rocksdb.estimate_live-data-size was also extended to include live bytes residing in blob files.
Added two new RateLimiter IOPriorities: Env::IO_USER,Env::IO_MID. Env::IO_USER will have superior priority over all other RateLimiter IOPriorities without being subject to fair scheduling constraint.
SstFileWriter now supports Puts and Deletes with user-defined timestamps. Note that the ingestion logic itself is not timestamp-aware yet.
Allow a single write batch to include keys from multiple column families whose timestamps' formats can differ. For example, some column families may disable timestamp, while others enable timestamp.
Add compaction priority information in RemoteCompaction, which can be used to schedule high priority job first.
Added new callback APIs OnBlobFileCreationStarted,OnBlobFileCreatedand OnBlobFileDeleted in EventListener class of listener.h. It notifies listeners during creation/deletion of individual blob files in Integrated BlobDB. It also log blob file creation finished event and deletion event in LOG file.
Batch blob read requests for DB::MultiGet using MultiRead.
Add support for fallback to local compaction, the user can return CompactionServiceJobStatus::kUseLocal to instruct RocksDB to run the compaction locally instead of waiting for the remote compaction result.
Add built-in rate limiter's implementation of RateLimiter::GetTotalPendingRequest(int64_t* total_pending_requests, const Env::IOPriority pri) for the total number of requests that are pending for bytes in the rate limiter.
Charge memory usage during data buffering, from which training samples are gathered for dictionary compression, to block cache. Unbuffering data can now be triggered if the block cache becomes full and strict_capacity_limit=true for the block cache, in addition to existing conditions that can trigger unbuffering.

Public API change

Remove obsolete implementation details FullKey and ParseFullKey from public API
Change SstFileMetaData::size from size_t to uint64_t.
Made Statistics extend the Customizable class and added a CreateFromString method. Implementations of Statistics need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
Extended FlushJobInfo and CompactionJobInfo in listener.h to provide information about the blob files generated by a flush/compaction and garbage collected during compaction in Integrated BlobDB. Added struct members blob_file_addition_infos and blob_file_garbage_infos that contain this information.
Extended parameter output_file_names of CompactFiles API to also include paths of the blob files generated by the compaction in Integrated BlobDB.
Most BackupEngine functions now return IOStatus instead of Status. Most existing code should be compatible with this change but some calls might need to be updated.

rocksdb - RocksDB 6.24.2

Published by ltamasi about 3 years ago

6.24.2 (2021-09-16)

Bug Fixes

Add checks for validity of the IO uring completion queue entries, and fail the BlockBasedTableReader MultiGet sub-batch if there's an invalid completion

6.24.1 (2021-08-31)

Bug Fixes

Fix a race in item ref counting in LRUCache when promoting an item from the SecondaryCache.

6.24.0 (2021-08-20)

Bug Fixes

If the primary's CURRENT file is missing or inaccessible, the secondary instance should not hang repeatedly trying to switch to a new MANIFEST. It should instead return the error code encountered while accessing the file.
Restoring backups with BackupEngine is now a logically atomic operation, so that if a restore operation is interrupted, DB::Open on it will fail. Using BackupEngineOptions::sync (default) ensures atomicity even in case of power loss or OS crash.
Fixed a race related to the destruction of ColumnFamilyData objects. The earlier logic unlocked the DB mutex before destroying the thread-local SuperVersion pointers, which could result in a process crash if another thread managed to get a reference to the ColumnFamilyData object.
Removed a call to RenameFile() on a non-existent info log file ("LOG") when opening a new DB. Such a call was guaranteed to fail though did not impact applications since we swallowed the error. Now we also stopped swallowing errors in renaming "LOG" file.
Fixed an issue where OnFlushCompleted was not called for atomic flush.
Fixed a bug affecting the batched MultiGet API when used with keys spanning multiple column families and sorted_input == false.
Fixed a potential incorrect result in opt mode and assertion failures caused by releasing snapshot(s) during compaction.
Fixed passing of BlobFileCompletionCallback to Compaction job and Atomic flush job which was default paramter (nullptr). BlobFileCompletitionCallback is internal callback that manages addition of blob files to SSTFileManager.
Fixed MultiGet not updating the block_read_count and block_read_byte PerfContext counters

New Features

Made the EventListener extend the Customizable class.
EventListeners that have a non-empty Name() and that are registered with the ObjectRegistry can now be serialized to/from the OPTIONS file.
Insert warm blocks (data blocks, uncompressed dict blocks, index and filter blocks) in Block cache during flush under option BlockBasedTableOptions.prepopulate_block_cache. Previously it was enabled for only data blocks.
BlockBasedTableOptions.prepopulate_block_cache can be dynamically configured using DB::SetOptions.
Add CompactionOptionsFIFO.age_for_warm, which allows RocksDB to move old files to warm tier in FIFO compactions. Note that file temperature is still an experimental feature.
Add a comment to suggest btrfs user to disable file preallocation by setting options.allow_fallocate=false.
Fast forward option in Trace replay changed to double type to allow replaying at a lower speed, by settings the value between 0 and 1. This option can be set via ReplayOptions in Replayer::Replay(), or via --trace_replay_fast_forward in db_bench.
Add property LiveSstFilesSizeAtTemperature to retrieve sst file size at different temperature.
Added a stat rocksdb.secondary.cache.hits
Added a PerfContext counter secondary_cache_hit_count
The integrated BlobDB implementation now supports the tickers BLOB_DB_BLOB_FILE_BYTES_READ, BLOB_DB_GC_NUM_KEYS_RELOCATED, and BLOB_DB_GC_BYTES_RELOCATED, as well as the histograms BLOB_DB_COMPRESSION_MICROS and BLOB_DB_DECOMPRESSION_MICROS.
Added hybrid configuration of Ribbon filter and Bloom filter where some LSM levels use Ribbon for memory space efficiency and some use Bloom for speed. See NewRibbonFilterPolicy. This also changes the default behavior of NewRibbonFilterPolicy to use Bloom for flushes under Leveled and Universal compaction and Ribbon otherwise. The C API function rocksdb_filterpolicy_create_ribbon is unchanged but adds new rocksdb_filterpolicy_create_ribbon_hybrid.

Public API change

Added APIs to decode and replay trace file via Replayer class. Added DB::NewDefaultReplayer() to create a default Replayer instance. Added TraceReader::Reset() to restart reading a trace file. Created trace_record.h, trace_record_result.h and utilities/replayer.h files to access the decoded Trace records, replay them, and query the actual operation results.
Added Configurable::GetOptionsMap to the public API for use in creating new Customizable classes.
Generalized bits_per_key parameters in C API from int to double for greater configurability. Although this is a compatible change for existing C source code, anything depending on C API signatures, such as foreign function interfaces, will need to be updated.

Performance Improvements

Try to avoid updating DBOptions if SetDBOptions() does not change any option value.

Behavior Changes

StringAppendOperator additionally accepts a string as the delimiter.
BackupEngineOptions::sync (default true) now applies to restoring backups in addition to creating backups. This could slow down restores, but ensures they are fully persisted before returning OK. (Consider increasing max_background_operations to improve performance.)

rocksdb - RocksDB 6.23.3

Published by ltamasi about 3 years ago

6.23.3 (2021-08-09)

Bug Fixes

Removed a call to RenameFile() on a non-existent info log file ("LOG") when opening a new DB. Such a call was guaranteed to fail though did not impact applications since we swallowed the error. Now we also stopped swallowing errors in renaming "LOG" file.
Fixed a bug affecting the batched MultiGet API when used with keys spanning multiple column families and sorted_input == false.

6.23.2 (2021-08-04)

Bug Fixes

Fixed a race related to the destruction of ColumnFamilyData objects. The earlier logic unlocked the DB mutex before destroying the thread-local SuperVersion pointers, which could result in a process crash if another thread managed to get a reference to the ColumnFamilyData object.
Fixed an issue where OnFlushCompleted was not called for atomic flush.

6.23.1 (2021-07-22)

Bug Fixes

Fix a race condition during multiple DB instances opening.

6.23.0 (2021-07-16)

Behavior Changes

Obsolete keys in the bottommost level that were preserved for a snapshot will now be cleaned upon snapshot release in all cases. This form of compaction (snapshot release triggered compaction) previously had an artificial limitation that multiple tombstones needed to be present.

Bug Fixes

Blob file checksums are now printed in hexadecimal format when using the manifest_dump ldb command.
GetLiveFilesMetaData() now populates the temperature, oldest_ancester_time, and file_creation_time fields of its LiveFileMetaData results when the information is available. Previously these fields always contained zero indicating unknown.
Fix mismatches of OnCompaction{Begin,Completed} in case of DisableManualCompaction().
Fix continuous logging of an existing background error on every user write
Fix a bug that Get() return Status::OK() and an empty value for non-existent key when read_options.read_tier = kBlockCacheTier.
Fix a bug that stat in get_context didn't accumulate to statistics when query is failed.

New Features

ldb has a new feature, list_live_files_metadata, that shows the live SST files, as well as their LSM storage level and the column family they belong to.
The new BlobDB implementation now tracks the amount of garbage in each blob file in the MANIFEST.
Integrated BlobDB now supports Merge with base values (Put/Delete etc.).
RemoteCompaction supports sub-compaction, the job_id in the user interface is changed from int to uint64_t to support sub-compaction id.
Expose statistics option in RemoteCompaction worker.

Public API change

Added APIs to the Customizable class to allow developers to create their own Customizable classes. Created the utilities/customizable_util.h file to contain helper methods for developing new Customizable classes.
Change signature of SecondaryCache::Name(). Make SecondaryCache customizable and add SecondaryCache::CreateFromString method.

rocksdb - RocksDB 6.22.1

Published by ajkr over 3 years ago

6.22.1 (2021-06-25)

Bug Fixes

GetLiveFilesMetaData() now populates the temperature, oldest_ancester_time, and file_creation_time fields of its LiveFileMetaData results when the information is available. Previously these fields always contained zero indicating unknown.

6.22.0 (2021-06-18)

Behavior Changes

Added two additional tickers, MEMTABLE_PAYLOAD_BYTES_AT_FLUSH and MEMTABLE_GARBAGE_BYTES_AT_FLUSH. These stats can be used to estimate the ratio of "garbage" (outdated) bytes in the memtable that are discarded at flush time.
Added API comments clarifying safe usage of Disable/EnableManualCompaction and EventListener callbacks for compaction.

Bug Fixes

fs_posix.cc GetFreeSpace() always report disk space available to root even when running as non-root. Linux defaults often have disk mounts with 5 to 10 percent of total space reserved only for root. Out of space could result for non-root users.
Subcompactions are now disabled when user-defined timestamps are used, since the subcompaction boundary picking logic is currently not timestamp-aware, which could lead to incorrect results when different subcompactions process keys that only differ by timestamp.
Fix an issue that DeleteFilesInRange() may cause ongoing compaction reports corruption exception, or ASSERT for debug build. There's no actual data loss or corruption that we find.
Fixed confusingly duplicated output in LOG for periodic stats ("DUMPING STATS"), including "Compaction Stats" and "File Read Latency Histogram By Level".
Fixed performance bugs in background gathering of block cache entry statistics, that could consume a lot of CPU when there are many column families with a shared block cache.

New Features

Marked the Ribbon filter and optimize_filters_for_memory features as production-ready, each enabling memory savings for Bloom-like filters. Use NewRibbonFilterPolicy in place of NewBloomFilterPolicy to use Ribbon filters instead of Bloom, or ribbonfilter in place of bloomfilter in configuration string.
Allow DBWithTTL to use DeleteRange api just like other DBs. DeleteRangeCF() which executes WriteBatchInternal::DeleteRange() has been added to the handler in DBWithTTLImpl::Write() to implement it.
Add BlockBasedTableOptions.prepopulate_block_cache. If enabled, it prepopulate warm/hot data blocks which are already in memory into block cache at the time of flush. On a flush, the data block that is in memory (in memtables) get flushed to the device. If using Direct IO, additional IO is incurred to read this data back into memory again, which is avoided by enabling this option and it also helps with Distributed FileSystem. More details in include/rocksdb/table.h.
Added a cancel field to CompactRangeOptions, allowing individual in-process manual range compactions to be cancelled.

rocksdb - RocksDB 6.20.3

Published by akankshamahajan15 over 3 years ago

6.20.3 (2021-05-05)

Bug Fixes

Fixed a bug where GetLiveFiles() output included a non-existent file called "OPTIONS-000000". Backups and checkpoints, which use GetLiveFiles(), failed on DBs impacted by this bug. Read-write DBs were impacted when the latest OPTIONS file failed to write and fail_if_options_file_error == false. Read-only DBs were impacted when no OPTIONS files existed.

6.20.2 (2021-04-23)

Bug Fixes

Fixed a bug in handling file rename error in distributed/network file systems when the server succeeds but client returns error. The bug can cause CURRENT file to point to non-existing MANIFEST file, thus DB cannot be opened.
Fixed a bug where ingested files were written with incorrect boundary key metadata. In rare cases this could have led to a level's files being wrongly ordered and queries for the boundary keys returning wrong results.
Fixed a data race between insertion into memtables and the retrieval of the DB properties rocksdb.cur-size-active-mem-table, rocksdb.cur-size-all-mem-tables, and rocksdb.size-all-mem-tables.
Fixed the false-positive alert when recovering from the WAL file. Avoid reporting "SST file is ahead of WAL" on a newly created empty column family, if the previous WAL file is corrupted.

Behavior Changes

Due to the fix of false-postive alert of "SST file is ahead of WAL", all the CFs with no SST file (CF empty) will bypass the consistency check. We fixed a false-positive, but introduced a very rare true-negative which will be triggered in the following conditions: A CF with some delete operations in the last a few queries which will result in an empty CF (those are flushed to SST file and a compaction triggered which combines this file and all other SST files and generates an empty CF, or there is another reason to write a manifest entry for this CF after a flush that generates no SST file from an empty CF). The deletion entries are logged in a WAL and this WAL was corrupted, while the CF's log number points to the next WAL (due to the flush). Therefore, the DB can only recover to the point without these trailing deletions and cause the inconsistent DB status.

6.20.0 (2021-04-16)

Behavior Changes

ColumnFamilyOptions::sample_for_compression now takes effect for creation of all block-based tables. Previously it only took effect for block-based tables created by flush.
CompactFiles() can no longer compact files from lower level to up level, which has the risk to corrupt DB (details: #8063). The validation is also added to all compactions.
Fixed some cases in which DB::OpenForReadOnly() could write to the filesystem. If you want a Logger with a read-only DB, you must now set DBOptions::info_log yourself, such as using CreateLoggerFromOptions().
get_iostats_context() will never return nullptr. If thread-local support is not available, and user does not opt-out iostats context, then compilation will fail. The same applies to perf context as well.

Bug Fixes

Use thread-safe strerror_r() to get error messages.
Fixed a potential hang in shutdown for a DB whose Env has high-pri thread pool disabled (Env::GetBackgroundThreads(Env::Priority::HIGH) == 0)
Made BackupEngine thread-safe and added documentation comments to clarify what is safe for multiple BackupEngine objects accessing the same backup directory.
Fixed crash (divide by zero) when compression dictionary is applied to a file containing only range tombstones.
Fixed a backward iteration bug with partitioned filter enabled: not including the prefix of the last key of the previous filter partition in current filter partition can cause wrong iteration result.
Fixed a bug that allowed DBOptions::max_open_files to be set with a non-negative integer with ColumnFamilyOptions::compaction_style = kCompactionStyleFIFO.
Fixed a bug in handling file rename error in distributed/network file systems when the server succeeds but client returns error. The bug can cause CURRENT file to point to non-existing MANIFEST file, thus DB cannot be opened.
Fixed a data race between insertion into memtables and the retrieval of the DB properties rocksdb.cur-size-active-mem-table, rocksdb.cur-size-all-mem-tables, and rocksdb.size-all-mem-tables.

Performance Improvements

On ARM platform, use yield instead of wfe to relax cpu to gain better performance.

Public API change

Added TableProperties::slow_compression_estimated_data_size and TableProperties::fast_compression_estimated_data_size. When ColumnFamilyOptions::sample_for_compression > 0, they estimate what TableProperties::data_size would have been if the "fast" or "slow" (see ColumnFamilyOptions::sample_for_compression API doc for definitions) compression had been used instead.
Update DB::StartIOTrace and remove Env object from the arguments as its redundant and DB already has Env object that is passed down to IOTracer::StartIOTrace
Added FlushReason::kWalFull, which is reported when a memtable is flushed due to the WAL reaching its size limit; those flushes were previously reported as FlushReason::kWriteBufferManager. Also, changed the reason for flushes triggered by the write buffer manager to FlushReason::kWriteBufferManager; they were previously reported as FlushReason::kWriteBufferFull.
Extend file_checksum_dump ldb command and DB::GetLiveFilesChecksumInfo API for IntegratedBlobDB and get checksum of blob files along with SST files.

New Features

Added the ability to open BackupEngine backups as read-only DBs, using BackupInfo::name_for_open and env_for_open provided by BackupEngine::GetBackupInfo() with include_file_details=true.
Added BackupEngine support for integrated BlobDB, with blob files shared between backups when table files are shared. Because of current limitations, blob files always use the kLegacyCrc32cAndFileSize naming scheme, and incremental backups must read and checksum all blob files in a DB, even for files that are already backed up.
Added an optional output parameter to BackupEngine::CreateNewBackup(WithMetadata) to return the BackupID of the new backup.
Added BackupEngine::GetBackupInfo / GetLatestBackupInfo for querying individual backups.
Made the Ribbon filter a long-term supported feature in terms of the SST schema(compatible with version >= 6.15.0) though the API for enabling it is expected to change.

rocksdb - RocksDB 6.19.3

Published by zhichao-cao over 3 years ago

6.19.3 (2021-04-19)

Bug Fixes

Fixed a bug in handling file rename error in distributed/network file systems when the server succeeds but client returns error. The bug can cause CURRENT file to point to non-existing MANIFEST file, thus DB cannot be opened.

6.19.2 (2021-04-08)

Bug Fixes

Fixed a backward iteration bug with partitioned filter enabled: not including the prefix of the last key of the previous filter partition in current filter partition can cause wrong iteration result.

6.19.1 (2021-04-01)

Bug Fixes

Fixed crash (divide by zero) when compression dictionary is applied to a file containing only range tombstones.

6.19.0 (2021-03-21)

Bug Fixes

Fixed the truncation error found in APIs/tools when dumping block-based SST files in a human-readable format. After fix, the block-based table can be fully dumped as a readable file.
When hitting a write slowdown condition, no write delay (previously 1 millisecond) is imposed until delayed_write_rate is actually exceeded, with an initial burst allowance of 1 millisecond worth of bytes. Also, beyond the initial burst allowance, delayed_write_rate is now more strictly enforced, especially with multiple column families.

Public API change

Changed default BackupableDBOptions::share_files_with_checksum to true and deprecated false because of potential for data loss. Note that accepting this change in behavior can temporarily increase backup data usage because files are not shared between backups using the two different settings. Also removed obsolete option kFlagMatchInterimNaming.
Add a new option BlockBasedTableOptions::max_auto_readahead_size. RocksDB does auto-readahead for iterators on noticing more than two reads for a table file if user doesn't provide readahead_size. The readahead starts at 8KB and doubles on every additional read upto max_auto_readahead_size and now max_auto_readahead_size can be configured dynamically as well. Found that 256 KB readahead size provides the best performance, based on experiments, for auto readahead. Experiment data is in PR #3282. If value is set 0 then no automatic prefetching will be done by rocksdb. Also changing the value will only affect files opened after the change.
Add suppport to extend DB::VerifyFileChecksums API to also verify blob files checksum.
When using the new BlobDB, the amount of data written by flushes/compactions is now broken down into table files and blob files in the compaction statistics; namely, Write(GB) denotes the amount of data written to table files, while Wblob(GB) means the amount of data written to blob files.
New default BlockBasedTableOptions::format_version=5 to enable new Bloom filter implementation by default, compatible with RocksDB versions >= 6.6.0.
Add new SetBufferSize API to WriteBufferManager to allow dynamic management of memory allotted to all write buffers. This allows user code to adjust memory monitoring provided by WriteBufferManager as process memory needs change datasets grow and shrink.
Clarified the required semantics of Read() functions in FileSystem and Env APIs. Please ensure any custom implementations are compliant.
For the new integrated BlobDB implementation, compaction statistics now include the amount of data read from blob files during compaction (due to garbage collection or compaction filters). Write amplification metrics have also been extended to account for data read from blob files.
Add EqualWithoutTimestamp() to Comparator.
Extend support to track blob files in SSTFileManager whenever a blob file is created/deleted. Blob files will be scheduled to delete via SSTFileManager and SStFileManager will now take blob files in account while calculating size and space limits along with SST files.
Add new Append and PositionedAppend API with checksum handoff to legacy Env.

New Features

Support compaction filters for the new implementation of BlobDB. Add FilterBlobByKey() to CompactionFilter. Subclasses can override this method so that compaction filters can determine whether the actual blob value has to be read during compaction. Use a new kUndetermined in CompactionFilter::Decision to indicated that further action is necessary for compaction filter to make a decision.
Add support to extend retrieval of checksums for blob files from the MANIFEST when checkpointing. During backup, rocksdb can detect corruption in blob files during file copies.
Add new options for db_bench --benchmarks: flush, waitforcompaction, compact0, compact1.
Add an option to BackupEngine::GetBackupInfo to include the name and size of each backed-up file. Especially in the presence of file sharing among backups, this offers detailed insight into backup space usage.
Enable backward iteration on keys with user-defined timestamps.
Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume count, auto resume total retry number, and auto resume sucess; Histogram for auto resume retry count in each recovery call. Note that, each auto resume attempt will have one or multiple retries.

Behavior Changes

During flush, only WAL sync retryable IO error is mapped to hard error, which will stall the writes. When WAL is used but only SST file write has retryable IO error, it will be mapped to soft error and write will not be affected.

rocksdb - RocksDB 6.16.4

Published by jay-zhuang over 3 years ago

6.16.4 (2021-03-30)

Bug Fixes

Fix build on ppc64 and musl build.

rocksdb - RocksDB 6.17.3

Published by jay-zhuang over 3 years ago

6.17.3 (2021-02-18)

Bug Fixes

Fix WRITE_PREPARED, WRITE_UNPREPARED TransactionDB MultiGet() may return uncommitted data with snapshot.

6.17.2 (2021-02-05)

Bug Fixes

Since 6.15.0, TransactionDB returns error Statuses from calls to DeleteRange() and calls to Write() where the WriteBatch contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees. There are certain cases where range deletion can still be used on such DBs; see the API doc on TransactionDB::DeleteRange() for details.
OptimisticTransactionDB now returns error Statuses from calls to DeleteRange() and calls to Write() where the WriteBatch contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees.

6.17.1 (2021-01-28)

Behavior Changes

When retryable IO error occurs during compaction, it is mapped to soft error and set the BG error. However, auto resume is not called to clean the soft error since compaction will reschedule by itself. In this change, When retryable IO error occurs during compaction, BG error is not set. User will be informed the error via EventHelper.

6.17.0 (2021-01-15)

Behavior Changes

When verifying full file checksum with DB::VerifyFileChecksums(), we now fail with Status::InvalidArgument if the name of the checksum generator used for verification does not match the name of the checksum generator used for protecting the file when it was created.
Since RocksDB does not continue write the same file if a file write fails for any reason, the file scope write IO error is treated the same as retryable IO error. More information about error handling of file scope IO error is included in ErrorHandler::SetBGError.

Bug Fixes

Version older than 6.15 cannot decode VersionEdits WalAddition and WalDeletion, fixed this by changing the encoded format of them to be ignorable by older versions.
Fix a race condition between DB startups and shutdowns in managing the periodic background worker threads. One effect of this race condition could be the process being terminated.

Public API Change

Add a public API WriteBufferManager::dummy_entries_in_cache_usage() which reports the size of dummy entries stored in cache (passed to WriteBufferManager). Dummy entries are used to account for DataBlocks.

rocksdb - RocksDB 6.16.3

Published by jay-zhuang over 3 years ago

6.16.3 (2021-02-05)

Bug Fixes

Since 6.15.0, TransactionDB returns error Statuses from calls to DeleteRange() and calls to Write() where the WriteBatch contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees. There are certain cases where range deletion can still be used on such DBs; see the API doc on TransactionDB::DeleteRange() for details.
OptimisticTransactionDB now returns error Statuses from calls to DeleteRange() and calls to Write() where the WriteBatch contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees.

6.16.2 (2021-01-21)

Bug Fixes

Fix a race condition between DB startups and shutdowns in managing the periodic background worker threads. One effect of this race condition could be the process being terminated.

6.16.1 (2021-01-20)

Bug Fixes

Version older than 6.15 cannot decode VersionEdits WalAddition and WalDeletion, fixed this by changing the encoded format of them to be ignorable by older versions.

6.16.0 (2020-12-18)

Behavior Changes

Attempting to write a merge operand without explicitly configuring merge_operator now fails immediately, causing the DB to enter read-only mode. Previously, failure was deferred until the merge_operator was needed by a user read or a background operation.
Since RocksDB does not continue write the same file if a file write fails for any reason, the file scope write IO error is treated the same as retryable IO error. More information about error handling of file scope IO error is included in ErrorHandler::SetBGError.

Bug Fixes

Truncated WALs ending in incomplete records can no longer produce gaps in the recovered data when WALRecoveryMode::kPointInTimeRecovery is used. Gaps are still possible when WALs are truncated exactly on record boundaries; for complete protection, users should enable track_and_verify_wals_in_manifest.
Fix a bug where compressed blocks read by MultiGet are not inserted into the compressed block cache when use_direct_reads = true.
Fixed the issue of full scanning on obsolete files when there are too many outstanding compactions with ConcurrentTaskLimiter enabled.
Fixed the logic of populating native data structure for read_amp_bytes_per_bit during OPTIONS file parsing on big-endian architecture. Without this fix, original code introduced in PR7659, when running on big-endian machine, can mistakenly store read_amp_bytes_per_bit (an uint32) in little endian format. Future access to read_amp_bytes_per_bit will give wrong values. Little endian architecture is not affected.
Fixed prefix extractor with timestamp issues.
Fixed a bug in atomic flush: in two-phase commit mode, the minimum WAL log number to keep is incorrect.
Fixed a bug related to checkpoint in PR7789: if there are multiple column families, and the checkpoint is not opened as read only, then in rare cases, data loss may happen in the checkpoint. Since backup engine relies on checkpoint, it may also be affected.

New Features

User defined timestamp feature supports CompactRange and GetApproximateSizes.
Support getting aggregated table properties (kAggregatedTableProperties and kAggregatedTablePropertiesAtLevel) with DB::GetMapProperty, for easier access to the data in a structured format.
Experimental option BlockBasedTableOptions::optimize_filters_for_memory now works with experimental Ribbon filter (as well as Bloom filter).

Public API Change

Deprecated public but rarely-used FilterBitsBuilder::CalculateNumEntry, which is replaced with ApproximateNumEntries taking a size_t parameter and returning size_t.
Added a new option track_and_verify_wals_in_manifest. If true, the log numbers and sizes of the synced WALs are tracked in MANIFEST, then during DB recovery, if a synced WAL is missing from disk, or the WAL's size does not match the recorded size in MANIFEST, an error will be reported and the recovery will be aborted. Note that this option does not work with secondary instance.

rocksdb - RocksDB 6.15.5

Published by ajkr over 3 years ago

6.15.5 (2021-02-05)

Bug Fixes

Since 6.15.0, TransactionDB returns error Statuses from calls to DeleteRange() and calls to Write() where the WriteBatch contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees. There are certain cases where range deletion can still be used on such DBs; see the API doc on TransactionDB::DeleteRange() for details.
OptimisticTransactionDB now returns error Statuses from calls to DeleteRange() and calls to Write() where the WriteBatch contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees.

rocksdb - RocksDB 6.15.4

Published by ajkr almost 4 years ago

6.15.4 (2021-01-21)

Bug Fixes

Fix a race condition between DB startups and shutdowns in managing the periodic background worker threads. One effect of this race condition could be the process being terminated.

6.15.3 (2021-01-07)

Bug Fixes

For Java builds, fix errors due to missing compression library includes.

Package Rankings

Top 1.19% on Repo1.maven.org

Top 4.85% on Spack.io

Top 3.59% on Proxy.golang.org

Top 37.05% on Pypi.org

Top 11.69% on Conda-forge.org

Badges

Extracted from project README

Related Projects

fbjs

A collection of utility libraries used by other Meta JS projects.

28 May 2015 1,953

mysql-5.6

Facebook's branch of the Oracle MySQL database. This includes MyRocks.

15 Apr 2013 2,443