Bot releases are visible (Hide)

rocksdb - RocksDB 6.15.2

Published by ramvadiv almost 4 years ago

6.15.2 (2020-12-22)

Bug Fixes

Fix failing RocksJava test compilation and add CI jobs
Fix jemalloc compilation issue on macOS
Fix build issues - compatibility with older gcc, older jemalloc libraries, docker warning when building i686 binaries

6.15.1 (2020-12-01)

Bug Fixes

Truncated WALs ending in incomplete records can no longer produce gaps in the recovered data when WALRecoveryMode::kPointInTimeRecovery is used. Gaps are still possible when WALs are truncated exactly on record boundaries.
Fix a bug where compressed blocks read by MultiGet are not inserted into the compressed block cache when use_direct_reads = true.

6.15.0 (2020-11-13)

Bug Fixes

Fixed a bug in the following combination of features: indexes with user keys (format_version >= 3), indexes are partitioned (index_type == kTwoLevelIndexSearch), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.
Fixed a bug when indexes are partitioned (index_type == kTwoLevelIndexSearch), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1 and mmap_read == 1, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.
Since 6.12, memtable lookup should report unrecognized value_type as corruption (#7121).
Since 6.14, fix false positive flush/compaction Status::Corruption failure when paranoid_file_checks == true and range tombstones were written to the compaction output files.
Since 6.14, fix a bug that could cause a stalled write to crash with mixed of slowdown and no_slowdown writes (WriteOptions.no_slowdown=true).
Fixed a bug which causes hang in closing DB when refit level is set in opt build. It was because ContinueBackgroundWork() was called in assert statement which is a no op. It was introduced in 6.14.
Fixed a bug which causes Get() to return incorrect result when a key's merge operand is applied twice. This can occur if the thread performing Get() runs concurrently with a background flush thread and another thread writing to the MANIFEST file (PR6069).
Reverted a behavior change silently introduced in 6.14.2, in which the effects of the ignore_unknown_options flag (used in option parsing/loading functions) changed.
Reverted a behavior change silently introduced in 6.14, in which options parsing/loading functions began returning NotFound instead of InvalidArgument for option names not available in the present version.
Fixed MultiGet bugs it doesn't return valid data with user defined timestamp.
Fixed a potential bug caused by evaluating TableBuilder::NeedCompact() before TableBuilder::Finish() in compaction job. For example, the NeedCompact() method of CompactOnDeletionCollector returned by built-in CompactOnDeletionCollectorFactory requires BlockBasedTable::Finish() to return the correct result. The bug can cause a compaction-generated file not to be marked for future compaction based on deletion ratio.
Fixed a seek issue with prefix extractor and timestamp.
Fixed a bug of encoding and parsing BlockBasedTableOptions::read_amp_bytes_per_bit as a 64-bit integer.
Fixed the logic of populating native data structure for read_amp_bytes_per_bit during OPTIONS file parsing on big-endian architecture. Without this fix, original code introduced in PR7659, when running on big-endian machine, can mistakenly store read_amp_bytes_per_bit (an uint32) in little endian format. Future access to read_amp_bytes_per_bit will give wrong values. Little endian architecture is not affected.

Public API Change

Deprecate BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache and BlockBasedTableOptions::pin_top_level_index_and_filter. These options still take effect until users migrate to the replacement APIs in BlockBasedTableOptions::metadata_cache_options. Migration guidance can be found in the API comments on the deprecated options.
Add new API DB::VerifyFileChecksums to verify SST file checksum with corresponding entries in the MANIFEST if present. Current implementation requires scanning and recomputing file checksums.

Behavior Changes

The dictionary compression settings specified in ColumnFamilyOptions::compression_opts now additionally affect files generated by flush and compaction to non-bottommost level. Previously those settings at most affected files generated by compaction to bottommost level, depending on whether ColumnFamilyOptions::bottommost_compression_opts overrode them. Users who relied on dictionary compression settings in ColumnFamilyOptions::compression_opts affecting only the bottommost level can keep the behavior by moving their dictionary settings to ColumnFamilyOptions::bottommost_compression_opts and setting its enabled flag.
When the enabled flag is set in ColumnFamilyOptions::bottommost_compression_opts, those compression options now take effect regardless of the value in ColumnFamilyOptions::bottommost_compression. Previously, those compression options only took effect when ColumnFamilyOptions::bottommost_compression != kDisableCompressionOption. Now, they additionally take effect when ColumnFamilyOptions::bottommost_compression == kDisableCompressionOption (such a setting causes bottommost compression type to fall back to ColumnFamilyOptions::compression_per_level if configured, and otherwise fall back to ColumnFamilyOptions::compression).

New Features

An EXPERIMENTAL new Bloom alternative that saves about 30% space compared to Bloom filters, with about 3-4x construction time and similar query times is available using NewExperimentalRibbonFilterPolicy.

rocksdb - RocksDB 6.14.6

Published by ajkr almost 4 years ago

6.14.6 (2020-12-01)

Bug Fixes

Truncated WALs ending in incomplete records can no longer produce gaps in the recovered data when WALRecoveryMode::kPointInTimeRecovery is used. Gaps are still possible when WALs are truncated exactly on record boundaries.

rocksdb - RocksDB 6.14.5

Published by akankshamahajan15 almost 4 years ago

6.14.5 (2020-11-15)

Bug Fixes

Fix a bug of encoding and parsing BlockBasedTableOptions::read_amp_bytes_per_bit as a 64-bit integer.

6.14.4 (2020-11-05)

Bug Fixes

Fixed a potential bug caused by evaluating TableBuilder::NeedCompact() before TableBuilder::Finish() in compaction job. For example, the NeedCompact() method of CompactOnDeletionCollector returned by built-in CompactOnDeletionCollectorFactory requires BlockBasedTable::Finish() to return the correct result. The bug can cause a compaction-generated file not to be marked for future compaction based on deletion ratio.

6.14.3 (2020-10-30)

Bug Fixes

Reverted a behavior change silently introduced in 6.14.2, in which the effects of the ignore_unknown_options flag (used in option parsing/loading functions) changed.
Reverted a behavior change silently introduced in 6.14, in which options parsing/loading functions began returning NotFound instead of InvalidArgument for option names not available in the present version.

6.14.2 (2020-10-21)

Bug Fixes

Fixed a bug which causes hang in closing DB when refit level is set in opt build. It was because ContinueBackgroundWork() was called in assert statement which is a no op. It was introduced in 6.14.

6.14.1 (2020-10-13)

Bug Fixes

Since 6.12, memtable lookup should report unrecognized value_type as corruption (#7121).
Since 6.14, fix false positive flush/compaction Status::Corruption failure when paranoid_file_checks == true and range tombstones were written to the compaction output files.
Fixed a bug in the following combination of features: indexes with user keys (format_version >= 3), indexes are partitioned (index_type == kTwoLevelIndexSearch), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.
Fixed a bug when indexes are partitioned (index_type == kTwoLevelIndexSearch), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1 and mmap_read == 1, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.

6.14 (2020-10-09)

Bug fixes

Fixed a bug after a CompactRange() with CompactRangeOptions::change_level set fails due to a conflict in the level change step, which caused all subsequent calls to CompactRange() with CompactRangeOptions::change_level set to incorrectly fail with a Status::NotSupported("another thread is refitting") error.
Fixed a bug that the bottom most level compaction could still be a trivial move even if BottommostLevelCompaction.kForce or kForceOptimized is set.

Public API Change

The methods to create and manage EncrypedEnv have been changed. The EncryptionProvider is now passed to NewEncryptedEnv as a shared pointer, rather than a raw pointer. Comparably, the CTREncryptedProvider now takes a shared pointer, rather than a reference, to a BlockCipher. CreateFromString methods have been added to BlockCipher and EncryptionProvider to provide a single API by which different ciphers and providers can be created, respectively.
The internal classes (CTREncryptionProvider, ROT13BlockCipher, CTRCipherStream) associated with the EncryptedEnv have been moved out of the public API. To create a CTREncryptionProvider, one can either use EncryptionProvider::NewCTRProvider, or EncryptionProvider::CreateFromString("CTR"). To create a new ROT13BlockCipher, one can either use BlockCipher::NewROT13Cipher or BlockCipher::CreateFromString("ROT13").
The EncryptionProvider::AddCipher method has been added to allow keys to be added to an EncryptionProvider. This API will allow future providers to support multiple cipher keys.
Add a new option "allow_data_in_errors". When this new option is set by users, it allows users to opt-in to get error messages containing corrupted keys/values. Corrupt keys, values will be logged in the messages, logs, status etc. that will help users with the useful information regarding affected data. By default value of this option is set false to prevent users data to be exposed in the messages so currently, data will be redacted from logs, messages, status by default.
AdvancedColumnFamilyOptions::force_consistency_checks is now true by default, for more proactive DB corruption detection at virtually no cost (estimated two extra CPU cycles per million on a major production workload). Corruptions reported by these checks now mention "force_consistency_checks" in case a false positive corruption report is suspected and the option needs to be disabled (unlikely). Since existing column families have a saved setting for force_consistency_checks, only new column families will pick up the new default.

General Improvements

The settings of the DBOptions and ColumnFamilyOptions are now managed by Configurable objects (see New Features). The same convenience methods to configure these options still exist but the backend implementation has been unified under a common implementation.

New Features

Methods to configure serialize, and compare -- such as TableFactory -- are exposed directly through the Configurable base class (from which these objects inherit). This change will allow for better and more thorough configuration management and retrieval in the future. The options for a Configurable object can be set via the ConfigureFromMap, ConfigureFromString, or ConfigureOption method. The serialized version of the options of an object can be retrieved via the GetOptionString, ToString, or GetOption methods. The list of options supported by an object can be obtained via the GetOptionNames method. The "raw" object (such as the BlockBasedTableOption) for an option may be retrieved via the GetOptions method. Configurable options can be compared via the AreEquivalent method. The settings within a Configurable object may be validated via the ValidateOptions method. The object may be intialized (at which point only mutable options may be updated) via the PrepareOptions method.
Introduce options.check_flush_compaction_key_order with default value to be true. With this option, during flush and compaction, key order will be checked when writing to each SST file. If the order is violated, the flush or compaction will fail.
Added is_full_compaction to CompactionJobStats, so that the information is available through the EventListener interface.
Add more stats for MultiGet in Histogram to get number of data blocks, index blocks, filter blocks and sst files read from file system per level.

rocksdb - RocksDB 6.12.7

Published by ajkr about 4 years ago

6.12.7 (2020-10-14)

Other

Fix build issue to enable RocksJava release for ppc64le

rocksdb - RocksDB 6.13.3

Published by ajkr about 4 years ago

6.13.3 (2020-10-14)

Bug Fixes

Fix a bug that could cause a stalled write to crash with mixed of slowdown and no_slowdown writes (WriteOptions.no_slowdown=true).

6.13.2 (2020-10-13)

Bug Fixes

Fix false positive flush/compaction Status::Corruption failure when paranoid_file_checks == true and range tombstones were written to the compaction output files.

6.13.1 (2020-10-12)

Bug Fixes

Since 6.12, memtable lookup should report unrecognized value_type as corruption (#7121).
Fixed a bug in the following combination of features: indexes with user keys (format_version >= 3), indexes are partitioned (index_type == kTwoLevelIndexSearch), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.
Fixed a bug when indexes are partitioned (index_type == kTwoLevelIndexSearch), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1 and mmap_read == 1, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.

6.13 (2020-09-24)

Bug fixes

Fix a performance regression introduced in 6.4 that makes a upper bound check for every Next() even if keys are within a data block that is within the upper bound.
Fix a possible corruption to the LSM state (overlapping files within a level) when a CompactRange() for refitting levels (CompactRangeOptions::change_level == true) and another manual compaction are executed in parallel.
Sanitize recycle_log_file_num to zero when the user attempts to enable it in combination with WALRecoveryMode::kTolerateCorruptedTailRecords. Previously the two features were allowed together, which compromised the user's configured crash-recovery guarantees.
Fix a bug where a level refitting in CompactRange() might race with an automatic compaction that puts the data to the target level of the refitting. The bug has been there for years.
Fixed a bug in version 6.12 in which BackupEngine::CreateNewBackup could fail intermittently with non-OK status when backing up a read-write DB configured with a DBOptions::file_checksum_gen_factory.
Fix useless no-op compactions scheduled upon snapshot release when options.disable-auto-compactions = true.
Fix a bug when max_write_buffer_size_to_maintain is set, immutable flushed memtable destruction is delayed until the next super version is installed. A memtable is not added to delete list because of its reference hold by super version and super version doesn't switch because of empt delete list. So memory usage keeps on increasing beyond write_buffer_size + max_write_buffer_size_to_maintain.
Avoid converting MERGES to PUTS when allow_ingest_behind is true.
Fix compression dictionary sampling together with SstFileWriter. Previously, the dictionary would be trained/finalized immediately with zero samples. Now, the whole SstFileWriter file is buffered in memory and then sampled.
Fix a bug with avoid_unnecessary_blocking_io=1 and creating backups (BackupEngine::CreateNewBackup) or checkpoints (Checkpoint::Create). With this setting and WAL enabled, these operations could randomly fail with non-OK status.
Fix a bug in which bottommost compaction continues to advance the underlying InternalIterator to skip tombstones even after shutdown.

New Features

A new field std::string requested_checksum_func_name is added to FileChecksumGenContext, which enables the checksum factory to create generators for a suite of different functions.
Added a new subcommand, ldb unsafe_remove_sst_file, which removes a lost or corrupt SST file from a DB's metadata. This command involves data loss and must not be used on a live DB.

Performance Improvements

Reduce thread number for multiple DB instances by re-using one global thread for statistics dumping and persisting.
Reduce write-amp in heavy write bursts in kCompactionStyleLevel compaction style with level_compaction_dynamic_level_bytes set.
BackupEngine incremental backups no longer read DB table files that are already saved to a shared part of the backup directory, unless share_files_with_checksum is used with kLegacyCrc32cAndFileSize naming (discouraged).
- For share_files_with_checksum, we are confident there is no regression (vs. pre-6.12) in detecting DB or backup corruption at backup creation time, mostly because the old design did not leverage this extra checksum computation for detecting inconsistencies at backup creation time.
- For share_table_files without "checksum" (not recommended), there is a regression in detecting fundamentally unsafe use of the option, greatly mitigated by file size checking (under "Behavior Changes"). Almost no reason to use share_files_with_checksum=false should remain.
- DB::VerifyChecksum and BackupEngine::VerifyBackup with checksum checking are still able to catch corruptions that CreateNewBackup does not.

Public API Change

Expose kTypeDeleteWithTimestamp in EntryType and update GetEntryType() accordingly.
Added file_checksum and file_checksum_func_name to TableFileCreationInfo, which can pass the table file checksum information through the OnTableFileCreated callback during flush and compaction.
A warning is added to DB::DeleteFile() API describing its known problems and deprecation plan.
Add a new stats level, i.e. StatsLevel::kExceptTickers (PR7329) to exclude tickers even if application passes a non-null Statistics object.
Added a new status code IOStatus::IOFenced() for the Env/FileSystem to indicate that writes from this instance are fenced off. Like any other background error, this error is returned to the user in Put/Merge/Delete/Flush calls and can be checked using Status::IsIOFenced().

Behavior Changes

File abstraction FSRandomAccessFile.Prefetch() default return status is changed from OK to NotSupported. If the user inherited file doesn't implement prefetch, RocksDB will create internal prefetch buffer to improve read performance.
When retryable IO error happens during Flush (manifest write error is excluded) and WAL is disabled, originally it is mapped to kHardError. Now,it is mapped to soft error. So DB will not stall the writes unless the memtable is full. At the same time, when auto resume is triggered to recover the retryable IO error during Flush, SwitchMemtable is not called to avoid generating to many small immutable memtables. If WAL is enabled, no behavior changes.
When considering whether a table file is already backed up in a shared part of backup directory, BackupEngine would already query the sizes of source (DB) and pre-existing destination (backup) files. BackupEngine now uses these file sizes to detect corruption, as at least one of (a) old backup, (b) backup in progress, or (c) current DB is corrupt if there's a size mismatch.

Others

Error in prefetching partitioned index blocks will not be swallowed. It will fail the query and return the IOError users.

rocksdb - RocksDB 6.12.6

Published by ajkr about 4 years ago

6.12.6 (2020-10-13)

Bug Fixes

Fix false positive flush/compaction Status::Corruption failure when paranoid_file_checks == true and range tombstones were written to the compaction output files.

6.12.5 (2020-10-12)

Bug Fixes

Since 6.12, memtable lookup should report unrecognized value_type as corruption (#7121).
Fixed a bug in the following combination of features: indexes with user keys (format_version >= 3), indexes are partitioned (index_type == kTwoLevelIndexSearch), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.
Fixed a bug when indexes are partitioned (index_type == kTwoLevelIndexSearch), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1 and mmap_read == 1, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.

6.12.4 (2020-09-18)

Public API Change

Reworked BackupableDBOptions::share_files_with_checksum_naming (new in 6.12) with some minor improvements and to better support those who were extracting files sizes from backup file names.

6.12.3 (2020-09-16)

Bug fixes

Fixed a bug in size-amp-triggered and periodic-triggered universal compaction, where the compression settings for the first input level were used rather than the compression settings for the output (bottom) level.

6.12.2 (2020-09-14)

Public API Change

BlobDB now exposes the start of the expiration range of TTL blob files via the GetLiveFilesMetaData API.

6.12.1 (2020-08-20)

Bug fixes

BackupEngine::CreateNewBackup could fail intermittently with non-OK status when backing up a read-write DB configured with a DBOptions::file_checksum_gen_factory. This issue has been worked-around such that CreateNewBackup should succeed, but (until fully fixed) BackupEngine might not see all checksums available in the DB.

6.12 (2020-07-28)

Public API Change

Encryption file classes now exposed for inheritance in env_encryption.h
File I/O listener is extended to cover more I/O operations. Now class EventListener in listener.h contains new callback functions: OnFileFlushFinish(), OnFileSyncFinish(), OnFileRangeSyncFinish(), OnFileTruncateFinish(), and OnFileCloseFinish().
FileOperationInfo now reports duration measured by std::chrono::steady_clock and start_ts measured by std::chrono::system_clock instead of start and finish timestamps measured by system_clock. Note that system_clock is called before steady_clock in program order at operation starts.
DB::GetDbSessionId(std::string& session_id) is added. session_id stores a unique identifier that gets reset every time the DB is opened. This DB session ID should be unique among all open DB instances on all hosts, and should be unique among re-openings of the same or other DBs. This identifier is recorded in the LOG file on the line starting with "DB Session ID:".
DB::OpenForReadOnly() now returns Status::NotFound when the specified DB directory does not exist. Previously the error returned depended on the underlying Env. This change is available in all 6.11 releases as well.
A parameter verify_with_checksum is added to BackupEngine::VerifyBackup, which is false by default. If it is ture, BackupEngine::VerifyBackup verifies checksums and file sizes of backup files. Pass false for verify_with_checksum to maintain the previous behavior and performance of BackupEngine::VerifyBackup, by only verifying sizes of backup files.

Behavior Changes

Best-efforts recovery ignores CURRENT file completely. If CURRENT file is missing during recovery, best-efforts recovery still proceeds with MANIFEST file(s).
In best-efforts recovery, an error that is not Corruption or IOError::kNotFound or IOError::kPathNotFound will be overwritten silently. Fix this by checking all non-ok cases and return early.
When file_checksum_gen_factory is set to GetFileChecksumGenCrc32cFactory(), BackupEngine will compare the crc32c checksums of table files computed when creating a backup to the expected checksums stored in the DB manifest, and will fail CreateNewBackup() on mismatch (corruption). If the file_checksum_gen_factory is not set or set to any other customized factory, there is no checksum verification to detect if SST files in a DB are corrupt when read, copied, and independently checksummed by BackupEngine.
When a DB sets stats_dump_period_sec > 0, either as the initial value for DB open or as a dynamic option change, the first stats dump is staggered in the following X seconds, where X is an integer in [0, stats_dump_period_sec). Subsequent stats dumps are still spaced stats_dump_period_sec seconds apart.
When the paranoid_file_checks option is true, a hash is generated of all keys and values are generated when the SST file is written, and then the values are read back in to validate the file. A corruption is signaled if the two hashes do not match.

Bug fixes

Compressed block cache was automatically disabled with read-only DBs by mistake. Now it is fixed: compressed block cache will be in effective with read-only DB too.
Fix a bug of wrong iterator result if another thread finishes an update and a DB flush between two statement.
Disable file deletion after MANIFEST write/sync failure until db re-open or Resume() so that subsequent re-open will not see MANIFEST referencing deleted SSTs.
Fix a bug when index_type == kTwoLevelIndexSearch in PartitionedIndexBuilder to update FlushPolicy to point to internal key partitioner when it changes from user-key mode to internal-key mode in index partition.
Make compaction report InternalKey corruption while iterating over the input.
Fix a bug which may cause MultiGet to be slow because it may read more data than requested, but this won't affect correctness. The bug was introduced in 6.10 release.
Fail recovery and report once hitting a physical log record checksum mismatch, while reading MANIFEST. RocksDB should not continue processing the MANIFEST any further.

New Features

DB identity (db_id) and DB session identity (db_session_id) are added to table properties and stored in SST files. SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as DB::GetDbSessionId. The session ID for SstFileWriter (resp., Repairer) resets every time SstFileWriter::Open (resp., Repairer::Run) is called.
Added experimental option BlockBasedTableOptions::optimize_filters_for_memory for reducing allocated memory size of Bloom filters (~10% savings with Jemalloc) while preserving the same general accuracy. To have an effect, the option requires format_version=5 and malloc_usable_size. Enabling this option is forward and backward compatible with existing format_version=5.
BackupableDBOptions::share_files_with_checksum_naming is added with new default behavior for naming backup files with share_files_with_checksum, to address performance and backup integrity issues. See API comments for details.
Added auto resume function to automatically recover the DB from background Retryable IO Error. When retryable IOError happens during flush and WAL write, the error is mapped to Hard Error and DB will be in read mode. When retryable IO Error happens during compaction, the error will be mapped to Soft Error. DB is still in write/read mode. Autoresume function will create a thread for a DB to call DB->ResumeImpl() to try the recover for Retryable IO Error during flush and WAL write. Compaction will be rescheduled by itself if retryable IO Error happens. Auto resume may also cause other Retryable IO Error during the recovery, so the recovery will fail. Retry the auto resume may solve the issue, so we use max_bgerror_resume_count to decide how many resume cycles will be tried in total. If it is <=0, auto resume retryable IO Error is disabled. Default is INT_MAX, which will lead to a infinit auto resume. bgerror_resume_retry_interval decides the time interval between two auto resumes.
Option max_subcompactions can be set dynamically using DB::SetDBOptions().
Added experimental ColumnFamilyOptions::sst_partitioner_factory to define determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space).

Performance Improvements

Eliminate key copies for internal comparisons while accessing ingested block-based tables.
Reduce key comparisons during random access in all block-based tables.
BackupEngine avoids unnecessary repeated checksum computation for backing up a table file to the shared_checksum directory when using share_files_with_checksum_naming = kUseDbSessionId (new default), except on SST files generated before this version of RocksDB, which fall back on using kLegacyCrc32cAndFileSize.

rocksdb - RocksDB 6.11.6

Published by ajkr about 4 years ago

6.11.6 (10/12/2020)

Bug Fixes

Fixed a bug in the following combination of features: indexes with user keys (format_version >= 3), indexes are partitioned (index_type == kTwoLevelIndexSearch), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.
Fixed a bug when indexes are partitioned (index_type == kTwoLevelIndexSearch), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1 and mmap_read == 1, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.

6.11.5 (7/23/2020)

Bug Fixes

Memtable lookup should report unrecognized value_type as corruption (#7121).

rocksdb - RocksDB 6.11.4

Published by ajkr over 4 years ago

6.11.4 (2020-07-15)

Bug Fixes

Make compaction report InternalKey corruption while iterating over the input.

6.11.3 (2020-07-09)

Bug Fixes

Fix a bug when index_type == kTwoLevelIndexSearch in PartitionedIndexBuilder to update FlushPolicy to point to internal key partitioner when it changes from user-key mode to internal-key mode in index partition.
Disable file deletion after MANIFEST write/sync failure until db re-open or Resume() so that subsequent re-open will not see MANIFEST referencing deleted SSTs.

6.11.1 (2020-06-23)

Bug Fixes

Best-efforts recovery ignores CURRENT file completely. If CURRENT file is missing during recovery, best-efforts recovery still proceeds with MANIFEST file(s).
In best-efforts recovery, an error that is not Corruption or IOError::kNotFound or IOError::kPathNotFound will be overwritten silently. Fix this by checking all non-ok cases and return early.
Compressed block cache was automatically disabled with read-only DBs by mistake. Now it is fixed: compressed block cache will be in effective with read-only DB too.
Fail recovery and report once hitting a physical log record checksum mismatch, while reading MANIFEST. RocksDB should not continue processing the MANIFEST any further.
Fix a bug of wrong iterator result if another thread finishes an update and a DB flush between two statement.

Public API Change

DB::OpenForReadOnly() now returns Status::NotFound when the specified DB directory does not exist. Previously the error returned depended on the underlying Env.

6.11 (2020-06-12)

Bug Fixes

Fix consistency checking error swallowing in some cases when options.force_consistency_checks = true.
Fix possible false NotFound status from batched MultiGet using index type kHashSearch.
Fix corruption caused by enabling delete triggered compaction (NewCompactOnDeletionCollectorFactory) in universal compaction mode, along with parallel compactions. The bug can result in two parallel compactions picking the same input files, resulting in the DB resurrecting older and deleted versions of some keys.
Fix a use-after-free bug in best-efforts recovery. column_family_memtables_ needs to point to valid ColumnFamilySet.
Let best-efforts recovery ignore corrupted files during table loading.
Fix corrupt key read from ingested file when iterator direction switches from reverse to forward at a key that is a prefix of another key in the same file. It is only possible in files with a non-zero global seqno.
Fix abnormally large estimate from GetApproximateSizes when a range starts near the end of one SST file and near the beginning of another. Now GetApproximateSizes consistently and fairly includes the size of SST metadata in addition to data blocks, attributing metadata proportionally among the data blocks based on their size.
Fix potential file descriptor leakage in PosixEnv's IsDirectory() and NewRandomAccessFile().
Fix false negative from the VerifyChecksum() API when there is a checksum mismatch in an index partition block in a BlockBasedTable format table file (index_type is kTwoLevelIndexSearch).
Fix sst_dump to return non-zero exit code if the specified file is not a recognized SST file or fails requested checks.
Fix incorrect results from batched MultiGet for duplicate keys, when the duplicate key matches the largest key of an SST file and the value type for the key in the file is a merge value.

Public API Change

Flush(..., column_family) may return Status::ColumnFamilyDropped() instead of Status::InvalidArgument() if column_family is dropped while processing the flush request.
BlobDB now explicitly disallows using the default column family's storage directories as blob directory.
DeleteRange now returns Status::InvalidArgument if the range's end key comes before its start key according to the user comparator. Previously the behavior was undefined.
ldb now uses options.force_consistency_checks = true by default and "--disable_consistency_checks" is added to disable it.
DB::OpenForReadOnly no longer creates files or directories if the named DB does not exist, unless create_if_missing is set to true.
The consistency checks that validate LSM state changes (table file additions/deletions during flushes and compactions) are now stricter, more efficient, and no longer optional, i.e. they are performed even if force_consistency_checks is false.
Disable delete triggered compaction (NewCompactOnDeletionCollectorFactory) in universal compaction mode and num_levels = 1 in order to avoid a corruption bug.
pin_l0_filter_and_index_blocks_in_cache no longer applies to L0 files larger than 1.5 * write_buffer_size to give more predictable memory usage. Such L0 files may exist due to intra-L0 compaction, external file ingestion, or user dynamically changing write_buffer_size (note, however, that files that are already pinned will continue being pinned, even after such a dynamic change).
In point-in-time wal recovery mode, fail database recovery in case of IOError while reading the WAL to avoid data loss.

New Features

sst_dump to add a new --readahead_size argument. Users can specify read size when scanning the data. Sst_dump also tries to prefetch tail part of the SST files so usually some number of I/Os are saved there too.
Generate file checksum in SstFileWriter if Options.file_checksum_gen_factory is set. The checksum and checksum function name are stored in ExternalSstFileInfo after the sst file write is finished.
Add a value_size_soft_limit in read options which limits the cumulative value size of keys read in batches in MultiGet. Once the cumulative value size of found keys exceeds read_options.value_size_soft_limit, all the remaining keys are returned with status Abort without further finding their values. By default the value_size_soft_limit is std::numeric_limits<uint64_t>::max().
Enable SST file ingestion with file checksum information when calling IngestExternalFiles(const std::vector& args). Added files_checksums and files_checksum_func_names to IngestExternalFileArg such that user can ingest the sst files with their file checksum information. Added verify_file_checksum to IngestExternalFileOptions (default is True). To be backward compatible, if DB does not enable file checksum or user does not provide checksum information (vectors of files_checksums and files_checksum_func_names are both empty), verification of file checksum is always sucessful. If DB enables file checksum, DB will always generate the checksum for each ingested SST file during Prepare stage of ingestion and store the checksum in Manifest, unless verify_file_checksum is False and checksum information is provided by the application. In this case, we only verify the checksum function name and directly store the ingested checksum in Manifest. If verify_file_checksum is set to True, DB will verify the ingested checksum and function name with the genrated ones. Any mismatch will fail the ingestion. Note that, if IngestExternalFileOptions::write_global_seqno is True, the seqno will be changed in the ingested file. Therefore, the checksum of the file will be changed. In this case, a new checksum will be generated after the seqno is updated and be stored in the Manifest.

Performance Improvements

Eliminate redundant key comparisons during random access in block-based tables.

rocksdb - RocksDB 6.10.2

Published by anand1976 over 4 years ago

6.10.2 (2020-06-05)

Bug fix

Fix false negative from the VerifyChecksum() API when there is a checksum mismatch in an index partition block in a BlockBasedTable format table file (index_type is kTwoLevelIndexSearch).

rocksdb - RocksDB 6.10.1

Published by anand1976 over 4 years ago

6.10.1 (2020-05-27)

Bug fix

Remove "u''" in TARGETS file.
Fix db_stress_lib target in buck.

6.10 (2020-05-02)

Behavior Changes

Disable delete triggered compaction (NewCompactOnDeletionCollectorFactory) in universal compaction mode and num_levels = 1 in order to avoid a corruption bug.

Bug Fixes

Fix wrong result being read from ingested file. May happen when a key in the file happen to be prefix of another key also in the file. The issue can further cause more data corruption. The issue exists with rocksdb >= 5.0.0 since DB::IngestExternalFile() was introduced.
Finish implementation of BlockBasedTableOptions::IndexType::kBinarySearchWithFirstKey. It's now ready for use. Significantly reduces read amplification in some setups, especially for iterator seeks.
Fix a bug by updating CURRENT file so that it points to the correct MANIFEST file after best-efforts recovery.
Fixed a bug where ColumnFamilyHandle objects were not cleaned up in case an error happened during BlobDB's open after the base DB had been opened.
Fix a potential undefined behavior caused by trying to dereference nullable pointer (timestamp argument) in DB::MultiGet.
Fix a bug caused by not including user timestamp in MultiGet LookupKey construction. This can lead to wrong query result since the trailing bytes of a user key, if not shorter than timestamp, will be mistaken for user timestamp.
Fix a bug caused by using wrong compare function when sorting the input keys of MultiGet with timestamps.
Upgraded version of bzip library (1.0.6 -> 1.0.8) used with RocksJava to address potential vulnerabilities if an attacker can manipulate compressed data saved and loaded by RocksDB (not normal). See issue #6703.
Fix consistency checking error swallowing in some cases when options.force_consistency_checks = true.
Fix possible false NotFound status from batched MultiGet using index type kHashSearch.
Fix corruption caused by enabling delete triggered compaction (NewCompactOnDeletionCollectorFactory) in universal compaction mode, along with parallel compactions. The bug can result in two parallel compactions picking the same input files, resulting in the DB resurrecting older and deleted versions of some keys.
Fix a use-after-free bug in best-efforts recovery. column_family_memtables_ needs to point to valid ColumnFamilySet.
Let best-efforts recovery ignore corrupted files during table loading.
Fix a bug when making options.bottommost_compression, options.compression_opts and options.bottommost_compression_opts dynamically changeable: the modified values are not written to option files or returned back to users when being queried.
Fix a bug where index key comparisons were unaccounted in PerfContext::user_key_comparison_count for lookups in files written with format_version >= 3.
Fix many bloom.filter statistics not being updated in batch MultiGet.

Public API Change

Add a ConfigOptions argument to the APIs dealing with converting options to and from strings and files. The ConfigOptions is meant to replace some of the options (such as input_strings_escaped and ignore_unknown_options) and allow for more parameters to be passed in the future without changing the function signature.
Add NewFileChecksumGenCrc32cFactory to the file checksum public API, such that the builtin Crc32c based file checksum generator factory can be used by applications.
Add IsDirectory to Env and FS to indicate if a path is a directory.
ldb now uses options.force_consistency_checks = true by default and "--disable_consistency_checks" is added to disable it.
Add ReadOptions::deadline to allow users to specify a deadline for MultiGet requests

New Features

Added support for pipelined & parallel compression optimization for BlockBasedTableBuilder. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set CompressionOptions::parallel_threads greater than 1 to enable compression parallelism. This feature is experimental for now.
Provide an allocator for memkind to be used with block cache. This is to work with memory technologies (Intel DCPMM is one such technology currently available) that require different libraries for allocation and management (such as PMDK and memkind). The high capacities available make it possible to provision large caches (up to several TBs in size) beyond what is achievable with DRAM.
Option max_background_flushes can be set dynamically using DB::SetDBOptions().
Added functionality in sst_dump tool to check the compressed file size for different compression levels and print the time spent on compressing files with each compression type. Added arguments --compression_level_from and --compression_level_to to report size of all compression levels and one compression_type must be specified with it so that it will report compressed sizes of one compression type with different levels.
Added statistics for redundant insertions into block cache: rocksdb.block.cache.*add.redundant. (There is currently no coordination to ensure that only one thread loads a table block when many threads are trying to access that same table block.)

Performance Improvements

Improve performance of batch MultiGet with partitioned filters, by sharing block cache lookups to applicable filter blocks.
Reduced memory copies when fetching and uncompressing compressed blocks from sst files.

rocksdb - RocksDB v6.8.1

Published by over 4 years ago

6.8.1 (2020-03-30)

Behavior changes

Since RocksDB 6.8.0, ttl-based FIFO compaction can drop a file whose oldest key becomes older than options.ttl while others have not. This fix reverts this and makes ttl-based FIFO compaction use the file's flush time as the criterion. This fix also requires that max_open_files = -1 and compaction_options_fifo.allow_compaction = false to function properly.

6.8.0 (2020-02-24)

Java API Changes

Major breaking changes to Java comparators, toward standardizing on ByteBuffer for performant, locale-neutral operations on keys (#6252).
Added overloads of common API methods using direct ByteBuffers for keys and values (#2283).

Bug Fixes

Fix incorrect results while block-based table uses kHashSearch, together with Prev()/SeekForPrev().
Fix a bug that prevents opening a DB after two consecutive crash with TransactionDB, where the first crash recovers from a corrupted WAL with kPointInTimeRecovery but the second cannot.
Fixed issue #6316 that can cause a corruption of the MANIFEST file in the middle when writing to it fails due to no disk space.
Add DBOptions::skip_checking_sst_file_sizes_on_db_open. It disables potentially expensive checking of all sst file sizes in DB::Open().
BlobDB now ignores trivially moved files when updating the mapping between blob files and SSTs. This should mitigate issue #6338 where out of order flush/compaction notifications could trigger an assertion with the earlier code.
Batched MultiGet() ignores IO errors while reading data blocks, causing it to potentially continue looking for a key and returning stale results.
WriteBatchWithIndex::DeleteRange returns Status::NotSupported. Previously it returned success even though reads on the batch did not account for range tombstones. The corresponding language bindings now cannot be used. In C, that includes rocksdb_writebatch_wi_delete_range, rocksdb_writebatch_wi_delete_range_cf, rocksdb_writebatch_wi_delete_rangev, and rocksdb_writebatch_wi_delete_rangev_cf. In Java, that includes WriteBatchWithIndex::deleteRange.
Assign new MANIFEST file number when caller tries to create a new MANIFEST by calling LogAndApply(..., new_descriptor_log=true). This bug can cause MANIFEST being overwritten during recovery if options.write_dbid_to_manifest = true and there are WAL file(s).

Performance Improvements

Perfom readahead when reading from option files. Inside DB, options.log_readahead_size will be used as the readahead size. In other cases, a default 512KB is used.

Public API Change

The BlobDB garbage collector now emits the statistics BLOB_DB_GC_NUM_FILES (number of blob files obsoleted during GC), BLOB_DB_GC_NUM_NEW_FILES (number of new blob files generated during GC), BLOB_DB_GC_FAILURES (number of failed GC passes), BLOB_DB_GC_NUM_KEYS_RELOCATED (number of blobs relocated during GC), and BLOB_DB_GC_BYTES_RELOCATED (total size of blobs relocated during GC). On the other hand, the following statistics, which are not relevant for the new GC implementation, are now deprecated: BLOB_DB_GC_NUM_KEYS_OVERWRITTEN, BLOB_DB_GC_NUM_KEYS_EXPIRED, BLOB_DB_GC_BYTES_OVERWRITTEN, BLOB_DB_GC_BYTES_EXPIRED, and BLOB_DB_GC_MICROS.
Disable recycle_log_file_num when an inconsistent recovery modes are requested: kPointInTimeRecovery and kAbsoluteConsistency

New Features

Added the checksum for each SST file generated by Flush or Compaction. Added sst_file_checksum_func to Options such that user can plugin their own SST file checksum function via override the FileChecksumFunc class. If user does not set the sst_file_checksum_func, SST file checksum calculation will not be enabled. The checksum information inlcuding uint32_t checksum value and a checksum function name (string). The checksum information is stored in FileMetadata in version store and also logged to MANIFEST. A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST (stored in an unordered_map).
db_bench now supports value_size_distribution_type, value_size_min, value_size_max options for generating random variable sized value. Added blob_db_compression_type option for BlobDB to enable blob compression.
Replace RocksDB namespace "rocksdb" with flag "ROCKSDB_NAMESPACE" which if is not defined, defined as "rocksdb" in header file rocksdb_namespace.h.

rocksdb - RocksDB v6.7.3

Published by siying over 4 years ago

6.7.3 (2020-03-18)

Bug Fixes

Fix a data race that might cause crash when calling DB::GetCreationTimeOfOldestFile() by a small chance. The bug was introduced in 6.6 Release.

6.7.2 (2020-02-24)

Bug Fixes

Fixed a bug of IO Uring partial result handling introduced in 6.7.0.

6.7.1 (2020-02-13)

Bug Fixes

Fixed issue #6316 that can cause a corruption of the MANIFEST file in the middle when writing to it fails due to no disk space.
Batched MultiGet() ignores IO errors while reading data blocks, causing it to potentially continue looking for a key and returning stale results.

6.7.0 (2020-01-21)

Public API Change

Added a rocksdb::FileSystem class in include/rocksdb/file_system.h to encapsulate file creation/read/write operations, and an option DBOptions::file_system to allow a user to pass in an instance of rocksdb::FileSystem. If its a non-null value, this will take precendence over DBOptions::env for file operations. A new API rocksdb::FileSystem::Default() returns a platform default object. The DBOptions::env option and Env::Default() API will continue to be used for threading and other OS related functions, and where DBOptions::file_system is not specified, for file operations. For storage developers who are accustomed to rocksdb::Env, the interface in rocksdb::FileSystem is new and will probably undergo some changes as more storage systems are ported to it from rocksdb::Env. As of now, no env other than Posix has been ported to the new interface.
A new rocksdb::NewSstFileManager() API that allows the caller to pass in separate Env and FileSystem objects.
Changed Java API for RocksDB.keyMayExist functions to use Holder<byte[]> instead of StringBuilder, so that retrieved values need not decode to Strings.
A new OptimisticTransactionDBOptions Option that allows users to configure occ validation policy. The default policy changes from kValidateSerial to kValidateParallel to reduce mutex contention.

Bug Fixes

Fix a bug that can cause unnecessary bg thread to be scheduled(#6104).
Fix crash caused by concurrent CF iterations and drops(#6147).
Fix a race condition for cfd->log_number_ between manifest switch and memtable switch (PR 6249) when number of column families is greater than 1.
Fix a bug on fractional cascading index when multiple files at the same level contain the same smallest user key, and those user keys are for merge operands. In this case, Get() the exact key may miss some merge operands.
Delcare kHashSearch index type feature-incompatible with index_block_restart_interval larger than 1.
Fixed an issue where the thread pools were not resized upon setting max_background_jobs dynamically through the SetDBOptions interface.
Fix a bug that can cause write threads to hang when a slowdown/stall happens and there is a mix of writers with WriteOptions::no_slowdown set/unset.
Fixed an issue where an incorrect "number of input records" value was used to compute the "records dropped" statistics for compactions.

New Features

It is now possible to enable periodic compactions for the base DB when using BlobDB.
BlobDB now garbage collects non-TTL blobs when enable_garbage_collection is set to true in BlobDBOptions. Garbage collection is performed during compaction: any valid blobs located in the oldest N files (where N is the number of non-TTL blob files multiplied by the value of BlobDBOptions::garbage_collection_cutoff) encountered during compaction get relocated to new blob files, and old blob files are dropped once they are no longer needed. Note: we recommend enabling periodic compactions for the base DB when using this feature to deal with the case when some old blob files are kept alive by SSTs that otherwise do not get picked for compaction.
db_bench now supports the garbage_collection_cutoff option for BlobDB.
MultiGet() can use IO Uring to parallelize read from the same SST file. This featuer is by default disabled. It can be enabled with environment variable ROCKSDB_USE_IO_URING.

rocksdb - RocksDB 5.18.4

Published by siying over 4 years ago

Special release for ARM. (Note: the originally tagged commit for this release was wrong but the tag has been updated a couple of times. You might need to delete your copy of the tag with git tag -d v5.18.4 to get the new one. See https://git-scm.com/docs/git-tag#_on_re_tagging)

rocksdb - RocksDB v6.6.4

Published by gfosco over 4 years ago

Rocksdb Change Log

6.6.4 (2020-01-31)

Bug Fixes

Fixed issue #6316 that can cause a corruption of the MANIFEST file in the middle when writing to it fails due to no disk space.

rocksdb - RocksDB v6.6.3

Published by gfosco over 4 years ago

Rocksdb Change Log

6.6.3 (2020-01-24)

Bug Fixes

Fix a bug that can cause write threads to hang when a slowdown/stall happens and there is a mix of writers with WriteOptions::no_slowdown set/unset.

6.6.2 (2020-01-13)

Bug Fixes

Fixed a bug where non-L0 compaction input files were not considered to compute the creation_time of new compaction outputs.

6.6.1 (2020-01-02)

Bug Fixes

Fix a bug in WriteBatchWithIndex::MultiGetFromBatchAndDB, which is called by Transaction::MultiGet, that causes due to stale pointer access when the number of keys is > 32
Fixed two performance issues related to memtable history trimming. First, a new SuperVersion is now created only if some memtables were actually trimmed. Second, trimming is only scheduled if there is at least one flushed memtable that is kept in memory for the purposes of transaction conflict checking.
BlobDB no longer updates the SST to blob file mapping upon failed compactions.
Fix a bug in which a snapshot read through an iterator could be affected by a DeleteRange after the snapshot (#6062).
Fixed a bug where BlobDB was comparing the ColumnFamilyHandle pointers themselves instead of only the column family IDs when checking whether an API call uses the default column family or not.
Delete superversions in BackgroundCallPurge.
Fix use-after-free and double-deleting files in BackgroundCallPurge().

6.6.0 (2019-11-25)

Bug Fixes

Fix data corruption casued by output of intra-L0 compaction on ingested file not being placed in correct order in L0.
Fix a data race between Version::GetColumnFamilyMetaData() and Compaction::MarkFilesBeingCompacted() for access to being_compacted (#6056). The current fix acquires the db mutex during Version::GetColumnFamilyMetaData(), which may cause regression.
Fix a bug in DBIter that is_blob_ state isn't updated when iterating backward using seek.
Fix a bug when format_version=3, partitioned fitlers, and prefix search are used in conjunction. The bug could result into Seek::(prefix) returning NotFound for an existing prefix.
Revert the feature "Merging iterator to avoid child iterator reseek for some cases (#5286)" since it might cause strong results when reseek happens with a different iterator upper bound.
Fix a bug causing a crash during ingest external file when background compaction cause severe error (file not found).
Fix a bug when partitioned filters and prefix search are used in conjunction, ::SeekForPrev could return invalid for an existing prefix. ::SeekForPrev might be called by the user, or internally on ::Prev, or within ::Seek if the return value involves Delete or a Merge operand.
Fix OnFlushCompleted fired before flush result persisted in MANIFEST when there's concurrent flush job. The bug exists since OnFlushCompleted was introduced in rocksdb 3.8.
Fixed an sst_dump crash on some plain table SST files.
Fixed a memory leak in some error cases of opening plain table SST files.
Fix a bug when a crash happens while calling WriteLevel0TableForRecovery for multiple column families, leading to a column family's log number greater than the first corrutped log number when the DB is being opened in PointInTime recovery mode during next recovery attempt (#5856).

New Features

Universal compaction to support options.periodic_compaction_seconds. A full compaction will be triggered if any file is over the threshold.
GetLiveFilesMetaData and GetColumnFamilyMetaData now expose the file number of SST files as well as the oldest blob file referenced by each SST.
A batched MultiGet API (DB::MultiGet()) that supports retrieving keys from multiple column families.
Full and partitioned filters in the block-based table use an improved Bloom filter implementation, enabled with format_version 5 (or above) because previous releases cannot read this filter. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single (full) filter. For example, the new Bloom filter has the same false postive rate at 9.55 bits per key as the old one at 10 bits per key, and a lower false positive rate at 16 bits per key than the old one at 100 bits per key.
Added AVX2 instructions to USE_SSE builds to accelerate the new Bloom filter and XXH3-based hash function on compatible x86_64 platforms (Haswell and later, ~2014).
Support options.ttl or options.periodic_compaction_seconds with options.max_open_files = -1. File's oldest ancester time and file creation time will be written to manifest. If it is availalbe, this information will be used instead of creation_time and file_creation_time in table properties.
Setting options.ttl for universal compaction now has the same meaning as setting periodic_compaction_seconds.
SstFileMetaData also returns file creation time and oldest ancester time.
The sst_dump command line tool recompress command now displays how many blocks were compressed and how many were not, in particular how many were not compressed because the compression ratio was not met (12.5% threshold for GoodCompressionRatio), as seen in the number.block.not_compressed counter stat since version 6.0.0.
The block cache usage is now takes into account the overhead of metadata per each entry. This results into more accurate managment of memory. A side-effect of this feature is that less items are fit into the block cache of the same size, which would result to higher cache miss rates. This can be remedied by increasing the block cache size or passing kDontChargeCacheMetadata to its constuctor to restore the old behavior.
When using BlobDB, a mapping is maintained and persisted in the MANIFEST between each SST file and the oldest non-TTL blob file it references.
db_bench now supports and by default issues non-TTL Puts to BlobDB. TTL Puts can be enabled by specifying a non-zero value for the blob_db_max_ttl_range command line parameter explicitly.
sst_dump now supports printing BlobDB blob indexes in a human-readable format. This can be enabled by specifying the decode_blob_index flag on the command line.
A number of new information elements are now exposed through the EventListener interface. For flushes, the file numbers of the new SST file and the oldest blob file referenced by the SST are propagated. For compactions, the level, file number, and the oldest blob file referenced are passed to the client for each compaction input and output file.

Public API Change

RocksDB release 4.1 or older will not be able to open DB generated by the new release. 4.2 was released on Feb 23, 2016.
TTL Compactions in Level compaction style now initiate successive cascading compactions on a key range so that it reaches the bottom level quickly on TTL expiry. creation_time table property for compaction output files is now set to the minimum of the creation times of all compaction inputs.
With FIFO compaction style, options.periodic_compaction_seconds will have the same meaning as options.ttl. Whichever stricter will be used. With the default options.periodic_compaction_seconds value with options.ttl's default of 0, RocksDB will give a default of 30 days.
Added an API GetCreationTimeOfOldestFile(uint64_t* creation_time) to get the file_creation_time of the oldest SST file in the DB.
FilterPolicy now exposes additional API to make it possible to choose filter configurations based on context, such as table level and compaction style. See LevelAndStyleCustomFilterPolicy in db_bloom_filter_test.cc. While most existing custom implementations of FilterPolicy should continue to work as before, those wrapping the return of NewBloomFilterPolicy will require overriding new function GetBuilderWithContext(), because calling GetFilterBitsBuilder() on the FilterPolicy returned by NewBloomFilterPolicy is no longer supported.
An unlikely usage of FilterPolicy is no longer supported. Calling GetFilterBitsBuilder() on the FilterPolicy returned by NewBloomFilterPolicy will now cause an assertion violation in debug builds, because RocksDB has internally migrated to a more elaborate interface that is expected to evolve further. Custom implementations of FilterPolicy should work as before, except those wrapping the return of NewBloomFilterPolicy, which will require a new override of a protected function in FilterPolicy.
NewBloomFilterPolicy now takes bits_per_key as a double instead of an int. This permits finer control over the memory vs. accuracy trade-off in the new Bloom filter implementation and should not change source code compatibility.
The option BackupableDBOptions::max_valid_backups_to_open is now only used when opening BackupEngineReadOnly. When opening a read/write BackupEngine, anything but the default value logs a warning and is treated as the default. This change ensures that backup deletion has proper accounting of shared files to ensure they are deleted when no longer referenced by a backup.
Deprecate snap_refresh_nanos option.
Added DisableManualCompaction/EnableManualCompaction to stop and resume manual compaction.
Add TryCatchUpWithPrimary() to StackableDB in non-LITE mode.
Add a new Env::LoadEnv() overloaded function to return a shared_ptr to Env.
Flush sets file name to "(nil)" for OnTableFileCreationCompleted() if the flush does not produce any L0. This can happen if the file is empty thus delete by RocksDB.

Default Option Changes

Changed the default value of periodic_compaction_seconds to UINT64_MAX - 1 which allows RocksDB to auto-tune periodic compaction scheduling. When using the default value, periodic compactions are now auto-enabled if a compaction filter is used. A value of 0 will turn off the feature completely.
Changed the default value of ttl to UINT64_MAX - 1 which allows RocksDB to auto-tune ttl value. When using the default value, TTL will be auto-enabled to 30 days, when the feature is supported. To revert the old behavior, you can explictly set it to 0.

Performance Improvements

For 64-bit hashing, RocksDB is standardizing on a slightly modified preview version of XXH3. This function is now used for many non-persisted hashes, along with fastrange64() in place of the modulus operator, and some benchmarks show a slight improvement.
Level iterator to invlidate the iterator more often in prefix seek and the level is filtered out by prefix bloom.

rocksdb - RocksDB v6.5.3

Published by gfosco almost 5 years ago

Rocksdb Change Log

6.5.3 (2020-01-10)

Bug Fixes

Fixed two performance issues related to memtable history trimming. First, a new SuperVersion is now created only if some memtables were actually trimmed. Second, trimming is only scheduled if there is at least one flushed memtable that is kept in memory for the purposes of transaction conflict checking.

rocksdb - RocksDB v6.5.2

Published by gfosco almost 5 years ago

6.5.2 (2019-11-15)

Bug Fixes

Fix a assertion failure in MultiGe4t() when BlockBasedTableOptions::no_block_cache is true and there is no compressed block cache
Fix a buffer overrun problem in BlockBasedTable::MultiGet() when compression is enabled and no compressed block cache is configured.
If a call to BackupEngine::PurgeOldBackups or BackupEngine::DeleteBackup suffered a crash, power failure, or I/O error, files could be left over from old backups that could only be purged with a call to GarbageCollect. Any call to PurgeOldBackups, DeleteBackup, or GarbageCollect should now suffice to purge such files.

6.5.1 (2019-10-16)

Bug Fixes

Revert the feature "Merging iterator to avoid child iterator reseek for some cases (#5286)" since it might cause strange results when reseek happens with a different iterator upper bound.
Fix a bug in BlockBasedTableIterator that might return incorrect results when reseek happens with a different iterator upper bound.
Fix a bug when partitioned filters and prefix search are used in conjunction, ::SeekForPrev could return invalid for an existing prefix. ::SeekForPrev might be called by the user, or internally on ::Prev, or within ::Seek if the return value involves Delete or a Merge operand.

6.5.0 (2019-09-13)

Bug Fixes

Fixed a number of data races in BlobDB.
Fix a bug where the compaction snapshot refresh feature is not disabled as advertised when snap_refresh_nanos is set to 0..
Fix bloom filter lookups by the MultiGet batching API when BlockBasedTableOptions::whole_key_filtering is false, by checking that a key is in the perfix_extractor domain and extracting the prefix before looking up.
Fix a bug in file ingestion caused by incorrect file number allocation when the number of column families involved in the ingestion exceeds 2.

New Features

Introduced DBOptions::max_write_batch_group_size_bytes to configure maximum limit on number of bytes that are written in a single batch of WAL or memtable write. It is followed when the leader write size is larger than 1/8 of this limit.
VerifyChecksum() by default will issue readahead. Allow ReadOptions to be passed in to those functions to override the readhead size. For checksum verifying before external SST file ingestion, a new option IngestExternalFileOptions.verify_checksums_readahead_size, is added for this readahead setting.
When user uses options.force_consistency_check in RocksDb, instead of crashing the process, we now pass the error back to the users without killing the process.
Add an option memtable_insert_hint_per_batch to WriteOptions. If it is true, each WriteBatch will maintain its own insert hints for each memtable in concurrent write. See include/rocksdb/options.h for more details.

Public API Change

Added max_write_buffer_size_to_maintain option to better control memory usage of immutable memtables.
Added a lightweight API GetCurrentWalFile() to get last live WAL filename and size. Meant to be used as a helper for backup/restore tooling in a larger ecosystem such as MySQL with a MyRocks storage engine.
The MemTable Bloom filter, when enabled, now always uses cache locality. Options::bloom_locality now only affects the PlainTable SST format.

Performance Improvements

Improve the speed of the MemTable Bloom filter, reducing the write overhead of enabling it by 1/3 to 1/2, with similar benefit to read performance.

rocksdb - RocksDB v6.4.6

Published by gfosco almost 5 years ago

Rocksdb Change Log

6.4.6 (2019-10-16)

Bug Fixes

Fix a bug when partitioned filters and prefix search are used in conjunction, ::SeekForPrev could return invalid for an existing prefix. ::SeekForPrev might be called by the user, or internally on ::Prev, or within ::Seek if the return value involves Delete or a Merge operand.

6.4.5 (2019-10-01)

Bug Fixes

Revert the feature "Merging iterator to avoid child iterator reseek for some cases (#5286)" since it might cause strange results when reseek happens with a different iterator upper bound.
Fix a bug in BlockBasedTableIterator that might return incorrect results when reseek happens with a different iterator upper bound.

6.4.4 (2019-09-17)

Fix a bug introduced 6.3 which could cause wrong results in a corner case when prefix bloom filter is used and the iterator is reseeked.

6.4.2 (2019-09-03)

Bug Fixes

Fix a bug in file ingestion caused by incorrect file number allocation when the number of column families involved in the ingestion exceeds 2.

6.4.1 (2019-08-20)

Bug Fixes

Fix a bug where the compaction snapshot refresh feature is not disabled as advertised when snap_refresh_nanos is set to 0..

6.4.0 (2019-07-30)

Default Option Change

LRUCacheOptions.high_pri_pool_ratio is set to 0.5 (previously 0.0) by default, which means that by default midpoint insertion is enabled. The same change is made for the default value of high_pri_pool_ratio argument in NewLRUCache(). When block cache is not explictly created, the small block cache created by BlockBasedTable will still has this option to be 0.0.
Change BlockBasedTableOptions.cache_index_and_filter_blocks_with_high_priority's default value from false to true.

Public API Change

Filter and compression dictionary blocks are now handled similarly to data blocks with regards to the block cache: instead of storing objects in the cache, only the blocks themselves are cached. In addition, filter and compression dictionary blocks (as well as filter partitions) no longer get evicted from the cache when a table is closed.
Due to the above refactoring, block cache eviction statistics for filter and compression dictionary blocks are temporarily broken. We plan to reintroduce them in a later phase.
The semantics of the per-block-type block read counts in the performance context now match those of the generic block_read_count.
Errors related to the retrieval of the compression dictionary are now propagated to the user.
db_bench adds a "benchmark" stats_history, which prints out the whole stats history.
Overload GetAllKeyVersions() to support non-default column family.
Added new APIs ExportColumnFamily() and CreateColumnFamilyWithImport() to support export and import of a Column Family. https://github.com/facebook/rocksdb/issues/3469
ldb sometimes uses a string-append merge operator if no merge operator is passed in. This is to allow users to print keys from a DB with a merge operator.
Replaces old Registra with ObjectRegistry to allow user to create custom object from string, also add LoadEnv() to Env.
Added new overload of GetApproximateSizes which gets SizeApproximationOptions object and returns a Status. The older overloads are redirecting their calls to this new method and no longer assert if the include_flags doesn't have either of INCLUDE_MEMTABLES or INCLUDE_FILES bits set. It's recommended to use the new method only, as it is more type safe and returns a meaningful status in case of errors.

New Features

Add argument --secondary_path to ldb to open the database as the secondary instance. This would keep the original DB intact.
Compression dictionary blocks are now prefetched and pinned in the cache (based on the customer's settings) the same way as index and filter blocks.
Added DBOptions::log_readahead_size which specifies the number of bytes to prefetch when reading the log. This is mostly useful for reading a remotely located log, as it can save the number of round-trips. If 0 (default), then the prefetching is disabled.
Support loading custom objects in unit tests. In the affected unit tests, RocksDB will create custom Env objects based on environment variable TEST_ENV_URI. Users need to make sure custom object types are properly registered. For example, a static library should expose a RegisterCustomObjects function. By linking the unit test binary with the static library, the unit test can execute this function.

Performance Improvements

Reduce iterator key comparision for upper/lower bound check.
Improve performance of row_cache: make reads with newer snapshots than data in an SST file share the same cache key, except in some transaction cases.
The compression dictionary is no longer copied to a new object upon retrieval.

Bug Fixes

Fix ingested file and directory not being fsync.
Return TryAgain status in place of Corruption when new tail is not visible to TransactionLogIterator.
Fixed a regression where the fill_cache read option also affected index blocks.
Fixed an issue where using cache_index_and_filter_blocks==false affected partitions of partitioned indexes/filters as well.

rocksdb - RocksDB v6.3.6

Published by gfosco about 5 years ago

Rocksdb Change Log

6.3.6 (2019-10-01)

Revert the feature "Merging iterator to avoid child iterator reseek for some cases (#5286)" since it might cause strong results when reseek happens with a different iterator upper bound.

6.3.5 (2019-09-17)

Fix a bug introduced 6.3 which could cause wrong results in a corner case when prefix bloom filter is used and the iterator is reseeked.

6.3.4 (2019-09-03)

Bug Fixes

Fix a bug in file ingestion caused by incorrect file number allocation when the number of column families involved in the ingestion exceeds 2.

6.3.3 (2019-08-20)

Bug Fixes

Fix a bug where the compaction snapshot refresh feature is not disabled as advertised when snap_refresh_nanos is set to 0..

6.3.2 (2019-08-15)

Public API Change

The semantics of the per-block-type block read counts in the performance context now match those of the generic block_read_count.

Bug Fixes

Fixed a regression where the fill_cache read option also affected index blocks.
Fixed an issue where using cache_index_and_filter_blocks==false affected partitions of partitioned indexes as well.

6.3.1 (2019-07-24)

Bug Fixes

Fix auto rolling bug introduced in 6.3.0, which causes segfault if log file creation fails.

6.3.0 (2019-06-18)

Public API Change

Now DB::Close() will return Aborted() error when there is unreleased snapshot. Users can retry after all snapshots are released.
Index blocks are now handled similarly to data blocks with regards to the block cache: instead of storing objects in the cache, only the blocks themselves are cached. In addition, index blocks no longer get evicted from the cache when a table is closed, can now use the compressed block cache (if any), and can be shared among multiple table readers.
Partitions of partitioned indexes no longer affect the read amplification statistics.
Due to the above refactoring, block cache eviction statistics for indexes are temporarily broken. We plan to reintroduce them in a later phase.
options.keep_log_file_num will be enforced strictly all the time. File names of all log files will be tracked, which may take significantly amount of memory if options.keep_log_file_num is large and either of options.max_log_file_size or options.log_file_time_to_roll is set.
Add initial support for Get/Put with user timestamps. Users can specify timestamps via ReadOptions and WriteOptions when calling DB::Get and DB::Put.
Accessing a partition of a partitioned filter or index through a pinned reference is no longer considered a cache hit.
Add C bindings for secondary instance, i.e. DBImplSecondary.
Rate limited deletion of WALs is only enabled if DBOptions::wal_dir is not set, or explicitly set to db_name passed to DB::Open and DBOptions::db_paths is empty, or same as db_paths[0].path

New Features

Add an option snap_refresh_nanos (default to 0) to periodically refresh the snapshot list in compaction jobs. Assign to 0 to disable the feature.
Add an option unordered_write which trades snapshot guarantees with higher write throughput. When used with WRITE_PREPARED transactions with two_write_queues=true, it offers higher throughput with however no compromise on guarantees.
Allow DBImplSecondary to remove memtables with obsolete data after replaying MANIFEST and WAL.
Add an option failed_move_fall_back_to_copy (default is true) for external SST ingestion. When move_files is true and hard link fails, ingestion falls back to copy if failed_move_fall_back_to_copy is true. Otherwise, ingestion reports an error.

Performance Improvements

Reduce binary search when iterator reseek into the same data block.
DBIter::Next() can skip user key checking if previous entry's seqnum is 0.
Merging iterator to avoid child iterator reseek for some cases
Log Writer will flush after finishing the whole record, rather than a fragment.
Lower MultiGet batching API latency by reading data blocks from disk in parallel

General Improvements

Added new status code kColumnFamilyDropped to distinguish between Column Family Dropped and DB Shutdown in progress.
Improve ColumnFamilyOptions validation when creating a new column family.

Bug Fixes

Fix a bug in WAL replay of secondary instance by skipping write batches with older sequence numbers than the current last sequence number.
Fix flush's/compaction's merge processing logic which allowed Puts covered by range tombstones to reappear. Note Puts may exist even if the user only ever called Merge() due to an internal conversion during compaction to the bottommost level.
Fix/improve memtable earliest sequence assignment and WAL replay so that WAL entries of unflushed column families will not be skipped after replaying the MANIFEST and increasing db sequence due to another flushed/compacted column family.
Fix a bug caused by secondary not skipping the beginning of new MANIFEST.
On DB open, delete WAL trash files left behind in wal_dir