A library that provides an embeddable, persistent key-value store for fast storage.
GPL-2.0 License
Published by ramvadiv almost 4 years ago
WALRecoveryMode::kPointInTimeRecovery
is used. Gaps are still possible when WALs are truncated exactly on record boundaries.format_version >= 3
), indexes are partitioned (index_type == kTwoLevelIndexSearch
), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.index_type == kTwoLevelIndexSearch
), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1
and mmap_read == 1
, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.Status::Corruption
failure when paranoid_file_checks == true
and range tombstones were written to the compaction output files.WriteOptions.no_slowdown=true
).ignore_unknown_options
flag (used in option parsing/loading functions) changed.NotFound
instead of InvalidArgument
for option names not available in the present version.TableBuilder::NeedCompact()
before TableBuilder::Finish()
in compaction job. For example, the NeedCompact()
method of CompactOnDeletionCollector
returned by built-in CompactOnDeletionCollectorFactory
requires BlockBasedTable::Finish()
to return the correct result. The bug can cause a compaction-generated file not to be marked for future compaction based on deletion ratio.read_amp_bytes_per_bit
during OPTIONS file parsing on big-endian architecture. Without this fix, original code introduced in PR7659, when running on big-endian machine, can mistakenly store read_amp_bytes_per_bit (an uint32) in little endian format. Future access to read_amp_bytes_per_bit
will give wrong values. Little endian architecture is not affected.BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
and BlockBasedTableOptions::pin_top_level_index_and_filter
. These options still take effect until users migrate to the replacement APIs in BlockBasedTableOptions::metadata_cache_options
. Migration guidance can be found in the API comments on the deprecated options.DB::VerifyFileChecksums
to verify SST file checksum with corresponding entries in the MANIFEST if present. Current implementation requires scanning and recomputing file checksums.ColumnFamilyOptions::compression_opts
now additionally affect files generated by flush and compaction to non-bottommost level. Previously those settings at most affected files generated by compaction to bottommost level, depending on whether ColumnFamilyOptions::bottommost_compression_opts
overrode them. Users who relied on dictionary compression settings in ColumnFamilyOptions::compression_opts
affecting only the bottommost level can keep the behavior by moving their dictionary settings to ColumnFamilyOptions::bottommost_compression_opts
and setting its enabled
flag.enabled
flag is set in ColumnFamilyOptions::bottommost_compression_opts
, those compression options now take effect regardless of the value in ColumnFamilyOptions::bottommost_compression
. Previously, those compression options only took effect when ColumnFamilyOptions::bottommost_compression != kDisableCompressionOption
. Now, they additionally take effect when ColumnFamilyOptions::bottommost_compression == kDisableCompressionOption
(such a setting causes bottommost compression type to fall back to ColumnFamilyOptions::compression_per_level
if configured, and otherwise fall back to ColumnFamilyOptions::compression
).Published by ajkr almost 4 years ago
WALRecoveryMode::kPointInTimeRecovery
is used. Gaps are still possible when WALs are truncated exactly on record boundaries.Published by akankshamahajan15 almost 4 years ago
Fixed a potential bug caused by evaluating TableBuilder::NeedCompact()
before TableBuilder::Finish()
in compaction job. For example, the NeedCompact()
method of CompactOnDeletionCollector
returned by built-in CompactOnDeletionCollectorFactory
requires BlockBasedTable::Finish()
to return the correct result. The bug can cause a compaction-generated file not to be marked for future compaction based on deletion ratio.
ignore_unknown_options
flag (used in option parsing/loading functions) changed.NotFound
instead of InvalidArgument
for option names not available in the present version.Status::Corruption
failure when paranoid_file_checks == true
and range tombstones were written to the compaction output files.format_version >= 3
), indexes are partitioned (index_type == kTwoLevelIndexSearch
), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.index_type == kTwoLevelIndexSearch
), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1
and mmap_read == 1
, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.CompactRange()
with CompactRangeOptions::change_level
set fails due to a conflict in the level change step, which caused all subsequent calls to CompactRange()
with CompactRangeOptions::change_level
set to incorrectly fail with a Status::NotSupported("another thread is refitting")
error.BottommostLevelCompaction.kForce
or kForceOptimized
is set.Published by ajkr about 4 years ago
Fix build issue to enable RocksJava release for ppc64le
Published by ajkr about 4 years ago
WriteOptions.no_slowdown=true
).Status::Corruption
failure when paranoid_file_checks == true
and range tombstones were written to the compaction output files.format_version >= 3
), indexes are partitioned (index_type == kTwoLevelIndexSearch
), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.index_type == kTwoLevelIndexSearch
), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1
and mmap_read == 1
, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.CompactRange()
for refitting levels (CompactRangeOptions::change_level == true
) and another manual compaction are executed in parallel.recycle_log_file_num
to zero when the user attempts to enable it in combination with WALRecoveryMode::kTolerateCorruptedTailRecords
. Previously the two features were allowed together, which compromised the user's configured crash-recovery guarantees.SstFileWriter
. Previously, the dictionary would be trained/finalized immediately with zero samples. Now, the whole SstFileWriter
file is buffered in memory and then sampled.avoid_unnecessary_blocking_io=1
and creating backups (BackupEngine::CreateNewBackup) or checkpoints (Checkpoint::Create). With this setting and WAL enabled, these operations could randomly fail with non-OK status.std::string requested_checksum_func_name
is added to FileChecksumGenContext
, which enables the checksum factory to create generators for a suite of different functions.ldb unsafe_remove_sst_file
, which removes a lost or corrupt SST file from a DB's metadata. This command involves data loss and must not be used on a live DB.kCompactionStyleLevel
compaction style with level_compaction_dynamic_level_bytes
set.share_files_with_checksum
is used with kLegacyCrc32cAndFileSize
naming (discouraged).
share_files_with_checksum
, we are confident there is no regression (vs. pre-6.12) in detecting DB or backup corruption at backup creation time, mostly because the old design did not leverage this extra checksum computation for detecting inconsistencies at backup creation time.share_table_files
without "checksum" (not recommended), there is a regression in detecting fundamentally unsafe use of the option, greatly mitigated by file size checking (under "Behavior Changes"). Almost no reason to use share_files_with_checksum=false
should remain.DB::VerifyChecksum
and BackupEngine::VerifyBackup
with checksum checking are still able to catch corruptions that CreateNewBackup
does not.DB::DeleteFile()
API describing its known problems and deprecation plan.FSRandomAccessFile.Prefetch()
default return status is changed from OK
to NotSupported
. If the user inherited file doesn't implement prefetch, RocksDB will create internal prefetch buffer to improve read performance.Published by ajkr about 4 years ago
Status::Corruption
failure when paranoid_file_checks == true
and range tombstones were written to the compaction output files.format_version >= 3
), indexes are partitioned (index_type == kTwoLevelIndexSearch
), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.index_type == kTwoLevelIndexSearch
), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1
and mmap_read == 1
, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.BackupableDBOptions::share_files_with_checksum_naming
(new in 6.12) with some minor improvements and to better support those who were extracting files sizes from backup file names.GetLiveFilesMetaData
API.EventListener
in listener.h contains new callback functions: OnFileFlushFinish()
, OnFileSyncFinish()
, OnFileRangeSyncFinish()
, OnFileTruncateFinish()
, and OnFileCloseFinish()
.FileOperationInfo
now reports duration
measured by std::chrono::steady_clock
and start_ts
measured by std::chrono::system_clock
instead of start and finish timestamps measured by system_clock
. Note that system_clock
is called before steady_clock
in program order at operation starts.DB::GetDbSessionId(std::string& session_id)
is added. session_id
stores a unique identifier that gets reset every time the DB is opened. This DB session ID should be unique among all open DB instances on all hosts, and should be unique among re-openings of the same or other DBs. This identifier is recorded in the LOG file on the line starting with "DB Session ID:".DB::OpenForReadOnly()
now returns Status::NotFound
when the specified DB directory does not exist. Previously the error returned depended on the underlying Env
. This change is available in all 6.11 releases as well.verify_with_checksum
is added to BackupEngine::VerifyBackup
, which is false by default. If it is ture, BackupEngine::VerifyBackup
verifies checksums and file sizes of backup files. Pass false
for verify_with_checksum
to maintain the previous behavior and performance of BackupEngine::VerifyBackup
, by only verifying sizes of backup files.file_checksum_gen_factory
is set to GetFileChecksumGenCrc32cFactory()
, BackupEngine will compare the crc32c checksums of table files computed when creating a backup to the expected checksums stored in the DB manifest, and will fail CreateNewBackup()
on mismatch (corruption). If the file_checksum_gen_factory
is not set or set to any other customized factory, there is no checksum verification to detect if SST files in a DB are corrupt when read, copied, and independently checksummed by BackupEngine.stats_dump_period_sec > 0
, either as the initial value for DB open or as a dynamic option change, the first stats dump is staggered in the following X seconds, where X is an integer in [0, stats_dump_period_sec)
. Subsequent stats dumps are still spaced stats_dump_period_sec
seconds apart.db_id
) and DB session identity (db_session_id
) are added to table properties and stored in SST files. SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as DB::GetDbSessionId
. The session ID for SstFileWriter (resp., Repairer) resets every time SstFileWriter::Open
(resp., Repairer::Run
) is called.BackupableDBOptions::share_files_with_checksum_naming
is added with new default behavior for naming backup files with share_files_with_checksum
, to address performance and backup integrity issues. See API comments for details.max_subcompactions
can be set dynamically using DB::SetDBOptions().shared_checksum
directory when using share_files_with_checksum_naming = kUseDbSessionId
(new default), except on SST files generated before this version of RocksDB, which fall back on using kLegacyCrc32cAndFileSize
.Published by ajkr about 4 years ago
format_version >= 3
), indexes are partitioned (index_type == kTwoLevelIndexSearch
), and some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
). The bug could cause keys to be truncated when read from the index leading to wrong read results or other unexpected behavior.index_type == kTwoLevelIndexSearch
), some index partitions are pinned in memory (BlockBasedTableOptions::pin_l0_filter_and_index_blocks_in_cache
), and partitions reads could be mixed between block cache and directly from the file (e.g., with enable_index_compression == 1
and mmap_read == 1
, partitions that were stored uncompressed due to poor compression ratio would be read directly from the file via mmap, while partitions that were stored compressed would be read from block cache). The bug could cause index partitions to be mistakenly considered empty during reads leading to wrong read results.Published by ajkr over 4 years ago
DB::OpenForReadOnly()
now returns Status::NotFound
when the specified DB directory does not exist. Previously the error returned depended on the underlying Env
.Status::InvalidArgument
if the range's end key comes before its start key according to the user comparator. Previously the behavior was undefined.force_consistency_checks
is false
.pin_l0_filter_and_index_blocks_in_cache
no longer applies to L0 files larger than 1.5 * write_buffer_size
to give more predictable memory usage. Such L0 files may exist due to intra-L0 compaction, external file ingestion, or user dynamically changing write_buffer_size
(note, however, that files that are already pinned will continue being pinned, even after such a dynamic change).Published by anand1976 over 4 years ago
Published by anand1976 over 4 years ago
Published by siying over 4 years ago
OptimisticTransactionDBOptions
Option that allows users to configure occ validation policy. The default policy changes from kValidateSerial to kValidateParallel to reduce mutex contention.max_background_jobs
dynamically through the SetDBOptions
interface.enable_garbage_collection
is set to true
in BlobDBOptions
. Garbage collection is performed during compaction: any valid blobs located in the oldest N files (where N is the number of non-TTL blob files multiplied by the value of BlobDBOptions::garbage_collection_cutoff
) encountered during compaction get relocated to new blob files, and old blob files are dropped once they are no longer needed. Note: we recommend enabling periodic compactions for the base DB when using this feature to deal with the case when some old blob files are kept alive by SSTs that otherwise do not get picked for compaction.db_bench
now supports the garbage_collection_cutoff
option for BlobDB.Published by siying over 4 years ago
Special release for ARM. (Note: the originally tagged commit for this release was wrong but the tag has been updated a couple of times. You might need to delete your copy of the tag with git tag -d v5.18.4
to get the new one. See https://git-scm.com/docs/git-tag#_on_re_tagging)
Published by gfosco over 4 years ago
Published by gfosco over 4 years ago
creation_time
of new compaction outputs.ColumnFamilyHandle
pointers themselves instead of only the column family IDs when checking whether an API call uses the default column family or not.GetLiveFilesMetaData
and GetColumnFamilyMetaData
now expose the file number of SST files as well as the oldest blob file referenced by each SST.sst_dump
command line tool recompress
command now displays how many blocks were compressed and how many were not, in particular how many were not compressed because the compression ratio was not met (12.5% threshold for GoodCompressionRatio), as seen in the number.block.not_compressed
counter stat since version 6.0.0.db_bench
now supports and by default issues non-TTL Puts to BlobDB. TTL Puts can be enabled by specifying a non-zero value for the blob_db_max_ttl_range
command line parameter explicitly.sst_dump
now supports printing BlobDB blob indexes in a human-readable format. This can be enabled by specifying the decode_blob_index
flag on the command line.creation_time
table property for compaction output files is now set to the minimum of the creation times of all compaction inputs.LevelAndStyleCustomFilterPolicy
in db_bloom_filter_test.cc. While most existing custom implementations of FilterPolicy should continue to work as before, those wrapping the return of NewBloomFilterPolicy will require overriding new function GetBuilderWithContext()
, because calling GetFilterBitsBuilder()
on the FilterPolicy returned by NewBloomFilterPolicy is no longer supported.snap_refresh_nanos
option.UINT64_MAX - 1
which allows RocksDB to auto-tune periodic compaction scheduling. When using the default value, periodic compactions are now auto-enabled if a compaction filter is used. A value of 0
will turn off the feature completely.UINT64_MAX - 1
which allows RocksDB to auto-tune ttl value. When using the default value, TTL will be auto-enabled to 30 days, when the feature is supported. To revert the old behavior, you can explictly set it to 0.Published by gfosco almost 5 years ago
Published by gfosco almost 5 years ago
snap_refresh_nanos
is set to 0..memtable_insert_hint_per_batch
to WriteOptions. If it is true, each WriteBatch will maintain its own insert hints for each memtable in concurrent write. See include/rocksdb/options.h for more details.Published by gfosco almost 5 years ago
snap_refresh_nanos
is set to 0..--secondary_path
to ldb to open the database as the secondary instance. This would keep the original DB intact.RegisterCustomObjects
function. By linking the unit test binary with the static library, the unit test can execute this function.Published by gfosco about 5 years ago
snap_refresh_nanos
is set to 0..snap_refresh_nanos
(default to 0) to periodically refresh the snapshot list in compaction jobs. Assign to 0 to disable the feature.unordered_write
which trades snapshot guarantees with higher write throughput. When used with WRITE_PREPARED transactions with two_write_queues=true, it offers higher throughput with however no compromise on guarantees.failed_move_fall_back_to_copy
(default is true) for external SST ingestion. When move_files
is true and hard link fails, ingestion falls back to copy if failed_move_fall_back_to_copy
is true. Otherwise, ingestion reports an error.Put
s covered by range tombstones to reappear. Note Put
s may exist even if the user only ever called Merge()
due to an internal conversion during compaction to the bottommost level.Published by gfosco about 5 years ago
Published by gfosco about 5 years ago
Put
s covered by range tombstones to reappear. Note Put
s may exist even if the user only ever called Merge()
due to an internal conversion during compaction to the bottommost level.strict_bytes_per_sync
that causes a file-writing thread to block rather than exceed the limit on bytes pending writeback specified by bytes_per_sync
or wal_bytes_per_sync
.snap_refresh_nanos
(default to 0.5s) to periodically refresh the snapshot list in compaction jobs. Assign to 0 to disable the feature.IsFlushPending() == true
caused by one bg thread releasing the db mutex in ~ColumnFamilyData and another thread clearing flush_requested_
flag.