A library that provides an embeddable, persistent key-value store for fast storage.
GPL-2.0 License
Bot releases are hidden (Show)
CompactForTieringCollectorFactory
to auto trigger compaction for tiering use case.GetEntityForUpdate
API.rocksdb_writebatch_update_timestamps
, rocksdb_writebatch_wi_update_timestamps
in C API.rocksdb_iter_refresh
in C API.rocksdb_writebatch_create_with_params
, rocksdb_writebatch_wi_create_with_params
to create WB and WBWI with all options in C APILogFile
and VectorLogPtr
in favor of new names WalFile
and VectorWalPtr
.level0_file_num_compaction_trigger
) #12477.background_close_inactive_wals
.ldb dump_wal
command for PutEntity
records so it prints the key and correctly resets the hexadecimal formatting flag after printing the wide-column entity.PutEntity
records were handled incorrectly while rebuilding transactions during recovery.Published by ajkr 4 months ago
GetEntity
API.Iterator
property, "rocksdb.iterator.is-value-pinned", for checking whether the Slice
returned by Iterator::value()
can be used until the Iterator
is destroyed.MultiGetEntity
API.PutEntity
API. Support for read APIs and other write policies (WritePrepared, WriteUnprepared) will be added later.DBOptions::allow_2pc == true
(all TransactionDB
s except OptimisticTransactionDB
) that have exactly one column family. Due to a missing WAL sync, attempting to open the DB could have returned a Status::Corruption
with a message like "SST file is ahead of WALs".ColumnFamilyOptions::inplace_update_support == true
between user overwrites and reads on the same key.CompactFiles()
can compact files of range conflict with other ongoing compactions' when preclude_last_level_data_seconds > 0
is usedStatus::Corruption
reported when reopening a DB that used DBOptions::recycle_log_file_num > 0
and DBOptions::wal_compression != kNoCompression
.Published by anand1976 5 months ago
deadline
and max_size_bytes
for CacheDumper to exit earlyGetEntityFromBatchAndDB
to WriteBatchWithIndex
that can be used for wide-column point lookups with read-your-own-writes consistency. Similarly to GetFromBatchAndDB
, the API can combine data from the write batch with data from the underlying database if needed. See the API comments for more details.MultiGetEntityFromBatchAndDB
to WriteBatchWithIndex
that can be used for batched wide-column point lookups with read-your-own-writes consistency. Similarly to MultiGetFromBatchAndDB
, the API can combine data from the write batch with data from the underlying database if needed. See the API comments for more details.SstFileReader::NewTableIterator
API to support programmatically read a SST file as a raw table file.WaitForCompactOptions
- wait_for_purge
to make WaitForCompact()
API wait for background purge to completeCompactionOptions::compression
since CompactionOptions
's API for configuring compression was incomplete, unsafe, and likely unnecessaryOptionChangeMigration()
to migrate from non-FIFO to FIFO compactionOptions::compaction_options_fifo.max_table_files_size
> 0 can causemax_table_files_size
BlockBasedTableOptions::block_align
is now incompatible (i.e., APIs will return Status::InvalidArgument
) with more ways of enabling compression: CompactionOptions::compression
, ColumnFamilyOptions::compression_per_level
, and ColumnFamilyOptions::bottommost_compression
.CompactionOptions::compression
to kDisableCompressionOption
, which means the compression type is determined by the ColumnFamilyOptions
.BlockBasedTableOptions::optimize_filters_for_memory
is now set to true by default. When partition_filters=false
, this could lead to somewhat increased average RSS memory usage by the block cache, but this "extra" usage is within the allowed memory budget and should make memory usage more consistent (by minimizing internal fragmentation for more kinds of blocks).SetDumpFilter()
is not calledCompactRange()
with CompactRangeOptions::change_level = true
and CompactRangeOptions::target_level = 0
that ends up moving more than 1 file from non-L0 to L0 will return Status::Aborted()
.VerifyFileChecksums()
to return false-positive corruption under BlockBasedTableOptions::block_align=true
NewIterators()
API.DeleteRange()
together with ColumnFamilyOptions::memtable_insert_with_hint_prefix_extractor
. The impact of this bug would likely be corruption or crashing.DisableManualCompactions()
where compactions waiting to be scheduled due to conflicts would not be canceled promptlyColumnFamilyOptions::max_successive_merges > 0
where the CPU overhead for deciding whether to merge could have increased unless the user had set the option ColumnFamilyOptions::strict_max_successive_merges
MultiGet()
and MultiGetEntity()
together with blob files (ColumnFamilyOptions::enable_blob_files == true
). An error looking up one of the keys could cause the results to be wrong for other keys for which the statuses were Status::OK
.DataVerificationInfo::checksum
upon file creationPinnableWideColumns
.SstFilemManager
's slow deletion feature even if it's configured.Published by ajkr 6 months ago
SstFileMetaData
to prevent throwing java.lang.NoSuchMethodError
ColumnFamilyOptions::max_successive_merges > 0
where the CPU overhead for deciding whether to merge could have increased unless the user had set the option ColumnFamilyOptions::strict_max_successive_merges
Published by jowlyzhang 6 months ago
GetMergeOperandsOptions::continue_cb
, to give users the ability to end GetMergeOperands()
's lookup process before all merge operands were found.default_write_temperature
CF option and opening an SstFileWriter
with a temperature.WriteBatchWithIndex
now supports wide-column point lookups via the GetEntityFromBatch
API. See the API comments for more details.Iterator::GetProperty("rocksdb.iterator.write-time")
to allow users to get data's approximate write unix time and write data with a specific write time via WriteBatch::TimedPut
API.best_efforts_recovery == true
) may now be used together with atomic flush (atomic_flush == true
). The all-or-nothing recovery guarantee for atomically flushed data will be upheld.bottommost_temperature
, already replaced by last_level_temperature
WriteCommittedTransaction::GetForUpdate
, if the column family enables user-defined timestamp, it was mandated that argument do_validate
cannot be false, and UDT based validation has to be done with a user set read timestamp. It's updated to make the UDT based validation optional if user sets do_validate
to false and does not set a read timestamp. With this, GetForUpdate
skips UDT based validation and it's users' responsibility to enforce the UDT invariant. SO DO NOT skip this UDT-based validation if users do not have ways to enforce the UDT invariant. Ways to enforce the invariant on the users side include manage a monotonically increasing timestamp, commit transactions in a single thread etc.kEnableWait
to measure time spent by user threads blocked in RocksDB other than mutex, such as a write thread waiting to be added to a write group, a write thread delayed or stalled etc.RateLimiter
's API no longer requires the burst size to be the refill size. Users of NewGenericRateLimiter()
can now provide burst size in single_burst_bytes
. Implementors of RateLimiter::SetSingleBurstBytes()
need to adapt their implementations to match the changed API doc.write_memtable_time
to the newly introduced PerfLevel kEnableWait
.RateLimiter
s created by NewGenericRateLimiter()
no longer modify the refill period when SetSingleBurstBytes()
is called.ColumnFamilyOptions::max_successive_merges
when the key's merge operands are all found in memory, unless strict_max_successive_merges
is explicitly set.kBlockCacheTier
reads to return Status::Incomplete
when I/O is needed to fetch a merge chain's base value from a blob file.kBlockCacheTier
reads to return Status::Incomplete
on table cache miss rather than incorrectly returning an empty value.multiGet()
variants now take advantage of the underlying batched multiGet()
performance improvements.Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 64 thrpt 25 6315.541 ± 8.106 ops/s
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 1024 thrpt 25 6975.468 ± 68.964 ops/s
After
Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 64 thrpt 25 7046.739 ± 13.299 ops/s
MultiGetBenchmarks.multiGetList10 no_column_family 10000 16 100 1024 thrpt 25 7654.521 ± 60.121 ops/s
Published by ajkr 6 months ago
SstFileMetaData
to prevent throwing java.lang.NoSuchMethodError
Published by ajkr 6 months ago
SstFileMetaData
to prevent throwing java.lang.NoSuchMethodError
Published by jowlyzhang 7 months ago
SstFileWriter
create SST files without persisting user defined timestamps when the Option.persist_user_defined_timestamps
flag is set to false.DeleteFilesInRanges
and GetPropertiesOfTablesInRange
.access_hint_on_compaction_start
ColumnFamilyOptions::check_flush_compaction_key_order
WritableFile::GetFileSize
and FSWritableFile::GetFileSize
implementation that returns 0 and make it pure virtual, so that subclasses are enforced to explicitly provide an implementation.ColumnFamilyOptions::level_compaction_dynamic_file_size
EnableFileDeletions
API because it is unsafe with no known legitimate use.ColumnFamilyOptions::ignore_max_compaction_bytes_for_input
sst_dump --command=check
now compares the number of records in a table with num_entries
in table property, and reports corruption if there is a mismatch. API SstFileDumper::ReadSequential()
is updated to optionally do this verification. (#12322)DBImpl::RenameTempFileToOptionsFile
.Published by pdillinger 8 months ago
rocksdb.sst.write.micros
measures time of each write to SST file; rocksdb.file.write.{flush|compaction|db.open}.micros
measure time of each write to SST table (currently only block-based table format) and blob file for flush, compaction and db open.kVerify
to enum class FileOperationType
in listener.h. Update your switch
statements as needed.level_compaction_dynamic_file_size
, ignore_max_compaction_bytes_for_input
, check_flush_compaction_key_order
, flush_verify_memtable_count
, compaction_verify_record_count
, fail_if_options_file_error
, and enforce_single_del_contracts
rocksdb.blobdb.blob.file.write.micros
expands to also measure time writing the header and footer. Therefore the COUNT may be higher and values may be smaller than before. For stacked BlobDB, it no longer measures the time of explictly flushing blob file.rocksdb.blobdb.blob.file.synced
includes blob files failed to get synced and rocksdb.blobdb.blob.file.bytes.written
includes blob bytes failed to get written.BackupEngine
, sst_dump
, or ldb
.preclude_last_level_data_seconds
option that could interfere with expected data tiering.Published by pdillinger 8 months ago
Published by ltamasi 9 months ago
WriteBatchWithIndex
. This includes the PutEntity
API and support for wide columns in the existing read APIs (GetFromBatch
, GetFromBatchAndDB
, MultiGetFromBatchAndDB
, and BaseDeltaIterator
).TablePropertiesCollectorFactory
may now return a nullptr
collector to decline processing a file, reducing callback overheads in such cases.HyperClockCacheOptions::eviction_effort_cap
controls the space-time trade-off of the response. The default should be generally well-balanced, with no measurable affect on normal operation.RocksDB.get([ColumnFamilyHandle columnFamilyHandle,] ReadOptions opt, ByteBuffer key, ByteBuffer value)
which now accepts indirect buffer parameters as well as direct buffer parametersRocksDB.put( [ColumnFamilyHandle columnFamilyHandle,] WriteOptions writeOpts, final ByteBuffer key, final ByteBuffer value)
which now accepts indirect buffer parameters as well as direct buffer parametersRocksDB.merge([ColumnFamilyHandle columnFamilyHandle,] WriteOptions writeOptions, ByteBuffer key, ByteBuffer value)
methods with the same parameter options as put(...)
- direct and indirect buffers are supportedRocksIterator.key( byte[] key [, int offset, int len])
methods which retrieve the iterator key into the supplied bufferRocksIterator.value( byte[] value [, int offset, int len])
methods which retrieve the iterator value into the supplied bufferget(final ColumnFamilyHandle columnFamilyHandle, final ReadOptions readOptions, byte[])
in favour of get(final ReadOptions readOptions, final ColumnFamilyHandle columnFamilyHandle, byte[])
which has consistent parameter ordering with other methods in the same classTransaction.get( ReadOptions opt, [ColumnFamilyHandle columnFamilyHandle, ] byte[] key, byte[] value)
methods which retrieve the requested value into the supplied bufferTransaction.get( ReadOptions opt, [ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value)
methods which retrieve the requested value into the supplied bufferTransaction.getForUpdate( ReadOptions readOptions, [ColumnFamilyHandle columnFamilyHandle, ] byte[] key, byte[] value, boolean exclusive [, boolean doValidate])
methods which retrieve the requested value into the supplied bufferTransaction.getForUpdate( ReadOptions readOptions, [ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value, boolean exclusive [, boolean doValidate])
methods which retrieve the requested value into the supplied bufferTransaction.getIterator()
method as a convenience which defaults the ReadOptions
value supplied to existing Transaction.iterator()
methods. This mirrors the existing RocksDB.iterator()
method.Transaction.put([ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value [, boolean assumeTracked])
methods which supply the key, and the value to be written in a ByteBuffer
parameterTransaction.merge([ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value [, boolean assumeTracked])
methods which supply the key, and the value to be written/merged in a ByteBuffer
parameterTransaction.mergeUntracked([ColumnFamilyHandle columnFamilyHandle, ] ByteBuffer key, ByteBuffer value)
methods which supply the key, and the value to be written/merged in a ByteBuffer
parameterPublished by jaykorean 10 months ago
EnableFileDeletion
API not default to force enabling. For users that rely on this default behavior and stilltrue
to EnableFileDeletion
.daily_offpeak_time_utc
, the compaction picker will select a larger number of files for periodic compaction. This selection will include files that are projected to expire by the next off-peak start time, ensuring that these files are not chosen for periodic compaction outside of off-peak hours.DB::StartTrace()
, the subsequent trace writes are skipped to avoid writing to a file that has previously seen error. In this case, DB::EndTrace()
will also return a non-ok status with info about the error occured previously in its status message.TablePropertiesCollector::Finish()
once.WAL_ttl_seconds > 0
, we now process archived WALs for deletion at least every WAL_ttl_seconds / 2
seconds. Previously it could be less frequent in case of small WAL_ttl_seconds
values when size-based expiration (WAL_size_limit_MB > 0
) was simultaneously enabled.Published by hx235 11 months ago
rocksdb.fifo.{max.size|ttl}.compactions
to count FIFO compactions that drop files for different reasonsDBOptions::daily_offpeak_time_utc
in "HH:mm-HH:mm" format. This information will be used for resource optimization in the futureSetSingleBurstBytes()
for RocksDB rate limiterDBOptions::fail_if_options_file_error
changed from false
to true
. Operations that set in-memory options (e.g., DB::Open*()
, DB::SetOptions()
, DB::CreateColumnFamily*()
, and DB::DropColumnFamily()
) but fail to persist the change will now return a non-OK Status
by default.Options::compaction_readahead_size
is 0Status::NotSupported()
max_successive_merges
logic.create_missing_column_families=true
and many column families.Published by ajkr 11 months ago
COMPACTION_CPU_TOTAL_TIME
that records cumulative compaction cpu time. This ticker is updated regularly while a compaction is running.GetEntity()
API for ReadOnly DB and Secondary DB.Iterator::Refresh(const Snapshot *)
that allows iterator to be refreshed while using the input snapshot to read.merge_operand_count_threshold
. When the number of merge operands applied during a successful point lookup exceeds this threshold, the query will return a special OK status with a new subcode kMergeOperandThresholdExceeded
. Applications might use this signal to take action to reduce the number of merge operands for the affected key(s), for example by running a compaction.NewRibbonFilterPolicy()
, made the bloom_before_level
option mutable through the Configurable interface and the SetOptions API, allowing dynamic switching between all-Bloom and all-Ribbon configurations, and configurations in between. See comments on NewRibbonFilterPolicy()
NewTieredCache()
API in rocksdb/cache.h..FullMergeV3
to MergeOperator
. FullMergeV3
supports wide columns both as base value and merge result, which enables the application to perform more general transformations during merges. For backward compatibility, the default implementation implements the earlier logic of applying the merge operation to the default column of any wide-column entities. Specifically, if there is no base value or the base value is a plain key-value, the default implementation falls back to FullMergeV2
. If the base value is a wide-column entity, the default implementation invokes FullMergeV2
to perform the merge on the default column, and leaves any other columns unchanged.CompactionFilter::Context
. See CompactionFilter::Context::input_start_level
,CompactionFilter::Context::input_table_properties
for more.Options::compaction_readahead_size
's default value is changed from 0 to 2MB.acceleration
parameter is configurable by setting the negated value in CompressionOptions::level
. For example, CompressionOptions::level=-10
will set acceleration=10
NewTieredCache
API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in TieredCacheOptions
. Any capacity specified in LRUCacheOptions
, HyperClockCacheOptions
and CompressedSecondaryCacheOptions
is ignored. A new API, UpdateTieredCache
is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy.NewTieredVolatileCache()
API in rocksdb/cache.h has been renamed to NewTieredCache()
.Options::compaction_readahead_size
is explicitly set to 0Options::compaction_readahead_size
is 0IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_87732_4230653031040984171/000013.sst
.LRUCache
before returning, thus incurring locking overhead. With this fix, inserts and lookups are no-ops and do not add any overhead.MultiGet
for cleaning up SuperVersion acquired with locking db mutex.GenericRateLimiter
that could cause it to stop granting requestsrocksdb.file.read.verify.file.checksums.micros
is not populatedStatus::NotSupported()
Published by anand1976 about 1 year ago
Status::NotSupported()
Options::compaction_readahead_size
is 0IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_87732_4230653031040984171/000013.sst
.LRUCache
before returning, thus incurring locking overhead. With this fix, inserts and lookups are no-ops and do not add any overhead.rocksdb.file.read.verify.file.checksums.micros
is not populated.default_temperature
that is used for file reading accounting purpose, such as io statistics, for files that don't have an explicitly set temperature.Options::compaction_readahead_size
's default value is changed from 0 to 2MB.Options::compaction_readahead_size
is explicitly set to 0compaction_verify_record_count
is introduced for this purpose and is enabled by default.bottommost_file_compaction_delay
to allow specifying the delay of bottommost level single-file compactions.memtable_max_range_deletions
that limits the number of range deletions in a memtable. RocksDB will try to do an automatic flush after the limit is reached. (#11358)timeout
in microsecond option to WaitForCompactOptions
to allow timely termination of prolonged waiting in scenarios like recurring recoverable errors, such as out-of-space situations and continuous write streams that sustain ongoing flush and compactionsrocksdb.file.read.{get|multiget|db.iterator|verify.checksum|verify.file.checksums}.micros
measure read time of block-based SST tables or blob files during db open, Get()
, MultiGet()
, using db iterator, VerifyFileChecksums()
and VerifyChecksum()
. They require stats level greater than StatsLevel::kExceptDetailedTimers
.WaitForCompactOptions
to call Close() after waiting is done.CompressionOptions::checksum
for enabling ZSTD's checksum feature to detect corruption during decompression.Options::access_hint_on_compaction_start
related APIs as deprecated. See #11631 for alternative behavior.rocksdb.sst.read.micros
now includes time spent on multi read and async read into the fileperiodic_compaction_seconds
) will be set to 30 days by default if block based table is used.Published by hx235 about 1 year ago
Status::NotSupported()
Options::compaction_readahead_size
is 0Published by ajkr about 1 year ago
GenericRateLimiter
that could cause it to stop granting requestsGeneralCache
and MakeSharedGeneralCache()
as our plan changed to stop exposing a general-purpose cache interface. The old forms of these APIs, Cache
and NewLRUCache()
, are still available, although general-purpose caching support will be dropped eventually.periodic_compaction_seconds
no longer supports FIFO compaction: setting it has no effect on FIFO compactions. FIFO compaction users should only set option ttl
instead.Published by ajkr about 1 year ago
GenericRateLimiter
that could cause it to stop granting requestsAdvancedColumnFamilyOptions.persist_user_defined_timestamps
in the Manifest and table properties for a SST file when it is created. And use the recorded flag when creating a table reader for the SST file. This flag is only explicitly record if it's false.rocksdb.files.marked.trash.deleted
to track the number of trash files deleted by background thread from the trash queue.WaitForCompact()
to wait for all flush and compactions jobs to finish. Jobs to wait include the unscheduled (queued, but not scheduled yet).WriteBatch::Release()
that releases the batch's serialized data to the caller.rocksdb_options_add_compact_on_deletion_collector_factory_del_ratio
.rocksdb.error.handler.bg.error.count
, rocksdb.error.handler.bg.io.error.count
, rocksdb.error.handler.bg.retryable.io.error.count
to replace the misspelled ones: rocksdb.error.handler.bg.errro.count
, rocksdb.error.handler.bg.io.errro.count
, rocksdb.error.handler.bg.retryable.io.errro.count
('error' instead of 'errro'). Users should switch to use the new tickers before 9.0 release as the misspelled old tickers will be completely removed then.level_compaction_dynamic_level_bytes
to true. This affects users who use leveled compaction and do not set this option explicitly. These users may see additional background compactions following DB open. These compactions help to shape the LSM according to level_compaction_dynamic_level_bytes
such that the size of each level Ln is approximately size of Ln-1 * max_bytes_for_level_multiplier
. Turning on this option has other benefits too: see more detail in wiki: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size and in option comment in advanced_options.h (#11525).CompactRange()
will now always try to compact to the last non-empty level. (#11468)CompactRange()
with bottommost_level_compaction = BottommostLevelCompaction::kIfHaveCompactionFilter
will behave similar to kForceOptimized
in that it will skip files created during this manual compaction when compacting files in the bottommost level. (#11468)allow_ingest_behind=true
(currently only Universal compaction is supported), files in the last level, i.e. the ingested files, will not be included in any compaction. (#11489)rocksdb.sst.read.micros
scope is expanded to all SST reads except for file ingestion and column family import (some compaction reads were previously excluded).Published by ajkr about 1 year ago
GenericRateLimiter
that could cause it to stop granting requestsPublished by jowlyzhang over 1 year ago
block_protection_bytes_per_key
, which can be used to enable per key-value integrity protection for in-memory blocks in block cache (#11287).JemallocAllocatorOptions::num_arenas
. Setting num_arenas > 1
may mitigate mutex contention in the allocator, particularly in scenarios where block allocations commonly bypass jemalloc tcache.ShardedCacheOptions::hash_seed
, which also documents the solved problem in more detail.CompactionOptionsFIFO::file_temperature_age_thresholds
that allows FIFO compaction to compact files to different temperatures based on key age (#11428).BLOCK_CHECKSUM_MISMATCH_COUNT
.rocksdb.file.read.db.open.micros
that measures read time of block-based SST tables or blob files during db open._LEVEL_SEEK_
*. (#11460)DB::ClipColumnFamily
to clip the key in CF to a certain range. It will physically deletes all keys outside the range including tombstones.MakeSharedCache()
construction functions to various cache Options objects, and deprecated the NewWhateverCache()
functions with long parameter lists._LEVEL_SEEK_
*. stats. (#11460)