A library that provides an embeddable, persistent key-value store for fast storage.
GPL-2.0 License
Published by ajkr over 2 years ago
Iterator::Refresh()
reads stale keys after DeleteRange() performed.DisableManualCompaction
. Also DB close can cancel the manual compaction thread.versions_
between DBImpl::ResumeImpl()
and threads waiting for recovery to complete (#9496)DB::GetMergeOperands()
.Published by ajkr over 2 years ago
Published by ajkr over 2 years ago
DisableManualCompaction
. Also DB close can cancel the manual compaction thread.versions_
between DBImpl::ResumeImpl()
and threads waiting for recovery to complete (#9496)DB::GetMergeOperands()
.std::vector
instead of std::map
for storing the metadata objects for blob files, which can improve performance for certain workloads, especially when the number of blob files is high.close()
on their RocksJava objects. See #9523.ReadOptions::rate_limiter_priority
. When set to something other than Env::IO_TOTAL
, the internal rate limiter (DBOptions::rate_limiter
) will be charged at the specified priority for file reads associated with the API to which the ReadOptions
was provided.BackupableDBOptions
. Use backup_engine.h and BackupEngineOptions
. Similar renamings are in the C and Java APIs.UtilityDB::OpenTtlDB
. Use db_ttl.h and DBWithTTL::Open
.Cache::CreateCallback
from void*
to const void*
.rocksdb_filterpolicy_create()
from C API, as the only C API support for custom filter policies is now obsolete.SizeApproximationOptions.include_memtabtles
to SizeApproximationOptions.include_memtables
.CompactionService::Start()
and CompactionService::WaitForComplete()
. Please use CompactionService::StartV2()
, CompactionService::WaitForCompleteV2()
instead, which provides the same information plus extra data like priority, db_id, etc.ColumnFamilyOptions::OldDefaults
and DBOptions::OldDefaults
are marked deprecated, as they are no longer maintained.OnSubcompactionBegin()
and OnSubcompactionCompleted()
.FileOperationInfo
in event listener API.NewSequentialFile()
. backup and checkpoint operations need to open the source files with NewSequentialFile()
, which will have the temperature hints. Other operations are not covered.ReadOptions::total_order_seek
no longer affects DB::Get()
. The original motivation for this interaction has been obsolete since RocksDB has been able to detect whether the current prefix extractor is compatible with that used to generate table files, probably RocksDB 5.14.0.BlockBasedTableOptions::detect_filter_construct_corruption
for detecting corruption during Bloom Filter (format_version >= 5) and Ribbon Filter construction.rocksdb.blob-stats
DB property.LAST_LEVEL_READ_*
, NON_LAST_LEVEL_READ_*
.Published by anand1976 over 2 years ago
Note: The next release will be major release 7.0. See https://github.com/facebook/rocksdb/issues/9390 for more info.
TraceFilterType
: kTraceFilterIteratorSeek
, kTraceFilterIteratorSeekForPrev
, and kTraceFilterMultiGet
. They can be set in TraceOptions
to filter out the operation types after which they are named.TraceOptions::preserve_write_order
. When enabled it guarantees write records are traced in the same order they are logged to WAL and applied to the DB. By default it is disabled (false) to match the legacy behavior and prevent regression.Options::OldDefaults
is marked deprecated, as it is no longer maintained.BlockBasedTableOptions::block_size
from size_t
to uint64_t
.Iterator::Refresh()
together with DB::DeleteRange()
, which are incompatible and have always risked causing the refreshed iterator to return incorrect results.DB::DestroyColumnFamilyHandle()
will return Status::InvalidArgument() if called with DB::DefaultColumnFamily()
.Options::DisableExtraChecks()
that can be used to improve peak write performance by disabling checks that should not be necessary in the absence of software logic errors or CPU+memory hardware errors. (Default options are slowly moving toward some performance overheads for extra correctness checking.)fcntl(F_FULLFSYNC)
on OS X and iOS.Published by akankshamahajan15 over 2 years ago
ObjectRegistry
. The bug could result in failure to save the OPTIONS file.FaultInjectionTestFS
.FSRandomAccessFile::GetUniqueId()
(previously used when available), so a filesystem recycling unique ids can no longer lead to incorrect result or crash (#7405). For files generated by RocksDB >= 6.24, the cache keys are stable across DB::Open and DB directory move / copy / import / export / migration, etc. Although collisions are still theoretically possible, they are (a) impossible in many common cases, (b) not dependent on environmental factors, and (c) much less likely than a CPU miscalculation while executing RocksDB.checker
argument that performs additional checking on timestamp sizes.TableProperties::properties_offsets
with uint64_t property external_sst_file_global_seqno_offset
to save table properties's memory.TableProperties.getPropertiesOffsets()
as it exposed internal details to external users.Published by riversand963 almost 3 years ago
ObjectRegistry
. The bug could result in failure to save the OPTIONS file.GetSortedWalFiles()
to fail randomly with an error like IO error: 001234.log: No such file or directory
BlockBasedTableOptions::reserve_table_builder_memory = true
.blob_compaction_readahead_size
.CompactRange()
with CompactRangeOptions::change_level == true
from possibly causing corruption to the LSM state (overlapping files within a level) when run in parallel with another manual compaction. Note that setting force_consistency_checks == true
(the default) would cause the DB to enter read-only mode in this scenario and return Status::Corruption
, rather than committing any corruption.RecordTick(stats_, WRITE_WITH_WAL)
(at 2 place), this fix remove the extra RecordTick
s and fix the corresponding test case.GenericRateLimiter::Request
.BlockBasedTableOptions
if insertion into one of {block_cache
, block_cache_compressed
, persistent_cache
} can show up in another of these. (RocksDB expects to be able to use the same key for different physical data among tiers.)Env::Priority::BOTTOM
pool will no longer see RocksDB schedule automatic compactions exceeding the DB's compaction concurrency limit. For details on per-DB compaction concurrency limit, see API docs of max_background_compactions
and max_background_jobs
.NUM_FILES_IN_SINGLE_COMPACTION
was only counting the first input level files, now it's including all input files.TransactionUtil::CheckKeyForConflicts
can also perform conflict-checking based on user-defined timestamps in addition to sequence numbers.GenericRateLimiter
's minimum refill bytes per period previously enforced.WriteBufferManager
as final
because it is not intended for extension.FSDirectory::FsyncWithDirOptions()
, which provides extra information like directory fsync reason in DirFsyncOptions
. File system like btrfs is using that to skip directory fsync for creating a new file, or when renaming a file, fsync the target file instead of the directory, which improves the DB::Open()
speed by ~20%.DB::Open()
is not going be blocked by obsolete file purge if DBOptions::avoid_unnecessary_blocking_io
is set to true.gettid()
, info log ("LOG" file) lines now print a system-wide thread ID from gettid()
instead of the process-local pthread_self()
. For all users, the thread ID format is changed from hexadecimal to decimal integer.pthread_setname_np()
, the background thread names no longer contain an ID suffix. For example, "rocksdb:bottom7" (and all other threads in the Env::Priority::BOTTOM
pool) are now named "rocksdb:bottom". Previously large thread pools could breach the name size limit (e.g., naming "rocksdb:bottom10" would fail).ReadOptions::iter_start_seqnum
and DBOptions::preserve_deletes
, please try using user defined timestamp feature instead. The options will be removed in a future release, currently it logs a warning message when using.BlockBasedTableBuilder
for FullFilter
and PartitionedFilter
case (#9070)Published by siying almost 3 years ago
DisableManualCompaction()
to cancel compactions even when they are waiting on automatic compactions to drain due to CompactRangeOptions::exclusive_manual_compactions == true
.Env::ReopenWritableFile()
and FileSystem::ReopenWritableFile()
to specify any existing file must not be deleted or truncated.IngestExternalFiles()
with files for multiple column families. The bug could have introduced a delay in ingested file keys becoming visible after IngestExternalFiles()
returned. Furthermore, mutations to ingested file keys while they were invisible could have been dropped (not necessarily immediately).WriteBufferManager
who constructed it with allow_stall == true
. The race condition led to undefined behavior (in our experience, typically a process crash).WriteBufferManager::SetBufferSize()
with new_size == 0
to dynamically disable memory limiting.DB::close()
thread-safe.REMOTE_COMPACT_READ_BYTES
, REMOTE_COMPACT_WRITE_BYTES
.class CacheDumper
and CacheDumpedLoader
at rocksdb/utilities/cache_dump_load.h
Note that, this feature is subject to the potential change in the future, it is still experimental.blob_garbage_collection_force_threshold
, which can be used to trigger compactions targeting the SST files which reference the oldest blob files when the ratio of garbage in those blob files meets or exceeds the specified threshold. This can reduce space amplification with skewed workloads where the affected SST files might not otherwise get picked up for compaction.GetUniqueIdFromTableProperties
. Only SST files from RocksDB >= 6.24 support unique IDs.GetMapProperty()
support for "rocksdb.dbstats" (DB::Properties::kDBStats
). As a map property, it includes DB-level internal stats accumulated over the DB's lifetime, such as user write related stats and uptime.file_temperature
to IngestExternalFileArg
such that when ingesting SST files, we are able to indicate the temperature of the this batch of files.DB::Close()
failed with a non aborted status, calling DB::Close()
again will return the original status instead of Status::OK.lowest_used_cache_tier
option to DBOptions
(immutable) and pass it to BlockBasedTableReader. By default it is CacheTier::kNonVolatileBlockTier
, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileBlockTier). By set it to CacheTier::kVolatileTier
, the DB will not use the secondary cache.keyMayExist()
supports ByteBuffer.Published by siying almost 3 years ago
DisableManualCompaction()
to cancel compactions even when they are waiting on automatic compactions to drain due to CompactRangeOptions::exclusive_manual_compactions == true
.Env::ReopenWritableFile()
and FileSystem::ReopenWritableFile()
to specify any existing file must not be deleted or truncated.IngestExternalFiles()
with files for multiple column families. The bug could have introduced a delay in ingested file keys becoming visible after IngestExternalFiles()
returned. Furthermore, mutations to ingested file keys while they were invisible could have been dropped (not necessarily immediately).WriteBufferManager
who constructed it with allow_stall == true
. The race condition led to undefined behavior (in our experience, typically a process crash).WriteBufferManager::SetBufferSize()
with new_size == 0
to dynamically disable memory limiting.DB::close()
thread-safe.REMOTE_COMPACT_READ_BYTES
, REMOTE_COMPACT_WRITE_BYTES
.class CacheDumper
and CacheDumpedLoader
at rocksdb/utilities/cache_dump_load.h
Note that, this feature is subject to the potential change in the future, it is still experimental.blob_garbage_collection_force_threshold
, which can be used to trigger compactions targeting the SST files which reference the oldest blob files when the ratio of garbage in those blob files meets or exceeds the specified threshold. This can reduce space amplification with skewed workloads where the affected SST files might not otherwise get picked up for compaction.GetUniqueIdFromTableProperties
. Only SST files from RocksDB >= 6.24 support unique IDs.GetMapProperty()
support for "rocksdb.dbstats" (DB::Properties::kDBStats
). As a map property, it includes DB-level internal stats accumulated over the DB's lifetime, such as user write related stats and uptime.file_temperature
to IngestExternalFileArg
such that when ingesting SST files, we are able to indicate the temperature of the this batch of files.DB::Close()
failed with a non aborted status, calling DB::Close()
again will return the original status instead of Status::OK.lowest_used_cache_tier
option to DBOptions
(immutable) and pass it to BlockBasedTableReader. By default it is CacheTier::kNonVolatileBlockTier
, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileBlockTier). By set it to CacheTier::kVolatileTier
, the DB will not use the secondary cache.keyMayExist()
supports ByteBuffer.Published by ajkr about 3 years ago
IngestExternalFiles()
with files for multiple column families. The bug could have introduced a delay in ingested file keys becoming visible after IngestExternalFiles()
returned. Furthermore, mutations to ingested file keys while they were invisible could have been dropped (not necessarily immediately).WriteBufferManager
who constructed it with allow_stall == true
. The race condition led to undefined behavior (in our experience, typically a process crash).WriteBufferManager::SetBufferSize()
with new_size == 0
to dynamically disable memory limiting.DisableManualCompaction()
to cancel compactions even when they are waiting on automatic compactions to drain due to CompactRangeOptions::exclusive_manual_compactions == true
.Env::ReopenWritableFile()
and FileSystem::ReopenWritableFile()
to specify any existing file must not be deleted or truncated.Published by ltamasi about 3 years ago
prepopulate_block_cache = kFlushOnly
to only apply to flushes rather than to all generated files.db_name
, db_id
, session_id
, which could help the user uniquely identify compaction job between db instances and sessions.VerifyChecksum()
and VerifyFileChecksums()
queries.rocksdb.num-blob-files
, rocksdb.blob-stats
, rocksdb.total-blob-file-size
, and rocksdb.live-blob-file-size
. The existing property rocksdb.estimate_live-data-size
was also extended to include live bytes residing in blob files.Env::IO_USER
,Env::IO_MID
. Env::IO_USER
will have superior priority over all other RateLimiter IOPriorities without being subject to fair scheduling constraint.SstFileWriter
now supports Put
s and Delete
s with user-defined timestamps. Note that the ingestion logic itself is not timestamp-aware yet.OnBlobFileCreationStarted
,OnBlobFileCreated
and OnBlobFileDeleted
in EventListener
class of listener.h. It notifies listeners during creation/deletion of individual blob files in Integrated BlobDB. It also log blob file creation finished event and deletion event in LOG file.DB::MultiGet
using MultiRead
.CompactionServiceJobStatus::kUseLocal
to instruct RocksDB to run the compaction locally instead of waiting for the remote compaction result.RateLimiter::GetTotalPendingRequest(int64_t* total_pending_requests, const Env::IOPriority pri)
for the total number of requests that are pending for bytes in the rate limiter.strict_capacity_limit=true
for the block cache, in addition to existing conditions that can trigger unbuffering.SstFileMetaData::size
from size_t
to uint64_t
.FlushJobInfo
and CompactionJobInfo
in listener.h to provide information about the blob files generated by a flush/compaction and garbage collected during compaction in Integrated BlobDB. Added struct members blob_file_addition_infos
and blob_file_garbage_infos
that contain this information.output_file_names
of CompactFiles
API to also include paths of the blob files generated by the compaction in Integrated BlobDB.BackupEngine
functions now return IOStatus
instead of Status
. Most existing code should be compatible with this change but some calls might need to be updated.Published by ltamasi about 3 years ago
ColumnFamilyData
objects. The earlier logic unlocked the DB mutex before destroying the thread-local SuperVersion
pointers, which could result in a process crash if another thread managed to get a reference to the ColumnFamilyData
object.RenameFile()
on a non-existent info log file ("LOG") when opening a new DB. Such a call was guaranteed to fail though did not impact applications since we swallowed the error. Now we also stopped swallowing errors in renaming "LOG" file.OnFlushCompleted
was not called for atomic flush.MultiGet
API when used with keys spanning multiple column families and sorted_input == false
.options.allow_fallocate=false
.ReplayOptions
in Replayer::Replay()
, or via --trace_replay_fast_forward
in db_bench.LiveSstFilesSizeAtTemperature
to retrieve sst file size at different temperature.BLOB_DB_BLOB_FILE_BYTES_READ
, BLOB_DB_GC_NUM_KEYS_RELOCATED
, and BLOB_DB_GC_BYTES_RELOCATED
, as well as the histograms BLOB_DB_COMPRESSION_MICROS
and BLOB_DB_DECOMPRESSION_MICROS
.rocksdb_filterpolicy_create_ribbon
is unchanged but adds new rocksdb_filterpolicy_create_ribbon_hybrid
.DB::NewDefaultReplayer()
to create a default Replayer instance. Added TraceReader::Reset()
to restart reading a trace file. Created trace_record.h, trace_record_result.h and utilities/replayer.h files to access the decoded Trace records, replay them, and query the actual operation results.SetDBOptions()
does not change any option value.StringAppendOperator
additionally accepts a string as the delimiter.Published by ltamasi about 3 years ago
RenameFile()
on a non-existent info log file ("LOG") when opening a new DB. Such a call was guaranteed to fail though did not impact applications since we swallowed the error. Now we also stopped swallowing errors in renaming "LOG" file.MultiGet
API when used with keys spanning multiple column families and sorted_input == false
.ColumnFamilyData
objects. The earlier logic unlocked the DB mutex before destroying the thread-local SuperVersion
pointers, which could result in a process crash if another thread managed to get a reference to the ColumnFamilyData
object.OnFlushCompleted
was not called for atomic flush.manifest_dump
ldb
command.GetLiveFilesMetaData()
now populates the temperature
, oldest_ancester_time
, and file_creation_time
fields of its LiveFileMetaData
results when the information is available. Previously these fields always contained zero indicating unknown.Get()
return Status::OK() and an empty value for non-existent key when read_options.read_tier = kBlockCacheTier
.get_context
didn't accumulate to statistics when query is failed.list_live_files_metadata
, that shows the live SST files, as well as their LSM storage level and the column family they belong to.int
to uint64_t
to support sub-compaction id.Published by ajkr over 3 years ago
GetLiveFilesMetaData()
now populates the temperature
, oldest_ancester_time
, and file_creation_time
fields of its LiveFileMetaData
results when the information is available. Previously these fields always contained zero indicating unknown.DeleteFilesInRange()
may cause ongoing compaction reports corruption exception, or ASSERT for debug build. There's no actual data loss or corruption that we find.NewRibbonFilterPolicy
in place of NewBloomFilterPolicy
to use Ribbon filters instead of Bloom, or ribbonfilter
in place of bloomfilter
in configuration string.DBWithTTL
to use DeleteRange
api just like other DBs. DeleteRangeCF()
which executes WriteBatchInternal::DeleteRange()
has been added to the handler in DBWithTTLImpl::Write()
to implement it.cancel
field to CompactRangeOptions
, allowing individual in-process manual range compactions to be cancelled.Published by akankshamahajan15 over 3 years ago
GetLiveFiles()
output included a non-existent file called "OPTIONS-000000". Backups and checkpoints, which use GetLiveFiles()
, failed on DBs impacted by this bug. Read-write DBs were impacted when the latest OPTIONS file failed to write and fail_if_options_file_error == false
. Read-only DBs were impacted when no OPTIONS files existed.rocksdb.cur-size-active-mem-table
, rocksdb.cur-size-all-mem-tables
, and rocksdb.size-all-mem-tables
.ColumnFamilyOptions::sample_for_compression
now takes effect for creation of all block-based tables. Previously it only took effect for block-based tables created by flush.CompactFiles()
can no longer compact files from lower level to up level, which has the risk to corrupt DB (details: #8063). The validation is also added to all compactions.strerror_r()
to get error messages.Env
has high-pri thread pool disabled (Env::GetBackgroundThreads(Env::Priority::HIGH) == 0
)DBOptions::max_open_files
to be set with a non-negative integer with ColumnFamilyOptions::compaction_style = kCompactionStyleFIFO
.rocksdb.cur-size-active-mem-table
, rocksdb.cur-size-all-mem-tables
, and rocksdb.size-all-mem-tables
.yield
instead of wfe
to relax cpu to gain better performance.TableProperties::slow_compression_estimated_data_size
and TableProperties::fast_compression_estimated_data_size
. When ColumnFamilyOptions::sample_for_compression > 0
, they estimate what TableProperties::data_size
would have been if the "fast" or "slow" (see ColumnFamilyOptions::sample_for_compression
API doc for definitions) compression had been used instead.FlushReason::kWalFull
, which is reported when a memtable is flushed due to the WAL reaching its size limit; those flushes were previously reported as FlushReason::kWriteBufferManager
. Also, changed the reason for flushes triggered by the write buffer manager to FlushReason::kWriteBufferManager
; they were previously reported as FlushReason::kWriteBufferFull
.Published by zhichao-cao over 3 years ago
delayed_write_rate
is actually exceeded, with an initial burst allowance of 1 millisecond worth of bytes. Also, beyond the initial burst allowance, delayed_write_rate
is now more strictly enforced, especially with multiple column families.BackupableDBOptions::share_files_with_checksum
to true
and deprecated false
because of potential for data loss. Note that accepting this change in behavior can temporarily increase backup data usage because files are not shared between backups using the two different settings. Also removed obsolete option kFlagMatchInterimNaming.FilterBlobByKey()
to CompactionFilter
. Subclasses can override this method so that compaction filters can determine whether the actual blob value has to be read during compaction. Use a new kUndetermined
in CompactionFilter::Decision
to indicated that further action is necessary for compaction filter to make a decision.Published by jay-zhuang over 3 years ago
Published by jay-zhuang over 3 years ago
WRITE_PREPARED
, WRITE_UNPREPARED
TransactionDB MultiGet()
may return uncommitted data with snapshot.TransactionDB
returns error Status
es from calls to DeleteRange()
and calls to Write()
where the WriteBatch
contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees. There are certain cases where range deletion can still be used on such DBs; see the API doc on TransactionDB::DeleteRange()
for details.OptimisticTransactionDB
now returns error Status
es from calls to DeleteRange()
and calls to Write()
where the WriteBatch
contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees.DB::VerifyFileChecksums()
, we now fail with Status::InvalidArgument
if the name of the checksum generator used for verification does not match the name of the checksum generator used for protecting the file when it was created.ErrorHandler::SetBGError
.WalAddition
and WalDeletion
, fixed this by changing the encoded format of them to be ignorable by older versions.Published by jay-zhuang over 3 years ago
TransactionDB
returns error Status
es from calls to DeleteRange()
and calls to Write()
where the WriteBatch
contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees. There are certain cases where range deletion can still be used on such DBs; see the API doc on TransactionDB::DeleteRange()
for details.OptimisticTransactionDB
now returns error Status
es from calls to DeleteRange()
and calls to Write()
where the WriteBatch
contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees.WalAddition
and WalDeletion
, fixed this by changing the encoded format of them to be ignorable by older versions.merge_operator
now fails immediately, causing the DB to enter read-only mode. Previously, failure was deferred until the merge_operator
was needed by a user read or a background operation.ErrorHandler::SetBGError
.WALRecoveryMode::kPointInTimeRecovery
is used. Gaps are still possible when WALs are truncated exactly on record boundaries; for complete protection, users should enable track_and_verify_wals_in_manifest
.read_amp_bytes_per_bit
during OPTIONS file parsing on big-endian architecture. Without this fix, original code introduced in PR7659, when running on big-endian machine, can mistakenly store read_amp_bytes_per_bit (an uint32) in little endian format. Future access to read_amp_bytes_per_bit
will give wrong values. Little endian architecture is not affected.CompactRange
and GetApproximateSizes
.track_and_verify_wals_in_manifest
. If true
, the log numbers and sizes of the synced WALs are tracked in MANIFEST, then during DB recovery, if a synced WAL is missing from disk, or the WAL's size does not match the recorded size in MANIFEST, an error will be reported and the recovery will be aborted. Note that this option does not work with secondary instance.Published by ajkr over 3 years ago
TransactionDB
returns error Status
es from calls to DeleteRange()
and calls to Write()
where the WriteBatch
contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees. There are certain cases where range deletion can still be used on such DBs; see the API doc on TransactionDB::DeleteRange()
for details.OptimisticTransactionDB
now returns error Status
es from calls to DeleteRange()
and calls to Write()
where the WriteBatch
contains a range deletion. Previously such operations may have succeeded while not providing the expected transactional guarantees.Published by ajkr almost 4 years ago