Skip to content

Commit f6d7ec1

Browse files
pdillingerfacebook-github-bot
authored andcommitted
Ignore total_order_seek in DB::Get (#9427)
Summary: Apparently setting total_order_seek=true for DB::Get was intended to allow accurate read semantics if the current prefix extractor doesn't match what was used to generate SST files on disk. But since prefix_extractor was made a mutable option in 5.14.0, we have been able to detect this case and provide the correct semantics regardless of the total_order_seek option. Since that time, the option has only made Get() slower in a reasonably common case: prefix_extractor unchanged and whole_key_filtering=false. So this change primarily removes unnecessary effect of total_order_seek on Get. Also cleans up some related comments. Also adds a -total_order_seek option to db_bench and canonicalizes handling of ReadOptions in db_bench so that command line options have the expected association with library features. (There is potential for change in regression test behavior, but the old behavior is likely indefensible, or some other inconsistency would need to be fixed.) TODO in follow-up work: there should be no reason for Get() to depend on current prefix extractor at all. Pull Request resolved: #9427 Test Plan: Unit tests updated. Performance (using db_bench update) Create DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12 -whole_key_filtering=0` Test with and without `-total_order_seek` on `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=readrandom -num=10000000 -duration=40 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12` Before this change, total_order_seek=false: 25188 ops/sec Before this change, total_order_seek=true: 1222 ops/sec (~20x slower) After this change, total_order_seek=false: 24570 ops/sec After this change, total_order_seek=true: 25012 ops/sec (indistinguishable) Reviewed By: siying Differential Revision: D33753458 Pulled By: pdillinger fbshipit-source-id: bf892f34907a5e407d9c40bd4d42f0adbcbe0014
1 parent c7ce03d commit f6d7ec1

File tree

6 files changed

+97
-86
lines changed

6 files changed

+97
-86
lines changed

HISTORY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323

2424
### Behavior Changes
2525
* Disallow the combination of DBOptions.use_direct_io_for_flush_and_compaction == true and DBOptions.writable_file_max_buffer_size == 0. This combination can cause WritableFileWriter::Append() to loop forever, and it does not make much sense in direct IO.
26+
* `ReadOptions::total_order_seek` no longer affects `DB::Get()`. The original motivation for this interaction has been obsolete since RocksDB has been able to detect whether the current prefix extractor is compatible with that used to generate table files, probably RocksDB 5.14.0.
2627

2728
## 6.29.0 (01/21/2022)
2829
Note: The next release will be major release 7.0. See https://github.com/facebook/rocksdb/issues/9390 for more info.

db/db_bloom_filter_test.cc

Lines changed: 30 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -217,11 +217,23 @@ TEST_F(DBBloomFilterTest, GetFilterByPrefixBloomCustomPrefixExtractor) {
217217
(*(get_perf_context()->level_to_perf_context))[0].bloom_filter_useful);
218218

219219
ro.total_order_seek = true;
220-
ASSERT_TRUE(db_->Get(ro, "foobarbar", &value).IsNotFound());
221-
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 2);
220+
// NOTE: total_order_seek no longer affects Get()
221+
ASSERT_EQ("NOT_FOUND", Get("foobarbar"));
222+
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 3);
222223
ASSERT_EQ(
223-
2,
224+
3,
225+
(*(get_perf_context()->level_to_perf_context))[0].bloom_filter_useful);
226+
227+
// No bloom on extractor changed
228+
#ifndef ROCKSDB_LITE
229+
ASSERT_OK(db_->SetOptions({{"prefix_extractor", "capped:10"}}));
230+
ASSERT_EQ("NOT_FOUND", Get("foobarbar"));
231+
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 3);
232+
ASSERT_EQ(
233+
3,
224234
(*(get_perf_context()->level_to_perf_context))[0].bloom_filter_useful);
235+
#endif // ROCKSDB_LITE
236+
225237
get_perf_context()->Reset();
226238
}
227239
}
@@ -268,11 +280,23 @@ TEST_F(DBBloomFilterTest, GetFilterByPrefixBloom) {
268280
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 2);
269281

270282
ro.total_order_seek = true;
271-
ASSERT_TRUE(db_->Get(ro, "foobarbar", &value).IsNotFound());
272-
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 2);
283+
// NOTE: total_order_seek no longer affects Get()
284+
ASSERT_EQ("NOT_FOUND", Get("foobarbar"));
285+
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 3);
273286
ASSERT_EQ(
274-
2,
287+
3,
288+
(*(get_perf_context()->level_to_perf_context))[0].bloom_filter_useful);
289+
290+
// No bloom on extractor changed
291+
#ifndef ROCKSDB_LITE
292+
ASSERT_OK(db_->SetOptions({{"prefix_extractor", "capped:10"}}));
293+
ASSERT_EQ("NOT_FOUND", Get("foobarbar"));
294+
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 3);
295+
ASSERT_EQ(
296+
3,
275297
(*(get_perf_context()->level_to_perf_context))[0].bloom_filter_useful);
298+
#endif // ROCKSDB_LITE
299+
276300
get_perf_context()->Reset();
277301
}
278302
}

include/rocksdb/options.h

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1499,8 +1499,7 @@ struct ReadOptions {
14991499
// used in the table. Some table format (e.g. plain table) may not support
15001500
// this option.
15011501
// If true when calling Get(), we also skip prefix bloom when reading from
1502-
// block based table. It provides a way to read existing data after
1503-
// changing implementation of prefix extractor.
1502+
// block based table, which only affects Get() performance.
15041503
// Default: false
15051504
bool total_order_seek;
15061505

table/block_based/block_based_table_reader.cc

Lines changed: 21 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -122,9 +122,8 @@ void ReleaseCachedEntry(void* arg, void* h) {
122122
cache->Release(handle, false /* force_erase */);
123123
}
124124

125-
// For hash based index, return true if prefix_extractor and
126-
// prefix_extractor_block mismatch, false otherwise. This flag will be used
127-
// as total_order_seek via NewIndexIterator
125+
// For hash based index, return false if table_properties->prefix_extractor_name
126+
// and prefix_extractor both exist and match, otherwise true.
128127
inline bool PrefixExtractorChangedHelper(
129128
const TableProperties* table_properties,
130129
const SliceTransform* prefix_extractor) {
@@ -616,19 +615,21 @@ Status BlockBasedTable::Open(
616615
"version of RocksDB?");
617616
}
618617

619-
// We've successfully read the footer. We are ready to serve requests.
620-
// Better not mutate rep_ after the creation. eg. internal_prefix_transform
621-
// raw pointer will be used to create HashIndexReader, whose reset may
622-
// access a dangling pointer.
623618
BlockCacheLookupContext lookup_context{TableReaderCaller::kPrefetch};
624619
Rep* rep = new BlockBasedTable::Rep(ioptions, env_options, table_options,
625620
internal_comparator, skip_filters,
626621
file_size, level, immortal_table);
627622
rep->file = std::move(file);
628623
rep->footer = footer;
629624
rep->hash_index_allow_collision = table_options.hash_index_allow_collision;
625+
626+
// We've successfully read the footer. We are ready to serve requests.
627+
// Better not mutate rep_ after the creation. eg. internal_prefix_transform
628+
// raw pointer will be used to create HashIndexReader, whose reset may
629+
// access a dangling pointer.
630630
// We need to wrap data with internal_prefix_transform to make sure it can
631631
// handle prefix correctly.
632+
// FIXME: is changed prefix_extractor handled anywhere for hash index?
632633
if (prefix_extractor != nullptr) {
633634
rep->internal_prefix_transform.reset(
634635
new InternalKeySliceTransform(prefix_extractor.get()));
@@ -2212,8 +2213,7 @@ FragmentedRangeTombstoneIterator* BlockBasedTable::NewRangeTombstoneIterator(
22122213
}
22132214

22142215
bool BlockBasedTable::FullFilterKeyMayMatch(
2215-
const ReadOptions& read_options, FilterBlockReader* filter,
2216-
const Slice& internal_key, const bool no_io,
2216+
FilterBlockReader* filter, const Slice& internal_key, const bool no_io,
22172217
const SliceTransform* prefix_extractor, GetContext* get_context,
22182218
BlockCacheLookupContext* lookup_context) const {
22192219
if (filter == nullptr || filter->IsBlockBased()) {
@@ -2228,13 +2228,14 @@ bool BlockBasedTable::FullFilterKeyMayMatch(
22282228
may_match =
22292229
filter->KeyMayMatch(user_key_without_ts, prefix_extractor, kNotValid,
22302230
no_io, const_ikey_ptr, get_context, lookup_context);
2231-
} else if (!read_options.total_order_seek &&
2232-
!PrefixExtractorChanged(prefix_extractor) &&
2231+
} else if (!PrefixExtractorChanged(prefix_extractor) &&
22332232
prefix_extractor->InDomain(user_key_without_ts) &&
22342233
!filter->PrefixMayMatch(
22352234
prefix_extractor->Transform(user_key_without_ts),
22362235
prefix_extractor, kNotValid, no_io, const_ikey_ptr,
22372236
get_context, lookup_context)) {
2237+
// FIXME ^^^: there should be no reason for Get() to depend on current
2238+
// prefix_extractor at all. It should always use table_prefix_extractor.
22382239
may_match = false;
22392240
}
22402241
if (may_match) {
@@ -2245,8 +2246,7 @@ bool BlockBasedTable::FullFilterKeyMayMatch(
22452246
}
22462247

22472248
void BlockBasedTable::FullFilterKeysMayMatch(
2248-
const ReadOptions& read_options, FilterBlockReader* filter,
2249-
MultiGetRange* range, const bool no_io,
2249+
FilterBlockReader* filter, MultiGetRange* range, const bool no_io,
22502250
const SliceTransform* prefix_extractor,
22512251
BlockCacheLookupContext* lookup_context) const {
22522252
if (filter == nullptr || filter->IsBlockBased()) {
@@ -2269,8 +2269,9 @@ void BlockBasedTable::FullFilterKeysMayMatch(
22692269
PERF_COUNTER_BY_LEVEL_ADD(bloom_filter_useful, filtered_keys,
22702270
rep_->level);
22712271
}
2272-
} else if (!read_options.total_order_seek &&
2273-
!PrefixExtractorChanged(prefix_extractor)) {
2272+
} else if (!PrefixExtractorChanged(prefix_extractor)) {
2273+
// FIXME ^^^: there should be no reason for MultiGet() to depend on current
2274+
// prefix_extractor at all. It should always use table_prefix_extractor.
22742275
filter->PrefixesMayMatch(range, prefix_extractor, kNotValid, false,
22752276
lookup_context);
22762277
RecordTick(rep_->ioptions.stats, BLOOM_FILTER_PREFIX_CHECKED, before_keys);
@@ -2308,9 +2309,8 @@ Status BlockBasedTable::Get(const ReadOptions& read_options, const Slice& key,
23082309
read_options.snapshot != nullptr;
23092310
}
23102311
TEST_SYNC_POINT("BlockBasedTable::Get:BeforeFilterMatch");
2311-
const bool may_match =
2312-
FullFilterKeyMayMatch(read_options, filter, key, no_io, prefix_extractor,
2313-
get_context, &lookup_context);
2312+
const bool may_match = FullFilterKeyMayMatch(
2313+
filter, key, no_io, prefix_extractor, get_context, &lookup_context);
23142314
TEST_SYNC_POINT("BlockBasedTable::Get:AfterFilterMatch");
23152315
if (!may_match) {
23162316
RecordTick(rep_->ioptions.stats, BLOOM_FILTER_USEFUL);
@@ -2496,8 +2496,8 @@ void BlockBasedTable::MultiGet(const ReadOptions& read_options,
24962496
BlockCacheLookupContext lookup_context{
24972497
TableReaderCaller::kUserMultiGet, tracing_mget_id,
24982498
/*get_from_user_specified_snapshot=*/read_options.snapshot != nullptr};
2499-
FullFilterKeysMayMatch(read_options, filter, &sst_file_range, no_io,
2500-
prefix_extractor, &lookup_context);
2499+
FullFilterKeysMayMatch(filter, &sst_file_range, no_io, prefix_extractor,
2500+
&lookup_context);
25012501

25022502
if (!sst_file_range.empty()) {
25032503
IndexBlockIter iiter_on_stack;
@@ -3110,12 +3110,6 @@ Status BlockBasedTable::CreateIndexReader(
31103110
InternalIterator* meta_iter, bool use_cache, bool prefetch, bool pin,
31113111
BlockCacheLookupContext* lookup_context,
31123112
std::unique_ptr<IndexReader>* index_reader) {
3113-
// kHashSearch requires non-empty prefix_extractor but bypass checking
3114-
// prefix_extractor here since we have no access to MutableCFOptions.
3115-
// Add need_upper_bound_check flag in BlockBasedTable::NewIndexIterator.
3116-
// If prefix_extractor does not match prefix_extractor_name from table
3117-
// properties, turn off Hash Index by setting total_order_seek to true
3118-
31193113
switch (rep_->index_type) {
31203114
case BlockBasedTableOptions::kTwoLevelIndexSearch: {
31213115
return PartitionIndexReader::Create(this, ro, prefetch_buffer, use_cache,
@@ -3133,6 +3127,7 @@ Status BlockBasedTable::CreateIndexReader(
31333127
std::unique_ptr<Block> metaindex_guard;
31343128
std::unique_ptr<InternalIterator> metaindex_iter_guard;
31353129
bool should_fallback = false;
3130+
// FIXME: is changed prefix_extractor handled anywhere for hash index?
31363131
if (rep_->internal_prefix_transform.get() == nullptr) {
31373132
ROCKS_LOG_WARN(rep_->ioptions.logger,
31383133
"No prefix extractor passed in. Fall back to binary"

table/block_based/block_based_table_reader.h

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -439,15 +439,13 @@ class BlockBasedTable : public TableReader {
439439
BlockCacheLookupContext* lookup_context,
440440
std::unique_ptr<IndexReader>* index_reader);
441441

442-
bool FullFilterKeyMayMatch(const ReadOptions& read_options,
443-
FilterBlockReader* filter, const Slice& user_key,
442+
bool FullFilterKeyMayMatch(FilterBlockReader* filter, const Slice& user_key,
444443
const bool no_io,
445444
const SliceTransform* prefix_extractor,
446445
GetContext* get_context,
447446
BlockCacheLookupContext* lookup_context) const;
448447

449-
void FullFilterKeysMayMatch(const ReadOptions& read_options,
450-
FilterBlockReader* filter, MultiGetRange* range,
448+
void FullFilterKeysMayMatch(FilterBlockReader* filter, MultiGetRange* range,
451449
const bool no_io,
452450
const SliceTransform* prefix_extractor,
453451
BlockCacheLookupContext* lookup_context) const;
@@ -505,8 +503,8 @@ class BlockBasedTable : public TableReader {
505503
void DumpKeyValue(const Slice& key, const Slice& value,
506504
std::ostream& out_stream);
507505

508-
// Returns true if prefix_extractor is compatible with that used in building
509-
// the table file.
506+
// Returns false if prefix_extractor exists and is compatible with that used
507+
// in building the table file, otherwise true.
510508
bool PrefixExtractorChanged(const SliceTransform* prefix_extractor) const;
511509

512510
// A cumulative data block file read in MultiGet lower than this size will

0 commit comments

Comments
 (0)