-
Couldn't load subscription status.
- Fork 3.4k
HBASE-27686: Recovery of BucketCache and Prefetched data after RS Crash #5080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great, yet simple solution for the persistent cache map. We could also solve the problem of evicted blocks by setting cacheDirty to true after we remove cache keys from the map here.
| if (ioEngine.isPersistent()){ | ||
| setCacheDirty(true); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point we are not guaranteed to have added the block, as it's just sent to the writer threads. We should be doing this here instead.
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketCachePersister.java
Show resolved
Hide resolved
| private final BucketCacheStats cacheStats = new BucketCacheStats(); | ||
|
|
||
| private final String persistencePath; | ||
| boolean isCacheDirty; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use AtomicBoolean?
| this.backingMap = new ConcurrentHashMap<>((int) blockNumCapacity); | ||
|
|
||
| if (ioEngine.isPersistent() && persistencePath != null) { | ||
| BucketCachePersister cachePersister = new BucketCachePersister(this, 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this interval configurable. Maybe worth decreasing this default.
| Thread.sleep(1000); | ||
| } | ||
| // Kill the RS | ||
| cluster.killRegionServer(cluster.getRegionServer(0).getServerName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had noticed that even when using this killRegionServer, we still go through the normal shutdown path for block cache, so we can't really rely on this. But since the main logic we want to test is that updates happen while the RS is running, we could check for the file existence without killing the RS. Also, can we validate the contents of the persisted file? At this point, I think we should certify that all the cache entries are there.
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are almost there, just some minor comments on the tests.
| prefetchPersistencePath = testDir + "/prefetch.persistence"; | ||
| conf.set(CacheConfig.PREFETCH_PERSISTENCE_PATH_KEY, prefetchPersistencePath); | ||
| FileSystem fs = HFileSystem.get(conf); | ||
| long BUCKETCACHE_PERSIST_INTERVAL = 3000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we decrease this interval to a sub-second value and make this test more expedite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So writeStoreFile() and readStoreFile() methods take at least 2400 millis. So I used a value higher than that.
| prefetchPersistencePath = testDir + "/prefetch.persistence"; | ||
| conf.set(CacheConfig.PREFETCH_PERSISTENCE_PATH_KEY, prefetchPersistencePath); | ||
| FileSystem fs = HFileSystem.get(conf); | ||
| long BUCKETCACHE_PERSIST_INTERVAL = 3000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use camel case for local variable naming.
| Path storeFile2 = writeStoreFile("TestPrefetch1", conf, cacheConf, fs); | ||
| readStoreFile(storeFile, 0, fs, cacheConf, conf, bucketCache); | ||
| readStoreFile(storeFile2, 0, fs, cacheConf, conf, bucketCache); | ||
| Thread.sleep(BUCKETCACHE_PERSIST_INTERVAL - startTime + System.currentTimeMillis() + 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So readStoreFile makes sure to only return once prefetch is complete. Do we still this extra (- startTime + System.currentTimeMillis()), plus the 100 milis to pass this test?
| final int writerQLen = BucketCache.DEFAULT_WRITER_QUEUE_ITEMS; | ||
|
|
||
| @Test | ||
| public void testPrefetchPersistenceCrash() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two test methods are almost identical. Can we do some code reuse with some lambdas for the different pre-conditions/predicates?
|
Lets also update the flag for evictions. That way, we would sort the need for update on evictions as well. |
e252244 to
52b230b
Compare
|
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there, just a few more nits and fix spotbugs warnings.
| cache.setCacheInconsistent(false); | ||
| } | ||
| } catch (IOException | InterruptedException e) { | ||
| LOG.info("Exception in BucketCachePersister" + e.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should be a WARN.
| private static final Logger LOG = LoggerFactory.getLogger(BucketCachePersister.class); | ||
|
|
||
| public BucketCachePersister(BucketCache cache, long intervalMillis) { | ||
| this.cache = cache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Let's set a meaningful thread name for this one.
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
…sh (#5080) Signed-off-by: Wellington Ramos Chevreuil <[email protected]> (cherry picked from commit 58cb1f4)
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
…tched data after RS Crash (apache#5080) Signed-off-by: Wellington Ramos Chevreuil <[email protected]> (cherry picked from commit 58cb1f4) Change-Id: I0d446b51f16f11d91ab8a6619ce5b01f8de7da91
https://issues.apache.org/jira/browse/HBASE-27686