-
Notifications
You must be signed in to change notification settings - Fork 28.9k
Clean up CacheManager et al. #1083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We only unroll the serialized form of each partition for this case, because the deserialized form may be much larger and may not fit in memory. This commit also abstracts out part of the logic of getOrCompute to make it more readable.
Previously we never returned the updated blocks in MemoryStore's putBytes. This is a simple bug with a simple fix.
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might make sense to remove this assume; in case we add a new storage level in the future, this won't hold any more and because this code is so far away from the storage level code, we will likely forget to update this location.
|
BTW can you construct a unit test for this in CacheManagerSuite? Would be good also to add a unit test to test the lock (which existed earlier but had no test for it). Thanks. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
It's worth noting that the special handling of the memory serialized storage level actually introduces a regression. In particular, now we add an extra step to deserialize the bytes in the end, which could be slow for large partitions. This special handling will most likely be superseded by a more general solution for SPARK-1777, which avoids unrolling an entire partition if there is not enough space for it, regardless of the storage level. For now, I will put this PR on hold. |
This special handling sacrifices CPU cycles for memory usage by introducing an additional step to deserialize the serialized bytes put into BlockManager. This may cause a performance regression in some cases. For now, let's keep the functionality the same as before, and only include style changes in this PR. This is a precursor to another incoming PR that changes the way unroll RDD partitions.
|
Merged build triggered. |
|
Merged build started. |
|
I have removed the special handling of the memory serialized case for the aforementioned reason. However, I would like get the style changes in master as I have a separate WIP PR that builds on top of this one. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
1 similar comment
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
All automated tests passed. |
|
Merged build triggered. |
|
Merged build started. |
|
LGTM. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
Merged in master. Thanks! |
**UPDATE** I have removed the special handling for `StorageLevel.MEMORY_*_SER` for now, because it introduces a potential performance regression. With the latest changes, this PR should include mainly style (code readability) fixes. The only functionality change is the update in `MemoryStore#putBytes` to actually return updated blocks, though this is a minor bug fix. Now this is mainly a precursor to another PR (once again). --------- *Old comment* The deserialized version of a partition may occupy much more space than the serialized version. Therefore, if a partition is to be cached with `StorageLevel.MEMORY_*_SER`, we don't need to fully unroll it into an `ArrayBuffer`, but instead we can unroll it into a potentially much smaller `ByteBuffer`. This may save us from OOMs in this case. Author: Andrew Or <[email protected]> Closes apache#1083 from andrewor14/unroll-them-partitions and squashes the following commits: 7048aa0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions 3d9a366 [Andrew Or] Minor change for readability d12b95f [Andrew Or] Remove unused imports (minor) a4c387b [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions cf5f565 [Andrew Or] Remove special handling for MEM_*_SER 0091ec0 [Andrew Or] Address review feedback 44ef282 [Andrew Or] Actually return updated blocks in putBytes 2941c89 [Andrew Or] Clean up BlockStore (minor) a8f181d [Andrew Or] Add special handling for StorageLevel.MEMORY_*_SER
**UPDATE** I have removed the special handling for `StorageLevel.MEMORY_*_SER` for now, because it introduces a potential performance regression. With the latest changes, this PR should include mainly style (code readability) fixes. The only functionality change is the update in `MemoryStore#putBytes` to actually return updated blocks, though this is a minor bug fix. Now this is mainly a precursor to another PR (once again). --------- *Old comment* The deserialized version of a partition may occupy much more space than the serialized version. Therefore, if a partition is to be cached with `StorageLevel.MEMORY_*_SER`, we don't need to fully unroll it into an `ArrayBuffer`, but instead we can unroll it into a potentially much smaller `ByteBuffer`. This may save us from OOMs in this case. Author: Andrew Or <[email protected]> Closes apache#1083 from andrewor14/unroll-them-partitions and squashes the following commits: 7048aa0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions 3d9a366 [Andrew Or] Minor change for readability d12b95f [Andrew Or] Remove unused imports (minor) a4c387b [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions cf5f565 [Andrew Or] Remove special handling for MEM_*_SER 0091ec0 [Andrew Or] Address review feedback 44ef282 [Andrew Or] Actually return updated blocks in putBytes 2941c89 [Andrew Or] Clean up BlockStore (minor) a8f181d [Andrew Or] Add special handling for StorageLevel.MEMORY_*_SER
UPDATE
I have removed the special handling for
StorageLevel.MEMORY_*_SERfor now, because it introduces a potential performance regression. With the latest changes, this PR should include mainly style (code readability) fixes. The only functionality change is the update inMemoryStore#putBytesto actually return updated blocks, though this is a minor bug fix.Now this is mainly a precursor to another PR (once again).
Old comment
The deserialized version of a partition may occupy much more space than the serialized version. Therefore, if a partition is to be cached with
StorageLevel.MEMORY_*_SER, we don't need to fully unroll it into anArrayBuffer, but instead we can unroll it into a potentially much smallerByteBuffer. This may save us from OOMs in this case.