Skip to content

Conversation

@andrewor14
Copy link
Contributor

UPDATE

I have removed the special handling for StorageLevel.MEMORY_*_SER for now, because it introduces a potential performance regression. With the latest changes, this PR should include mainly style (code readability) fixes. The only functionality change is the update in MemoryStore#putBytes to actually return updated blocks, though this is a minor bug fix.

Now this is mainly a precursor to another PR (once again).


Old comment

The deserialized version of a partition may occupy much more space than the serialized version. Therefore, if a partition is to be cached with StorageLevel.MEMORY_*_SER, we don't need to fully unroll it into an ArrayBuffer, but instead we can unroll it into a potentially much smaller ByteBuffer. This may save us from OOMs in this case.

We only unroll the serialized form of each partition for this case,
because the deserialized form may be much larger and may not fit in
memory.

This commit also abstracts out part of the logic of getOrCompute to
make it more readable.
Previously we never returned the updated blocks in MemoryStore's
putBytes. This is a simple bug with a simple fix.
@andrewor14 andrewor14 changed the title [SPARK-1201] Do not fully materialize partitions for StorageLevel.MEMORY_*_SER [SPARK-1201] Do not fully materialize partitions for MEMORY_SER Jun 14, 2014
@andrewor14 andrewor14 changed the title [SPARK-1201] Do not fully materialize partitions for MEMORY_SER [SPARK-1201] Do not fully materialize partitions for MEMORY_*_SER Jun 14, 2014
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15783/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make sense to remove this assume; in case we add a new storage level in the future, this won't hold any more and because this code is so far away from the storage level code, we will likely forget to update this location.

@rxin
Copy link
Contributor

rxin commented Jun 14, 2014

BTW can you construct a unit test for this in CacheManagerSuite?

Would be good also to add a unit test to test the lock (which existed earlier but had no test for it). Thanks.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15822/

@andrewor14
Copy link
Contributor Author

It's worth noting that the special handling of the memory serialized storage level actually introduces a regression. In particular, now we add an extra step to deserialize the bytes in the end, which could be slow for large partitions.

This special handling will most likely be superseded by a more general solution for SPARK-1777, which avoids unrolling an entire partition if there is not enough space for it, regardless of the storage level. For now, I will put this PR on hold.

This special handling sacrifices CPU cycles for memory usage by
introducing an additional step to deserialize the serialized bytes
put into BlockManager. This may cause a performance regression in
some cases.

For now, let's keep the functionality the same as before, and only
include style changes in this PR. This is a precursor to another
incoming PR that changes the way unroll RDD partitions.
@andrewor14 andrewor14 changed the title [SPARK-1201] Do not fully materialize partitions for MEMORY_*_SER [Minor] Clean up CacheManager, BlockStore, and MemoryStore Jun 20, 2014
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@andrewor14
Copy link
Contributor Author

I have removed the special handling of the memory serialized case for the aforementioned reason. However, I would like get the style changes in master as I have a separate WIP PR that builds on top of this one.

@andrewor14 andrewor14 changed the title [Minor] Clean up CacheManager, BlockStore, and MemoryStore Clean up CacheManager, BlockStore, and MemoryStore Jun 20, 2014
@andrewor14 andrewor14 changed the title Clean up CacheManager, BlockStore, and MemoryStore [Minor] Clean up CacheManager, BlockStore, and MemoryStore Jun 20, 2014
@andrewor14 andrewor14 changed the title [Minor] Clean up CacheManager, BlockStore, and MemoryStore [Minor] Clean up CacheManager et al. Jun 20, 2014
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

1 similar comment
@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15933/

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15935/

@andrewor14 andrewor14 changed the title [Minor] Clean up CacheManager et al. Clean up CacheManager et al. Jun 20, 2014
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@rxin
Copy link
Contributor

rxin commented Jun 20, 2014

LGTM.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15972/

@rxin
Copy link
Contributor

rxin commented Jun 21, 2014

Merged in master. Thanks!

@asfgit asfgit closed this in 01125a1 Jun 21, 2014
@andrewor14 andrewor14 deleted the unroll-them-partitions branch June 21, 2014 01:18
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
**UPDATE**

I have removed the special handling for `StorageLevel.MEMORY_*_SER` for now, because it introduces a potential performance regression. With the latest changes, this PR should include mainly style (code readability) fixes. The only functionality change is the update in `MemoryStore#putBytes` to actually return updated blocks, though this is a minor bug fix.

Now this is mainly a precursor to another PR (once again).

---------
*Old comment*

The deserialized version of a partition may occupy much more space than the serialized version. Therefore, if a partition is to be cached with `StorageLevel.MEMORY_*_SER`, we don't need to fully unroll it into an `ArrayBuffer`, but instead we can unroll it into a potentially much smaller `ByteBuffer`. This may save us from OOMs in this case.

Author: Andrew Or <[email protected]>

Closes apache#1083 from andrewor14/unroll-them-partitions and squashes the following commits:

7048aa0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions
3d9a366 [Andrew Or] Minor change for readability
d12b95f [Andrew Or] Remove unused imports (minor)
a4c387b [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions
cf5f565 [Andrew Or] Remove special handling for MEM_*_SER
0091ec0 [Andrew Or] Address review feedback
44ef282 [Andrew Or] Actually return updated blocks in putBytes
2941c89 [Andrew Or] Clean up BlockStore (minor)
a8f181d [Andrew Or] Add special handling for StorageLevel.MEMORY_*_SER
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
**UPDATE**

I have removed the special handling for `StorageLevel.MEMORY_*_SER` for now, because it introduces a potential performance regression. With the latest changes, this PR should include mainly style (code readability) fixes. The only functionality change is the update in `MemoryStore#putBytes` to actually return updated blocks, though this is a minor bug fix.

Now this is mainly a precursor to another PR (once again).

---------
*Old comment*

The deserialized version of a partition may occupy much more space than the serialized version. Therefore, if a partition is to be cached with `StorageLevel.MEMORY_*_SER`, we don't need to fully unroll it into an `ArrayBuffer`, but instead we can unroll it into a potentially much smaller `ByteBuffer`. This may save us from OOMs in this case.

Author: Andrew Or <[email protected]>

Closes apache#1083 from andrewor14/unroll-them-partitions and squashes the following commits:

7048aa0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions
3d9a366 [Andrew Or] Minor change for readability
d12b95f [Andrew Or] Remove unused imports (minor)
a4c387b [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions
cf5f565 [Andrew Or] Remove special handling for MEM_*_SER
0091ec0 [Andrew Or] Address review feedback
44ef282 [Andrew Or] Actually return updated blocks in putBytes
2941c89 [Andrew Or] Clean up BlockStore (minor)
a8f181d [Andrew Or] Add special handling for StorageLevel.MEMORY_*_SER
wangyum pushed a commit that referenced this pull request May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants