You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-39983][CORE][SQL] Do not cache unserialized broadcast relations on the driver
### What changes were proposed in this pull request?
This PR addresses the issue raised in https://issues.apache.org/jira/browse/SPARK-39983 - broadcast relations should not be cached on the driver as they are not needed and can cause significant memory pressure (in one case the relation was 60MB )
The PR adds a new SparkContext.broadcastInternal method with parameter serializedOnly allowing the caller to specify that the broadcasted object should be stored only in serialized form. The current behavior is to also cache an unserialized form of the object.
The PR changes the broadcast implementation in TorrentBroadcast to honor the serializedOnly flag and not store the unserialized value, unless the execution is in a local mode (single process). In that case the broadcast cache is effectively shared between driver and executors and thus the unserialized value needs to be cached to satisfy the executor-side of the functionality.
### Why are the changes needed?
The broadcast relations can be fairly large (observed 60MB one) and are not needed in unserialized form on the driver.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added a new unit test to BroadcastSuite verifying the low-level broadcast functionality in respect to the serializedOnly flag.
Added a new unit test to BroadcastExchangeSuite verifying that broadcasted relations are not cached on the driver.
Closes#37413 from alex-balikov/SPARK-39983-broadcast-no-cache.
Lead-authored-by: Alex Balikov <[email protected]>
Co-authored-by: Josh Rosen <[email protected]>
Signed-off-by: Josh Rosen <[email protected]>
0 commit comments