-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-20280][CORE] FileStatusCache Weigher integer overflow #17591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)).toInt | ||
| val estimate = (SizeEstimator.estimate(key) + SizeEstimator.estimate(value)) / weightScale | ||
| if (estimate > Int.MaxValue) { | ||
| throw new IllegalStateException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be an error. It's just a weight. Capping at Int.MaxValue is no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's capped, the cache might use more memory than configured with spark.sql.hive.filesourcePartitionFileCacheSize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, though, this can only happen if the filesourcePartitionFileCacheSize is at least 64GB and some object is at least 64GB. The effect is to possibly cache things longer than they should, which seems better than failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I guess it's better to fail later than sooner. I made it a warning instead.
| * than the size of one [[FileStatus]]). | ||
| * so it will support objects up to 64GB in size. | ||
| */ | ||
| private val weightScale = 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than make it a member variable, just a local variable in the initializer for cache?
|
Test build #75660 has finished for PR 17591 at commit
|
|
Test build #75663 has finished for PR 17591 at commit
|
| } | ||
| val fileStatusCache = FileStatusCache.getOrCreate(spark) | ||
| fileStatusCache.putLeafFiles(new Path("/tmp", "abc"), files.toArray) | ||
| // scalastyle:off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets remove this comment block, the JIRA should be used for tracking these things.
| .removalListener(new RemovalListener[(ClientId, Path), Array[FileStatus]]() { | ||
| override def onRemoval(removed: RemovalNotification[(ClientId, Path), Array[FileStatus]]) | ||
| private val cache: Cache[(ClientId, Path), Array[FileStatus]] = { | ||
| /* [[Weigher]].weigh returns Int so we could only cache objects < 2GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Could you use java style comments // here?
| } | ||
| } | ||
| }) | ||
| .removalListener(new RemovalListener[(ClientId, Path), Array[FileStatus]]() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kinda hard to read. Can we just initialize the weighter and the listener in separate variables?
|
LGTM - pending jenkins |
|
Test build #75665 has finished for PR 17591 at commit
|
|
Merging to master/2.1. Thanks! |
## What changes were proposed in this pull request? Weigher.weigh needs to return Int but it is possible for an Array[FileStatus] to have size > Int.maxValue. To avoid this, the size is scaled down by a factor of 32. The maximumWeight of the cache is also scaled down by the same factor. ## How was this patch tested? New test in FileIndexSuite Author: Bogdan Raducanu <[email protected]> Closes #17591 from bogdanrdc/SPARK-20280. (cherry picked from commit f6dd8e0) Signed-off-by: Herman van Hovell <[email protected]>
## What changes were proposed in this pull request? Weigher.weigh needs to return Int but it is possible for an Array[FileStatus] to have size > Int.maxValue. To avoid this, the size is scaled down by a factor of 32. The maximumWeight of the cache is also scaled down by the same factor. ## How was this patch tested? New test in FileIndexSuite Author: Bogdan Raducanu <[email protected]> Closes apache#17591 from bogdanrdc/SPARK-20280.
What changes were proposed in this pull request?
Weigher.weigh needs to return Int but it is possible for an Array[FileStatus] to have size > Int.maxValue. To avoid this, the size is scaled down by a factor of 32. The maximumWeight of the cache is also scaled down by the same factor.
How was this patch tested?
New test in FileIndexSuite