-
Notifications
You must be signed in to change notification settings - Fork 28.9k
SPARK-1057 (alternative) Remove fastutil #266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Merged build triggered. Build is starting -or- tests failed to complete. |
|
Merged build started. Build is starting -or- tests failed to complete. |
|
Merged build finished. Build is starting -or- tests failed to complete. |
|
Build is starting -or- tests failed to complete. |
|
Jenkins, retest this please. |
|
Merged build triggered. Build is starting -or- tests failed to complete. |
|
Merged build started. Build is starting -or- tests failed to complete. |
|
Merged build finished. Build is starting -or- tests failed to complete. |
|
Build is starting -or- tests failed to complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep this code the way it was before, I think it was there for some stress tests that passed in lots of data, to make sure the parsing is not the bottleneck. Just switch the map over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I mean the while stuff above in particular, not the map.iterator)
|
Hey Sean, I think this would be good to include. Made a few comments throughout it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put a blank line before this
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13796/ |
|
Hey @srowen does |
|
@pwendell As I saw it, the reason it was used was for the ability to access the internal |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
For the OpenHashMap null lower bound, it should be fine to drop the lower bound (based on my 30 sec check). The original intention was that if the key is primitive (non-null), PrimitiveKeyOpenHashMap.scala should be used. Maybe we can have a factory method to help users choose. |
|
Cool, this looks good. I'll rerun the tests because Jenkins had some false positives in the past few days. |
|
Jenkins, test this please |
|
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14065/ |
|
Jenkins, test this please. |
|
Merged build triggered. |
|
Merged build started. |
|
Merged build finished. All automated tests passed. |
|
All automated tests passed. |
|
Thanks - I've merged this! Decided to pull it into 1.0 as well. |
|
I did not notice this earlier. Already we are hitting cases of the byteoutputstream failing due to 2G limit. |
(This is for discussion at this point -- I'm not suggesting this should be committed.) This is what removing fastutil looks like. Much of it is straightforward, like using `java.io` buffered stream classes, and Guava for murmurhash3. Uses of the `FastByteArrayOutputStream` were a little trickier. In only one case though do I think the change to use `java.io` actually entails an extra array copy. The rest is using `OpenHashMap` and `OpenHashSet`. These are now written in terms of more scala-like operations. `OpenHashMap` is where I made three non-trivial changes to make it work, and they need review: - It is no longer private - The key must be a `ClassTag` - Unless a lot of other code changes, the key type can't enforce being a supertype of `Null` It all works and tests pass, and I think there is reason to believe it's OK from a speed perspective. But what about those last changes? Author: Sean Owen <[email protected]> Closes #266 from srowen/SPARK-1057-alternate and squashes the following commits: 2601129 [Sean Owen] Fix Map return type error not previously caught ec65502 [Sean Owen] Updates from matei's review 00bc81e [Sean Owen] Remove use of fastutil and replace with use of java.io, spark.util and Guava classes (cherry picked from commit 165e06a) Signed-off-by: Patrick Wendell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does seem pretty bad ....
|
I think we can replace it with a custom impl - where we decide that it is ok to "waste" some memory within some threshold in case the copy is much more expensive - particularly given that most of this is almost immediately used and thrown away. |
|
i will submit one soon |
|
Hold up a sec -- the array copy is not new. It was merely hidden in the call to |
(This is for discussion at this point -- I'm not suggesting this should be committed.) This is what removing fastutil looks like. Much of it is straightforward, like using `java.io` buffered stream classes, and Guava for murmurhash3. Uses of the `FastByteArrayOutputStream` were a little trickier. In only one case though do I think the change to use `java.io` actually entails an extra array copy. The rest is using `OpenHashMap` and `OpenHashSet`. These are now written in terms of more scala-like operations. `OpenHashMap` is where I made three non-trivial changes to make it work, and they need review: - It is no longer private - The key must be a `ClassTag` - Unless a lot of other code changes, the key type can't enforce being a supertype of `Null` It all works and tests pass, and I think there is reason to believe it's OK from a speed perspective. But what about those last changes? Author: Sean Owen <[email protected]> Closes apache#266 from srowen/SPARK-1057-alternate and squashes the following commits: 2601129 [Sean Owen] Fix Map return type error not previously caught ec65502 [Sean Owen] Updates from matei's review 00bc81e [Sean Owen] Remove use of fastutil and replace with use of java.io, spark.util and Guava classes
Add ansible functional testing jobs against shade or openstacksdk
…33] Backport insert operation lock (apache#197) * [HADP-40184]Backport insert operation lock (#15) [HADP-31946] Fix data duplicate on application retry and support concurrent write to different partitions in the same table.[HADP-33040][HADP-33041] Optimize merging staging files to output path and detect conflict with HDFS file lease. HADP-34738] During commitJob, merge paths with multi threads (apache#218) [HADP-36251] Enhance the concurrent lock mechanism for insert operation (apache#272) [HADP-37137] Add option to disable insert operation lock to write partitioned table (apache#286) * [HADP-46224] Do not overwrite the lock file when creating lock (apache#133) * [HADP-46868] Fix Spark merge path race condition (apache#161) * [HADP-50903] Ignore the error message if insert operation lock file has been deleted (apache#271) * [HADP-50733] Enhance the error message on picking insert operation lock failure (apache#267) * Fix * Fix * Fix * fix * Fix * Fix * Fix * Fix * Fix * [HADP-50574] Support to create the lock file for EC enabled path (apache#263) * [HADP-50574][FOLLOWUP] Add parameter type when getting overwrite method (apache#265) * [HADP-50574][FOLLOWUP] Add UT for creating ec disabled lock file and use underlying DistributedFileSystem for ViewFileSystem (apache#266) * Fix * Fix * Fix * [HADP-34612][FOLLOWUP] Do not show the insert local error by removing the being written stream from dfs client (apache#288) * Enabled Hadoop 3 --------- Co-authored-by: fwang12 <[email protected]>
(This is for discussion at this point -- I'm not suggesting this should be committed.)
This is what removing fastutil looks like. Much of it is straightforward, like using
java.iobuffered stream classes, and Guava for murmurhash3.Uses of the
FastByteArrayOutputStreamwere a little trickier. In only one case though do I think the change to usejava.ioactually entails an extra array copy.The rest is using
OpenHashMapandOpenHashSet. These are now written in terms of more scala-like operations.OpenHashMapis where I made three non-trivial changes to make it work, and they need review:ClassTagNullIt all works and tests pass, and I think there is reason to believe it's OK from a speed perspective.
But what about those last changes?