SPARK-1057 (alternative) Remove fastutil #266

srowen · 2014-03-28T21:57:25Z

(This is for discussion at this point -- I'm not suggesting this should be committed.)

This is what removing fastutil looks like. Much of it is straightforward, like using java.io buffered stream classes, and Guava for murmurhash3.

Uses of the FastByteArrayOutputStream were a little trickier. In only one case though do I think the change to use java.io actually entails an extra array copy.

The rest is using OpenHashMap and OpenHashSet. These are now written in terms of more scala-like operations.

OpenHashMap is where I made three non-trivial changes to make it work, and they need review:

It is no longer private
The key must be a ClassTag
Unless a lot of other code changes, the key type can't enforce being a supertype of Null

It all works and tests pass, and I think there is reason to believe it's OK from a speed perspective.

But what about those last changes?

AmplabJenkins · 2014-03-28T22:02:22Z

Merged build triggered. Build is starting -or- tests failed to complete.

AmplabJenkins · 2014-03-28T22:02:29Z

Merged build started. Build is starting -or- tests failed to complete.

AmplabJenkins · 2014-03-28T22:57:00Z

Merged build finished. Build is starting -or- tests failed to complete.

AmplabJenkins · 2014-03-28T22:57:00Z

Build is starting -or- tests failed to complete.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13561/

srowen · 2014-03-30T20:28:37Z

Jenkins, retest this please.

AmplabJenkins · 2014-03-30T20:32:24Z

Merged build triggered. Build is starting -or- tests failed to complete.

AmplabJenkins · 2014-03-30T20:32:32Z

Merged build started. Build is starting -or- tests failed to complete.

AmplabJenkins · 2014-03-30T21:26:05Z

Merged build finished. Build is starting -or- tests failed to complete.

AmplabJenkins · 2014-03-30T21:26:06Z

Build is starting -or- tests failed to complete.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13587/

mateiz · 2014-04-05T01:32:45Z

streaming/src/main/scala/org/apache/spark/streaming/util/RawTextHelper.scala

Keep this code the way it was before, I think it was there for some stress tests that passed in lots of data, to make sure the parsing is not the bottleneck. Just switch the map over.

(I mean the while stuff above in particular, not the map.iterator)

mateiz · 2014-04-05T01:36:50Z

Hey Sean, I think this would be good to include. Made a few comments throughout it.

mateiz · 2014-04-05T01:37:42Z

core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala

Put a blank line before this

AmplabJenkins · 2014-04-05T16:47:23Z

Merged build triggered.

AmplabJenkins · 2014-04-05T16:47:30Z

Merged build started.

AmplabJenkins · 2014-04-05T17:42:18Z

Merged build finished.

AmplabJenkins · 2014-04-05T17:42:18Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13796/

pwendell · 2014-04-05T18:20:30Z

Hey @srowen does FastBufferedOutputStream offer any performance advantage over BufferedOutputStream that might cause regressions? We use these in a few places during the shuffle that are fairly performance sensitive (though maybe since it all ends up going through the OS/page cache the bottleneck isn't in Spark anyways). Just wondering if you had any knowledge of specific differences.

srowen · 2014-04-05T18:36:33Z

@pwendell As I saw it, the reason it was used was for the ability to access the internal byte[] buffer directly rather than a copy. However in 2 of 3 cases, it ended up copying anyway (the call to trim()). The third case was in Serializer.scala, in SerializerInstance.serializeMany. But even there, I think the copying of the byte[] is offset by the fact that it can be merely wrap()ed by a ByteBuffer and avoids ByteBuffer.allocate() -- another copy in disguise. From my read, that was the potential difference, and I don't think there ends up being a meaningful difference. I can't say I'm 100% certain, but feel pretty sure.

AmplabJenkins · 2014-04-05T20:22:23Z

Merged build triggered.

AmplabJenkins · 2014-04-05T20:22:31Z

Merged build started.

AmplabJenkins · 2014-04-05T21:30:16Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-05T21:30:16Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13800/

rxin · 2014-04-06T19:34:11Z

For the OpenHashMap null lower bound, it should be fine to drop the lower bound (based on my 30 sec check).

The original intention was that if the key is primitive (non-null), PrimitiveKeyOpenHashMap.scala should be used. Maybe we can have a factory method to help users choose.

mateiz · 2014-04-11T00:45:32Z

Cool, this looks good. I'll rerun the tests because Jenkins had some false positives in the past few days.

mateiz · 2014-04-11T00:45:37Z

Jenkins, test this please

AmplabJenkins · 2014-04-11T21:49:10Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14065/

pwendell · 2014-04-12T03:30:49Z

Jenkins, test this please.

AmplabJenkins · 2014-04-12T03:33:14Z

Merged build triggered.

AmplabJenkins · 2014-04-12T03:33:23Z

Merged build started.

AmplabJenkins · 2014-04-12T04:29:07Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-12T04:29:07Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14071/

pwendell · 2014-04-12T05:47:29Z

Thanks - I've merged this! Decided to pull it into 1.0 as well.

mridulm · 2014-04-12T06:22:15Z

I did not notice this earlier.
The toByteArray method is insanely expensive for anything nontrivial.
A better solution would be to replace use of ByteArrayOutputStream with an inhouse variant which allows us direct access to the byte[] - if we dont want to use fastutil.

Already we are hitting cases of the byteoutputstream failing due to 2G limit.
This PR will make us create two copies of the same : the performance implication of this is terrible

(This is for discussion at this point -- I'm not suggesting this should be committed.) This is what removing fastutil looks like. Much of it is straightforward, like using `java.io` buffered stream classes, and Guava for murmurhash3. Uses of the `FastByteArrayOutputStream` were a little trickier. In only one case though do I think the change to use `java.io` actually entails an extra array copy. The rest is using `OpenHashMap` and `OpenHashSet`. These are now written in terms of more scala-like operations. `OpenHashMap` is where I made three non-trivial changes to make it work, and they need review: - It is no longer private - The key must be a `ClassTag` - Unless a lot of other code changes, the key type can't enforce being a supertype of `Null` It all works and tests pass, and I think there is reason to believe it's OK from a speed perspective. But what about those last changes? Author: Sean Owen <[email protected]> Closes #266 from srowen/SPARK-1057-alternate and squashes the following commits: 2601129 [Sean Owen] Fix Map return type error not previously caught ec65502 [Sean Owen] Updates from matei's review 00bc81e [Sean Owen] Remove use of fastutil and replace with use of java.io, spark.util and Guava classes (cherry picked from commit 165e06a) Signed-off-by: Patrick Wendell <[email protected]>

rxin · 2014-04-12T06:47:49Z

core/src/main/scala/org/apache/spark/scheduler/Task.scala

This does seem pretty bad ....

mridulm · 2014-04-12T07:02:35Z

I think we can replace it with a custom impl - where we decide that it is ok to "waste" some memory within some threshold in case the copy is much more expensive - particularly given that most of this is almost immediately used and thrown away.
For example, if the size > X bytes and wastage (capacity - size) < Y% of capacity.
What we save from reallocating and compaction is not worth the effort.

rxin · 2014-04-12T07:16:51Z

i will submit one soon

srowen · 2014-04-12T08:52:01Z

Hold up a sec -- the array copy is not new. It was merely hidden in the call to trim() before, or to ByteBuffer.allocate(). Yes, it's better to avoid it if possible. The problem is not just about getting at a byte[] but one of the right size. So at least I think this is no worse than it was before, no emergency just yet. Might be best to really fix this up per Mridul's suggestion to use Seq[ByteBuffer].

(This is for discussion at this point -- I'm not suggesting this should be committed.) This is what removing fastutil looks like. Much of it is straightforward, like using `java.io` buffered stream classes, and Guava for murmurhash3. Uses of the `FastByteArrayOutputStream` were a little trickier. In only one case though do I think the change to use `java.io` actually entails an extra array copy. The rest is using `OpenHashMap` and `OpenHashSet`. These are now written in terms of more scala-like operations. `OpenHashMap` is where I made three non-trivial changes to make it work, and they need review: - It is no longer private - The key must be a `ClassTag` - Unless a lot of other code changes, the key type can't enforce being a supertype of `Null` It all works and tests pass, and I think there is reason to believe it's OK from a speed perspective. But what about those last changes? Author: Sean Owen <[email protected]> Closes apache#266 from srowen/SPARK-1057-alternate and squashes the following commits: 2601129 [Sean Owen] Fix Map return type error not previously caught ec65502 [Sean Owen] Updates from matei's review 00bc81e [Sean Owen] Remove use of fastutil and replace with use of java.io, spark.util and Guava classes

)

Add ansible functional testing jobs against shade or openstacksdk

…33] Backport insert operation lock (apache#197) * [HADP-40184]Backport insert operation lock (#15) [HADP-31946] Fix data duplicate on application retry and support concurrent write to different partitions in the same table.[HADP-33040][HADP-33041] Optimize merging staging files to output path and detect conflict with HDFS file lease. HADP-34738] During commitJob, merge paths with multi threads (apache#218) [HADP-36251] Enhance the concurrent lock mechanism for insert operation (apache#272) [HADP-37137] Add option to disable insert operation lock to write partitioned table (apache#286) * [HADP-46224] Do not overwrite the lock file when creating lock (apache#133) * [HADP-46868] Fix Spark merge path race condition (apache#161) * [HADP-50903] Ignore the error message if insert operation lock file has been deleted (apache#271) * [HADP-50733] Enhance the error message on picking insert operation lock failure (apache#267) * Fix * Fix * Fix * fix * Fix * Fix * Fix * Fix * Fix * [HADP-50574] Support to create the lock file for EC enabled path (apache#263) * [HADP-50574][FOLLOWUP] Add parameter type when getting overwrite method (apache#265) * [HADP-50574][FOLLOWUP] Add UT for creating ec disabled lock file and use underlying DistributedFileSystem for ViewFileSystem (apache#266) * Fix * Fix * Fix * [HADP-34612][FOLLOWUP] Do not show the insert local error by removing the being written stream from dfs client (apache#288) * Enabled Hadoop 3 --------- Co-authored-by: fwang12 <[email protected]>

srowen mentioned this pull request Mar 30, 2014

SPARK-1057: Upgrade fastutil to 6.5.11 #215

Closed

mateiz reviewed Apr 5, 2014
View reviewed changes

core/src/main/scala/org/apache/spark/util/collection/AppendOnlyMap.scala Outdated

Copy link

Contributor

mateiz Apr 5, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put a blank line before this

asfgit closed this in 165e06a Apr 12, 2014

rxin reviewed Apr 12, 2014
View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/Task.scala

Copy link

Contributor

rxin Apr 12, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem pretty bad ....

rxin mentioned this pull request Apr 12, 2014

Added a FastByteArrayOutputStream that exposes the underlying array to avoid unnecessary mem copy. #397

Closed

srowen deleted the SPARK-1057-alternate branch April 12, 2014 14:20

erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request May 8, 2017

Correct readme links (apache#266)

8ccb305

erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017

Correct readme links (apache#266)

c4f17b7

mccheah pushed a commit to mccheah/spark that referenced this pull request Oct 12, 2017

Move log capture to script so pipe will propagate failures (apache#266)

fdd2e01

Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018

Add edit window for history server and remove dead bash code (apache#266

5e68281

)

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#266 from liu-sheng/ansible-support

bdfc7a8

Add ansible functional testing jobs against shade or openstacksdk

turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025

[CARMEL-6752] Support more subexpression elimination cases (apache#266)

2be1564

SPARK-1057 (alternative) Remove fastutil #266

SPARK-1057 (alternative) Remove fastutil #266

Uh oh!

Conversation

srowen commented Mar 28, 2014

Uh oh!

AmplabJenkins commented Mar 28, 2014

Uh oh!

AmplabJenkins commented Mar 28, 2014

Uh oh!

AmplabJenkins commented Mar 28, 2014

Uh oh!

AmplabJenkins commented Mar 28, 2014

Uh oh!

srowen commented Mar 30, 2014

Uh oh!

AmplabJenkins commented Mar 30, 2014

Uh oh!

AmplabJenkins commented Mar 30, 2014

Uh oh!

AmplabJenkins commented Mar 30, 2014

Uh oh!

AmplabJenkins commented Mar 30, 2014

Uh oh!

mateiz Apr 5, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz Apr 5, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz commented Apr 5, 2014

Uh oh!

mateiz Apr 5, 2014

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

pwendell commented Apr 5, 2014

Uh oh!

srowen commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

AmplabJenkins commented Apr 5, 2014

Uh oh!

rxin commented Apr 6, 2014

Uh oh!

mateiz commented Apr 11, 2014

Uh oh!

mateiz commented Apr 11, 2014

Uh oh!

AmplabJenkins commented Apr 11, 2014

Uh oh!

pwendell commented Apr 12, 2014

Uh oh!

AmplabJenkins commented Apr 12, 2014

Uh oh!

AmplabJenkins commented Apr 12, 2014

Uh oh!

AmplabJenkins commented Apr 12, 2014

Uh oh!

AmplabJenkins commented Apr 12, 2014

Uh oh!

pwendell commented Apr 12, 2014

Uh oh!

mridulm commented Apr 12, 2014

Uh oh!

rxin Apr 12, 2014

Choose a reason for hiding this comment

Uh oh!