Skip to content

Conversation

@srowen
Copy link
Member

@srowen srowen commented Mar 28, 2014

(This is for discussion at this point -- I'm not suggesting this should be committed.)

This is what removing fastutil looks like. Much of it is straightforward, like using java.io buffered stream classes, and Guava for murmurhash3.

Uses of the FastByteArrayOutputStream were a little trickier. In only one case though do I think the change to use java.io actually entails an extra array copy.

The rest is using OpenHashMap and OpenHashSet. These are now written in terms of more scala-like operations.

OpenHashMap is where I made three non-trivial changes to make it work, and they need review:

  • It is no longer private
  • The key must be a ClassTag
  • Unless a lot of other code changes, the key type can't enforce being a supertype of Null

It all works and tests pass, and I think there is reason to believe it's OK from a speed perspective.

But what about those last changes?

@AmplabJenkins
Copy link

Merged build triggered. Build is starting -or- tests failed to complete.

@AmplabJenkins
Copy link

Merged build started. Build is starting -or- tests failed to complete.

@AmplabJenkins
Copy link

Merged build finished. Build is starting -or- tests failed to complete.

@AmplabJenkins
Copy link

Build is starting -or- tests failed to complete.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13561/

@srowen
Copy link
Member Author

srowen commented Mar 30, 2014

Jenkins, retest this please.

@AmplabJenkins
Copy link

Merged build triggered. Build is starting -or- tests failed to complete.

@AmplabJenkins
Copy link

Merged build started. Build is starting -or- tests failed to complete.

@AmplabJenkins
Copy link

Merged build finished. Build is starting -or- tests failed to complete.

@AmplabJenkins
Copy link

Build is starting -or- tests failed to complete.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13587/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this code the way it was before, I think it was there for some stress tests that passed in lots of data, to make sure the parsing is not the bottleneck. Just switch the map over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I mean the while stuff above in particular, not the map.iterator)

@mateiz
Copy link
Contributor

mateiz commented Apr 5, 2014

Hey Sean, I think this would be good to include. Made a few comments throughout it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put a blank line before this

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13796/

@pwendell
Copy link
Contributor

pwendell commented Apr 5, 2014

Hey @srowen does FastBufferedOutputStream offer any performance advantage over BufferedOutputStream that might cause regressions? We use these in a few places during the shuffle that are fairly performance sensitive (though maybe since it all ends up going through the OS/page cache the bottleneck isn't in Spark anyways). Just wondering if you had any knowledge of specific differences.

@srowen
Copy link
Member Author

srowen commented Apr 5, 2014

@pwendell As I saw it, the reason it was used was for the ability to access the internal byte[] buffer directly rather than a copy. However in 2 of 3 cases, it ended up copying anyway (the call to trim()). The third case was in Serializer.scala, in SerializerInstance.serializeMany. But even there, I think the copying of the byte[] is offset by the fact that it can be merely wrap()ed by a ByteBuffer and avoids ByteBuffer.allocate() -- another copy in disguise. From my read, that was the potential difference, and I don't think there ends up being a meaningful difference. I can't say I'm 100% certain, but feel pretty sure.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13800/

@rxin
Copy link
Contributor

rxin commented Apr 6, 2014

For the OpenHashMap null lower bound, it should be fine to drop the lower bound (based on my 30 sec check).

The original intention was that if the key is primitive (non-null), PrimitiveKeyOpenHashMap.scala should be used. Maybe we can have a factory method to help users choose.

@mateiz
Copy link
Contributor

mateiz commented Apr 11, 2014

Cool, this looks good. I'll rerun the tests because Jenkins had some false positives in the past few days.

@mateiz
Copy link
Contributor

mateiz commented Apr 11, 2014

Jenkins, test this please

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14065/

@pwendell
Copy link
Contributor

Jenkins, test this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14071/

@pwendell
Copy link
Contributor

Thanks - I've merged this! Decided to pull it into 1.0 as well.

@mridulm
Copy link
Contributor

mridulm commented Apr 12, 2014

I did not notice this earlier.
The toByteArray method is insanely expensive for anything nontrivial.
A better solution would be to replace use of ByteArrayOutputStream with an inhouse variant which allows us direct access to the byte[] - if we dont want to use fastutil.

Already we are hitting cases of the byteoutputstream failing due to 2G limit.
This PR will make us create two copies of the same : the performance implication of this is terrible

asfgit pushed a commit that referenced this pull request Apr 12, 2014
(This is for discussion at this point -- I'm not suggesting this should be committed.)

This is what removing fastutil looks like. Much of it is straightforward, like using `java.io` buffered stream classes, and Guava for murmurhash3.

Uses of the `FastByteArrayOutputStream` were a little trickier. In only one case though do I think the change to use `java.io` actually entails an extra array copy.

The rest is using `OpenHashMap` and `OpenHashSet`.  These are now written in terms of more scala-like operations.

`OpenHashMap` is where I made three non-trivial changes to make it work, and they need review:

- It is no longer private
- The key must be a `ClassTag`
- Unless a lot of other code changes, the key type can't enforce being a supertype of `Null`

It all works and tests pass, and I think there is reason to believe it's OK from a speed perspective.

But what about those last changes?

Author: Sean Owen <[email protected]>

Closes #266 from srowen/SPARK-1057-alternate and squashes the following commits:

2601129 [Sean Owen] Fix Map return type error not previously caught
ec65502 [Sean Owen] Updates from matei's review
00bc81e [Sean Owen] Remove use of fastutil and replace with use of java.io, spark.util and Guava classes
(cherry picked from commit 165e06a)

Signed-off-by: Patrick Wendell <[email protected]>
@asfgit asfgit closed this in 165e06a Apr 12, 2014
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem pretty bad ....

@mridulm
Copy link
Contributor

mridulm commented Apr 12, 2014

I think we can replace it with a custom impl - where we decide that it is ok to "waste" some memory within some threshold in case the copy is much more expensive - particularly given that most of this is almost immediately used and thrown away.
For example, if the size > X bytes and wastage (capacity - size) < Y% of capacity.
What we save from reallocating and compaction is not worth the effort.

@rxin
Copy link
Contributor

rxin commented Apr 12, 2014

i will submit one soon

@srowen
Copy link
Member Author

srowen commented Apr 12, 2014

Hold up a sec -- the array copy is not new. It was merely hidden in the call to trim() before, or to ByteBuffer.allocate(). Yes, it's better to avoid it if possible. The problem is not just about getting at a byte[] but one of the right size. So at least I think this is no worse than it was before, no emergency just yet. Might be best to really fix this up per Mridul's suggestion to use Seq[ByteBuffer].

@srowen srowen deleted the SPARK-1057-alternate branch April 12, 2014 14:20
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
(This is for discussion at this point -- I'm not suggesting this should be committed.)

This is what removing fastutil looks like. Much of it is straightforward, like using `java.io` buffered stream classes, and Guava for murmurhash3.

Uses of the `FastByteArrayOutputStream` were a little trickier. In only one case though do I think the change to use `java.io` actually entails an extra array copy.

The rest is using `OpenHashMap` and `OpenHashSet`.  These are now written in terms of more scala-like operations.

`OpenHashMap` is where I made three non-trivial changes to make it work, and they need review:

- It is no longer private
- The key must be a `ClassTag`
- Unless a lot of other code changes, the key type can't enforce being a supertype of `Null`

It all works and tests pass, and I think there is reason to believe it's OK from a speed perspective.

But what about those last changes?

Author: Sean Owen <[email protected]>

Closes apache#266 from srowen/SPARK-1057-alternate and squashes the following commits:

2601129 [Sean Owen] Fix Map return type error not previously caught
ec65502 [Sean Owen] Updates from matei's review
00bc81e [Sean Owen] Remove use of fastutil and replace with use of java.io, spark.util and Guava classes
erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request May 8, 2017
erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017
mccheah pushed a commit to mccheah/spark that referenced this pull request Oct 12, 2017
Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
Add ansible functional testing jobs against shade or openstacksdk
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
turboFei added a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…33] Backport insert operation lock (apache#197)

* [HADP-40184]Backport insert operation lock (#15)

[HADP-31946] Fix data duplicate on application retry and support concurrent write to different partitions in the same table.[HADP-33040][HADP-33041] Optimize merging staging files to output path and detect conflict with HDFS file lease.
HADP-34738] During commitJob, merge paths with multi threads (apache#218)
[HADP-36251] Enhance the concurrent lock mechanism  for insert operation (apache#272)
[HADP-37137] Add option to disable insert operation lock to write partitioned table (apache#286)

* [HADP-46224] Do not overwrite the lock file when creating lock (apache#133)

* [HADP-46868] Fix Spark merge path race condition (apache#161)

* [HADP-50903] Ignore the error message if insert operation lock file has been deleted (apache#271)

* [HADP-50733] Enhance the error message on picking insert operation lock failure (apache#267)

* Fix

* Fix

* Fix

* fix

* Fix

* Fix

* Fix

* Fix

* Fix

* [HADP-50574] Support to create the lock file for EC enabled path (apache#263)

* [HADP-50574][FOLLOWUP] Add parameter type when getting overwrite method (apache#265)

* [HADP-50574][FOLLOWUP] Add UT for creating ec disabled lock file and use underlying DistributedFileSystem for ViewFileSystem (apache#266)

* Fix

* Fix

* Fix

* [HADP-34612][FOLLOWUP] Do not show the insert local error by removing the being written stream from dfs client (apache#288)

* Enabled Hadoop 3

---------

Co-authored-by: fwang12 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants