Skip to content

Conversation

@tejasapatil
Copy link
Contributor

What changes were proposed in this pull request?

SPARK-21595 reported that there is excessive spilling to disk due to default spill threshold for ExternalAppendOnlyUnsafeRowArray being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre #16909) would hold data in an array for first 4096 records post which it would switch to UnsafeExternalSorter and start spilling to disk after reaching spark.shuffle.spill.numElementsForceSpillThreshold (or earlier if there was paucity of memory due to excessive consumers).

Currently the (switch from in-memory to UnsafeExternalSorter) and (UnsafeExternalSorter spilling to disk) for ExternalAppendOnlyUnsafeRowArray is controlled by a single threshold. This PR aims to separate that to have more granular control.

How was this patch tested?

Added unit tests

@tejasapatil
Copy link
Contributor Author

@hvanhovell : let me know what you think about this.

@SparkQA
Copy link

SparkQA commented Aug 4, 2017

Test build #80236 has finished for PR 18843 at commit 8e3bfb7.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Aug 4, 2017

Test build #80247 has finished for PR 18843 at commit 8e3bfb7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil tejasapatil force-pushed the SPARK-21595 branch 2 times, most recently from 9f66038 to 398ccaf Compare August 4, 2017 17:23
@SparkQA
Copy link

SparkQA commented Aug 4, 2017

Test build #80255 has finished for PR 18843 at commit 9f66038.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 4, 2017

Test build #80256 has finished for PR 18843 at commit 398ccaf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

retest this please

@hvanhovell
Copy link
Contributor

LGTM - pending jenkins

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? this should be numRowsInMemoryBufferThreshold. We may spill before reaching numRowsSpillThreshold if memory is not enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it was a typo. Corrected it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just have one config for both window and SMJ? ideally we can say this config is for ExternalAppendOnlyUnsafeRowArray

Copy link
Contributor Author

@tejasapatil tejasapatil Aug 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with that. We can even go a step further and just have two configs : in-mem threshold and spill threshold at the ExternalAppendOnlyUnsafeRowArray for all its clients (currently SMJ, cartesian product, Window). That way we have consistency across all clients and both knobs. One downside is backward compatibility : spill threshold was already defined per operator level and people might be using it in prod.

Let me know what you think about that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let's keep them separated for each operator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a reasonable default value? won't it lead to OOM according to the document?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the current value. I suppose you want to be able to tune it if you have to. Not all of us are running Spark at FB scale :)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before introducing ExternalAppendOnlyUnsafeRowArray, SMJ used to hold in-memory data in scala's ArrayBuffer. Its backed by an array which would at max be Int.MaxValue in size... so this default is keeping things as they were before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

@SparkQA
Copy link

SparkQA commented Aug 9, 2017

Test build #80451 has finished for PR 18843 at commit 398ccaf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

@tejasapatil tejasapatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR as per review comments by @cloud-fan. I haven't made changes for all his comments and replied for more discussion in those places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it was a typo. Corrected it

Copy link
Contributor Author

@tejasapatil tejasapatil Aug 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with that. We can even go a step further and just have two configs : in-mem threshold and spill threshold at the ExternalAppendOnlyUnsafeRowArray for all its clients (currently SMJ, cartesian product, Window). That way we have consistency across all clients and both knobs. One downside is backward compatibility : spill threshold was already defined per operator level and people might be using it in prod.

Let me know what you think about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before introducing ExternalAppendOnlyUnsafeRowArray, SMJ used to hold in-memory data in scala's ArrayBuffer. Its backed by an array which would at max be Int.MaxValue in size... so this default is keeping things as they were before.

@SparkQA
Copy link

SparkQA commented Aug 10, 2017

Test build #80504 has finished for PR 18843 at commit a69969c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we introduce a similar config for cartesian product?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Since there was no in-memory buffer for cartesian product before, I am using a conservative value 4096 for the in-memory buffer threshold. However, the spill threshold is set to UnsafeExternalSorter.DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD like it was before.

@cloud-fan
Copy link
Contributor

LGTM except one question, thanks for the fix!

@cloud-fan
Copy link
Contributor

LGTM, pending jenkins

@cloud-fan
Copy link
Contributor

retest this please

@tejasapatil
Copy link
Contributor Author

jenkins test this please

@SparkQA
Copy link

SparkQA commented Aug 11, 2017

Test build #80536 has finished for PR 18843 at commit ab5cd2e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

Merging to master/2.2. Thanks!

asfgit pushed a commit that referenced this pull request Aug 11, 2017
…nalAppendOnlyUnsafeRowArray

## What changes were proposed in this pull request?

[SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre #16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers).

Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control.

## How was this patch tested?

Added unit tests

Author: Tejas Patil <[email protected]>

Closes #18843 from tejasapatil/SPARK-21595.

(cherry picked from commit 9443999)
Signed-off-by: Herman van Hovell <[email protected]>
@asfgit asfgit closed this in 9443999 Aug 11, 2017
dilipbiswal pushed a commit to dilipbiswal/spark that referenced this pull request Dec 20, 2017
…nalAppendOnlyUnsafeRowArray

## What changes were proposed in this pull request?

[SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre apache#16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers).

Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control.

## How was this patch tested?

Added unit tests

Author: Tejas Patil <[email protected]>

Closes apache#18843 from tejasapatil/SPARK-21595.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…nalAppendOnlyUnsafeRowArray

## What changes were proposed in this pull request?

[SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported that there is excessive spilling to disk due to default spill threshold for `ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old behaviour of WINDOW operator (pre apache#16909) would hold data in an array for first 4096 records post which it would switch to `UnsafeExternalSorter` and start spilling to disk after reaching `spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was paucity of memory due to excessive consumers).

Currently the (switch from in-memory to `UnsafeExternalSorter`) and (`UnsafeExternalSorter` spilling to disk) for `ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR aims to separate that to have more granular control.

## How was this patch tested?

Added unit tests

Author: Tejas Patil <[email protected]>

Closes apache#18843 from tejasapatil/SPARK-21595.

(cherry picked from commit 9443999)
Signed-off-by: Herman van Hovell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants