Skip to content

Conversation

@ilganeli
Copy link

This patch will do the following based on discussion here: https://issues.apache.org/jira/browse/SPARK-8890:

  1. Attempt to create new outputWriter for each partition (current behavior)
  2. When maximum is exceeded, pause output of rows.
  3. Sort all remaining data such that we can process one partition at at time.
  4. One partition at a time, create an outputWriter and write all rows associated with that key
  5. Close outputWriter for that key and open a new outputWriter, continue from step 4.

@SparkQA
Copy link

SparkQA commented Jul 19, 2015

Test build #37775 has finished for PR 7514 at commit f1479e5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2015

Test build #37776 has finished for PR 7514 at commit 7e355c4.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2015

Test build #37777 has finished for PR 7514 at commit 4588f82.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2015

Test build #37780 has finished for PR 7514 at commit 9e15cdc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ilganeli
Copy link
Author

@rxin Where would be the best place to add a test for this functionality?

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37839 has finished for PR 7514 at commit 035f537.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 23, 2015

Test build #38260 has finished for PR 7514 at commit cdce9d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ilganeli
Copy link
Author

@rxin @davies @JoshRosen Hey all, could I please get a review of these updates? I'd love to get this fix in.

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39237 has finished for PR 7514 at commit d1aa55a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class TernaryExpression extends Expression

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expensive, we should avoid that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a preferred way to do this? I could have the HashSet be created once to avoid creating it every time and clear it between calls?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we make sure that only visit the items once, then the rows will not be outputted twice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that after a sort, everything is reorganized so we may end up traversing some elements that have already been processed, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the iterator can only be consumed once, so we only sort the items that have not been visited.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so just use an `ExternalSorter`` based off that iterator to do the sort to avoid potential memory problems.

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39239 has finished for PR 7514 at commit 9d5a70e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class TernaryExpression extends Expression

@marmbrus
Copy link
Contributor

marmbrus commented Aug 6, 2015

Hey, thanks for working on this! Since we are really close to the first 1.5 RC I went ahead and tried an alternative solution based on our external sorter. #8010

Comments welcome :)

@JoshRosen
Copy link
Contributor

Hey @ilganeli @marmbrus, is this PR still relevant after #8010? If not, could we close it?

@ilganeli ilganeli closed this Oct 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants