Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Nov 25, 2015

This PR improve the performance of CartesianProduct by caching the result of right plan.

After this patch, the query time of TPC-DS Q65 go down to 4 seconds from 28 minutes (420X faster).

cc @nongli

@SparkQA
Copy link

SparkQA commented Nov 25, 2015

Test build #46683 has finished for PR 9969 at commit 162268c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class ChainedIterator extends UnsafeSorterIterator\n * class UnsafeCartesianRDD(rdd1 : RDD[UnsafeRow], rdd2 : RDD[UnsafeRow])\n

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document what the difference is between this iterator and the sorted iterator? Is it simply that one is sorted and the other is not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies are you trying to save a in-memory sort here?

@SparkQA
Copy link

SparkQA commented Nov 25, 2015

Test build #46702 has finished for PR 9969 at commit a94204b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class ChainedIterator extends UnsafeSorterIterator\n * class UnsafeCartesianRDD(rdd1 : RDD[UnsafeRow], rdd2 : RDD[UnsafeRow])\n

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the UnsafeExternalSorter preserve records order if it spills?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we may also need to update CartesianProduct strategy to put smaller child at right side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed it in #7417, right now it's not clear that which metric could be used as the size of table, that could be another story.

Even the right table is larger than left, this approach is still much better than current one (building the partition is usually much expensive than loading them from memory or disk), it also fix another problem that the right table could be nondeterministic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan For the first question, yes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #2117 has finished for PR 9969 at commit d3edd4f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2015

Test build #46755 has finished for PR 9969 at commit 074f2a7.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class ChainedIterator extends UnsafeSorterIterator\n * class UnsafeCartesianRDD(left : RDD[UnsafeRow], right : RDD[UnsafeRow], numFieldsOfRight: Int)\n

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesnt work if iterators contain empty iterators. Fix or assert that can't be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It checked that the iterators is not empty

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what i mean.

If iterators contains an empty one. So iterators is:
(1, 2) : empty : (3, 4)

When you move to the second iterator (current is empty) you will stop and not iterate over the iterator containing (3,4)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, thanks, will fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For UnsafeExternalSorter, it's not possible to have an empty iterator in the middle, they are spilled files. It's still good to be defensive for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea. I figured it would not be empty but I agree about being defensive. If the implementation of UnsafeExternalSorter changes, we don't want to debug this.

Davies Liu added 2 commits November 30, 2015 11:07
Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala
@nongli
Copy link
Contributor

nongli commented Nov 30, 2015

LGTM

@asfgit asfgit closed this in 8df584b Nov 30, 2015
@SparkQA
Copy link

SparkQA commented Nov 30, 2015

Test build #46898 has finished for PR 9969 at commit 91c7824.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class ChainedIterator extends UnsafeSorterIterator\n * case class Count(children: Seq[Expression]) extends DeclarativeAggregate\n * class UnsafeCartesianRDD(left : RDD[UnsafeRow], right : RDD[UnsafeRow], numFieldsOfRight: Int)\n

@SparkQA
Copy link

SparkQA commented Nov 30, 2015

Test build #2131 has finished for PR 9969 at commit d3edd4f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dujunling
Copy link

After this patch, the query time of TPC-DS Q65 go down to 4 seconds from 28 minutes (420X faster).
@davies ,How many data did you used?

@davies
Copy link
Contributor Author

davies commented May 5, 2016

Scale factor 1 and 10 (1G and 10G).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants