[SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records #14248

zero323 · 2016-07-18T17:07:23Z

What changes were proposed in this pull request?

If either self or other is serialized using CartesianSerializer it is reserialized with default serializer before executing _jrdd.cartesian.

How was this patch tested?

Using existing unit tests as well as additional test case to address SPARK-16589.

SparkQA · 2016-07-18T17:55:00Z

Test build #62475 has finished for PR 14248 at commit 6ad588e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-18T18:33:45Z

Test build #62476 has finished for PR 14248 at commit 38374e3.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-18T19:14:56Z

Test build #62477 has finished for PR 14248 at commit db4546d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-07-19T18:06:20Z

Is this the best way to fix this? e.g. does the cartesian serializer only have problems when chained with other cartesian products or is this a problem that might make sense to look at in the CartesianSerializer its self instead?
Anyways cc @JoshRosen

zero323 · 2016-07-19T19:44:08Z

@holdenk I had the same doubts and to be honest I am not sure what is the right approach here.

On a side note chained cartesian didn't work before 1.3 at all and is broken since 1.4 and for some reason results are not even consistent between Python version. So in general problem can be much more complex than this.

holdenk · 2016-07-22T00:58:09Z

In that case maybe we should consider investigating it a bit more before we fix this one specific case?

zero323 · 2016-07-22T22:16:12Z

@holdenk Can we move this discussion to JIRA?

… records ## What changes were proposed in this pull request? Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching. `CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks. `PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added. Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization. ## How was this patch tested? Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip. Author: Andrew Ray <[email protected]> Closes #16121 from aray/fix-cartesian. (cherry picked from commit 3c68944) Signed-off-by: Davies Liu <[email protected]>

… records ## What changes were proposed in this pull request? Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching. `CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks. `PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added. Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization. ## How was this patch tested? Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip. Author: Andrew Ray <[email protected]> Closes #16121 from aray/fix-cartesian.

… records ## What changes were proposed in this pull request? Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching. `CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks. `PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added. Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization. ## How was this patch tested? Additional unit tests (sourced from apache#14248) plus one for testing a cartesian with zip. Author: Andrew Ray <[email protected]> Closes apache#16121 from aray/fix-cartesian.

Reserialize RDDs using CartesianSerializer when using cartesian

6ad588e

zero323 changed the title ~~Reserialize RDDs using CartesianSerializer when using cartesian~~ [SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records Jul 18, 2016

Tests if chaining produces correct content, not only size

db4546d

zero323 force-pushed the SPARK-16589 branch from 38374e3 to db4546d Compare July 18, 2016 18:32

zero323 closed this Oct 7, 2016

aray mentioned this pull request Dec 2, 2016

[SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records #16121

Closed

zero323 deleted the SPARK-16589 branch April 6, 2017 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records #14248

[SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records #14248

Uh oh!

zero323 commented Jul 18, 2016 •

edited

Loading

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

holdenk commented Jul 19, 2016

Uh oh!

zero323 commented Jul 19, 2016

Uh oh!

holdenk commented Jul 22, 2016

Uh oh!

zero323 commented Jul 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records #14248

[SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records #14248

Uh oh!

Conversation

zero323 commented Jul 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

holdenk commented Jul 19, 2016

Uh oh!

zero323 commented Jul 19, 2016

Uh oh!

holdenk commented Jul 22, 2016

Uh oh!

zero323 commented Jul 22, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zero323 commented Jul 18, 2016 •

edited

Loading