Skip to content

Conversation

@zero323
Copy link
Member

@zero323 zero323 commented Jul 18, 2016

What changes were proposed in this pull request?

If either self or other is serialized using CartesianSerializer it is reserialized with default serializer before executing _jrdd.cartesian.

How was this patch tested?

Using existing unit tests as well as additional test case to address SPARK-16589.

@zero323 zero323 changed the title Reserialize RDDs using CartesianSerializer when using cartesian [SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records Jul 18, 2016
@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62475 has finished for PR 14248 at commit 6ad588e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62476 has finished for PR 14248 at commit 38374e3.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62477 has finished for PR 14248 at commit db4546d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jul 19, 2016

Is this the best way to fix this? e.g. does the cartesian serializer only have problems when chained with other cartesian products or is this a problem that might make sense to look at in the CartesianSerializer its self instead?
Anyways cc @JoshRosen

@zero323
Copy link
Member Author

zero323 commented Jul 19, 2016

@holdenk I had the same doubts and to be honest I am not sure what is the right approach here.

On a side note chained cartesian didn't work before 1.3 at all and is broken since 1.4 and for some reason results are not even consistent between Python version. So in general problem can be much more complex than this.

@holdenk
Copy link
Contributor

holdenk commented Jul 22, 2016

In that case maybe we should consider investigating it a bit more before we fix this one specific case?

@zero323
Copy link
Member Author

zero323 commented Jul 22, 2016

@holdenk Can we move this discussion to JIRA?

@zero323 zero323 closed this Oct 7, 2016
asfgit pushed a commit that referenced this pull request Dec 8, 2016
… records

## What changes were proposed in this pull request?

Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching.

`CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks.

`PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added.

Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization.

## How was this patch tested?

Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip.

Author: Andrew Ray <[email protected]>

Closes #16121 from aray/fix-cartesian.

(cherry picked from commit 3c68944)
Signed-off-by: Davies Liu <[email protected]>
asfgit pushed a commit that referenced this pull request Dec 8, 2016
… records

## What changes were proposed in this pull request?

Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching.

`CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks.

`PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added.

Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization.

## How was this patch tested?

Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip.

Author: Andrew Ray <[email protected]>

Closes #16121 from aray/fix-cartesian.
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
… records

## What changes were proposed in this pull request?

Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching.

`CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks.

`PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added.

Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization.

## How was this patch tested?

Additional unit tests (sourced from apache#14248) plus one for testing a cartesian with zip.

Author: Andrew Ray <[email protected]>

Closes apache#16121 from aray/fix-cartesian.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
… records

## What changes were proposed in this pull request?

Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching.

`CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks.

`PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added.

Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization.

## How was this patch tested?

Additional unit tests (sourced from apache#14248) plus one for testing a cartesian with zip.

Author: Andrew Ray <[email protected]>

Closes apache#16121 from aray/fix-cartesian.
@zero323 zero323 deleted the SPARK-16589 branch April 6, 2017 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants