Skip to content

Conversation

@kiszk
Copy link
Member

@kiszk kiszk commented Jul 8, 2017

What changes were proposed in this pull request?

This PR fixes a wrong comparison for BinaryType. This PR enables unsigned comparison and unsigned prefix generation for an array for BinaryType. Previous implementations uses signed operations.

How was this patch tested?

Added a test suite in OrderingSuite.

@SparkQA
Copy link

SparkQA commented Jul 8, 2017

Test build #79381 has finished for PR 18571 at commit ac86eed.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 8, 2017

Aren't byte comparisons signed though elsewhere? as well as other integer type comparison? this would become inconsistent. Maybe I'm missing the reason this should be different.

@kiszk
Copy link
Member Author

kiszk commented Jul 8, 2017

IIUC, BinaryType should be handled as unsigned byte while Java and Scala have only signed byte array.
That is why we have to do unsigned comparison of BinaryType.
The places where byte comparison is used rather than for BinaryType, I think that we do not change comparisons.

What do you think?
cc: @gatorsmile @maropu

@maropu
Copy link
Member

maropu commented Jul 9, 2017

I checked other dbms (dbms-like) systems handled binary as unassigned;

postgres=# SELECT E'\\x00'::BYTEA < E'\\xff'::BYTEA;
 ?column? 
----------
 t
(1 row)

hive> SELECT unhex('00') < unhex('ff');
OK
true

mysql> SELECT x'00' < x'ff';
+---------------+
| x'00' < x'ff' |
+---------------+
|             1 |
+---------------+

In some parts of the Spark implementation, they also handle binary as unassigned (https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/PrefixComparators.java#L36). I feel, based on the principle of least astonishment, we'd be better to handle binary as unassigned in user-facing pieces.

@kiszk
Copy link
Member Author

kiszk commented Jul 9, 2017

@maropu, thank you for your comment.


test("SPARK-21344: BinaryType comparison does signed byte array comparison") {
val b1 = Array[Byte](1) // 0x01
val b2 = Array[Byte](-1) // 0xff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @kiszk
Can we have another test case with more than one byte for code coverage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment, done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it looks better to check the boundary values like -128 and 127.

@SparkQA
Copy link

SparkQA commented Jul 9, 2017

Test build #79425 has finished for PR 18571 at commit 0fea784.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

long p = 0;
for (int i = 0; i < minLen; ++i) {
p |= (128L + Platform.getByte(bytes, Platform.BYTE_ARRAY_OFFSET + i))
p |= ((long)Platform.getByte(bytes, Platform.BYTE_ARRAY_OFFSET + i) & 0xff)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe we need a single space like ((long) Platform....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. done.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79436 has finished for PR 18571 at commit 814d4c4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79439 has finished for PR 18571 at commit a84776b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 10, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79450 has finished for PR 18571 at commit a84776b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Jul 10, 2017

@gatorsmile could you please review this?

@gatorsmile
Copy link
Member

SELECT x'00' < x'0f'
SELECT x'00' < x'ff'

Could you add the above queries to a new comparator.sql file?

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@SparkQA
Copy link

SparkQA commented Jul 14, 2017

Test build #79600 has finished for PR 18571 at commit e33ec8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Jul 14, 2017

@gatorsmile thank you for your comment, done

(Array[Byte](1, 1, 1, 1, 1, 1, 1, 1, 1), Array[Byte](1, 1, 1, 1, 1, 1, 1, 1, -1))
)
data.foreach { case (b1, b2) =>
val rowOrdering = InterpretedOrdering.forSchema(Seq(BinaryType, BinaryType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, don't you just have one binary field for each row? Why you specify two fields to compare?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, updated.

BoundReference(1, BinaryType, nullable = true).asc :: Nil)
val rowType = StructType(
StructField("b1", BinaryType, nullable = true) ::
StructField("b2", BinaryType, nullable = true) :: Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You specify two binary fields as row schema. But actually your input just has one binary field (i.e., Row(b1)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood, updated.

long p = 0;
for (int i = 0; i < minLen; ++i) {
p |= (128L + Platform.getByte(bytes, Platform.BYTE_ARRAY_OFFSET + i))
p |= ((long) Platform.getByte(bytes, Platform.BYTE_ARRAY_OFFSET + i) & 0xff)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we don't have unit test for this. Shall we add a ByteArraySuite?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the unit test for this. I added some cases.

@SparkQA
Copy link

SparkQA commented Jul 14, 2017

Test build #79612 has finished for PR 18571 at commit 08776e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

It sounds like all the comments from @viirya have been resolved. Will merge it tonight. Thanks!

@viirya
Copy link
Member

viirya commented Jul 14, 2017

LGTM

asfgit pushed a commit that referenced this pull request Jul 15, 2017
…rison

## What changes were proposed in this pull request?

This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.

## How was this patch tested?

Added a test suite in `OrderingSuite`.

Author: Kazuaki Ishizaki <[email protected]>

Closes #18571 from kiszk/SPARK-21344.

(cherry picked from commit ac5d5d7)
Signed-off-by: gatorsmile <[email protected]>
asfgit pushed a commit that referenced this pull request Jul 15, 2017
…rison

## What changes were proposed in this pull request?

This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.

## How was this patch tested?

Added a test suite in `OrderingSuite`.

Author: Kazuaki Ishizaki <[email protected]>

Closes #18571 from kiszk/SPARK-21344.

(cherry picked from commit ac5d5d7)
Signed-off-by: gatorsmile <[email protected]>
asfgit pushed a commit that referenced this pull request Jul 15, 2017
…rison

## What changes were proposed in this pull request?

This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.

## How was this patch tested?

Added a test suite in `OrderingSuite`.

Author: Kazuaki Ishizaki <[email protected]>

Closes #18571 from kiszk/SPARK-21344.

(cherry picked from commit ac5d5d7)
Signed-off-by: gatorsmile <[email protected]>
@gatorsmile
Copy link
Member

Thanks! Merging to master/2.2/2.1/2.0

@asfgit asfgit closed this in ac5d5d7 Jul 15, 2017
asfgit pushed a commit that referenced this pull request Jul 16, 2017
… representation

## What changes were proposed in this pull request?
SPARK 2.0 does not support hex literal. Thus, the test case failed after backporting #18571

## How was this patch tested?
N/A

Author: gatorsmile <[email protected]>

Closes #18643 from gatorsmile/fixTestFailure2.0.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…rison

## What changes were proposed in this pull request?

This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.

## How was this patch tested?

Added a test suite in `OrderingSuite`.

Author: Kazuaki Ishizaki <[email protected]>

Closes apache#18571 from kiszk/SPARK-21344.

(cherry picked from commit ac5d5d7)
Signed-off-by: gatorsmile <[email protected]>
jzhuge pushed a commit to jzhuge/spark that referenced this pull request Aug 20, 2018
…rison

This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses signed operations.

Added a test suite in `OrderingSuite`.

Author: Kazuaki Ishizaki <[email protected]>

Closes apache#18571 from kiszk/SPARK-21344.

(cherry picked from commit ac5d5d7)
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants