Skip to content

Conversation

@kiszk
Copy link
Member

@kiszk kiszk commented Apr 18, 2018

What changes were proposed in this pull request?

The PR adds the SQL function array_except. The behavior of the function is based on Presto's one.

This function returns returns an array of the elements in array1 but not in array2.

Note: The order of elements in the result is not defined.

How was this patch tested?

Added UTs.

@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89524 has finished for PR 21103 at commit ce18ce0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class ArraySetUtils extends BinaryExpression with ExpectsInputTypes
  • case class ArrayExcept(array1: Expression, array2: Expression) extends ArraySetUtils

@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89540 has finished for PR 21103 at commit 6813ba4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Apr 19, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89561 has finished for PR 21103 at commit 6813ba4.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89587 has finished for PR 21103 at commit 2d16556.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class ArraySetUtils extends BinaryExpression with ExpectsInputTypes

@gatorsmile
Copy link
Member

cc @ueshin

@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89589 has finished for PR 21103 at commit 2551c09.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89605 has finished for PR 21103 at commit cdb6257.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ArrayExcept(left: Expression, right: Expression) extends ArraySetUtils

@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89629 has finished for PR 21103 at commit ad9f576.

  • This patch fails from timeout after a configured wait of `300m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Apr 20, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 20, 2018

Test build #89651 has finished for PR 21103 at commit ad9f576.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92322 has finished for PR 21103 at commit ad9f576.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

kiszk added a commit to kiszk/spark that referenced this pull request Jun 27, 2018
fix test failure
@SparkQA
Copy link

SparkQA commented Jul 14, 2018

Test build #92997 has finished for PR 21103 at commit 5c06922.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ArrayExcept(left: Expression, right: Expression) extends ArraySetLike

@kiszk
Copy link
Member Author

kiszk commented Jul 14, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jul 14, 2018

Test build #93001 has finished for PR 21103 at commit 5c06922.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ArrayExcept(left: Expression, right: Expression) extends ArraySetLike

@SparkQA
Copy link

SparkQA commented Jul 19, 2018

Test build #93270 has finished for PR 21103 at commit 26e3257.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ArrayUnion(left: Expression, right: Expression) extends ArraySetLike

@SparkQA
Copy link

SparkQA commented Jul 29, 2018

Test build #93750 has finished for PR 21103 at commit 3639b5b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Jul 29, 2018

retest this please

| {
| $javaTypeName $value = $array2.$getter;
| $hsJavaTypeName $hsValue = $genHsValue;
| $hs.add$postFix($hsValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wrap them inside {...}?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see. Maybe a better way is

def withArray2NullCheck(body: String) = if (right.dataType.asInstanceOf[ArrayType].containsNull) {
  s"""
    |if ($array2.isNullAt($i)) {
    |  $notFoundNullElement = false;
    |} else {
    |  $body
    |}""".stripMargin
} else {
  body
}

Copy link
Member Author

@kiszk kiszk Jul 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part will be part of else clause of if statement depends on $array2NullCheck (i.e. if (right.dataType.asInstanceOf[ArrayType].containsNull) is true).

}

nullSafeCodeGen(ctx, ev, (array1, array2) => {
if (openHashElementType != "") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a better way to organize it

if (elementTypeSupportEquals) {
  ...
  nullSafeCodeGen(...)
} else {
  nullSafeCodeGen(...)
}

|$hs = new $openHashSet$postFix($classTag);
|$notFoundNullElement = true;
|int $pos = 0;
|for (int $i = 0; $i < $array2.numElements(); $i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add array2 to the hash set again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add some elements in array1to the hash set at L4192.
Since we want to create a hashset that have elements only of array2, we reallocate a hashset at L4198 and add elements of array2 again here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then why we do the first iteration? only calculating the size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only calculating array size for allocating a UnsafeArrayData or GenericArrayData.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this worth? I feel using an ArrayBuffer like the interpreted path is faster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use 'ArrayBuffer' involves boxing. Let me consider 'ArrayBuffer.ofInt' or others with special null handling.

@SparkQA
Copy link

SparkQA commented Jul 30, 2018

Test build #93759 has finished for PR 21103 at commit 3639b5b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val (postFix, openHashElementType, hsJavaTypeName, genHsValue,
getter, setter, javaTypeName, primitiveTypeName, arrayDataBuilder) =
elementType match {
case BooleanType | ByteType | ShortType | IntegerType =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically we can have special handling for boolean type(at most 2 outputs), but I think it's very rare to see boolean type in array_except. Let's leave it unless users require it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree that it is very rare. Let us remove the special handling code for BooleanType.

@SparkQA
Copy link

SparkQA commented Jul 31, 2018

Test build #93822 has finished for PR 21103 at commit 49b5ab3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2018

Test build #93819 has finished for PR 21103 at commit 4d943c8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jul 31, 2018

Test build #93825 has finished for PR 21103 at commit 282445c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jul 31, 2018

Test build #93830 has finished for PR 21103 at commit 282445c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2018

Test build #93831 has finished for PR 21103 at commit 72ef664.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 31, 2018

Test build #93836 has finished for PR 21103 at commit 93e7979.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Aug 1, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93851 has finished for PR 21103 at commit 93e7979.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Aug 1, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Aug 1, 2018

Test build #93870 has finished for PR 21103 at commit 93e7979.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Aug 1, 2018

cc @ueshin @cloud-fan

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 95a9d5e Aug 1, 2018
asfgit pushed a commit that referenced this pull request Aug 7, 2018
## What changes were proposed in this pull request?

This PR refactors `ArrayUnion` based on [this suggestion](#21103 (comment)).
1. Generate optimized code for all of the primitive types except `boolean`
1. Generate code using `ArrayBuilder` or `ArrayBuffer`
1. Leave only a generic path in the interpreted path

## How was this patch tested?

Existing tests

Author: Kazuaki Ishizaki <[email protected]>

Closes #21937 from kiszk/SPARK-23914-follow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants