Skip to content

Conversation

@mn-mikke
Copy link
Contributor

@mn-mikke mn-mikke commented Mar 19, 2018

What changes were proposed in this pull request?

The PR adds a logic for easy concatenation of multiple array columns and covers:

  • Concat expression has been extended to support array columns
  • A Python wrapper

How was this patch tested?

New tests added into:

  • CollectionExpressionsSuite
  • DataFrameFunctionsSuite
  • typeCoercion/native/concat.sql

Codegen examples

Primitive-type elements

val df = Seq(
  (Seq(1 ,2), Seq(3, 4)),
  (Seq(1, 2, 3), null)
).toDF("a", "b")
df.filter('a.isNotNull).select(concat('a, 'b)).debugCodegen()

Result:

/* 033 */         boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
/* 034 */         ArrayData inputadapter_value = inputadapter_isNull ?
/* 035 */         null : (inputadapter_row.getArray(0));
/* 036 */
/* 037 */         if (!(!inputadapter_isNull)) continue;
/* 038 */
/* 039 */         ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1);
/* 040 */
/* 041 */         ArrayData[] project_args = new ArrayData[2];
/* 042 */
/* 043 */         if (!false) {
/* 044 */           project_args[0] = inputadapter_value;
/* 045 */         }
/* 046 */
/* 047 */         boolean inputadapter_isNull1 = inputadapter_row.isNullAt(1);
/* 048 */         ArrayData inputadapter_value1 = inputadapter_isNull1 ?
/* 049 */         null : (inputadapter_row.getArray(1));
/* 050 */         if (!inputadapter_isNull1) {
/* 051 */           project_args[1] = inputadapter_value1;
/* 052 */         }
/* 053 */
/* 054 */         ArrayData project_value = new Object() {
/* 055 */           public ArrayData concat(ArrayData[] args) {
/* 056 */             for (int z = 0; z < 2; z++) {
/* 057 */               if (args[z] == null) return null;
/* 058 */             }
/* 059 */
/* 060 */             long project_numElements = 0L;
/* 061 */             for (int z = 0; z < 2; z++) {
/* 062 */               project_numElements += args[z].numElements();
/* 063 */             }
/* 064 */             if (project_numElements > 2147483632) {
/* 065 */               throw new RuntimeException("Unsuccessful try to concat arrays with " + project_numElements +
/* 066 */                 " elements due to exceeding the array size limit 2147483632.");
/* 067 */             }
/* 068 */
/* 069 */             long project_size = UnsafeArrayData.calculateSizeOfUnderlyingByteArray(
/* 070 */               project_numElements,
/* 071 */               4);
/* 072 */             if (project_size > 2147483632) {
/* 073 */               throw new RuntimeException("Unsuccessful try to concat arrays with " + project_size +
/* 074 */                 " bytes of data due to exceeding the limit 2147483632 bytes" +
/* 075 */                 " for UnsafeArrayData.");
/* 076 */             }
/* 077 */
/* 078 */             byte[] project_array = new byte[(int)project_size];
/* 079 */             UnsafeArrayData project_arrayData = new UnsafeArrayData();
/* 080 */             Platform.putLong(project_array, 16, project_numElements);
/* 081 */             project_arrayData.pointTo(project_array, 16, (int)project_size);
/* 082 */             int project_counter = 0;
/* 083 */             for (int y = 0; y < 2; y++) {
/* 084 */               for (int z = 0; z < args[y].numElements(); z++) {
/* 085 */                 if (args[y].isNullAt(z)) {
/* 086 */                   project_arrayData.setNullAt(project_counter);
/* 087 */                 } else {
/* 088 */                   project_arrayData.setInt(
/* 089 */                     project_counter,
/* 090 */                     args[y].getInt(z)
/* 091 */                   );
/* 092 */                 }
/* 093 */                 project_counter++;
/* 094 */               }
/* 095 */             }
/* 096 */             return project_arrayData;
/* 097 */           }
/* 098 */         }.concat(project_args);
/* 099 */         boolean project_isNull = project_value == null;

Non-primitive-type elements

val df = Seq(
  (Seq("aa" ,"bb"), Seq("ccc", "ddd")),
  (Seq("x", "y"), null)
).toDF("a", "b")
df.filter('a.isNotNull).select(concat('a, 'b)).debugCodegen()

Result:

/* 033 */         boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
/* 034 */         ArrayData inputadapter_value = inputadapter_isNull ?
/* 035 */         null : (inputadapter_row.getArray(0));
/* 036 */
/* 037 */         if (!(!inputadapter_isNull)) continue;
/* 038 */
/* 039 */         ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1);
/* 040 */
/* 041 */         ArrayData[] project_args = new ArrayData[2];
/* 042 */
/* 043 */         if (!false) {
/* 044 */           project_args[0] = inputadapter_value;
/* 045 */         }
/* 046 */
/* 047 */         boolean inputadapter_isNull1 = inputadapter_row.isNullAt(1);
/* 048 */         ArrayData inputadapter_value1 = inputadapter_isNull1 ?
/* 049 */         null : (inputadapter_row.getArray(1));
/* 050 */         if (!inputadapter_isNull1) {
/* 051 */           project_args[1] = inputadapter_value1;
/* 052 */         }
/* 053 */
/* 054 */         ArrayData project_value = new Object() {
/* 055 */           public ArrayData concat(ArrayData[] args) {
/* 056 */             for (int z = 0; z < 2; z++) {
/* 057 */               if (args[z] == null) return null;
/* 058 */             }
/* 059 */
/* 060 */             long project_numElements = 0L;
/* 061 */             for (int z = 0; z < 2; z++) {
/* 062 */               project_numElements += args[z].numElements();
/* 063 */             }
/* 064 */             if (project_numElements > 2147483632) {
/* 065 */               throw new RuntimeException("Unsuccessful try to concat arrays with " + project_numElements +
/* 066 */                 " elements due to exceeding the array size limit 2147483632.");
/* 067 */             }
/* 068 */
/* 069 */             Object[] project_arrayObjects = new Object[(int)project_numElements];
/* 070 */             int project_counter = 0;
/* 071 */             for (int y = 0; y < 2; y++) {
/* 072 */               for (int z = 0; z < args[y].numElements(); z++) {
/* 073 */                 project_arrayObjects[project_counter] = args[y].getUTF8String(z);
/* 074 */                 project_counter++;
/* 075 */               }
/* 076 */             }
/* 077 */             return new org.apache.spark.sql.catalyst.util.GenericArrayData(project_arrayObjects);
/* 078 */           }
/* 079 */         }.concat(project_args);
/* 080 */         boolean project_isNull = project_value == null;

expression[MapValues]("map_values"),
expression[Size]("size"),
expression[SortArray]("sort_array"),
expression[ConcatArrays]("concat_arrays"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reusing concat?

concat(array1, array2, ..., arrayN) -> array ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already played with this option in my mind, but I'm not sure how concat would be categorized.

Currently, concat is defined as a pure string operation:
/**

  • @group string_funcs
  • @SInCE 1.5.0
    /
    @scala.annotation.varargs
    def concat(exprs: Column
    ): Column

Whereas the functionality in this PR belongs rather to the collection_funcs group.

Having just one function for both expressions would be elegant, but can you advise what group should be assigned to concat?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about move it to collection functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will merge the functions into one. Do you find having one expression class concatenation per the concatenation type ok?

I'm afraid if I incorporate all the logic into one expression class then the code will become messy since each codeGen and eveluation has a different nature.

@maropu
Copy link
Member

maropu commented Mar 19, 2018

Thinks for this work! One question; why do you think we need to support this api in Spark native? Other libraries support this as first-class?

@mn-mikke
Copy link
Contributor Author

@maropu What other libraries do you mean? I'm not aware of any library providing this functionality on top Spark SQL.

When using Spark SQL as an ETL tool for structured and nested data, people are forced to use UDFs for transforming arrays since current api for array columns is lacking. This approach brings several drawbacks:

  • bad code readability
  • Catalyst is blind when performing optimizations
  • impossibility to track data lineage of the transformation (a key aspect for the financial industry, see Spline and Spline paper)

So my colleagues and I decided to extend the current Spark SQL API with well-known collection functions like concat, flatten, zipWithIndex, etc. We don't want to keep this functionality just in our fork of Spark, but would like to share it with others.

* @param f a function that accepts a sequence of non-null evaluation result names of children
* and returns Java code to compute the output.
*/
protected def nullSafeCodeGen(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method looks almost the same with the one in BinaryExpression. Can you avoid the code duplication ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will combine it with concat

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeichenXu123 I do agree that there are strong similarities in the code.

If you take a look at UniryExpression, BinaryExpression, TernaryExpression, you will see that methods responsible for null save evaluation and code generation are the same except the number of parameters. My intention has been to generalize the methods into the NullSaveEvaluation trait and remove the original methods in a different PR once the trait is in. I didn't want to create a big bang PR because of one additional function in API.

Copy link
Member

@maropu maropu Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's ok to discuss this in follow-up activities cuz this is less related to this pr. So, can you make this pr minimal as much as possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will try.

@gatorsmile
Copy link
Member

@maropu Maybe you can help @mn-mikke review this PR ? Will open an umbrella JIRA for the built-in functions we plan to do in Apache 2.4. In the list, we have multiple for operating nested data.

@maropu
Copy link
Member

maropu commented Mar 23, 2018

ok, I'll check later!

trait UserDefinedExpression

/**
* The trait covers logic for performing null save evaluation and code generation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: null safe.

* override this.
*/
override def eval(input: InternalRow): Any =
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark usually use the style like:

override def eval(input: InternalRow): Any = {
  val values = children.map(_.eval(input))
  if (values.contains(null)) {
    null
  } else {
    nullSafeEval(values)
  }
}

You could follow the style of other codes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other places where the braces {} style doesn't follow Spark codes. We should keep the same code style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think I fixed all style differences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the style fix is missed here.

*/
override def eval(input: InternalRow): Any =
{
val values = children.map(_.eval(input))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need to evaluate all children. Once any child expression is null, we can just return null.


override def checkInputDataTypes(): TypeCheckResult = {
val arrayCheck = checkInputDataTypesAreArrays
if(arrayCheck.isFailure) arrayCheck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style issue:

if (...) {
  ...
} else {
  ...
}

/**
* The trait covers logic for performing null save evaluation and code generation.
*/
trait NullSafeEvaluation extends Expression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to bring in NullSafeEvaluation? If only ConcatArray uses it, we may not need to add this.

$resultCode
""" /: children.zip(gens)) {
case (acc, (child, gen)) =>
gen.code + ctx.nullSafeExec(child.nullable, gen.isNull)(acc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, for a binary expression, doesn't this generate code like:

rightGen.code + ctx.nullSafeExec(right.nullable, rightGen.isNull) {
  leftGen.code + ctx.nullSafeExec(left.nullable, leftGen.isNull) {
    ${ev.isNull} = false; // resultCode could change nullability.
    $resultCode
  }
}

Although for deterministic expressions, the evaluation order doesn't matter. But for non-deterministic, I'm little concerned that it may cause unexpected change.

* Concatenates multiple arrays into one.
*/
@ExpressionDescription(
usage = "_FUNC_(expr, ...) - Concatenates multiple arrays into one.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defines that the element types of the arrays must be the same.

val primitiveValueTypeName = CodeGenerator.primitiveTypeName(elementType)
val assignments = elements.map { el =>
s"""
|for(int z = 0; z < $el.numElements(); z++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stype: for (

val assignments = elements.map { el =>
s"""
|for(int z = 0; z < $el.numElements(); z++) {
| if($el.isNullAt(z)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: if ().

|Object[] $arrayName = new Object[$numElemName];
|int $counter = 0;
|$assignments
|$arrayDataName = new $genericArrayClass($arrayName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we concate complex elements into UnsafeArrayData?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, can we reuse the UnsafeArrayWriter logic for this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like this idea! I think it would require moving the complex type insertion logic from InterprettedUnsafeProjection directly to UnsafeDataWriter and introduce in that way write methods for complex type fields. I'm not sure whether this big refactoring task is still in the scope of this PR.

Also see that we could improve codeGen of CreateArray in the same way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You couldn't use UnsafeArrayData in the complex case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, currently there are no write methods on UnsafeArrayWriter or set methods on UnsafeArrayData that we could leverage for complex types. In theory, we could follow the same approach as in InterprettedUnsafeProjection and each complex type to a byte array and subsequently insert the produced byte array into the target UnsafeArrayData. Since this logic could be utilized from more places (e.g. CreateArray), it should be encapsulated into UnsafeArrayWriter or UnsafeArrayData at first. What do you think?

* @group collection_funcs
* @since 2.4.0
*/
def concat_arrays(columns: Column*): Column = withExpr { ConcatArrays(columns.map(_.expr)) }
Copy link
Member

@maropu maropu Mar 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add this func. in sql/functions here? It seems we might recommend users to use these kinds of functions via selectExpr, so is it okay to add this only in FunctionRegistry in terms of code simplicity and maintainablity? Thoughts? @viirya @gatorsmile

@maropu
Copy link
Member

maropu commented Mar 26, 2018

We should handle different (and compatible) typed arrays in this funs?

scala> sql("select concat_arrays(array(1L, 2L), array(3, 4))").show
org.apache.spark.sql.AnalysisException: cannot resolve 'concat_arrays(array(1L, 2L), array(3, 4))' due to data type mismatch: input to function concat_arrays sh
ould all be the same type, but it's [array<bigint>, array<int>]; line 1 pos 7;
'Project [unresolvedalias(concat_arrays(array(1, 2), array(3, 4)), None)]
+- OneRowRelation

Also, could you add more tests for this case in SQLQueryTestSuite? probably, we can add a new test file like concat_arrays.sql in typeCoercion.native.

> SELECT _FUNC_(array(1, 2, 3), array(4, 5), array(6));
[1,2,3,4,5,6]
""")
case class ConcatArrays(children: Seq[Expression]) extends Expression with NullSafeEvaluation {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a common base class (e.g., ConcatLike) for handling nested ConcatArrays in the optimizer(CombineConcat)?

@maropu
Copy link
Member

maropu commented Mar 26, 2018

Also, postgresql has the function array_cat for concatenating arrays, so it might be better to make the behaviour the same with the postgresql one:
https://www.postgresql.org/docs/10/static/functions-array.html

Examples:
> SELECT _FUNC_(array(1, 2, 3), array(4, 5), array(6));
[1,2,3,4,5,6]
""")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add since too?

...
       [1,2,3,4,5,6]
  """,
  since = "2.4.0")

else TypeUtils.checkForSameTypeInputExpr(children.map(_.dataType), s"function $prettyName")
}

private def checkInputDataTypesAreArrays(): TypeCheckResult =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just put this in checkInputDataTypes?

override def dataType: ArrayType =
children
.headOption.map(_.dataType.asInstanceOf[ArrayType])
.getOrElse(ArrayType.defaultConcreteType.asInstanceOf[ArrayType])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow empty children? I can't think of a use case for now and we should better disallow it first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely share your opinion, but I think we should be consistent across the whole Spark SQL API. Functions like concat and concat_ws accept empty children as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm .. but then this is array<null> when the children are empty. Seems CreateArray's type is array<string> in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, changing to return type array<string> when no children are provided. Also I've created the jira ticket SPARK-23798 since I don't see any reason why it couldn't return a default concrete type in this case. Hope I don't miss anything.

Collection function: Concatenates multiple arrays into one.
:param cols: list of column names (string) or list of :class:`Column` expressions that have
the same data type.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we note cols are expected to be array type?

@mn-mikke
Copy link
Contributor Author

Merged concat and concat_arrays functions into one via an unresolved expression and subsequent resolution. Do you have any objections to this approach?

@gatorsmile
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

@mn-mikke Could you update the PR title?

@SparkQA
Copy link

SparkQA commented Mar 26, 2018

Test build #88596 has finished for PR 20858 at commit bb46c3d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedConcat(children: Seq[Expression]) extends Expression

@mn-mikke
Copy link
Contributor Author

retest please

@mn-mikke mn-mikke changed the title [SPARK-23736][SQL] Implementation of the concat_arrays function concatenating multiple array columns into one. [SPARK-23736][SQL] Extending the concat function to support array columns Mar 26, 2018
@SparkQA
Copy link

SparkQA commented Mar 26, 2018

Test build #88598 has finished for PR 20858 at commit 11205af.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mn-mikke
Copy link
Contributor Author

retest please

@SparkQA
Copy link

SparkQA commented Mar 27, 2018

Test build #88605 has finished for PR 20858 at commit 753499d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

[Row(arr=[1, 2, 3, 4, 5]), Row(arr=None)]
"""
sc = SparkContext._active_spark_context
return Column(sc._jvm.functions.concat(_to_seq(sc, cols, _to_java_column)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we move this down .. ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole file is divide into sections according to groups of functions. Based on @gatorsmile's suggestion, the concat function should be categorized as a collection function. So I moved the function to comply with the file structure.

@mn-mikke
Copy link
Contributor Author

It seems that we experienced the same problem with failing "RateSourceV2Suite.basic microbatch execution" test reported here

@SparkQA
Copy link

SparkQA commented Apr 12, 2018

Test build #89286 has finished for PR 20858 at commit 6bb33e6.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2018

Test build #89302 has finished for PR 20858 at commit 944e0c9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 13, 2018

Test build #89323 has finished for PR 20858 at commit 7f5124b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

("UnsafeArrayData", arrayData),
("int[]", counter)))

s"""new Object() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

s"""
   |new Object() {
...

if (inputs.contains(null)) {
null
} else {
val elements = inputs.flatMap(_.asInstanceOf[ArrayData].toObjectArray(elementType))
Copy link
Member

@kiszk kiszk Apr 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we always allocate an concatenated array? I think that the total array element size may be overflow in some cases.

arguments = Seq(
(s"${javaType}[]", "args"),
("UnsafeArrayData", arrayData),
("int[]", counter)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can simply use for-loop here?

for (int $idx = 0; $idx < ${children.length}; $idx++) {
  for (int z = 0; z < args[$idx].numElements(); z++) {
    ...
  }
}

|int[] $tempVariableName = new int[]{0};
|$assignmentSection
|final int $numElementsConstant = $tempVariableName[0];
""".stripMargin,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can simply use for-loop here?

int $tempVariableName = 0;
for (int $idx = 0; $idx < ${children.length}; $idx++) {
  $tempVariableName += args[$idx].numElements();
}
final int $numElementsConstant = $tempVariableName;

|boolean[] $isNullVariable = new boolean[]{false};
|$assignmentSection;
|if ($isNullVariable[0]) return null;
""".stripMargin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can simply use for-loop here?

for (int $idx = 0; $idx < ${children.length}; $idx++) {
  if (args[$idx] == null) {
    return null;
  }
}

We can return as soon as we found null in this case.

val assignmentSection = ctx.splitExpressions(
expressions = assignments,
funcName = "complexArrayConcat",
arguments = Seq((s"${javaType}[]", "args"), ("Object[]", arrayData), ("int[]", counter)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can simply use for-loop here?

for (int $idx = 0; $idx < ${children.length}; $idx++) {
  for (int z = 0; z < args[$idx].numElements(); z++) {
    ...
  }
}

val assignments = (0 until children.length).map { idx =>
s"""
|for (int z = 0; z < args[$idx].numElements(); z++) {
| $arrayData[$counter[0]] = ${CodeGenerator.getValue(s"args[$idx]", elementType, "z")};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we operate only with non-primitive types where null is treated as a regular value so the null check shouldn't be necessary.
The added tests should cover this scenario.

@SparkQA
Copy link

SparkQA commented Apr 16, 2018

Test build #89402 has finished for PR 20858 at commit 600ae89.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 17, 2018

Test build #89456 has finished for PR 20858 at commit f2a67e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


override def dataType: DataType = children.map(_.dataType).headOption.getOrElse(StringType)

lazy val javaType: String = CodeGenerator.javaType(dataType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this into doGenCode() method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! But I think it would be better to reuse javaType also in genCodeForPrimitiveArrays and genCodeForNonPrimitiveArrays.

ev.copy(s"""
$initCode
$codes
${javaType} ${ev.value} = $concatenator.concat($args);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: $javaType

@ueshin
Copy link
Member

ueshin commented Apr 18, 2018

LGTM except for nits.

@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89495 has finished for PR 20858 at commit 8a125d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2018

Test build #89504 has finished for PR 20858 at commit 5a4cc8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HasCollectSubModels(Params):
  • class Summarizer(object):
  • class SummaryBuilder(JavaWrapper):
  • class CrossValidator(Estimator, ValidatorParams, HasParallelism, HasCollectSubModels,
  • class TrainValidationSplit(Estimator, ValidatorParams, HasParallelism, HasCollectSubModels,
  • case class Reverse(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
  • case class ArrayMin(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
  • class ArrayDataIndexedSeq[T](arrayData: ArrayData, dataType: DataType) extends IndexedSeq[T]
  • abstract class MemoryStreamBase[A : Encoder](sqlContext: SQLContext) extends BaseStreamingSource
  • class ContinuousMemoryStream[A : Encoder](id: Int, sqlContext: SQLContext)
  • case class GetRecord(offset: ContinuousMemoryStreamPartitionOffset)
  • class ContinuousMemoryStreamDataReaderFactory(
  • class ContinuousMemoryStreamDataReader(
  • case class ContinuousMemoryStreamOffset(partitionNums: Map[Int, Int])
  • case class ContinuousMemoryStreamPartitionOffset(partition: Int, numProcessed: Int)

@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89560 has finished for PR 20858 at commit f7bdcf7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ArrayPosition(left: Expression, right: Expression)

@SparkQA
Copy link

SparkQA commented Apr 19, 2018

Test build #89573 has finished for PR 20858 at commit 36d5d25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ElementAt(left: Expression, right: Expression) extends GetMapValueUtil
  • abstract class GetMapValueUtil extends BinaryExpression with ImplicitCastInputTypes
  • case class GetMapValue(child: Expression, key: Expression)

@ueshin
Copy link
Member

ueshin commented Apr 20, 2018

Thanks! merging to master.


override def foldable: Boolean = children.forall(_.foldable)

override def eval(input: InternalRow): Any = dataType match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this pattern match will probably cause significant regression in the interpreted (non-codegen) mode, due to the way scala pattern matching is implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've created #22471 to call the pattern matching only once.

WDYT about Reverse? It looks like a similar problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants