Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Oct 19, 2019

What changes were proposed in this pull request?

There's a case where MapObjects has a lambda function which creates nested struct - unsafe data in safe data struct. In this case, MapObjects doesn't copy the row returned from lambda function (as outmost data type is safe data struct), which misses copying nested unsafe data.

The culprit is that UnsafeProjection.toUnsafeExprs converts CreateNamedStruct to CreateNamedStructUnsafe (this is the only place where CreateNamedStructUnsafe is used) which incurs safe and unsafe being mixed up temporarily, which may not be needed at all at least logically, as it will finally assembly these evaluations to UnsafeRow.

Before the patch

/* 105 */   private ArrayData MapObjects_0(InternalRow i) {
/* 106 */     boolean isNull_1 = i.isNullAt(0);
/* 107 */     ArrayData value_1 = isNull_1 ?
/* 108 */     null : (i.getArray(0));
/* 109 */     ArrayData value_0 = null;
/* 110 */
/* 111 */     if (!isNull_1) {
/* 112 */
/* 113 */       int dataLength_0 = value_1.numElements();
/* 114 */
/* 115 */       ArrayData[] convertedArray_0 = null;
/* 116 */       convertedArray_0 = new ArrayData[dataLength_0];
/* 117 */
/* 118 */
/* 119 */       int loopIndex_0 = 0;
/* 120 */
/* 121 */       while (loopIndex_0 < dataLength_0) {
/* 122 */         value_MapObject_lambda_variable_1 = (int) (value_1.getInt(loopIndex_0));
/* 123 */         isNull_MapObject_lambda_variable_1 = value_1.isNullAt(loopIndex_0);
/* 124 */
/* 125 */         ArrayData arrayData_0 = ArrayData.allocateArrayData(
/* 126 */           -1, 1L, " createArray failed.");
/* 127 */
/* 128 */         mutableStateArray_0[0].reset();
/* 129 */
/* 130 */
/* 131 */         mutableStateArray_0[0].zeroOutNullBytes();
/* 132 */
/* 133 */
/* 134 */         if (isNull_MapObject_lambda_variable_1) {
/* 135 */           mutableStateArray_0[0].setNullAt(0);
/* 136 */         } else {
/* 137 */           mutableStateArray_0[0].write(0, value_MapObject_lambda_variable_1);
/* 138 */         }
/* 139 */         arrayData_0.update(0, (mutableStateArray_0[0].getRow()));
/* 140 */         if (false) {
/* 141 */           convertedArray_0[loopIndex_0] = null;
/* 142 */         } else {
/* 143 */           convertedArray_0[loopIndex_0] = arrayData_0 instanceof UnsafeArrayData? arrayData_0.copy() : arrayData_0;
/* 144 */         }
/* 145 */
/* 146 */         loopIndex_0 += 1;
/* 147 */       }
/* 148 */
/* 149 */       value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(convertedArray_0);
/* 150 */     }
/* 151 */     globalIsNull_0 = isNull_1;
/* 152 */     return value_0;
/* 153 */   }

After the patch

/* 104 */   private ArrayData MapObjects_0(InternalRow i) {
/* 105 */     boolean isNull_1 = i.isNullAt(0);
/* 106 */     ArrayData value_1 = isNull_1 ?
/* 107 */     null : (i.getArray(0));
/* 108 */     ArrayData value_0 = null;
/* 109 */
/* 110 */     if (!isNull_1) {
/* 111 */
/* 112 */       int dataLength_0 = value_1.numElements();
/* 113 */
/* 114 */       ArrayData[] convertedArray_0 = null;
/* 115 */       convertedArray_0 = new ArrayData[dataLength_0];
/* 116 */
/* 117 */
/* 118 */       int loopIndex_0 = 0;
/* 119 */
/* 120 */       while (loopIndex_0 < dataLength_0) {
/* 121 */         value_MapObject_lambda_variable_1 = (int) (value_1.getInt(loopIndex_0));
/* 122 */         isNull_MapObject_lambda_variable_1 = value_1.isNullAt(loopIndex_0);
/* 123 */
/* 124 */         ArrayData arrayData_0 = ArrayData.allocateArrayData(
/* 125 */           -1, 1L, " createArray failed.");
/* 126 */
/* 127 */         Object[] values_0 = new Object[1];
/* 128 */
/* 129 */
/* 130 */         if (isNull_MapObject_lambda_variable_1) {
/* 131 */           values_0[0] = null;
/* 132 */         } else {
/* 133 */           values_0[0] = value_MapObject_lambda_variable_1;
/* 134 */         }
/* 135 */
/* 136 */         final InternalRow value_3 = new org.apache.spark.sql.catalyst.expressions.GenericInternalRow(values_0);
/* 137 */         values_0 = null;
/* 138 */         arrayData_0.update(0, value_3);
/* 139 */         if (false) {
/* 140 */           convertedArray_0[loopIndex_0] = null;
/* 141 */         } else {
/* 142 */           convertedArray_0[loopIndex_0] = arrayData_0 instanceof UnsafeArrayData? arrayData_0.copy() : arrayData_0;
/* 143 */         }
/* 144 */
/* 145 */         loopIndex_0 += 1;
/* 146 */       }
/* 147 */
/* 148 */       value_0 = new org.apache.spark.sql.catalyst.util.GenericArrayData(convertedArray_0);
/* 149 */     }
/* 150 */     globalIsNull_0 = isNull_1;
/* 151 */     return value_0;
/* 152 */   }

Why are the changes needed?

This patch fixes the bug described above.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT added which fails on master branch and passes on PR.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Oct 19, 2019

Credit to @aaronlewism to report this issue and provide simple reproducer.

Please let me know if we would like to apply checking type recursively and call copy() to all unsafe data structs.

@SparkQA
Copy link

SparkQA commented Oct 19, 2019

Test build #112311 has finished for PR 26173 at commit 16e4954.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Oct 19, 2019

Test build #112313 has finished for PR 26173 at commit 16e4954.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 19, 2019

Test build #112314 has finished for PR 26173 at commit 8b2be71.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment on lines 316 to 324
val runInsideLoop = expressions.exists {
case _: LambdaVariable => true
case _ => false
}
val extractValueCode = if (runInsideLoop) {
s"$rowWriter.getRow().copy()"
} else {
s"$rowWriter.getRow()"
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice solution! Correct me if I'm wrong, but I believe this means it should be safe to get rid of the top-level copy from applying makeCopyIfInstanceOf in MapObjects

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert on SQL area (limited knowledge) so I'm not sure this is only one path - I'd feel safe if experts in SQL area jump in and review the approach. If we are uncertain and don't feel safe to fix for discovered case, I may need to make more change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we should fix CodegenContext.generateExpressionAndCopyIfUnsafe. To simplify codegen, maybe we can add copyUnsafeData to InternalRow/ArrayData/MapData.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to follow up review comment clearly, there's no CodegenContext,generateExpressionAndCopyIfUnsafe. Maybe you referred to makeCopyIfInstanceOf inside of MapObjects.doGenCode, or suggest to create a new method CodegenContext,generateExpressionAndCopyIfUnsafe?

@HeartSaVioR
Copy link
Contributor Author

cc. @cloud-fan @maropu @viirya

}

test("SPARK-29503 nest unsafe struct inside safe array") {
withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue is encountered only when whole stage codegen is disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least yes for provided reproducer. For other case I'm not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. For whole stage codegen, CreateNamedStruct is not converted to CreateNamedStructUnsafe, so the nested struct is not unsafe one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK got it. Thanks for explanation.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Oct 21, 2019

I've changed the approach to what @cloud-fan had suggested.

It has to pass a data type to check nested struct and copy unsafe data, so it needs to serialize data type to JSON and deserialize in method. Hope this is OK given we only do once per generating code for MapObjects.

I'd like to see whether the direction is OK - if we agree with the direction, I'll try to refine the code as well as add some UTs for new methods, and change the title and description of PR.

In case of rolling back, previous commit head is here:
https://github.com/HeartSaVioR/spark/commits/8b2be7116f61df17c7e538d026a7b82b94bf0f12

@SparkQA
Copy link

SparkQA commented Oct 21, 2019

Test build #112391 has finished for PR 26173 at commit 39d0a2e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 21, 2019

Test build #112395 has finished for PR 26173 at commit 07514a1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Looks like escaping JSON doesn't seem to work consistently during compiling generated code.

val typeToJson = StringEscapeUtils.escapeJava(StringEscapeUtils.escapeJson(dataType.json))

DataFrameComplexTypeSuite.SPARK-29503 nest unsafe struct inside safe array

arrayData_0.copyUnsafeData("{\\\"type\\\":\\\"array\\\",\\\"elementType\\\":{\\\"type\\\":\\\"struct\\\",\\\"fields\\\":[{\\\"name\\\":\\\"col1\\\",\\\"type\\\":\\\"integer\\\",\\\"nullable\\\":true,\\\"metadata\\\":{}}]},\\\"containsNull\\\":false}")
/* 140 */         if (false) {
/* 141 */           convertedArray_0[loopIndex_0] = null;
/* 142 */         } else {
/* 143 */           convertedArray_0[loopIndex_0] = arrayData_0.copyUnsafeData("{\"type\":\"array\",\"elementType\":{\"type\":\"struct\",\"fields\":[{\"name\":\"col1\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]},\"containsNull\":false}");
/* 144 */         }

ALSSuite.recommendForAllUsers with k <, = and > num_items

serializefromobject_value_2.copyUnsafeData("{\\\"type\\\":\\\"struct\\\",\\\"fields\\\":[{\\\"name\\\":\\\"_1\\\",\\\"type\\\":\\\"integer\\\",\\\"nullable\\\":false,\\\"metadata\\\":{}},{\\\"name\\\":\\\"_2\\\",\\\"type\\\":{\\\"type\\\":\\\"array\\\",\\\"elementType\\\":\\\"float\\\",\\\"containsNull\\\":false},\\\"nullable\\\":true,\\\"metadata\\\":{}}]}")
/* 112 */         if (serializefromobject_isNull_2) {
/* 113 */           serializefromobject_convertedArray_0[serializefromobject_loopIndex_0] = null;
/* 114 */         } else {
/* 115 */           serializefromobject_convertedArray_0[serializefromobject_loopIndex_0] = serializefromobject_value_2.copyUnsafeData("{"type":"struct","fields":[{"name":"_1","type":"integer","nullable":false,"metadata":{}},{"name":"_2","type":{"type":"array","elementType":"float","containsNull":false},"nullable":true,"metadata":{}}]}");
/* 116 */         }

val typeToJson = StringEscapeUtils.escapeJson(dataType.json)

DataFrameComplexTypeSuite.SPARK-29503 nest unsafe struct inside safe array

arrayData_0.copyUnsafeData("{\"type\":\"array\",\"elementType\":{\"type\":\"struct\",\"fields\":[{\"name\":\"col1\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]},\"containsNull\":false}")
/* 140 */         if (false) {
/* 141 */           convertedArray_0[loopIndex_0] = null;
/* 142 */         } else {
/* 143 */           convertedArray_0[loopIndex_0] = arrayData_0.copyUnsafeData("{"type":"array","elementType":{"type":"struct","fields":[{"name":"col1","type":"integer","nullable":true,"metadata":{}}]},"containsNull":false}");
/* 144 */         }

ALSSuite.recommendForAllUsers with k <, = and > num_items

serializefromobject_value_2.copyUnsafeData("{\"type\":\"struct\",\"fields\":[{\"name\":\"_1\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{}},{\"name\":\"_2\",\"type\":{\"type\":\"array\",\"elementType\":\"float\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}}]}")
/* 112 */         if (serializefromobject_isNull_2) {
/* 113 */           serializefromobject_convertedArray_0[serializefromobject_loopIndex_0] = null;
/* 114 */         } else {
/* 115 */           serializefromobject_convertedArray_0[serializefromobject_loopIndex_0] = serializefromobject_value_2.copyUnsafeData("{"type":"struct","fields":[{"name":"_1","type":"integer","nullable":false,"metadata":{}},{"name":"_2","type":{"type":"array","elementType":"float","containsNull":false},"nullable":true,"metadata":{}}]}");
/* 116 */         }

val typeToJson = StringEscapeUtils.escapeJava(dataType.json)

DataFrameComplexTypeSuite.SPARK-29503 nest unsafe struct inside safe array

arrayData_0.copyUnsafeData("{\"type\":\"array\",\"elementType\":{\"type\":\"struct\",\"fields\":[{\"name\":\"col1\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]},\"containsNull\":false}")
/* 140 */         if (false) {
/* 141 */           convertedArray_0[loopIndex_0] = null;
/* 142 */         } else {
/* 143 */           convertedArray_0[loopIndex_0] = arrayData_0.copyUnsafeData("{"type":"array","elementType":{"type":"struct","fields":[{"name":"col1","type":"integer","nullable":true,"metadata":{}}]},"containsNull":false}");
/* 144 */         }

ALSSuite.recommendForAllUsers with k <, = and > num_items

serializefromobject_value_2.copyUnsafeData("{\"type\":\"struct\",\"fields\":[{\"name\":\"_1\",\"type\":\"integer\",\"nullable\":false,\"metadata\":{}},{\"name\":\"_2\",\"type\":{\"type\":\"array\",\"elementType\":\"float\",\"containsNull\":false},\"nullable\":true,\"metadata\":{}}]}")
/* 112 */         if (serializefromobject_isNull_2) {
/* 113 */           serializefromobject_convertedArray_0[serializefromobject_loopIndex_0] = null;
/* 114 */         } else {
/* 115 */           serializefromobject_convertedArray_0[serializefromobject_loopIndex_0] = serializefromobject_value_2.copyUnsafeData("{"type":"struct","fields":[{"name":"_1","type":"integer","nullable":false,"metadata":{}},{"name":"_2","type":{"type":"array","elementType":"float","containsNull":false},"nullable":true,"metadata":{}}]}");
/* 116 */         }

So double escaping seems to be only working solution for new UT, but consistently fails on ALSSuite - \\ escape is gone. Not sure why.

@HeartSaVioR
Copy link
Contributor Author

Using catalog string instead of json seems to work on both tests. Let me see the build result.

s"$value instanceof ${clazz.getSimpleName}? ${value}.copy() : $value"
// Make a copy of the unsafe data if the result contains any
def makeCopyUnsafeData(dataType: DataType, value: String) = {
s"""${value}.copyUnsafeData("${dataType.catalogString}")"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this safe for codgen? For complex type, will we bump into a very long string here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on how long would we classify as very long string. JSON is much easier to possibly create very long string, catalog string would create less. I agree the approach is questionable, but even getting a value from field requires schema for InternalRow so I'm not sure there's alternative approach for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may be able to let CodeGenerator go through the schema and generate code based on the schema, but I'd suspect the overall amount of generated code could be larger as generated code cannot leverage schema information and we have to generate all the codes accessing and replacing fields recursively.

And looks like string literal has 64k limit, which doesn't seem to be short - it's also the limitation of method length as well. We may be even be able to try to apply trick here, define this in static field or field of class, and assign to one String instance via splitting by chunk (less than 64k) and concatenating to get over the limitation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the approach depends on how long of catalog string we are expecting for very complex type.

  1. If we expect this as not considerable size, it would work as it is. We might be able to refine a bit as defining local variable and assign from there, but it may require additional conditional logic.
  2. If we expect less than 64k but considerable size and we don't want to let it consume the length of method limitation, we may need to define it to be one of static fields.
  3. If we expect more than 64k, we should have it to be one of static fields, and also break catalog string down to chunk, and concat to one of String instance.

Do we have some huge and complicated but realistic (in production) schema to test this?

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this current approach seems overkill. Also I suspect possible performance downgrade. Can we avoid convert CreateNamedStruct to CreateNamedStructUnsafe in this special case?

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Oct 21, 2019

Would it work if we replace CreateNamedStructUnsafe to CreateNamedStruct in lambda expression of MapObjects?

I'm not sure we don't miss anything as it seems to only deal with struct type, but what we discovered for now is struct type, so the approach makes sense to me for resolving SPARK-29503. Just a view of as non-expert of SQL, current approach makes me feel safer though, in the sense of defensive programming.

I would wait to hear more voices before making change so we don't do back and forth. Thanks for the suggestion!

@SparkQA
Copy link

SparkQA commented Oct 21, 2019

Test build #112413 has finished for PR 26173 at commit 4c38ffc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2019

Test build #112426 has finished for PR 26173 at commit 6187d99.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented Oct 22, 2019

it's safer to make the copy supports mixed safe and unsafe data, but I agree with the perf concern from @viirya .

How many places do we create mixed safe and unsafe data? It looks to me that there is only one: UnsafeProjection.toUnsafeExprs only converts create_struct, but not create_array and create_map.

Furthermore, I don't quite understand why we need CreateNamedStructUnsafe. Can we simply remove it?

@HeartSaVioR
Copy link
Contributor Author

I quickly checked the usage of CreateNamedStructUnsafe, and it is only used in UnsafeProjection.toUnsafeExprs. If UnsafeProjection converts the result of evaluations of expressions into UnsafeRow (it should), I don't see strict need to care about individual expression, at least logically.

So I'm seeing the same, CreateNamedStructUnsafe doesn't seem to be needed at all. As a side effect UnsafeProjection.toUnsafeExprs is also not needed at all as well.

I've just removed them and rebased the branch to contain only the change. The new test passes with new change, so let's see the build result.

Old commit hash to revert: 6187d99

@HeartSaVioR HeartSaVioR changed the title [SPARK-29503][SQL] Copy result row from RowWriter in GenerateUnsafeProjection when the expression is lambdaFunction in MapObject [SPARK-29503][SQL] Remove conversion CreateNamedStruct to CreateNamedStructUnsafe Oct 22, 2019
@HeartSaVioR
Copy link
Contributor Author

OFFTOPIC: I've indicated ComplexTypes.scala file is using CRLF instead of LF which leads showing ^M in git diff making me a bit annoying. We don't seem to set any rule on this but personally it might be better if we make it be consistent. Have we discussed this earlier?

@viirya
Copy link
Member

viirya commented Oct 22, 2019

looks a much simpler. if no other concern, this approach looks good.

@SparkQA
Copy link

SparkQA commented Oct 23, 2019

Test build #112485 has finished for PR 26173 at commit d8d3900.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CreateNamedStruct(children: Seq[Expression]) extends Expression

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm late. The current one looks ok to me, too. I left super minor comments.

assert(result.size === 1)
assert(getValueInsideDepth(result.head, 0) === 1)
assert(getValueInsideDepth(result.head, 1) === 2)
assert(getValueInsideDepth(result.head, 2) === 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just check assert(result === Row(Seq(Seq(Row(1)), Seq(Row(2)), Seq(Row(3)))) :: Nil)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


test("SPARK-29503 nest unsafe struct inside safe array") {
withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> "false") {
val exampleDS = spark.sparkContext.parallelize(Seq(Seq(1, 2, 3))).toDF("items")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super...nit.., exampleDF?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or simply df

@maropu
Copy link
Member

maropu commented Oct 23, 2019

Ur, for better commit logs, can you describe a change of generated code for MapObjects in the PR decription?

@HeartSaVioR
Copy link
Contributor Author

Thanks for reviewing. Just addressed review comments.

import scala.collection.mutable

import org.apache.spark.sql.catalyst.DefinedByConstructorParams
import org.apache.spark.sql.catalyst.expressions.{Expression, GenericRowWithSchema}
Copy link
Member

@maropu maropu Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: plz remove GenericRowWithSchema


package org.apache.spark.sql

import scala.collection.mutable
Copy link
Member

@maropu maropu Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: plz remove this import.

@SparkQA
Copy link

SparkQA commented Oct 23, 2019

Test build #112508 has finished for PR 26173 at commit 4460233.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 23, 2019

Test build #112509 has finished for PR 26173 at commit f10042a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Oct 23, 2019

Test build #112520 has finished for PR 26173 at commit f10042a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in bfbf282 Oct 23, 2019
@HeartSaVioR
Copy link
Contributor Author

Thanks all for reviewing and merging!

@HeartSaVioR HeartSaVioR deleted the SPARK-29503 branch October 23, 2019 19:47
// items: Seq[Int] => items.map { item => Seq(Struct(item)) }
val result = df.select(
new Column(MapObjects(
(item: Expression) => array(struct(new Column(item))).expr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, while fix seems fine to me too, was this only the reproducer? MapObjects is supposed to be internal purpose - it's under catalyst package.

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Oct 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't spent another time to try it (as it seems to be clean and simple reproducer). I'm not sure it's not going to be valid reproducer just due to pulling catalyst package. Catalyst could analyze the query and inject it if necessary in any way.

I indicated you'd like to revisit #25745 - that was WIP and it didn't have any number of performance gain. I'd rather choose "safeness" over "speed", and even we haven't figured out there's outstanding difference between twos. It was the only one case MapObjects could have unsafe struct, by allowing this, safe and unsafe are possibly mixed up leading to encounter corner case.

Copy link
Member

@HyukjinKwon HyukjinKwon Oct 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am not against this change. In that way, I think this fix is fine but wanted to know if this actually affects any user-facing surface.

Was also wondering if we can benefit from #25745 since some investigations look already made there to completely use unsafe one instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess of #25745 is, it was based on the assumption that it's safe to replace CreateNamedStruct with CreateNamedStructUnsafe as we already have one path to do this - and this observation broke the assumption. IMHO, once we found it's not safe to do this, the improvement has to prove safety before we take its benefits into account.

@dongjoon-hyun
Copy link
Member

Hi, All.
This seems to exist since 2.1.1, do you think we can have a backport of this?

@maropu
Copy link
Member

maropu commented Jan 26, 2020

Is this an internal bug? (by referring to the @HyukjinKwon suggestion: https://github.com/apache/spark/pull/26173/files#r339915780)
If so, I think its less worth backporting this.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 26, 2020

Do you mean that this PR doesn't fix anything to user sides with the final results?

Is this an internal bug?

@maropu
Copy link
Member

maropu commented Jan 26, 2020

If users don't use internal objects (MapObjects in this case), I think yes.

@dongjoon-hyun
Copy link
Member

@maropu Please see the example in the JIRA.

  • It's already exposed to the users.
  • catalyst is internal means that we can change freely without deprecations or something.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 26, 2020

Databricks Delta library is also using our internals. So, it had some problems with 3.0.0. It's a frequent situation.

@maropu
Copy link
Member

maropu commented Jan 26, 2020

Ah, I see. Probably, I don't clearly understand which component correctness should cover (I thought that should do explicitly public APIs only).

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 26, 2020

No~ Thank you for feedback, @maropu . The reason why I'm asking around (or pushing more) is that there was no consensus on it. Since RC1 failure, I'm collecting them including yours.

@maropu
Copy link
Member

maropu commented Jan 26, 2020

Ah, I see, ok. Thanks for the kind explanation! Yea, its important to discuss more about it.

@HeartSaVioR
Copy link
Contributor Author

I also agree that catalyst thing has been treated as a non-public API, but it seems to be no harm to port back this to 2.4 branch. (And even we treated catalyst as non-public API, looks like end users don't recognize the fact and use it.) No strong opinion though - OK in either way. I can follow up if we decide to port back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants