[SPARK-23581][SQL] Add interpreted unsafe projection #20750

hvanhovell · 2018-03-06T14:36:10Z

What changes were proposed in this pull request?

We currently can only create unsafe rows using code generation. This is a problem for situations in which code generation fails. There is no fallback, and as a result we cannot execute the query.

This PR adds an interpreted version of UnsafeProjection. The implementation is modeled after InterpretedMutableProjection. It stores the expression results in a GenericInternalRow, and it then uses a conversion function to convert the GenericInternalRow into an UnsafeRow.

This PR does not implement the actual code generated to interpreted fallback logic. This will be done in a follow-up.

How was this patch tested?

I am piggybacking on exiting UnsafeProjection tests, and I have added an interpreted version for each of these.

hvanhovell · 2018-03-06T14:36:33Z

cc @cloud-fan @mgaido91 @kiszk @maropu

hvanhovell · 2018-03-06T14:42:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InterpretedUnsafeProjection.scala

+ * @param expressions that produces the resulting fields. These expressions must be bound
+ *                    to a schema.
+ */
+class InterpretedUnsafeProjection(expressions: Array[Expression]) extends UnsafeProjection {


The current implementation takes a two step approach, first it evaluates the expressions and puts them in an intermediate row and it then converts this row to an UnsafeRow. We could also just create a converter from InternalRow to UnsafeRow and punt the projection work of to a InterpretedMutableProjection.

mgaido91 · 2018-03-06T14:51:04Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

+    checkEvalutionWithUnsafeProjection(expression, expected, inputRow, InterpretedUnsafeProjection)
+  }
+
+  protected def checkEvalutionWithUnsafeProjection(


nit: typo in Evalu(a)tion

mgaido91 · 2018-03-06T15:19:46Z

sorry for the basic question, but I don't see where it is added the switch between the evaluated and the generated version. Maybe I am just missing something since I am not an expert on how Spark switches between the execution modes, but I expected that in the UnsafeProjection.create there would have been a check whether to use one or the other. Like this it looks to me that the newly added class in actually never used in the codebase, other than in the tests. Am I missing something? Thanks.

SparkQA · 2018-03-06T15:24:35Z

Test build #88008 has finished for PR 20750 at commit 8c277f1.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2018-03-06T15:25:58Z

@mgaido91 you are right. It is on purpose. I do not like to introduce these things in one big bang (this makes it hard to review and might make it hard to merge). In this PR we just introduce the InterpretedUnsafeProjection, and I (or someone else) will add the logic to fallback to interpreted mode in a follow up.

SparkQA · 2018-03-06T17:28:41Z

Test build #88006 has finished for PR 20750 at commit ead8335.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait UnsafeProjectionCreator

# Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

SparkQA · 2018-03-06T18:40:13Z

Test build #88009 has finished for PR 20750 at commit 8ed0695.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-03-06T19:35:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala

+  /**
+   * Returns an [[UnsafeProjection]] for given sequence of bound Expressions.
+   */
+  protected def createProjection(exprs: Seq[Expression]): UnsafeProjection


seems no place is calling this?

Yeah, that was pretty stupid. It is fixed now :)

SparkQA · 2018-03-06T21:10:01Z

Test build #88017 has finished for PR 20750 at commit 207933b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

# Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

cloud-fan · 2018-03-08T01:39:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InterpretedUnsafeProjection.scala

+            writer.write(i, null.asInstanceOf[Decimal], precision, scale)
+          }
+        }
+      case (_, true) if dt.defaultSize == 1 =>


it's a little tricky to depend on the default size, can we match ByteType directly here?

SparkQA · 2018-03-08T02:41:18Z

Test build #88064 has finished for PR 20750 at commit dee5105.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait SerializerSupport

SparkQA · 2018-03-08T03:33:57Z

Test build #88062 has finished for PR 20750 at commit 21d86fd.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
trait StatefulNondeterministic extends Nondeterministic
case class MonotonicallyIncreasingID() extends LeafExpression with StatefulNondeterministic
abstract class RDG extends UnaryExpression with ExpectsInputTypes with StatefulNondeterministic

mgaido91 · 2018-03-08T10:45:59Z

...atalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java

  }

+  @Override
+  public void setNullByte(int ordinal) {


which is the reason why these methods have been introduced?

These methods are needed for writing UnsafeArrayData, we fill the slot with 0s if we set it to null. The slot size in UnsafeArrayData is dependent on the data type we are storing in it.

I wanted to avoid writing a lot of duplicate code in the InterpretedUsafeProjection, and this is why I added these methods the UnsafeWriter parent class, and this also why they are in the UnsafeRowWriter.

I see, but I am not sure about having only some of these methods here. I mean, in UnsafeArrayData we have also setNullDouble, setNullFloat, etc. etc. It seems a bit weird to me that some of them are set at the parent level and some other no. It's not a big deal but I'd prefer for consistency to have all of them. What do you think?

We could also name them differently e.g.: setNull1Byte, setNull2Bytes, setNull4Bytes & setNull8Bytes

I am not sure it is a good idea, because then everyone while writing code should know exactly how many bytes each type is. I prefer the current approach. I would rather either reintroduce the setNullAt method with a match in the UnsafeArrayData's implementation or let is as it is now.

This is pretty low level stuff, so you should know how many bytes things contain at this point. I'd rather leave as it is. Doing a type match on such a hot code path doesn't seem like a good idea.

SparkQA · 2018-03-08T14:00:36Z

Test build #88084 has finished for PR 20750 at commit d718f9c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-03-14T19:27:08Z

...alyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java

  }

-  public void setOffsetAndSize(int ordinal, long currentCursor, int size) {
+  public void setOffsetAndSize(int ordinal, long currentCursor, long size) {


shall we check if size fits in an integer?

I think we can safely change the signature to take two ints.

cloud-fan · 2018-03-14T19:38:14Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeWriter.java

+ * Base class for writing Unsafe* structures.
+ */
+public abstract class UnsafeWriter {
+  public abstract void setNullByte(int ordinal);


This looks pretty weird. At the first glance I'm wondering why we don't have setBoolean/Float/Double, then I realized we don't need to, because we just need a way to set null for 1/2/4/8 bytes.

maybe it's better to name them setNull1/2/4/8Bytes, and ask the UnsafeArrayWriter to follow

See my previous discussion with @mgaido91. I am fine either way, I can also add the missing methods an be done with it, that will just make the interpreted code path a bit messier.

I feel setNull1/2/4/8Bytes is better. It's also easy to codegen, just setNull${dt.defaultSize}Bytes.

Actually this was my mistake... I thought Platform.setInt(0) is different from Platform.setFloat(0.0f), and that's why I introduced a setNull method for each primitive type.

cloud-fan · 2018-03-14T19:58:02Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala

+  override protected def createProjection(exprs: Seq[Expression]): UnsafeProjection = {
+    // We need to make sure that we do not reuse stateful non deterministic expressions.
+    val cleanedExpressions = exprs.map(_.transform {
+      case s: StatefulNondeterministic => s.freshCopy()


why it's not a problem for the codegen version?

In codegen the state is put in the generated class, if you happen to visit the same expression twice the state is added twice and is not shared during evaluation. In interpreted mode the Expression will be the same, and the same state will modified twice during evaluation.

cloud-fan · 2018-03-14T20:43:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+ * A stateful nondeterministic expression. These expressions contain state
+ * that is not stored in the parameter list.
+ */
+trait StatefulNondeterministic extends Nondeterministic {


In hive stateful and deterministic are orthogonal. If we wanna add this new trait, I think it's time to figure out the corrected semantic. Shall we have a new trait called Stateful, or add an assumption that stateful functions must be nondeterministic?

from Hive doc

A stateful UDF is considered to be non-deterministic, irrespective of what deterministic() returns.

This is corrected.

Maybe we can just call it Stateful while still extending Nondeterministic, and in the doc we say that stateful expressions imply it's nondeterministic.

SparkQA · 2018-03-15T01:43:03Z

Test build #88243 has finished for PR 20750 at commit 7e96f4b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait Stateful extends Nondeterministic
case class MonotonicallyIncreasingID() extends LeafExpression with Stateful
abstract class RDG extends UnaryExpression with ExpectsInputTypes with Stateful

cloud-fan

looks pretty good!

cloud-fan · 2018-03-15T17:44:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

+  /**
+   * Return a fresh uninitialized copy of the stateful expression.
+   */
+  def freshCopy(): Stateful = this


I think it's better to not provide this default implementation, to avoid mistakes in the future.

cloud-fan · 2018-03-15T17:50:45Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala

+   */
+  private def generateStructWriter(
+      bufferHolder: BufferHolder,
+      rowWriter: UnsafeRowWriter,


We can add UnsafeWriter#getBufferHolder, so that we don't need to pass 2 parameters.

Yeah we could do that. I think we need to refactor the writers a little bit anyway, but I would like to do that in a follow-up.

cloud-fan · 2018-03-15T18:41:42Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala

+          if (!v.isNullAt(i)) {
+            unsafeWriter(v, i)
+          } else {
+            writer.setNull8Bytes(i)


null type will hit this branch, can we add a test to make sure it works?

ah it's consistent with the codegen version. Maybe we should fix it later.

or just leave it, as array of null type doesn't make sense and maybe no one will do this.

There is a test that actually tests this for arrays: https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala#L80

SparkQA · 2018-03-15T20:10:30Z

Test build #88270 has finished for PR 20750 at commit 7e15f30.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-16T16:19:24Z

Test build #88312 has finished for PR 20750 at commit 3197d7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-03-16T17:20:17Z

thanks, merging to master!

hvanhovell · 2018-03-16T17:27:38Z

@cloud-fan has some issue with his mac, so I will be merging :)...

Thanks for the reviews!

## What changes were proposed in this pull request? We currently can only create unsafe rows using code generation. This is a problem for situations in which code generation fails. There is no fallback, and as a result we cannot execute the query. This PR adds an interpreted version of `UnsafeProjection`. The implementation is modeled after `InterpretedMutableProjection`. It stores the expression results in a `GenericInternalRow`, and it then uses a conversion function to convert the `GenericInternalRow` into an `UnsafeRow`. This PR does not implement the actual code generated to interpreted fallback logic. This will be done in a follow-up. ## How was this patch tested? I am piggybacking on exiting `UnsafeProjection` tests, and I have added an interpreted version for each of these. Author: Herman van Hovell <[email protected]> Closes apache#20750 from hvanhovell/SPARK-23581.

We currently can only create unsafe rows using code generation. This is a problem for situations in which code generation fails. There is no fallback, and as a result we cannot execute the query. This PR adds an interpreted version of `UnsafeProjection`. The implementation is modeled after `InterpretedMutableProjection`. It stores the expression results in a `GenericInternalRow`, and it then uses a conversion function to convert the `GenericInternalRow` into an `UnsafeRow`. This PR does not implement the actual code generated to interpreted fallback logic. This will be done in a follow-up. I am piggybacking on exiting `UnsafeProjection` tests, and I have added an interpreted version for each of these. Author: Herman van Hovell <[email protected]> Closes apache#20750 from hvanhovell/SPARK-23581. Ref: LIHADOOP-48531

hvanhovell added 2 commits March 6, 2018 14:19

Add an interpreted unsafe projection

4689170

Enable tests

ead8335

hvanhovell commented Mar 6, 2018

View reviewed changes

mgaido91 reviewed Mar 6, 2018

View reviewed changes

typo...

8c277f1

If you rename, use intellij for that....

8ed0695

Merge remote-tracking branch 'apache/master' into SPARK-23581

207933b

# Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

cloud-fan reviewed Mar 6, 2018

View reviewed changes

hvanhovell added 2 commits March 8, 2018 00:52

Actually use the interpreted projection, and fix a bunch of bugs.

21d86fd

Merge remote-tracking branch 'apache/master' into SPARK-23581

dee5105

# Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala

cloud-fan reviewed Mar 8, 2018

View reviewed changes

hvanhovell added 2 commits March 8, 2018 10:56

Match on types.

062ac10

Fix NullType width. Move projection into expressions subpackage.

d718f9c

mgaido91 reviewed Mar 8, 2018

View reviewed changes

cloud-fan reviewed Mar 14, 2018

View reviewed changes

Code Review

7e96f4b

hvanhovell added 2 commits March 15, 2018 11:46

Merge remote-tracking branch 'apache/master' into SPARK-23581

1b63e94

Always check nulls

7e15f30

cloud-fan reviewed Mar 15, 2018

View reviewed changes

Do not implement freshCopy...

3197d7f

asfgit closed this in 88d8de9 Mar 16, 2018

[SPARK-23581][SQL] Add interpreted unsafe projection #20750

[SPARK-23581][SQL] Add interpreted unsafe projection #20750

Uh oh!

Conversation

hvanhovell commented Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hvanhovell commented Mar 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Mar 6, 2018

Uh oh!

SparkQA commented Mar 6, 2018

Uh oh!

hvanhovell commented Mar 6, 2018

Uh oh!

SparkQA commented Mar 6, 2018

Uh oh!

SparkQA commented Mar 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 8, 2018

Uh oh!

SparkQA commented Mar 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 8, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 15, 2018

Uh oh!

cloud-fan left a comment

hvanhovell commented Mar 6, 2018 •

edited

Loading

cloud-fan Mar 14, 2018 •

edited

Loading