[SPARK-16213][SQL] Reduce runtime overhead of a program that creates an primitive array in DataFrame #13909

kiszk · 2016-06-26T04:07:29Z

What changes were proposed in this pull request?

This PR reduces runtime overhead of a program the creates an primitive array in DataFrame by using the similar approach to #15044. Generated code performs boxing operation in an assignment from InternalRow to an Object[] temporary array (at Lines 051 and 061 in the generated code before without this PR). If we know that type of array elements is primitive, we apply the following optimizations:

Eliminate a pair of isNullAt() and a null assignment
Allocate an primitive array instead of Object[] (eliminate boxing operations)
Create UnsafeArrayData by using UnsafeArrayWriter to keep a primitive array in a row format instead of doing non-lightweight operations in constructor of GenericArrayData
The PR also performs the same things for CreateMap.

Here are performance results of DataFrame programs by up to 17.9x over without this PR.

Without SPARK-16043
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.4.11-200.fc22.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
Read a primitive array in DataFrame:     Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Int                                           3805 / 4150          0.0      507308.9       1.0X
Double                                        3593 / 3852          0.0      479056.9       1.1X

With SPARK-16043
Read a primitive array in DataFrame:     Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Int                                            213 /  271          0.0       28387.5       1.0X
Double                                         204 /  223          0.0       27250.9       1.0X

Note : #15780 is enabled for these measurements

An motivating example

val df = sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF
df.selectExpr("Array(value + 1.1d, value + 2.2d)").show

Generated code without this PR

/* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private scala.collection.Iterator[] inputs;
/* 008 */   private scala.collection.Iterator inputadapter_input;
/* 009 */   private UnsafeRow serializefromobject_result;
/* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
/* 011 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
/* 012 */   private Object[] project_values;
/* 013 */   private UnsafeRow project_result;
/* 014 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
/* 015 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter project_rowWriter;
/* 016 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter project_arrayWriter;
/* 017 */
/* 018 */   public GeneratedIterator(Object[] references) {
/* 019 */     this.references = references;
/* 020 */   }
/* 021 */
/* 022 */   public void init(int index, scala.collection.Iterator[] inputs) {
/* 023 */     partitionIndex = index;
/* 024 */     this.inputs = inputs;
/* 025 */     inputadapter_input = inputs[0];
/* 026 */     serializefromobject_result = new UnsafeRow(1);
/* 027 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0);
/* 028 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
/* 029 */     this.project_values = null;
/* 030 */     project_result = new UnsafeRow(1);
/* 031 */     this.project_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 32);
/* 032 */     this.project_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, 1);
/* 033 */     this.project_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
/* 034 */
/* 035 */   }
/* 036 */
/* 037 */   protected void processNext() throws java.io.IOException {
/* 038 */     while (inputadapter_input.hasNext()) {
/* 039 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
/* 040 */       double inputadapter_value = inputadapter_row.getDouble(0);
/* 041 */
/* 042 */       final boolean project_isNull = false;
/* 043 */       this.project_values = new Object[2];
/* 044 */       boolean project_isNull1 = false;
/* 045 */
/* 046 */       double project_value1 = -1.0;
/* 047 */       project_value1 = inputadapter_value + 1.1D;
/* 048 */       if (false) {
/* 049 */         project_values[0] = null;
/* 050 */       } else {
/* 051 */         project_values[0] = project_value1;
/* 052 */       }
/* 053 */
/* 054 */       boolean project_isNull4 = false;
/* 055 */
/* 056 */       double project_value4 = -1.0;
/* 057 */       project_value4 = inputadapter_value + 2.2D;
/* 058 */       if (false) {
/* 059 */         project_values[1] = null;
/* 060 */       } else {
/* 061 */         project_values[1] = project_value4;
/* 062 */       }
/* 063 */
/* 064 */       final ArrayData project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_values);
/* 065 */       this.project_values = null;
/* 066 */       project_holder.reset();
/* 067 */
/* 068 */       project_rowWriter.zeroOutNullBytes();
/* 069 */
/* 070 */       if (project_isNull) {
/* 071 */         project_rowWriter.setNullAt(0);
/* 072 */       } else {
/* 073 */         // Remember the current cursor so that we can calculate how many bytes are
/* 074 */         // written later.
/* 075 */         final int project_tmpCursor = project_holder.cursor;
/* 076 */
/* 077 */         if (project_value instanceof UnsafeArrayData) {
/* 078 */           final int project_sizeInBytes = ((UnsafeArrayData) project_value).getSizeInBytes();
/* 079 */           // grow the global buffer before writing data.
/* 080 */           project_holder.grow(project_sizeInBytes);
/* 081 */           ((UnsafeArrayData) project_value).writeToMemory(project_holder.buffer, project_holder.cursor);
/* 082 */           project_holder.cursor += project_sizeInBytes;
/* 083 */
/* 084 */         } else {
/* 085 */           final int project_numElements = project_value.numElements();
/* 086 */           project_arrayWriter.initialize(project_holder, project_numElements, 8);
/* 087 */
/* 088 */           for (int project_index = 0; project_index < project_numElements; project_index++) {
/* 089 */             if (project_value.isNullAt(project_index)) {
/* 090 */               project_arrayWriter.setNullDouble(project_index);
/* 091 */             } else {
/* 092 */               final double project_element = project_value.getDouble(project_index);
/* 093 */               project_arrayWriter.write(project_index, project_element);
/* 094 */             }
/* 095 */           }
/* 096 */         }
/* 097 */
/* 098 */         project_rowWriter.setOffsetAndSize(0, project_tmpCursor, project_holder.cursor - project_tmpCursor);
/* 099 */       }
/* 100 */       project_result.setTotalSize(project_holder.totalSize());
/* 101 */       append(project_result);
/* 102 */       if (shouldStop()) return;
/* 103 */     }
/* 104 */   }
/* 105 */ }

Generated code with this PR

/* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private scala.collection.Iterator[] inputs;
/* 008 */   private scala.collection.Iterator inputadapter_input;
/* 009 */   private UnsafeRow serializefromobject_result;
/* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
/* 011 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
/* 012 */   private UnsafeArrayData project_arrayData;
/* 013 */   private UnsafeRow project_result;
/* 014 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
/* 015 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter project_rowWriter;
/* 016 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter project_arrayWriter;
/* 017 */
/* 018 */   public GeneratedIterator(Object[] references) {
/* 019 */     this.references = references;
/* 020 */   }
/* 021 */
/* 022 */   public void init(int index, scala.collection.Iterator[] inputs) {
/* 023 */     partitionIndex = index;
/* 024 */     this.inputs = inputs;
/* 025 */     inputadapter_input = inputs[0];
/* 026 */     serializefromobject_result = new UnsafeRow(1);
/* 027 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0);
/* 028 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
/* 029 */
/* 030 */     project_result = new UnsafeRow(1);
/* 031 */     this.project_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 32);
/* 032 */     this.project_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, 1);
/* 033 */     this.project_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
/* 034 */
/* 035 */   }
/* 036 */
/* 037 */   protected void processNext() throws java.io.IOException {
/* 038 */     while (inputadapter_input.hasNext()) {
/* 039 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
/* 040 */       double inputadapter_value = inputadapter_row.getDouble(0);
/* 041 */
/* 042 */       byte[] project_array = new byte[32];
/* 043 */       project_arrayData = new UnsafeArrayData();
/* 044 */       Platform.putLong(project_array, 16, 2);
/* 045 */       project_arrayData.pointTo(project_array, 16, 32);
/* 046 */
/* 047 */       boolean project_isNull1 = false;
/* 048 */
/* 049 */       double project_value1 = -1.0;
/* 050 */       project_value1 = inputadapter_value + 1.1D;
/* 051 */       if (false) {
/* 052 */         project_arrayData.setNullAt(0);
/* 053 */       } else {
/* 054 */         project_arrayData.setDouble(0, project_value1);
/* 055 */       }
/* 056 */
/* 057 */       boolean project_isNull4 = false;
/* 058 */
/* 059 */       double project_value4 = -1.0;
/* 060 */       project_value4 = inputadapter_value + 2.2D;
/* 061 */       if (false) {
/* 062 */         project_arrayData.setNullAt(1);
/* 063 */       } else {
/* 064 */         project_arrayData.setDouble(1, project_value4);
/* 065 */       }
/* 066 */       project_holder.reset();
/* 067 */
/* 068 */       // Remember the current cursor so that we can calculate how many bytes are
/* 069 */       // written later.
/* 070 */       final int project_tmpCursor = project_holder.cursor;
/* 071 */
/* 072 */       if (project_arrayData instanceof UnsafeArrayData) {
/* 073 */         final int project_sizeInBytes = ((UnsafeArrayData) project_arrayData).getSizeInBytes();
/* 074 */         // grow the global buffer before writing data.
/* 075 */         project_holder.grow(project_sizeInBytes);
/* 076 */         ((UnsafeArrayData) project_arrayData).writeToMemory(project_holder.buffer, project_holder.cursor);
/* 077 */         project_holder.cursor += project_sizeInBytes;
/* 078 */
/* 079 */       } else {
/* 080 */         final int project_numElements = project_arrayData.numElements();
/* 081 */         project_arrayWriter.initialize(project_holder, project_numElements, 8);
/* 082 */
/* 083 */         for (int project_index = 0; project_index < project_numElements; project_index++) {
/* 084 */           if (project_arrayData.isNullAt(project_index)) {
/* 085 */             project_arrayWriter.setNullDouble(project_index);
/* 086 */           } else {
/* 087 */             final double project_element = project_arrayData.getDouble(project_index);
/* 088 */             project_arrayWriter.write(project_index, project_element);
/* 089 */           }
/* 090 */         }
/* 091 */       }
/* 092 */
/* 093 */       project_rowWriter.setOffsetAndSize(0, project_tmpCursor, project_holder.cursor - project_tmpCursor);
/* 094 */       project_result.setTotalSize(project_holder.totalSize());
/* 095 */       append(project_result);
/* 096 */       if (shouldStop()) return;
/* 097 */     }
/* 098 */   }
/* 099 */ }

How was this patch tested?

Added unit tests into DataFrameComplexTypeSuite

SparkQA · 2016-06-26T04:39:33Z

Test build #61250 has finished for PR 13909 at commit 37e4ce2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-26T10:32:15Z

Test build #61264 has finished for PR 13909 at commit 681be95.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jaceklaskowski · 2016-06-26T17:56:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

This can be shorter (simpler?) with the pattern matching on assignment "trick", i.e.

val ArrayType(dt, _) = dataType

thank you for good catch.

viirya · 2016-06-27T03:51:22Z

How much overhead we can reduce with this?

SparkQA · 2016-06-27T04:07:22Z

Test build #61282 has finished for PR 13909 at commit 4238268.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-27T06:13:07Z

Test build #61285 has finished for PR 13909 at commit c2da5bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2016-06-27T07:40:02Z

@viirya Good question. I should have a benchmark. When #13758 is merged and https://issues.apache.org/jira/browse/SPARK-16223 is solved, I will prepare benchmark result.

kiszk · 2016-11-09T08:19:02Z

@cloud-fan @hvanhovell could you please review this instead of #13758?
cc: @viirya

It would be good to close #13758 when this would be merged.

This reverts commit c82fbf3.

SparkQA · 2016-11-09T08:33:22Z

Test build #68396 has finished for PR 13909 at commit 990d6c8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-11-09T08:38:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

This line should not be here now. values can be Object[] or ${javaDataType}[], depending isPrimitiveArray.

Good catch. Sorry for overlooking this

viirya · 2016-11-09T08:39:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

nit: Can we move two array classes to each branch?

thank, done

is it done?

viirya · 2016-11-09T08:46:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

s"new $arrayClass($keyArray)".

viirya · 2016-11-09T08:48:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

keyDt -> valueDt.

Thank you very much for good catch.

viirya · 2016-11-09T08:54:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

isPrimitiveArrayKey?

Thank you very much for good catch again.

viirya · 2016-11-09T08:58:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

I have a thought about this.

As the key of a map can't be null, do we need to check isNull here and only use primitive array for key if all key evaluation are not null?

I think we can simplify it to val isPrimitiveArrayKey = ctx.isPrimitiveType(keyDt).

And we add the generated codes to check if key is null when evalKeys.forall(_.isNull == "false") is false for primitive key array case.

I theoretically agree with you. The key of a map should not be null.
However, in the generate code, we see the following stuff. Thus, I thought that there are some cases that eval.isNull may have null.

if (${eval.isNull}) { throw new RuntimeException("Cannot use null as map key!"); } else { ...

If we can understandeval.isNull is never true, I really like your optimization. Do you have any thought?

Do you mean eval.isNull may be false? I think it can't be null.

It could be. So we can only insert the code to check and throw exception only if evalKeys.forall(_.isNull == "false") is false.

viirya · 2016-11-09T09:03:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

These two blocks to generate codes for key and value look very similar.

Can we abstract a private method to do this and use parameters to decide generating key or value array codes?

SparkQA · 2016-11-09T09:47:45Z

Test build #68399 has finished for PR 13909 at commit 425522c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-09T12:40:39Z

Test build #68403 has finished for PR 13909 at commit 19a4e1a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-11-09T12:58:20Z

Looks like this also hits the 64KB problem...

kiszk · 2016-11-09T13:06:08Z

@viirya What do you mean "the 64KB problem"?
Oh, probably, you mean 64KB bytecode size.

viirya · 2016-11-09T13:09:46Z

The exception on jenkins tests:

[info] java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "processNext()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB

Too many statements to assign array elements. So the method grows beyond 64KB.

kiszk · 2016-11-09T13:12:50Z

Yes, for this test. With #15780, this failure did not occur.
The root cause for other failures is to put a benchmark with test() instead of ignore().

cloud-fan · 2016-12-27T01:01:13Z

Should use dataType.containsNull for allowNull?

no we shouldn't, containsNull is kind of an optimizing hint that guarantees the value will not be null, and it's guaranteed by the source, not here in CreateArray.

ueshin · 2016-12-27T02:14:29Z

@cloud-fan Hmm, I agree that it's guaranteed by the source, but I thought that's why we should use containsNull for allowNull to guarantee the value will not be null when containsNull == false in this array for the caller of this expression. ditto for CreateMap.

cloud-fan · 2016-12-27T02:39:08Z

@ueshin That will be a big mess if we validate the null here and there, I think we should only do it at the source side. If we do not trust the source side, there are a lot of other places that need to add validation logic.

SparkQA · 2016-12-27T03:03:39Z

Test build #70614 has finished for PR 13909 at commit dcce4c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2016-12-27T03:50:41Z

@cloud-fan Hmm, I didn't mean we don't trust the source side.

Currently we DO validate the null regardless of allowNull value.
The difference by the allowNull is setting null or throwing an exception.
I think which we should do when containsNull == false is throwing an exception.

(Anyway, eval.isNull of most of children will be "false" when child.nullable == false, so Janino will remove the $isNullAssignment clause.

cloud-fan · 2016-12-27T11:28:40Z

I think the validation logic should be consistent across the system, you can't add the null check in CreateArray just because it's easy to do. Currently we do null check based on nullable property only at source side and we should follow it. And the previous code doesn't have null check either.

BTW, this change has external influence, the error message for null map key has been changed.

cloud-fan · 2016-12-27T11:30:09Z

Well I don't think we should waste time arguing this minor problem and block this PR, how about we revert it and send a new PR to add it?

ueshin · 2016-12-27T11:50:33Z

I see, I agree that we shouldn't block this pr.

cloud-fan · 2016-12-27T12:23:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

+   * @param ctx a [[CodegenContext]]
+   * @param elementType data type of underlying array elements
+   * @param elementsCode a set of [[ExprCode]] for each element of an underlying array
+   * @param allowNull if to assign null value to an array element is allowed


to be more accurate, let's call it isMapKey

cloud-fan · 2016-12-27T14:22:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala

+   * @param ctx a [[CodegenContext]]
+   * @param elementType data type of underlying array elements
+   * @param elementsCode a set of [[ExprCode]] for each element of an underlying array
+   * @param isMapKey if throw an exception when to assign a null value to an array element


if true, throw an exception when the element is null

SparkQA · 2016-12-27T16:36:58Z

Test build #70636 has finished for PR 13909 at commit 2f67ac2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-27T18:33:57Z

Test build #70640 has finished for PR 13909 at commit c986361.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2016-12-27T23:49:36Z

@cloud-fan I rethought after changing allowNull to isMapKey, and finally I see what you said.

And we should revert the error message changes 4a0409a, right?

cloud-fan · 2016-12-28T06:01:53Z

yea we should! thanks for pointing this out!

SparkQA · 2016-12-28T07:02:41Z

Test build #70657 has started for PR 13909 at commit cfe2e3d.

cloud-fan · 2016-12-28T11:49:47Z

retest this please

SparkQA · 2016-12-28T14:28:46Z

Test build #70668 has finished for PR 13909 at commit cfe2e3d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-12-28T15:27:28Z

LGTM

BTW, @kiszk have you updated the benchmark with newest change, especially after the UnsafeArrayData change?

kiszk · 2016-12-28T17:50:02Z

I updated the benchmark result and generated code.
FYI: Since I update the benchmark result once with UnsafeBufferWrite, the performance is almost same.

cloud-fan · 2016-12-29T03:00:27Z

thanks, merging to master!

…an primitive array in DataFrame ## What changes were proposed in this pull request? This PR reduces runtime overhead of a program the creates an primitive array in DataFrame by using the similar approach to apache#15044. Generated code performs boxing operation in an assignment from InternalRow to an `Object[]` temporary array (at Lines 051 and 061 in the generated code before without this PR). If we know that type of array elements is primitive, we apply the following optimizations: 1. Eliminate a pair of `isNullAt()` and a null assignment 2. Allocate an primitive array instead of `Object[]` (eliminate boxing operations) 3. Create `UnsafeArrayData` by using `UnsafeArrayWriter` to keep a primitive array in a row format instead of doing non-lightweight operations in constructor of `GenericArrayData` The PR also performs the same things for `CreateMap`. Here are performance results of [DataFrame programs](https://github.com/kiszk/spark/blob/6bf54ec5e227689d69f6db991e9ecbc54e153d0a/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/PrimitiveArrayBenchmark.scala#L83-L112) by up to 17.9x over without this PR. ``` Without SPARK-16043 OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 4.4.11-200.fc22.x86_64 Intel Xeon E3-12xx v2 (Ivy Bridge) Read a primitive array in DataFrame: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Int 3805 / 4150 0.0 507308.9 1.0X Double 3593 / 3852 0.0 479056.9 1.1X With SPARK-16043 Read a primitive array in DataFrame: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ Int 213 / 271 0.0 28387.5 1.0X Double 204 / 223 0.0 27250.9 1.0X ``` Note : apache#15780 is enabled for these measurements An motivating example ``` java val df = sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF df.selectExpr("Array(value + 1.1d, value + 2.2d)").show ``` Generated code without this PR ``` java /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { /* 006 */ private Object[] references; /* 007 */ private scala.collection.Iterator[] inputs; /* 008 */ private scala.collection.Iterator inputadapter_input; /* 009 */ private UnsafeRow serializefromobject_result; /* 010 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder; /* 011 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter; /* 012 */ private Object[] project_values; /* 013 */ private UnsafeRow project_result; /* 014 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder; /* 015 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter project_rowWriter; /* 016 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter project_arrayWriter; /* 017 */ /* 018 */ public GeneratedIterator(Object[] references) { /* 019 */ this.references = references; /* 020 */ } /* 021 */ /* 022 */ public void init(int index, scala.collection.Iterator[] inputs) { /* 023 */ partitionIndex = index; /* 024 */ this.inputs = inputs; /* 025 */ inputadapter_input = inputs[0]; /* 026 */ serializefromobject_result = new UnsafeRow(1); /* 027 */ this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0); /* 028 */ this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1); /* 029 */ this.project_values = null; /* 030 */ project_result = new UnsafeRow(1); /* 031 */ this.project_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 32); /* 032 */ this.project_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, 1); /* 033 */ this.project_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(); /* 034 */ /* 035 */ } /* 036 */ /* 037 */ protected void processNext() throws java.io.IOException { /* 038 */ while (inputadapter_input.hasNext()) { /* 039 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); /* 040 */ double inputadapter_value = inputadapter_row.getDouble(0); /* 041 */ /* 042 */ final boolean project_isNull = false; /* 043 */ this.project_values = new Object[2]; /* 044 */ boolean project_isNull1 = false; /* 045 */ /* 046 */ double project_value1 = -1.0; /* 047 */ project_value1 = inputadapter_value + 1.1D; /* 048 */ if (false) { /* 049 */ project_values[0] = null; /* 050 */ } else { /* 051 */ project_values[0] = project_value1; /* 052 */ } /* 053 */ /* 054 */ boolean project_isNull4 = false; /* 055 */ /* 056 */ double project_value4 = -1.0; /* 057 */ project_value4 = inputadapter_value + 2.2D; /* 058 */ if (false) { /* 059 */ project_values[1] = null; /* 060 */ } else { /* 061 */ project_values[1] = project_value4; /* 062 */ } /* 063 */ /* 064 */ final ArrayData project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_values); /* 065 */ this.project_values = null; /* 066 */ project_holder.reset(); /* 067 */ /* 068 */ project_rowWriter.zeroOutNullBytes(); /* 069 */ /* 070 */ if (project_isNull) { /* 071 */ project_rowWriter.setNullAt(0); /* 072 */ } else { /* 073 */ // Remember the current cursor so that we can calculate how many bytes are /* 074 */ // written later. /* 075 */ final int project_tmpCursor = project_holder.cursor; /* 076 */ /* 077 */ if (project_value instanceof UnsafeArrayData) { /* 078 */ final int project_sizeInBytes = ((UnsafeArrayData) project_value).getSizeInBytes(); /* 079 */ // grow the global buffer before writing data. /* 080 */ project_holder.grow(project_sizeInBytes); /* 081 */ ((UnsafeArrayData) project_value).writeToMemory(project_holder.buffer, project_holder.cursor); /* 082 */ project_holder.cursor += project_sizeInBytes; /* 083 */ /* 084 */ } else { /* 085 */ final int project_numElements = project_value.numElements(); /* 086 */ project_arrayWriter.initialize(project_holder, project_numElements, 8); /* 087 */ /* 088 */ for (int project_index = 0; project_index < project_numElements; project_index++) { /* 089 */ if (project_value.isNullAt(project_index)) { /* 090 */ project_arrayWriter.setNullDouble(project_index); /* 091 */ } else { /* 092 */ final double project_element = project_value.getDouble(project_index); /* 093 */ project_arrayWriter.write(project_index, project_element); /* 094 */ } /* 095 */ } /* 096 */ } /* 097 */ /* 098 */ project_rowWriter.setOffsetAndSize(0, project_tmpCursor, project_holder.cursor - project_tmpCursor); /* 099 */ } /* 100 */ project_result.setTotalSize(project_holder.totalSize()); /* 101 */ append(project_result); /* 102 */ if (shouldStop()) return; /* 103 */ } /* 104 */ } /* 105 */ } ``` Generated code with this PR ``` java /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { /* 006 */ private Object[] references; /* 007 */ private scala.collection.Iterator[] inputs; /* 008 */ private scala.collection.Iterator inputadapter_input; /* 009 */ private UnsafeRow serializefromobject_result; /* 010 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder; /* 011 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter; /* 012 */ private UnsafeArrayData project_arrayData; /* 013 */ private UnsafeRow project_result; /* 014 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder; /* 015 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter project_rowWriter; /* 016 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter project_arrayWriter; /* 017 */ /* 018 */ public GeneratedIterator(Object[] references) { /* 019 */ this.references = references; /* 020 */ } /* 021 */ /* 022 */ public void init(int index, scala.collection.Iterator[] inputs) { /* 023 */ partitionIndex = index; /* 024 */ this.inputs = inputs; /* 025 */ inputadapter_input = inputs[0]; /* 026 */ serializefromobject_result = new UnsafeRow(1); /* 027 */ this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0); /* 028 */ this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1); /* 029 */ /* 030 */ project_result = new UnsafeRow(1); /* 031 */ this.project_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result, 32); /* 032 */ this.project_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder, 1); /* 033 */ this.project_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(); /* 034 */ /* 035 */ } /* 036 */ /* 037 */ protected void processNext() throws java.io.IOException { /* 038 */ while (inputadapter_input.hasNext()) { /* 039 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); /* 040 */ double inputadapter_value = inputadapter_row.getDouble(0); /* 041 */ /* 042 */ byte[] project_array = new byte[32]; /* 043 */ project_arrayData = new UnsafeArrayData(); /* 044 */ Platform.putLong(project_array, 16, 2); /* 045 */ project_arrayData.pointTo(project_array, 16, 32); /* 046 */ /* 047 */ boolean project_isNull1 = false; /* 048 */ /* 049 */ double project_value1 = -1.0; /* 050 */ project_value1 = inputadapter_value + 1.1D; /* 051 */ if (false) { /* 052 */ project_arrayData.setNullAt(0); /* 053 */ } else { /* 054 */ project_arrayData.setDouble(0, project_value1); /* 055 */ } /* 056 */ /* 057 */ boolean project_isNull4 = false; /* 058 */ /* 059 */ double project_value4 = -1.0; /* 060 */ project_value4 = inputadapter_value + 2.2D; /* 061 */ if (false) { /* 062 */ project_arrayData.setNullAt(1); /* 063 */ } else { /* 064 */ project_arrayData.setDouble(1, project_value4); /* 065 */ } /* 066 */ project_holder.reset(); /* 067 */ /* 068 */ // Remember the current cursor so that we can calculate how many bytes are /* 069 */ // written later. /* 070 */ final int project_tmpCursor = project_holder.cursor; /* 071 */ /* 072 */ if (project_arrayData instanceof UnsafeArrayData) { /* 073 */ final int project_sizeInBytes = ((UnsafeArrayData) project_arrayData).getSizeInBytes(); /* 074 */ // grow the global buffer before writing data. /* 075 */ project_holder.grow(project_sizeInBytes); /* 076 */ ((UnsafeArrayData) project_arrayData).writeToMemory(project_holder.buffer, project_holder.cursor); /* 077 */ project_holder.cursor += project_sizeInBytes; /* 078 */ /* 079 */ } else { /* 080 */ final int project_numElements = project_arrayData.numElements(); /* 081 */ project_arrayWriter.initialize(project_holder, project_numElements, 8); /* 082 */ /* 083 */ for (int project_index = 0; project_index < project_numElements; project_index++) { /* 084 */ if (project_arrayData.isNullAt(project_index)) { /* 085 */ project_arrayWriter.setNullDouble(project_index); /* 086 */ } else { /* 087 */ final double project_element = project_arrayData.getDouble(project_index); /* 088 */ project_arrayWriter.write(project_index, project_element); /* 089 */ } /* 090 */ } /* 091 */ } /* 092 */ /* 093 */ project_rowWriter.setOffsetAndSize(0, project_tmpCursor, project_holder.cursor - project_tmpCursor); /* 094 */ project_result.setTotalSize(project_holder.totalSize()); /* 095 */ append(project_result); /* 096 */ if (shouldStop()) return; /* 097 */ } /* 098 */ } /* 099 */ } ``` ## How was this patch tested? Added unit tests into `DataFrameComplexTypeSuite` Author: Kazuaki Ishizaki <[email protected]> Author: Liang-Chi Hsieh <[email protected]> Closes apache#13909 from kiszk/SPARK-16213.

kiszk mentioned this pull request Jun 26, 2016

[SPARK-16043][SQL] Prepare GenericArrayData implementation specialized for a primitive array #13758

Closed

jaceklaskowski reviewed Jun 26, 2016
View reviewed changes

kiszk force-pushed the SPARK-16213 branch from 681be95 to 4238268 Compare June 27, 2016 03:39

kiszk added a commit to kiszk/spark that referenced this pull request Nov 8, 2016

add another use cases from apache#13909

c82fbf3

kiszk force-pushed the SPARK-16213 branch from c2da5bd to 990d6c8 Compare November 9, 2016 07:50

kiszk added a commit to kiszk/spark that referenced this pull request Nov 9, 2016

Revert "add another use cases from apache#13909"

7697e5f

This reverts commit c82fbf3.

viirya reviewed Nov 9, 2016

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala Outdated

Copy link

Member

viirya Nov 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s"new $arrayClass($keyArray)".

viirya reviewed Nov 9, 2016

View reviewed changes

addressed review comment

dcce4c5

cloud-fan reviewed Dec 27, 2016

View reviewed changes

address a review comment

2f67ac2

cloud-fan reviewed Dec 27, 2016

View reviewed changes

address a review comment

c986361

revert a change of an exception message

cfe2e3d

asfgit closed this in 93f3556 Dec 29, 2016

kiszk mentioned this pull request Jan 4, 2017

[SPARK-16042][SQL] Eliminate nullcheck code at projection for an array type #13757

Closed

[SPARK-16213][SQL] Reduce runtime overhead of a program that creates an primitive array in DataFrame #13909

[SPARK-16213][SQL] Reduce runtime overhead of a program that creates an primitive array in DataFrame #13909

Uh oh!

Conversation

kiszk commented Jun 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 26, 2016

Uh oh!

SparkQA commented Jun 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jun 27, 2016

Uh oh!

SparkQA commented Jun 27, 2016

Uh oh!

SparkQA commented Jun 27, 2016

Uh oh!

kiszk commented Jun 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiszk commented Nov 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Nov 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Nov 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Nov 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 9, 2016

Uh oh!

SparkQA commented Nov 9, 2016

Uh oh!

viirya commented Nov 9, 2016

Uh oh!

kiszk commented Nov 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Nov 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiszk commented Nov 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kiszk commented Jun 26, 2016 •

edited

Loading

kiszk commented Jun 27, 2016 •

edited

Loading

kiszk commented Nov 9, 2016 •

edited

Loading

viirya Nov 9, 2016 •

edited

Loading

kiszk Nov 9, 2016 •

edited

Loading

kiszk commented Nov 9, 2016 •

edited

Loading

viirya commented Nov 9, 2016 •

edited

Loading

kiszk commented Nov 9, 2016 •

edited

Loading

ueshin commented Dec 27, 2016 •

edited

Loading

cloud-fan Dec 27, 2016 •

edited

Loading