[SPARK-16215][SQL] Reduce runtime overhead of a program that writes an primitive array in Dataframe/Dataset #13911

kiszk · 2016-06-26T05:56:38Z

What changes were proposed in this pull request?

This PR optimize generate code of projection for an primitive type array. While we know primitive type array does not require null check and has contigious data region, current generated code performs null checks and performs copy for each element (at Lines 075-082 at Generated code before applying this PR)

Eliminate null checks for each array element
Perform bulk data copy by using Platform.copy
Eliminate primitive array allocation in GenericArrayData when [SPARK-16043][SQL] Prepare GenericArrayData implementation specialized for a primitive array #13758 is merged
Eliminate setting sparse index for UnsafeArrayData when [SPARK-15962][SQL] Introduce implementation with a dense format for UnsafeArrayData #13680 is merged

They are done in a helper method UnsafeArrayWrite.writePrimitive<PrimitiveType>Array() (at Line 075 at Generated code after applying this PR).

For now, 3 and 4 are not currently enabled. But, code are ready.

Benchmark program

  def writeArray(iters: Int): Unit = {
    import sparkSession.implicits._

    val iters = 5
    val n = 1024 * 1024
    val rows = 15

    val benchmark = new Benchmark("Read primitive array", n)

    val intDF = sparkSession.sparkContext.parallelize(0 until rows, 1)
      .map(i => Array.tabulate(n)(i => i)).toDF()
    intDF.count() // force to create df

    benchmark.addCase(s"Write int array in DataFrame", numIters = iters)(iter => {
      intDF.selectExpr("value as a").collect
    })

    val doubleDF = sparkSession.sparkContext.parallelize(0 until rows, 1)
      .map(i => Array.tabulate(n)(i => i.toDouble)).toDF()
    doubleDF.count() // force to create df

    benchmark.addCase(s"Write double array in DataFrame", numIters = iters)(iter => {
      doubleDF.selectExpr("value as a").collect
    })

    benchmark.run()
  }

  test("Write an array in DataFrame") {
    writeArray(1)
  }

An example program

val df = sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF
df.selectExpr("Array(value + 1.1d, value + 2.2d)").collect

Generated code before applying this PR

/* 028 */   protected void processNext() throws java.io.IOException {
/* 029 */     while (inputadapter_input.hasNext()) {
/* 030 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
/* 031 */       double inputadapter_value = inputadapter_row.getDouble(0);
/* 032 */
/* 033 */       final boolean project_isNull = false;
/* 034 */       this.project_values = new Object[2];
/* 035 */       double project_value1 = -1.0;
/* 036 */       project_value1 = inputadapter_value + 1.1D;
/* 037 */       if (false) {
/* 038 */         project_values[0] = null;
/* 039 */       } else {
/* 040 */         project_values[0] = project_value1;
/* 041 */       }
/* 042 */
/* 043 */       double project_value4 = -1.0;
/* 044 */       project_value4 = inputadapter_value + 2.2D;
/* 045 */       if (false) {
/* 046 */         project_values[1] = null;
/* 047 */       } else {
/* 048 */         project_values[1] = project_value4;
/* 049 */       }
/* 050 */
/* 051 */       final ArrayData project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_values);
/* 052 */       this.project_values = null;
/* 053 */       project_holder.reset();
/* 054 */
/* 055 */       project_rowWriter.zeroOutNullBytes();
/* 056 */
/* 057 */       if (project_isNull) {
/* 058 */         project_rowWriter.setNullAt(0);
/* 059 */       } else {
/* 060 */         // Remember the current cursor so that we can calculate how many bytes are
/* 061 */         // written later.
/* 062 */         final int project_tmpCursor = project_holder.cursor;
/* 063 */
/* 064 */         if (project_value instanceof UnsafeArrayData) {
/* 065 */           final int project_sizeInBytes = ((UnsafeArrayData) project_value).getSizeInBytes();
/* 066 */           // grow the global buffer before writing data.
/* 067 */           project_holder.grow(project_sizeInBytes);
/* 068 */           ((UnsafeArrayData) project_value).writeToMemory(project_holder.buffer, project_holder.cursor);
/* 069 */           project_holder.cursor += project_sizeInBytes;
/* 070 */
/* 071 */         } else {
/* 072 */           final int project_numElements = project_value.numElements();
/* 073 */           project_arrayWriter.initialize(project_holder, project_numElements, 8);
/* 074 */
/* 075 */           for (int project_index = 0; project_index < project_numElements; project_index++) {
/* 076 */             if (project_value.isNullAt(project_index)) {
/* 077 */               project_arrayWriter.setNullAt(project_index);
/* 078 */             } else {
/* 079 */               final double project_element = project_value.getDouble(project_index);
/* 080 */               project_arrayWriter.write(project_index, project_element);
/* 081 */             }
/* 082 */           }
/* 083 */
/* 084 */         }
/* 085 */
/* 086 */         project_rowWriter.setOffsetAndSize(0, project_tmpCursor, project_holder.cursor - project_tmpCursor);
/* 087 */         project_rowWriter.alignToWords(project_holder.cursor - project_tmpCursor);
/* 088 */       }
/* 089 */       project_result.setTotalSize(project_holder.totalSize());
/* 090 */       append(project_result);
/* 091 */       if (shouldStop()) return;
/* 092 */     }
/* 093 */   }
/* 094 */ }

Generated code after applying this PR

/* 028 */   protected void processNext() throws java.io.IOException {
/* 029 */     while (inputadapter_input.hasNext()) {
/* 030 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
/* 031 */       double inputadapter_value = inputadapter_row.getDouble(0);
/* 032 */
/* 033 */       final boolean project_isNull = false;
/* 034 */       this.project_values = new Object[2];
/* 035 */       double project_value1 = -1.0;
/* 036 */       project_value1 = inputadapter_value + 1.1D;
/* 037 */       if (false) {
/* 038 */         project_values[0] = null;
/* 039 */       } else {
/* 040 */         project_values[0] = project_value1;
/* 041 */       }
/* 042 */
/* 043 */       double project_value4 = -1.0;
/* 044 */       project_value4 = inputadapter_value + 2.2D;
/* 045 */       if (false) {
/* 046 */         project_values[1] = null;
/* 047 */       } else {
/* 048 */         project_values[1] = project_value4;
/* 049 */       }
/* 050 */
/* 051 */       final ArrayData project_value = new org.apache.spark.sql.catalyst.util.GenericArrayData(project_values);
/* 052 */       this.project_values = null;
/* 053 */       project_holder.reset();
/* 054 */
/* 055 */       project_rowWriter.zeroOutNullBytes();
/* 056 */
/* 057 */       if (project_isNull) {
/* 058 */         project_rowWriter.setNullAt(0);
/* 059 */       } else {
/* 060 */         // Remember the current cursor so that we can calculate how many bytes are
/* 061 */         // written later.
/* 062 */         final int project_tmpCursor = project_holder.cursor;
/* 063 */
/* 064 */         if (project_value instanceof UnsafeArrayData) {
/* 065 */           final int project_sizeInBytes = ((UnsafeArrayData) project_value).getSizeInBytes();
/* 066 */           // grow the global buffer before writing data.
/* 067 */           project_holder.grow(project_sizeInBytes);
/* 068 */           ((UnsafeArrayData) project_value).writeToMemory(project_holder.buffer, project_holder.cursor);
/* 069 */           project_holder.cursor += project_sizeInBytes;
/* 070 */
/* 071 */         } else {
/* 072 */           final int project_numElements = project_value.numElements();
/* 073 */           project_arrayWriter.initialize(project_holder, project_numElements, 8);
/* 074 */
/* 075 */           project_arrayWriter.writePrimitiveDoubleArray(project_value);
/* 076 */         }
/* 077 */
/* 078 */         project_rowWriter.setOffsetAndSize(0, project_tmpCursor, project_holder.cursor - project_tmpCursor);
/* 079 */         project_rowWriter.alignToWords(project_holder.cursor - project_tmpCursor);
/* 080 */       }
/* 081 */       project_result.setTotalSize(project_holder.totalSize());
/* 082 */       append(project_result);
/* 083 */       if (shouldStop()) return;
/* 084 */     }
/* 085 */   }

How was this patch tested?

Added test suites into DataFrameComplexTypeSuite

SparkQA · 2016-06-26T07:15:32Z

Test build #61256 has finished for PR 13911 at commit b1f6289.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-26T10:11:28Z

Test build #61262 has finished for PR 13911 at commit 333f7b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-02T08:25:04Z

Test build #61651 has finished for PR 13911 at commit 2e8fb0e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-14T12:51:58Z

Test build #62312 has finished for PR 13911 at commit 4b70df9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…RefArrayData

SparkQA · 2016-11-03T11:05:47Z

Test build #68066 has finished for PR 13911 at commit 88aad46.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-03T18:04:55Z

Test build #68073 has finished for PR 13911 at commit 25ddd8e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-04T17:18:14Z

Test build #68140 has finished for PR 13911 at commit 6204298.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2016-11-08T20:24:27Z

This was implemented by another approach in #15044

This was referenced Jun 26, 2016

[SPARK-16043][SQL] Prepare GenericArrayData implementation specialized for a primitive array #13758

Closed

[SPARK-15962][SQL] Introduce implementation with a dense format for UnsafeArrayData #13680

Closed

kiszk force-pushed the SPARK-16215 branch from 333f7b6 to 2e8fb0e Compare July 2, 2016 06:36

kiszk added 8 commits November 3, 2016 18:03

pass information on containsNull in ArrayType

1c57972

generate helper call instead of for-loop

3d78991

add test suite

90d716f

fix test failure

0b50242

update

4b132de

add test suites

249a1a9

fix potential bug since writePrimitiveTypeArray() may receive Generic…

96b0853

…RefArrayData

rebase

88aad46

kiszk force-pushed the SPARK-16215 branch from 4b70df9 to 88aad46 Compare November 3, 2016 09:37

fix test failures

25ddd8e

fix test failures

6204298

kiszk closed this Nov 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-16215][SQL] Reduce runtime overhead of a program that writes an primitive array in Dataframe/Dataset #13911

[SPARK-16215][SQL] Reduce runtime overhead of a program that writes an primitive array in Dataframe/Dataset #13911

Uh oh!

kiszk commented Jun 26, 2016 •

edited

Loading

Uh oh!

SparkQA commented Jun 26, 2016

Uh oh!

SparkQA commented Jun 26, 2016

Uh oh!

SparkQA commented Jul 2, 2016

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

SparkQA commented Nov 3, 2016

Uh oh!

SparkQA commented Nov 3, 2016

Uh oh!

SparkQA commented Nov 4, 2016

Uh oh!

kiszk commented Nov 8, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-16215][SQL] Reduce runtime overhead of a program that writes an primitive array in Dataframe/Dataset #13911

[SPARK-16215][SQL] Reduce runtime overhead of a program that writes an primitive array in Dataframe/Dataset #13911

Uh oh!

Conversation

kiszk commented Jun 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jun 26, 2016

Uh oh!

SparkQA commented Jun 26, 2016

Uh oh!

SparkQA commented Jul 2, 2016

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

SparkQA commented Nov 3, 2016

Uh oh!

SparkQA commented Nov 3, 2016

Uh oh!

SparkQA commented Nov 4, 2016

Uh oh!

kiszk commented Nov 8, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kiszk commented Jun 26, 2016 •

edited

Loading