[SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader #14045

viirya · 2016-07-04T09:44:42Z

What changes were proposed in this pull request?

Vectorization parquet reader now doesn't support complex types such as ArrayType, MapType and StructType. We should support it to extend the coverage of performance improvement introduced by vectorization parquet reader. This patch is to add ArrayType and StructType first.

Main changes

Obtain repetition and definition level values during converting Parquet schema

We convert Parquet schema to Catalyst DataType in ParquetSchemaConverter. Because we need repetition and definition level information during constructing complex types back from Parquet data, this PR obtains the repetition and definition levels for complex types and attaches them to Catalyst ArrayType, MapType and StructType. Accordingly, this PR adds a metadata to these Calalyst DataTypes as a new attribute. Although this PR tries to avoid modifying these Catalyst DataTypes, however, because the column vector in vectorization is corresponding to Catalyst DataType instead of Parquet schema, we can't access the repetition and definition levels for each column if we don't attach them to their assigned DataType.
Attach VectorizedColumnReader to ColumnVector

Because in flat schema each ColumnVector is actually a data column, previously the relation between VectorizedColumnReader and ColumnVector is one-by-one. Now only the ColumnVector representing a data column will have corresponding VectorizedColumnReader. Then when it is time to read batch, the ColumnVector with complex type will delegate to its child ColumnVector.
Implement constructing complex records in VectorizedColumnReader

The readBatch in VectorizedColumnReader is the main method to read data into ColumnVector. Previously its behavior is simply to load required number of data according to the data type of the column vector. Now after the data is loaded into the column, we need to construct complex records in its parent column that could be an ArrayType, MapType or StructType. The way to restore the data as complex types is encoding in repetition and definition levels in Parquet. The new method constructComplexRecords in VectorizedColumnReader implements the logic to restore the complex data. Basically, what constructComplexRecords does is to count the continuous values and add array into the parent column if the repetition level value indicates a new record happens. Besides, constructComplexRecords also needs to consider the null values. Null values could mean a null record at root level, an empty array or struct. This method considers different cases and sets it correctly.

Benchmark

val N = 10000
withParquetTable((0 until N).map { i =>
  ((i to i + 1000).toList, (i to i + 100).map(_.toString).toList,
    (i to i + 1000).map(_.toDouble / 2).toList,
    ((0 to 10).map(_.toString).toList, (0 to 10).map(_.toString).toList))
}, "t") {
  val benchmark = new Benchmark("Vectorization Parquet for nested types", N)
  benchmark.addCase("Vectorization Parquet reader", 10) { iter =>
    sql("SELECT _1[10], _2[20], _3[30], _4._1[5], _4._2[5] FROM t").collect()
  }
  benchmark.run()
}

Disabled vectorization:

Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Vectorization Parquet for nested types:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Vectorization Parquet reader                  1706 / 2207          0.0      170580.8       1.0X

Enabled vectorization:

Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Vectorization Parquet for nested types:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Vectorization Parquet reader                   789 /  972          0.0       78919.4       1.0X

How was this patch tested?

Existing unit tests.

…d-column9 Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java

viirya · 2016-07-04T09:46:57Z

Submitted to see jenkins test results. Benchmark will be run later.

viirya · 2016-07-04T10:55:58Z

retest this please.

SparkQA · 2016-07-04T11:48:54Z

Test build #61720 has finished for PR 14045 at commit d5e5a60.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-04T11:53:05Z

Test build #61718 has finished for PR 14045 at commit d5e5a60.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-05T03:32:14Z

Test build #61740 has finished for PR 14045 at commit 5c4c1c8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-05T04:26:13Z

Test build #61743 has finished for PR 14045 at commit 114a69b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…vel.

SparkQA · 2016-07-06T05:37:21Z

Test build #61813 has finished for PR 14045 at commit 4dca939.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-06T05:49:52Z

Test build #61814 has finished for PR 14045 at commit ded41b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-06T09:14:29Z

Test build #61828 has finished for PR 14045 at commit bf61a75.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-07-06T09:16:08Z

ok. 3 failed tests remaining...

…d-column9 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

SparkQA · 2016-07-07T09:00:49Z

Test build #61902 has finished for PR 14045 at commit a8f121b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-07T09:18:42Z

Test build #61900 has finished for PR 14045 at commit d719480.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

…d-column9 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

SparkQA · 2016-07-11T17:19:42Z

Test build #62094 has finished for PR 14045 at commit 9a8b062.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-07-12T01:00:40Z

ping @liancheng @yhuai @rxin I think this is ready now. Can you review this?

SparkQA · 2016-07-12T03:00:56Z

Test build #62130 has finished for PR 14045 at commit 42f53de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-13T15:04:25Z

Test build #62244 has finished for PR 14045 at commit 17f3b82.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… complex columns should take care of it.

SparkQA · 2016-07-15T04:26:25Z

Test build #62363 has finished for PR 14045 at commit 1788d4c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-15T08:49:26Z

Test build #62370 has finished for PR 14045 at commit 3b8c3ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…d-column9 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala

SparkQA · 2016-07-20T04:41:34Z

Test build #62573 has finished for PR 14045 at commit 545a57a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-07-20T05:12:24Z

ping @liancheng @yhuai @rxin Can you review this? I think that we should support complex types in vectorization to extend the coverage of performance improvement. Thanks!

SparkQA · 2016-07-20T05:53:26Z

Test build #62576 has finished for PR 14045 at commit cc35cab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-07-20T08:01:36Z

@viirya Thanks for your work! This would be very useful. I'll help review this one soon after finishing my 2.0 tasks at hand!

viirya · 2016-07-20T08:20:24Z

@liancheng Thank you!

viirya · 2016-07-27T08:43:20Z

I am going to refactor this a lot.

viirya · 2016-07-28T07:18:22Z

Closed this PR in favor of the refactored one: #14388.

viirya added 3 commits July 4, 2016 17:09

Support ArrayType and StructType in vectorization parquet reader.

38be47e

Remove commented code.

d5e5a60

viirya changed the title ~~[SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader~~ [SPARK-16362][SQL][WIP] Support ArrayType and StructType in vectorization Parquet reader Jul 4, 2016

Fix test.

5c4c1c8

Fix test.

114a69b

viirya added 2 commits July 6, 2016 12:52

For array type 2, don't take repeated type in computing repetition le…

4dca939

…vel.

supportBatch should check unsupported MapType recursively.

ded41b2

Definition level of array type 2 should include repeated type.

bf61a75

viirya added 2 commits July 7, 2016 15:25

Fix test.

d719480

Merge remote-tracking branch 'upstream/master' into parquet-vectorize…

a8f121b

…d-column9 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

viirya added 3 commits July 8, 2016 11:51

Support getBoolean.

e3f74bd

Consider more cases when the value is null.

5d5e933

Merge remote-tracking branch 'upstream/master' into parquet-vectorize…

9a8b062

…d-column9 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

Add more comments.

42f53de

viirya changed the title ~~[SPARK-16362][SQL][WIP] Support ArrayType and StructType in vectorization Parquet reader~~ [SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader Jul 12, 2016

viirya added 3 commits July 13, 2016 16:32

Fix null capacity issue.

60f2d7c

Use primitive array instead of HashMap.

1b37fe8

Remove unused method.

17f3b82

viirya changed the title ~~[SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader~~ [SPARK-16362][SQL][WIP] Support ArrayType and StructType in vectorization Parquet reader Jul 14, 2016

Repetition level encoding will be split across pages. So constructing…

1788d4c

… complex columns should take care of it.

viirya changed the title ~~[SPARK-16362][SQL][WIP] Support ArrayType and StructType in vectorization Parquet reader~~ [SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader Jul 15, 2016

Fix a bug.

3b8c3ce

viirya changed the title ~~[SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader~~ [SPARK-16362][SQL][WIP] Support ArrayType and StructType in vectorization Parquet reader Jul 15, 2016

viirya added 2 commits July 20, 2016 10:24

Merge remote-tracking branch 'upstream/master' into parquet-vectorize…

545a57a

…d-column9 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala

Improve the algorithm.

cc35cab

viirya changed the title ~~[SPARK-16362][SQL][WIP] Support ArrayType and StructType in vectorization Parquet reader~~ [SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader Jul 20, 2016

viirya mentioned this pull request Jul 20, 2016

[SPARK-16632][SQL] Use Spark requested schema to guide vectorized Parquet reader initialization #14278

Closed

viirya closed this Jul 28, 2016

viirya deleted the parquet-vectorized-column9 branch December 27, 2023 18:33

[SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader #14045

[SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader #14045

Uh oh!

Conversation

viirya commented Jul 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Main changes

Benchmark

How was this patch tested?

Uh oh!

viirya commented Jul 4, 2016

Uh oh!

viirya commented Jul 4, 2016

Uh oh!

SparkQA commented Jul 4, 2016

Uh oh!

SparkQA commented Jul 4, 2016

Uh oh!

SparkQA commented Jul 5, 2016

Uh oh!

SparkQA commented Jul 5, 2016

Uh oh!

SparkQA commented Jul 6, 2016

Uh oh!

SparkQA commented Jul 6, 2016

Uh oh!

SparkQA commented Jul 6, 2016

Uh oh!

viirya commented Jul 6, 2016

Uh oh!

SparkQA commented Jul 7, 2016

Uh oh!

SparkQA commented Jul 7, 2016

Uh oh!

SparkQA commented Jul 11, 2016

Uh oh!

viirya commented Jul 12, 2016

Uh oh!

SparkQA commented Jul 12, 2016

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

SparkQA commented Jul 15, 2016

Uh oh!

SparkQA commented Jul 15, 2016

Uh oh!

SparkQA commented Jul 20, 2016

Uh oh!

viirya commented Jul 20, 2016

Uh oh!

SparkQA commented Jul 20, 2016

Uh oh!

liancheng commented Jul 20, 2016

Uh oh!

viirya commented Jul 20, 2016

Uh oh!

viirya commented Jul 27, 2016

Uh oh!

viirya commented Jul 28, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

viirya commented Jul 4, 2016 •

edited

Loading