Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Jul 4, 2016

What changes were proposed in this pull request?

Vectorization parquet reader now doesn't support complex types such as ArrayType, MapType and StructType. We should support it to extend the coverage of performance improvement introduced by vectorization parquet reader. This patch is to add ArrayType and StructType first.

Main changes

  • Obtain repetition and definition level values during converting Parquet schema

    We convert Parquet schema to Catalyst DataType in ParquetSchemaConverter. Because we need repetition and definition level information during constructing complex types back from Parquet data, this PR obtains the repetition and definition levels for complex types and attaches them to Catalyst ArrayType, MapType and StructType. Accordingly, this PR adds a metadata to these Calalyst DataTypes as a new attribute. Although this PR tries to avoid modifying these Catalyst DataTypes, however, because the column vector in vectorization is corresponding to Catalyst DataType instead of Parquet schema, we can't access the repetition and definition levels for each column if we don't attach them to their assigned DataType.

  • Attach VectorizedColumnReader to ColumnVector

    Because in flat schema each ColumnVector is actually a data column, previously the relation between VectorizedColumnReader and ColumnVector is one-by-one. Now only the ColumnVector representing a data column will have corresponding VectorizedColumnReader. Then when it is time to read batch, the ColumnVector with complex type will delegate to its child ColumnVector.

  • Implement constructing complex records in VectorizedColumnReader

    The readBatch in VectorizedColumnReader is the main method to read data into ColumnVector. Previously its behavior is simply to load required number of data according to the data type of the column vector. Now after the data is loaded into the column, we need to construct complex records in its parent column that could be an ArrayType, MapType or StructType. The way to restore the data as complex types is encoding in repetition and definition levels in Parquet. The new method constructComplexRecords in VectorizedColumnReader implements the logic to restore the complex data. Basically, what constructComplexRecords does is to count the continuous values and add array into the parent column if the repetition level value indicates a new record happens. Besides, constructComplexRecords also needs to consider the null values. Null values could mean a null record at root level, an empty array or struct. This method considers different cases and sets it correctly.

Benchmark

val N = 10000
withParquetTable((0 until N).map { i =>
  ((i to i + 1000).toList, (i to i + 100).map(_.toString).toList,
    (i to i + 1000).map(_.toDouble / 2).toList,
    ((0 to 10).map(_.toString).toList, (0 to 10).map(_.toString).toList))
}, "t") {
  val benchmark = new Benchmark("Vectorization Parquet for nested types", N)
  benchmark.addCase("Vectorization Parquet reader", 10) { iter =>
    sql("SELECT _1[10], _2[20], _3[30], _4._1[5], _4._2[5] FROM t").collect()
  }
  benchmark.run()
}

Disabled vectorization:

Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Vectorization Parquet for nested types:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Vectorization Parquet reader                  1706 / 2207          0.0      170580.8       1.0X

Enabled vectorization:

Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Vectorization Parquet for nested types:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Vectorization Parquet reader                   789 /  972          0.0       78919.4       1.0X

How was this patch tested?

Existing unit tests.

viirya added 3 commits July 4, 2016 17:09
…d-column9

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
	sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
@viirya
Copy link
Member Author

viirya commented Jul 4, 2016

Submitted to see jenkins test results. Benchmark will be run later.

@viirya
Copy link
Member Author

viirya commented Jul 4, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 4, 2016

Test build #61720 has finished for PR 14045 at commit d5e5a60.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 4, 2016

Test build #61718 has finished for PR 14045 at commit d5e5a60.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya changed the title [SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader [SPARK-16362][SQL][WIP] Support ArrayType and StructType in vectorization Parquet reader Jul 4, 2016
@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61740 has finished for PR 14045 at commit 5c4c1c8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61743 has finished for PR 14045 at commit 114a69b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61813 has finished for PR 14045 at commit 4dca939.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61814 has finished for PR 14045 at commit ded41b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2016

Test build #61828 has finished for PR 14045 at commit bf61a75.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 6, 2016

ok. 3 failed tests remaining...

viirya added 2 commits July 7, 2016 15:25
…d-column9

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61902 has finished for PR 14045 at commit a8f121b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 7, 2016

Test build #61900 has finished for PR 14045 at commit d719480.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

viirya added 3 commits July 8, 2016 11:51
…d-column9

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
@SparkQA
Copy link

SparkQA commented Jul 11, 2016

Test build #62094 has finished for PR 14045 at commit 9a8b062.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya changed the title [SPARK-16362][SQL][WIP] Support ArrayType and StructType in vectorization Parquet reader [SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader Jul 12, 2016
@viirya
Copy link
Member Author

viirya commented Jul 12, 2016

ping @liancheng @yhuai @rxin I think this is ready now. Can you review this?

@SparkQA
Copy link

SparkQA commented Jul 12, 2016

Test build #62130 has finished for PR 14045 at commit 42f53de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2016

Test build #62244 has finished for PR 14045 at commit 17f3b82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya changed the title [SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader [SPARK-16362][SQL][WIP] Support ArrayType and StructType in vectorization Parquet reader Jul 14, 2016
@viirya viirya changed the title [SPARK-16362][SQL][WIP] Support ArrayType and StructType in vectorization Parquet reader [SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader Jul 15, 2016
@SparkQA
Copy link

SparkQA commented Jul 15, 2016

Test build #62363 has finished for PR 14045 at commit 1788d4c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 15, 2016

Test build #62370 has finished for PR 14045 at commit 3b8c3ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya changed the title [SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader [SPARK-16362][SQL][WIP] Support ArrayType and StructType in vectorization Parquet reader Jul 15, 2016
viirya added 2 commits July 20, 2016 10:24
…d-column9

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala
@viirya viirya changed the title [SPARK-16362][SQL][WIP] Support ArrayType and StructType in vectorization Parquet reader [SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader Jul 20, 2016
@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62573 has finished for PR 14045 at commit 545a57a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 20, 2016

ping @liancheng @yhuai @rxin Can you review this? I think that we should support complex types in vectorization to extend the coverage of performance improvement. Thanks!

@SparkQA
Copy link

SparkQA commented Jul 20, 2016

Test build #62576 has finished for PR 14045 at commit cc35cab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

@viirya Thanks for your work! This would be very useful. I'll help review this one soon after finishing my 2.0 tasks at hand!

@viirya
Copy link
Member Author

viirya commented Jul 20, 2016

@liancheng Thank you!

@viirya
Copy link
Member Author

viirya commented Jul 27, 2016

I am going to refactor this a lot.

@viirya
Copy link
Member Author

viirya commented Jul 28, 2016

Closed this PR in favor of the refactored one: #14388.

@viirya viirya closed this Jul 28, 2016
@viirya viirya deleted the parquet-vectorized-column9 branch December 27, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants