-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16362][SQL] Support ArrayType and StructType in vectorization Parquet reader #14045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…d-column9 Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
|
Submitted to see jenkins test results. Benchmark will be run later. |
|
retest this please. |
|
Test build #61720 has finished for PR 14045 at commit
|
|
Test build #61718 has finished for PR 14045 at commit
|
|
Test build #61740 has finished for PR 14045 at commit
|
|
Test build #61743 has finished for PR 14045 at commit
|
|
Test build #61813 has finished for PR 14045 at commit
|
|
Test build #61814 has finished for PR 14045 at commit
|
|
Test build #61828 has finished for PR 14045 at commit
|
|
ok. 3 failed tests remaining... |
…d-column9 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
|
Test build #61902 has finished for PR 14045 at commit
|
|
Test build #61900 has finished for PR 14045 at commit
|
…d-column9 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
|
Test build #62094 has finished for PR 14045 at commit
|
|
ping @liancheng @yhuai @rxin I think this is ready now. Can you review this? |
|
Test build #62130 has finished for PR 14045 at commit
|
|
Test build #62244 has finished for PR 14045 at commit
|
… complex columns should take care of it.
|
Test build #62363 has finished for PR 14045 at commit
|
|
Test build #62370 has finished for PR 14045 at commit
|
…d-column9 Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonGenerator.scala
|
Test build #62573 has finished for PR 14045 at commit
|
|
ping @liancheng @yhuai @rxin Can you review this? I think that we should support complex types in vectorization to extend the coverage of performance improvement. Thanks! |
|
Test build #62576 has finished for PR 14045 at commit
|
|
@viirya Thanks for your work! This would be very useful. I'll help review this one soon after finishing my 2.0 tasks at hand! |
|
@liancheng Thank you! |
|
I am going to refactor this a lot. |
|
Closed this PR in favor of the refactored one: #14388. |
What changes were proposed in this pull request?
Vectorization parquet reader now doesn't support complex types such as ArrayType, MapType and StructType. We should support it to extend the coverage of performance improvement introduced by vectorization parquet reader. This patch is to add ArrayType and StructType first.
Main changes
Obtain repetition and definition level values during converting Parquet schema
We convert Parquet schema to Catalyst DataType in
ParquetSchemaConverter. Because we need repetition and definition level information during constructing complex types back from Parquet data, this PR obtains the repetition and definition levels for complex types and attaches them to CatalystArrayType,MapTypeandStructType. Accordingly, this PR adds a metadata to these Calalyst DataTypes as a new attribute. Although this PR tries to avoid modifying these Catalyst DataTypes, however, because the column vector in vectorization is corresponding to Catalyst DataType instead of Parquet schema, we can't access the repetition and definition levels for each column if we don't attach them to their assigned DataType.Attach
VectorizedColumnReadertoColumnVectorBecause in flat schema each
ColumnVectoris actually a data column, previously the relation betweenVectorizedColumnReaderandColumnVectoris one-by-one. Now only theColumnVectorrepresenting a data column will have correspondingVectorizedColumnReader. Then when it is time to read batch, theColumnVectorwith complex type will delegate to its childColumnVector.Implement constructing complex records in
VectorizedColumnReaderThe
readBatchinVectorizedColumnReaderis the main method to read data intoColumnVector. Previously its behavior is simply to load required number of data according to the data type of the column vector. Now after the data is loaded into the column, we need to construct complex records in its parent column that could be an ArrayType, MapType or StructType. The way to restore the data as complex types is encoding in repetition and definition levels in Parquet. The new methodconstructComplexRecordsinVectorizedColumnReaderimplements the logic to restore the complex data. Basically, whatconstructComplexRecordsdoes is to count the continuous values and add array into the parent column if the repetition level value indicates a new record happens. Besides,constructComplexRecordsalso needs to consider the null values. Null values could mean a null record at root level, an empty array or struct. This method considers different cases and sets it correctly.Benchmark
Disabled vectorization:
Enabled vectorization:
How was this patch tested?
Existing unit tests.