-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-16362][SQL] Support ArrayType and StructType in vectorized Parquet reader #14388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #62957 has finished for PR 14388 at commit
|
|
@viirya |
|
@maver1ck Thanks for reporting this! I will take a look. Can you show me what the schema you test and what the data looks like? Thanks. |
…t-complex-type Conflicts: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
|
Hi @maver1ck Can you try the latest changes on your production workflow? Thank you! |
|
@viirya |
|
Test build #63669 has finished for PR 14388 at commit
|
|
Test build #63677 has finished for PR 14388 at commit
|
|
retest this please. |
|
Test build #63688 has finished for PR 14388 at commit
|
|
@maver1ck Any results about the test? Thank you. |
|
ping @maver1ck |
|
@viirya If I do a simple |
|
@mallman Thanks for reporting this. It is helpful. I will investigate it. |
|
@mallman I ran a simple test, but can't reproduce the issue. The following benchmark codes do select an array column and add an |
|
@viirya I'll see what I can do. If nothing else, I may be able to share a private data file over S3 if you promise not to share it with anyone else. |
|
@mallman Thanks! I promise not to share it with others. |
|
@viirya I sent you an email with a link to a test file to your public github e-mail address. |
|
@mallman Thanks. I will not share that file. |
|
@viirya Any progress on this? |
|
@mallman Not yet. I am working on another PR recently. I will return back when that is solved. |
|
This change seems not easy to maintain. I would like to close this for now. Maybe open later. |
What changes were proposed in this pull request?
Vectorized parquet reader now doesn't support complex types such as ArrayType, MapType and StructType. We should support it to extend the coverage of performance improvement introduced by vectorized parquet reader. This patch is to add ArrayType and StructType first.
Main changes
Obtain repetition and definition level information for Parquet schema
In order to support complex types in vectorized Parquet reader, we need to use repetition and definition level information for Parquet schema which are used to encoded the structure of complex types. This PR introduces a class to capture these encoding:
RepetitionDefinitionInfo. This PR also introduces few classes to capture Parquet schema structure:ParquetField,ParquetStruct,ParquetArrayandParquetMap. A new methodgetParquetStructis added toParquetSchemaConverterwhich is used to create aParquetStructobject which captures the structure and metadata. TheParquetStructhas the same schema structure as the required schema used to guide Parquet reading. It is used to provide the corresponding repetition and definition levels for the fields in the required schema.Attach
VectorizedColumnReadertoColumnVectorBecause in flat schema each
ColumnVectoris actually a data column, previously the relation betweenVectorizedColumnReaderandColumnVectoris one-by-one. Now only theColumnVectorrepresenting a data column will have correspondingVectorizedColumnReader. Then when it is time to read batch, theColumnVectorwith complex type will delegate to its childColumnVector.Implement constructing complex records in
VectorizedColumnReaderThe
readBatchinVectorizedColumnReaderis the main method to read data intoColumnVector. Previously its behavior is simply to load required number of data according to the data type of the column vector. Now after the data is loaded into the column, we need to construct complex records in its parent column that could be an ArrayType, MapType or StructType. The way to restore the data as complex types is encoding in repetition and definition levels in Parquet. The new methodconstructComplexRecordsinVectorizedColumnReaderimplements the logic to restore the complex data. Basically, whatconstructComplexRecordsdoes is to count the continuous values and add array into the parent column if the repetition level value indicates a new record happens. Besides,constructComplexRecordsalso needs to consider the null values. Null values could mean a null record at root level, an empty array or struct. This method considers different cases and sets it correctly.Benchmark
Disabled vectorization:
Enabled vectorization:
How was this patch tested?
Jenkins tests.