-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-6775] [SPARK-6776] [SQL] [WIP] Refactors converters for converting Parquet records to row objects #5422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dates are represented as integers inside Catalyst. Since SpecificMutableRow is only used inside Catalyst, it should be safe to specialize DateType here.
|
Test build #29858 has finished for PR 5422 at commit
|
|
Test build #29859 has finished for PR 5422 at commit
|
02d78c4 to
2529b76
Compare
|
Test build #29868 has finished for PR 5422 at commit
|
|
Test build #29869 has finished for PR 5422 at commit
|
|
Test build #31021 has finished for PR 5422 at commit
|
|
Closing this as it's superceded by #6617 and upcoming follow-up PRs. |
This PR simplifies the
CatalystConverterclass hierarchy, and implements backwards-compatibility rules for them. The new class hierarchy basically resemblesAvroIndexedRecordConverterin parquet-avro.This is still in WIP status. I'm opening this PR to check whether it passes Jenkins.
TODO:
Major changes:
Replaces
CatalystConvertertrait with a much simplerParentContainerUpdater.Now instead of extending the original
CatalystConvertertrait, every converter class accepts an updater whose's responsible to set the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row.This simplifies the design since converters don't need to care about details of their parent converters anymore.
Replaces
CatalystRootConverter,CatalystGroupConverterandCatalystPrimitiveRowConverterwithCatalystRowConverterSpecifically, now all row objects are represented with
SpecificMutableRowduring conversion.Removes
CatalystArrayContainsNullConverterandCatalystNativeArrayConverter, revampedCatalystArrayConverterCatalystNativeArrayConvertershould be designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal.The new
CatalystArrayConverterhandles both nullable and non-nullable array elements.Implemented backwards-compatibility rules in
CatalystArrayConverterWhen Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see backwards-compatibility rules for MAP), we only need to deal with LIST (namely array) here.