-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput #37864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… PythonArrowOutput
|
cc. @HyukjinKwon @ueshin Please take a look, thanks! |
| * Python (Arrow) to JVM (output type being deserialized from ColumnarBatch). | ||
| */ | ||
| private[python] trait PythonArrowOutput { self: BasePythonRunner[_, ColumnarBatch] => | ||
| private[python] trait PythonArrowOutput[OUT <: AnyRef] { self: BasePythonRunner[_, OUT] => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: should it be <: AnyRef?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We assign null to the OUT type (although that's a trick) hence need to be AnyRef at least if I understand correctly.
| */ | ||
| private[python] trait PythonArrowInput { self: BasePythonRunner[Iterator[InternalRow], _] => | ||
| private[python] trait PythonArrowInput[IN] { self: BasePythonRunner[IN, _] => | ||
| protected val sqlConf = SQLConf.get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we don't need this (in #37863, it's not used too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK didn't notice this. Will remove this. Thanks for the pointer.
| root: VectorSchemaRoot, | ||
| writer: ArrowStreamWriter, | ||
| dataOut: DataOutputStream, | ||
| inputIterator: Iterator[Iterator[InternalRow]]): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice finding!
65a7576 to
beecc9e
Compare
|
I'll merge this PR once build is green. |
|
Thanks for reviewing! Merging to master. |
What changes were proposed in this pull request?
This PR proposes to change PythonArrowInput and PythonArrowOutput to be more generic to cover the complex data type on both input and output. This is a baseline work for #37863.
Why are the changes needed?
The traits PythonArrowInput and PythonArrowOutput can be further generalized to cover complex data type on both input and output. E.g. Not all operators would have simple InternalRow as input data to pass to Python worker and vice versa for output data.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests.