Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

What changes were proposed in this pull request?

This PR proposes to change PythonArrowInput and PythonArrowOutput to be more generic to cover the complex data type on both input and output. This is a baseline work for #37863.

Why are the changes needed?

The traits PythonArrowInput and PythonArrowOutput can be further generalized to cover complex data type on both input and output. E.g. Not all operators would have simple InternalRow as input data to pass to Python worker and vice versa for output data.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

@HeartSaVioR
Copy link
Contributor Author

@HeartSaVioR
Copy link
Contributor Author

cc. @HyukjinKwon @ueshin Please take a look, thanks!

* Python (Arrow) to JVM (output type being deserialized from ColumnarBatch).
*/
private[python] trait PythonArrowOutput { self: BasePythonRunner[_, ColumnarBatch] =>
private[python] trait PythonArrowOutput[OUT <: AnyRef] { self: BasePythonRunner[_, OUT] =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: should it be <: AnyRef?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We assign null to the OUT type (although that's a trick) hence need to be AnyRef at least if I understand correctly.

*/
private[python] trait PythonArrowInput { self: BasePythonRunner[Iterator[InternalRow], _] =>
private[python] trait PythonArrowInput[IN] { self: BasePythonRunner[IN, _] =>
protected val sqlConf = SQLConf.get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we don't need this (in #37863, it's not used too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK didn't notice this. Will remove this. Thanks for the pointer.

@HyukjinKwon HyukjinKwon changed the title [SPARK-40414][SQL][PYSPARK] More generic type on PythonArrowInput and PythonArrowOutput [SPARK-40414][SQL][PYTHON] More generic type on PythonArrowInput and PythonArrowOutput Sep 13, 2022
Comment on lines 115 to 118
root: VectorSchemaRoot,
writer: ArrowStreamWriter,
dataOut: DataOutputStream,
inputIterator: Iterator[Iterator[InternalRow]]): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finding!

@HeartSaVioR
Copy link
Contributor Author

I'll merge this PR once build is green.

@HeartSaVioR
Copy link
Contributor Author

Thanks for reviewing! Merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants