[SPARK-40131][PYTHON] Support NumPy ndarray in built-in functions #37635

xinrong-meng · 2022-08-24T00:30:59Z

What changes were proposed in this pull request?

Support NumPy ndarray in built-in functions(pyspark.sql.functions) by introducing Py4J input converter NumpyArrayConverter. The converter converts a ndarray to a Java array.

The mapping between ndarray dtype with Java primitive type is defined as below:

            np.dtype("int64"): gateway.jvm.long,
            np.dtype("int32"): gateway.jvm.int,
            np.dtype("int16"): gateway.jvm.short,
            # Mapping to gateway.jvm.byte causes
            #   TypeError: 'bytes' object does not support item assignment
            np.dtype("int8"): gateway.jvm.short,
            np.dtype("float32"): gateway.jvm.float,
            np.dtype("float64"): gateway.jvm.double,
            np.dtype("bool"): gateway.jvm.boolean,

Why are the changes needed?

As part of SPARK-39405 for NumPy support in SQL.

Does this PR introduce any user-facing change?

Yes. NumPy ndarray is supported in built-in functions.

Take lit for example,

>>> spark.range(1).select(lit(np.array([1, 2], dtype='int16'))).dtypes
[('ARRAY(1S, 2S)', 'array<smallint>')]
>>> spark.range(1).select(lit(np.array([1, 2], dtype='int32'))).dtypes
[('ARRAY(1, 2)', 'array<int>')]
>>> spark.range(1).select(lit(np.array([1, 2], dtype='float32'))).dtypes
[("ARRAY(CAST('1.0' AS FLOAT), CAST('2.0' AS FLOAT))", 'array<float>')]
>>> spark.range(1).select(lit(np.array([]))).dtypes
[('ARRAY()', 'array<double>')]

How was this patch tested?

Unit tests.

AmplabJenkins · 2022-08-24T14:14:20Z

Can one of the admins verify this patch?

xinrong-meng · 2022-08-24T17:55:28Z

python/pyspark/sql/types.py

>>> from py4j.java_collections import ListConverter >>> ndarr = np.array([1, 2]) >>> ListConverter().can_convert(ndarr) True

xinrong-meng · 2022-08-24T17:55:47Z

python/pyspark/sql/types.py

The Java type of the array is required in order to create a Java array. So tpe_dict is created to map Python types to Java types.

python/pyspark/sql/tests/test_functions.py

xinrong-meng · 2022-08-25T21:59:57Z

May I get a review? Thanks! @HyukjinKwon @ueshin @zhengruifeng

HyukjinKwon · 2022-08-26T00:32:00Z

python/pyspark/sql/types.py

Shouldn't we map this type from NumPy dtype?

Since plist = obj.tolist(), plist is a list of Python scalars, see https://numpy.org/doc/stable/reference/generated/numpy.ndarray.tolist.html.

So tpe_dict maps Python types to Java type.

That's consistent with NumpyScalarConverter.convert which calls obj.item(), see https://numpy.org/doc/stable/reference/generated/numpy.ndarray.item.html.

Let me know if there is a better approach :)

Hm, unlike obj.item in which we have to pass Python primitive type; thus, resulting that JVM side type precision cannot be specified, here we can have more correct size in the JVM array.

I think it's better to have the correct type in the element ... Ideally we should make obj.item respect the numpy dtype too..

and I believe we already have the type mapping defined in pandas API on Spark somewhere IIRC

Makes sense!

One limitation is np.dtype("int8") cann't be mapped to gateway.jvm.byte, create jarr accordingly and then do the per-element assignment.

TypeError: 'bytes' object does not support item assignment is caused in jarr[i] = plist[i].

So both int8 and int16 are mapped to gateway.jvm.short.

python/pyspark/sql/tests/test_functions.py

python/pyspark/sql/types.py

xinrong-meng · 2022-09-05T01:08:58Z

Rebased to resolve conflicts. Only bc90498 is new after the review.

zhengruifeng · 2022-09-05T01:19:00Z

python/pyspark/sql/types.py

+        gateway = SparkContext._gateway
+        assert gateway is not None
+        plist = obj.tolist()
+        tpe_np_to_java = {


nit, what about moving this dict outside of convert, so it can be reused

We cannot import SparkContext from the module level. And we may want to do a nullability check for "SparkContext._gateway". So _from_numpy_type_to_java_type is introduced instead for code reuse. Let me know if you have a better idea :)

HyukjinKwon

LGTM except @zhengruifeng's comment.

itholic

Looks good otherwise

itholic · 2022-09-05T23:42:19Z

python/pyspark/sql/types.py

+            if jtpe is None:
+                raise TypeError("The type of array scalar is not supported")


Can we have a test for this ?

oops, yeah. let's add one negative test.

Sounds good!
Optimized the TypeError message as well.

HyukjinKwon · 2022-09-06T01:27:14Z

python/pyspark/sql/types.py

    return None


+def _from_numpy_type_to_java_type(nt: "np.dtype", gateway: JavaGateway) -> Optional[JavaClass]:


You can actually add this as a NumpyArrayConverter's class attribute

Did you mean an instance method?

xinrong-meng · 2022-09-12T17:07:37Z

Thank you all! Merged to master.

github-actions bot added CORE PYTHON SQL labels Aug 24, 2022

xinrong-meng changed the title ~~[WIP] Support NumPy arrays in built-in functions~~ [WIP][SPARK-40131][PYTHON] Support NumPy ndarray in built-in functions Aug 24, 2022

xinrong-meng commented Aug 24, 2022

View reviewed changes

xinrong-meng force-pushed the builtin_ndarray branch from b780b46 to 00370b5 Compare August 25, 2022 18:40

xinrong-meng commented Aug 25, 2022

View reviewed changes

python/pyspark/sql/tests/test_functions.py Outdated Show resolved Hide resolved

xinrong-meng changed the title ~~[WIP][SPARK-40131][PYTHON] Support NumPy ndarray in built-in functions~~ [SPARK-40131][PYTHON] Support NumPy ndarray in built-in functions Aug 25, 2022

xinrong-meng marked this pull request as ready for review August 25, 2022 19:09

HyukjinKwon reviewed Aug 26, 2022

View reviewed changes

zhengruifeng reviewed Aug 27, 2022

View reviewed changes

python/pyspark/sql/tests/test_functions.py Outdated Show resolved Hide resolved

python/pyspark/sql/types.py Outdated Show resolved Hide resolved

xinrong-meng requested review from HyukjinKwon and zhengruifeng August 30, 2022 14:28

xinrong-meng added 4 commits September 4, 2022 17:53

np arr support + test

88ab796

rmv bytes

124ca16

chk ndim

b487cb4

finer types

bc90498

xinrong-meng force-pushed the builtin_ndarray branch from b829383 to bc90498 Compare September 5, 2022 01:08

zhengruifeng reviewed Sep 5, 2022

View reviewed changes

HyukjinKwon approved these changes Sep 5, 2022

View reviewed changes

_from_numpy_type_to_java_type

5d6452b

xinrong-meng requested a review from zhengruifeng September 5, 2022 15:22

itholic reviewed Sep 5, 2022

View reviewed changes

itholic approved these changes Sep 5, 2022

View reviewed changes

HyukjinKwon reviewed Sep 6, 2022

View reviewed changes

err msg

fa71b0e

instance method

d63d050

zhengruifeng approved these changes Sep 12, 2022

View reviewed changes

xinrong-meng closed this in 5856117 Sep 12, 2022

		if jtpe is None:
		raise TypeError("The type of array scalar is not supported")

		return None


		def _from_numpy_type_to_java_type(nt: "np.dtype", gateway: JavaGateway) -> Optional[JavaClass]:

[SPARK-40131][PYTHON] Support NumPy ndarray in built-in functions #37635

[SPARK-40131][PYTHON] Support NumPy ndarray in built-in functions #37635

Uh oh!

Conversation

xinrong-meng commented Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Aug 24, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xinrong-meng commented Aug 25, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinrong-meng Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xinrong-meng commented Sep 5, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinrong-meng Sep 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

itholic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xinrong-meng commented Sep 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xinrong-meng commented Aug 24, 2022 •

edited

Loading

xinrong-meng Aug 26, 2022 •

edited

Loading

xinrong-meng Sep 5, 2022 •

edited

Loading