[SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs #22655

HyukjinKwon · 2018-10-06T11:13:54Z

What changes were proposed in this pull request?

We are facing some problems about type conversions between Python data and SQL types in UDFs (Pandas UDFs as well).
It's even difficult to identify the problems (see #20163 and #22610).

This PR targets to internally document the type conversion table. Some of them looks buggy and we should fix them.

import sys
import array
import datetime
from decimal import Decimal

from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import udf

if sys.version >= '3':
    long = int

data = [
    None,
    True,
    1,
    long(1),
    "a",
    u"a",
    datetime.date(1970, 1, 1),
    datetime.datetime(1970, 1, 1, 0, 0),
    1.0,
    array.array("i", [1]),
    [1],
    (1,),
    bytearray([65, 66, 67]),
    Decimal(1),
    {"a": 1},
    Row(kwargs=1),
    Row("namedtuple")(1),
]

types =  [
    BooleanType(),
    ByteType(),
    ShortType(),
    IntegerType(),
    LongType(),
    StringType(),
    DateType(),
    TimestampType(),
    FloatType(),
    DoubleType(),
    ArrayType(IntegerType()),
    BinaryType(),
    DecimalType(10, 0),
    MapType(StringType(), IntegerType()),
    StructType([StructField("_1", IntegerType())]),
]


df = spark.range(1)
results = []
count = 0
total = len(types) * len(data)
spark.sparkContext.setLogLevel("FATAL")
for t in types:
    result = []
    for v in data:
        try:
            row = df.select(udf(lambda: v, t)()).first()
            ret_str = repr(row[0])
        except Exception:
            ret_str = "X"
        result.append(ret_str)
        progress = "SQL Type: [%s]\n  Python Value: [%s(%s)]\n  Result Python Value: [%s]" % (
            t.simpleString(), str(v), type(v).__name__, ret_str)
        count += 1
        print("%s/%s:\n  %s" % (count, total, progress))
    results.append([t.simpleString()] + list(map(str, result)))

schema = ["SQL Type \\ Python Value(Type)"] + list(map(lambda v: "%s(%s)" % (str(v), type(v).__name__), data))
strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False)
print("\n".join(map(lambda line: "    # %s  # noqa" % line, strings.strip().split("\n"))))

This table was generated under Python 2 but the code above is Python 3 compatible as well.

How was this patch tested?

Manually tested and lint check.

… in UDFs

HyukjinKwon · 2018-10-06T11:14:15Z

cc @cloud-fan, @viirya and @BryanCutler, WDYT?

SparkQA · 2018-10-06T11:51:43Z

Test build #97046 has finished for PR 22655 at commit 3084be1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-10-06T12:26:21Z

Thanks for pinging me. I'll look into this this tonight or tomorrow.

viirya · 2018-10-07T13:56:29Z

python/pyspark/sql/functions.py

+    # +-----------------------------+--------------+----------+------+-------+------+----------+--------------------+-----------------------------+----------+----------------------+---------+--------------------+--------------+----------+--------------+-------------+-------------+  # noqa
+    # |SQL Type \ Python Value(Type)|None(NoneType)|True(bool)|1(int)|1(long)|a(str)|a(unicode)|    1970-01-01(date)|1970-01-01 00:00:00(datetime)|1.0(float)|array('i', [1])(array)|[1](list)|         (1,)(tuple)|ABC(bytearray)|1(Decimal)|{'a': 1}(dict)|Row(a=1)(Row)|Row(a=1)(Row)|  # noqa
+    # +-----------------------------+--------------+----------+------+-------+------+----------+--------------------+-----------------------------+----------+----------------------+---------+--------------------+--------------+----------+--------------+-------------+-------------+  # noqa
+    # |                         null|          None|      None|  None|   None|  None|      None|                None|                         None|      None|                  None|     None|                None|          None|      None|          None|            X|            X|  # noqa


We seems have to document what X means in this table?

viirya · 2018-10-07T13:58:22Z

python/pyspark/sql/functions.py

+    # |                     smallint|          None|      None|     1|      1|  None|      None|                None|                         None|      None|                  None|     None|                None|          None|      None|          None|            X|            X|  # noqa
+    # |                          int|          None|      None|     1|      1|  None|      None|                None|                         None|      None|                  None|     None|                None|          None|      None|          None|            X|            X|  # noqa
+    # |                       bigint|          None|      None|     1|      1|  None|      None|                None|                         None|      None|                  None|     None|                None|          None|      None|          None|            X|            X|  # noqa
+    # |                       string|          None|      true|     1|      1|     a|         a|java.util.Gregori...|         java.util.Gregori...|       1.0|           [I@7f1970e1|      [1]|[Ljava.lang.Objec...|   [B@284838a9|         1|         {a=1}|            X|            X|  # noqa


true means a string 'true'? Shall we add quotes for strings?

Is it meaningful for [B@284838a9 in this table?

Hmmmmm .. I see the type is not clear here. Let me think about this a bit more.

[B@284838a9 is a quite buggy behaviour - we should fix. So I was thinking of documenting internally since we already spent much time to figure out how it works for each case individually (at #20163).

viirya · 2018-10-07T14:02:08Z

python/pyspark/sql/functions.py

+    # Please see SPARK-25666's PR to see the codes in order to generate the table below.
+    #
+    # +-----------------------------+--------------+----------+------+-------+------+----------+--------------------+-----------------------------+----------+----------------------+---------+--------------------+--------------+----------+--------------+-------------+-------------+  # noqa
+    # |SQL Type \ Python Value(Type)|None(NoneType)|True(bool)|1(int)|1(long)|a(str)|a(unicode)|    1970-01-01(date)|1970-01-01 00:00:00(datetime)|1.0(float)|array('i', [1])(array)|[1](list)|         (1,)(tuple)|ABC(bytearray)|1(Decimal)|{'a': 1}(dict)|Row(a=1)(Row)|Row(a=1)(Row)|  # noqa


Any difference between last two Row(a=1)(Row)?

Right, one was Row(a=1) and the other one was namedtuple approach Row("a")(1). Let me try to update.

cloud-fan · 2018-10-08T04:09:30Z

it's useful to have this table, thanks!

Shall we discuss the expected behavior here or in another JIRA ticket?

HyukjinKwon · 2018-10-08T04:19:12Z

Let me make this table for Pandas UDF too and then open another JIRA (or mailing list) to discuss about this further. I need more investigations to propose the desired behaviour targeting 3.0.

SparkQA · 2018-10-08T05:41:56Z

Test build #97097 has finished for PR 22655 at commit 3aa0103.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-08T05:46:58Z

Test build #97098 has finished for PR 22655 at commit 6ee69a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-08T07:46:21Z

I am getting this in. This is an ongoing effort and it just documents them internally for now.

HyukjinKwon · 2018-10-08T07:53:17Z

Merged to master.

HyukjinKwon · 2018-10-12T02:17:23Z

@viirya and @BryanCutler, do you guys have some time to go for Pandas one? I think I wouldn't have some time within a couple of weeks. If you guys have some time, I would appreciate if you could go ahead. Otherwise, I will start this one after a couple of weeks.

viirya · 2018-10-12T09:05:31Z

@HyukjinKwon I can take some time to do similar for Pandas UDF.

BryanCutler · 2018-10-17T00:47:44Z

Thanks @viirya !

HyukjinKwon · 2018-10-22T11:02:53Z

Hey @viirya, I happened to find some times to work on it - I submitted a PR #22795.

viirya · 2018-10-22T13:11:48Z

@HyukjinKwon Cool! Thanks!

…hon data and SQL types in normal UDFs ### What changes were proposed in this pull request? We are facing some problems about type conversions between Python data and SQL types in UDFs (Pandas UDFs as well). It's even difficult to identify the problems (see apache#20163 and apache#22610). This PR targets to internally document the type conversion table. Some of them looks buggy and we should fix them. ```python import sys import array import datetime from decimal import Decimal from pyspark.sql import Row from pyspark.sql.types import * from pyspark.sql.functions import udf if sys.version >= '3': long = int data = [ None, True, 1, long(1), "a", u"a", datetime.date(1970, 1, 1), datetime.datetime(1970, 1, 1, 0, 0), 1.0, array.array("i", [1]), [1], (1,), bytearray([65, 66, 67]), Decimal(1), {"a": 1}, Row(kwargs=1), Row("namedtuple")(1), ] types = [ BooleanType(), ByteType(), ShortType(), IntegerType(), LongType(), StringType(), DateType(), TimestampType(), FloatType(), DoubleType(), ArrayType(IntegerType()), BinaryType(), DecimalType(10, 0), MapType(StringType(), IntegerType()), StructType([StructField("_1", IntegerType())]), ] df = spark.range(1) results = [] count = 0 total = len(types) * len(data) spark.sparkContext.setLogLevel("FATAL") for t in types: result = [] for v in data: try: row = df.select(udf(lambda: v, t)()).first() ret_str = repr(row[0]) except Exception: ret_str = "X" result.append(ret_str) progress = "SQL Type: [%s]\n Python Value: [%s(%s)]\n Result Python Value: [%s]" % ( t.simpleString(), str(v), type(v).__name__, ret_str) count += 1 print("%s/%s:\n %s" % (count, total, progress)) results.append([t.simpleString()] + list(map(str, result))) schema = ["SQL Type \\ Python Value(Type)"] + list(map(lambda v: "%s(%s)" % (str(v), type(v).__name__), data)) strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False) print("\n".join(map(lambda line: " # %s # noqa" % line, strings.strip().split("\n")))) ``` This table was generated under Python 2 but the code above is Python 3 compatible as well. ## How was this patch tested? Manually tested and lint check. Closes apache#22655 from HyukjinKwon/SPARK-25666. Authored-by: hyukjinkwon <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

Internally document type conversion between Python data and SQL types…

3084be1

… in UDFs

viirya reviewed Oct 7, 2018

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs~~ [WIP][SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs Oct 7, 2018

Address comments

3aa0103

HyukjinKwon changed the title ~~[WIP][SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs~~ [SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs Oct 8, 2018

was -> is

6ee69a3

asfgit closed this in a853a80 Oct 8, 2018

HyukjinKwon deleted the SPARK-25666 branch October 16, 2018 12:43

[SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs #22655

[SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs #22655

Uh oh!

Conversation

HyukjinKwon commented Oct 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Oct 6, 2018

Uh oh!

SparkQA commented Oct 6, 2018

Uh oh!

viirya commented Oct 6, 2018

Uh oh!

viirya Oct 7, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Oct 7, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Oct 7, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 7, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Oct 7, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 7, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 8, 2018

Uh oh!

HyukjinKwon commented Oct 8, 2018

Uh oh!

SparkQA commented Oct 8, 2018

Uh oh!

SparkQA commented Oct 8, 2018

Uh oh!

HyukjinKwon commented Oct 8, 2018

Uh oh!

HyukjinKwon commented Oct 8, 2018

Uh oh!

HyukjinKwon commented Oct 12, 2018

Uh oh!

viirya commented Oct 12, 2018

Uh oh!

BryanCutler commented Oct 17, 2018

Uh oh!

HyukjinKwon commented Oct 22, 2018

Uh oh!

viirya commented Oct 22, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Oct 6, 2018 •

edited

Loading