Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Oct 6, 2018

What changes were proposed in this pull request?

We are facing some problems about type conversions between Python data and SQL types in UDFs (Pandas UDFs as well).
It's even difficult to identify the problems (see #20163 and #22610).

This PR targets to internally document the type conversion table. Some of them looks buggy and we should fix them.

import sys
import array
import datetime
from decimal import Decimal

from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import udf

if sys.version >= '3':
    long = int

data = [
    None,
    True,
    1,
    long(1),
    "a",
    u"a",
    datetime.date(1970, 1, 1),
    datetime.datetime(1970, 1, 1, 0, 0),
    1.0,
    array.array("i", [1]),
    [1],
    (1,),
    bytearray([65, 66, 67]),
    Decimal(1),
    {"a": 1},
    Row(kwargs=1),
    Row("namedtuple")(1),
]

types =  [
    BooleanType(),
    ByteType(),
    ShortType(),
    IntegerType(),
    LongType(),
    StringType(),
    DateType(),
    TimestampType(),
    FloatType(),
    DoubleType(),
    ArrayType(IntegerType()),
    BinaryType(),
    DecimalType(10, 0),
    MapType(StringType(), IntegerType()),
    StructType([StructField("_1", IntegerType())]),
]


df = spark.range(1)
results = []
count = 0
total = len(types) * len(data)
spark.sparkContext.setLogLevel("FATAL")
for t in types:
    result = []
    for v in data:
        try:
            row = df.select(udf(lambda: v, t)()).first()
            ret_str = repr(row[0])
        except Exception:
            ret_str = "X"
        result.append(ret_str)
        progress = "SQL Type: [%s]\n  Python Value: [%s(%s)]\n  Result Python Value: [%s]" % (
            t.simpleString(), str(v), type(v).__name__, ret_str)
        count += 1
        print("%s/%s:\n  %s" % (count, total, progress))
    results.append([t.simpleString()] + list(map(str, result)))

schema = ["SQL Type \\ Python Value(Type)"] + list(map(lambda v: "%s(%s)" % (str(v), type(v).__name__), data))
strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False)
print("\n".join(map(lambda line: "    # %s  # noqa" % line, strings.strip().split("\n"))))

This table was generated under Python 2 but the code above is Python 3 compatible as well.

How was this patch tested?

Manually tested and lint check.

@HyukjinKwon
Copy link
Member Author

cc @cloud-fan, @viirya and @BryanCutler, WDYT?

@SparkQA
Copy link

SparkQA commented Oct 6, 2018

Test build #97046 has finished for PR 22655 at commit 3084be1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Oct 6, 2018

Thanks for pinging me. I'll look into this this tonight or tomorrow.

# +-----------------------------+--------------+----------+------+-------+------+----------+--------------------+-----------------------------+----------+----------------------+---------+--------------------+--------------+----------+--------------+-------------+-------------+ # noqa
# |SQL Type \ Python Value(Type)|None(NoneType)|True(bool)|1(int)|1(long)|a(str)|a(unicode)| 1970-01-01(date)|1970-01-01 00:00:00(datetime)|1.0(float)|array('i', [1])(array)|[1](list)| (1,)(tuple)|ABC(bytearray)|1(Decimal)|{'a': 1}(dict)|Row(a=1)(Row)|Row(a=1)(Row)| # noqa
# +-----------------------------+--------------+----------+------+-------+------+----------+--------------------+-----------------------------+----------+----------------------+---------+--------------------+--------------+----------+--------------+-------------+-------------+ # noqa
# | null| None| None| None| None| None| None| None| None| None| None| None| None| None| None| None| X| X| # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seems have to document what X means in this table?

# | smallint| None| None| 1| 1| None| None| None| None| None| None| None| None| None| None| None| X| X| # noqa
# | int| None| None| 1| 1| None| None| None| None| None| None| None| None| None| None| None| X| X| # noqa
# | bigint| None| None| 1| 1| None| None| None| None| None| None| None| None| None| None| None| X| X| # noqa
# | string| None| true| 1| 1| a| a|java.util.Gregori...| java.util.Gregori...| 1.0| [I@7f1970e1| [1]|[Ljava.lang.Objec...| [B@284838a9| 1| {a=1}| X| X| # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true means a string 'true'? Shall we add quotes for strings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it meaningful for [B@284838a9 in this table?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmmm .. I see the type is not clear here. Let me think about this a bit more.

[B@284838a9 is a quite buggy behaviour - we should fix. So I was thinking of documenting internally since we already spent much time to figure out how it works for each case individually (at #20163).

# Please see SPARK-25666's PR to see the codes in order to generate the table below.
#
# +-----------------------------+--------------+----------+------+-------+------+----------+--------------------+-----------------------------+----------+----------------------+---------+--------------------+--------------+----------+--------------+-------------+-------------+ # noqa
# |SQL Type \ Python Value(Type)|None(NoneType)|True(bool)|1(int)|1(long)|a(str)|a(unicode)| 1970-01-01(date)|1970-01-01 00:00:00(datetime)|1.0(float)|array('i', [1])(array)|[1](list)| (1,)(tuple)|ABC(bytearray)|1(Decimal)|{'a': 1}(dict)|Row(a=1)(Row)|Row(a=1)(Row)| # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any difference between last two Row(a=1)(Row)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, one was Row(a=1) and the other one was namedtuple approach Row("a")(1). Let me try to update.

@HyukjinKwon HyukjinKwon changed the title [SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs [WIP][SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs Oct 7, 2018
@cloud-fan
Copy link
Contributor

it's useful to have this table, thanks!

Shall we discuss the expected behavior here or in another JIRA ticket?

@HyukjinKwon
Copy link
Member Author

Let me make this table for Pandas UDF too and then open another JIRA (or mailing list) to discuss about this further. I need more investigations to propose the desired behaviour targeting 3.0.

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs [SPARK-25666][PYTHON] Internally document type conversion between Python data and SQL types in normal UDFs Oct 8, 2018
@SparkQA
Copy link

SparkQA commented Oct 8, 2018

Test build #97097 has finished for PR 22655 at commit 3aa0103.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 8, 2018

Test build #97098 has finished for PR 22655 at commit 6ee69a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

I am getting this in. This is an ongoing effort and it just documents them internally for now.

@asfgit asfgit closed this in a853a80 Oct 8, 2018
@HyukjinKwon
Copy link
Member Author

Merged to master.

@HyukjinKwon
Copy link
Member Author

@viirya and @BryanCutler, do you guys have some time to go for Pandas one? I think I wouldn't have some time within a couple of weeks. If you guys have some time, I would appreciate if you could go ahead. Otherwise, I will start this one after a couple of weeks.

@viirya
Copy link
Member

viirya commented Oct 12, 2018

@HyukjinKwon I can take some time to do similar for Pandas UDF.

@HyukjinKwon HyukjinKwon deleted the SPARK-25666 branch October 16, 2018 12:43
@BryanCutler
Copy link
Member

Thanks @viirya !

@HyukjinKwon
Copy link
Member Author

Hey @viirya, I happened to find some times to work on it - I submitted a PR #22795.

@viirya
Copy link
Member

viirya commented Oct 22, 2018

@HyukjinKwon Cool! Thanks!

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…hon data and SQL types in normal UDFs

### What changes were proposed in this pull request?

We are facing some problems about type conversions between Python data and SQL types in UDFs (Pandas UDFs as well).
It's even difficult to identify the problems (see apache#20163 and apache#22610).

This PR targets to internally document the type conversion table. Some of them looks buggy and we should fix them.

```python
import sys
import array
import datetime
from decimal import Decimal

from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql.functions import udf

if sys.version >= '3':
    long = int

data = [
    None,
    True,
    1,
    long(1),
    "a",
    u"a",
    datetime.date(1970, 1, 1),
    datetime.datetime(1970, 1, 1, 0, 0),
    1.0,
    array.array("i", [1]),
    [1],
    (1,),
    bytearray([65, 66, 67]),
    Decimal(1),
    {"a": 1},
    Row(kwargs=1),
    Row("namedtuple")(1),
]

types =  [
    BooleanType(),
    ByteType(),
    ShortType(),
    IntegerType(),
    LongType(),
    StringType(),
    DateType(),
    TimestampType(),
    FloatType(),
    DoubleType(),
    ArrayType(IntegerType()),
    BinaryType(),
    DecimalType(10, 0),
    MapType(StringType(), IntegerType()),
    StructType([StructField("_1", IntegerType())]),
]

df = spark.range(1)
results = []
count = 0
total = len(types) * len(data)
spark.sparkContext.setLogLevel("FATAL")
for t in types:
    result = []
    for v in data:
        try:
            row = df.select(udf(lambda: v, t)()).first()
            ret_str = repr(row[0])
        except Exception:
            ret_str = "X"
        result.append(ret_str)
        progress = "SQL Type: [%s]\n  Python Value: [%s(%s)]\n  Result Python Value: [%s]" % (
            t.simpleString(), str(v), type(v).__name__, ret_str)
        count += 1
        print("%s/%s:\n  %s" % (count, total, progress))
    results.append([t.simpleString()] + list(map(str, result)))

schema = ["SQL Type \\ Python Value(Type)"] + list(map(lambda v: "%s(%s)" % (str(v), type(v).__name__), data))
strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False)
print("\n".join(map(lambda line: "    # %s  # noqa" % line, strings.strip().split("\n"))))
```

This table was generated under Python 2 but the code above is Python 3 compatible as well.

## How was this patch tested?

Manually tested and lint check.

Closes apache#22655 from HyukjinKwon/SPARK-25666.

Authored-by: hyukjinkwon <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants