[SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs #23797

gaborgsomogyi · 2019-02-15T11:32:45Z

What changes were proposed in this pull request?

Avro is built-in but external data source module since Spark 2.4 but from_avro and to_avro APIs not yet supported in pyspark.

In this PR I've made them available from pyspark.

How was this patch tested?

Please see the python API examples what I've added.

cd docs/
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build
Manual webpage check.

SparkQA · 2019-02-15T11:35:16Z

Test build #102388 has finished for PR 23797 at commit 0161e50.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-15T12:33:35Z

Test build #102390 has finished for PR 23797 at commit 1733afe.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-15T13:18:12Z

Test build #102392 has finished for PR 23797 at commit ea67dca.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

Few comments ...

I think the import path should be pyspark.sql.avro.functions to be consistent.
Probably, you should fix dev/sparktestsupport/modules.py to include avro artifact in PySpark tests.
I am not sure which way is better. I am currently able to think ..
a) somehow provide a python file that can be used via -py-files (considering arrow is a separate source)
b) we can add some codes within Apache Spark like the current way. We could throw a proper exception after checking if some avro classes are loadable or not.

Let me think a bit more ..

docs/sql-data-sources-avro.md

python/pyspark/sql/functions.py

* Added avro artifact * Some refactoring * Formatting fixes

gaborgsomogyi · 2019-02-18T21:53:24Z

I'm doing further testing...

gaborgsomogyi · 2019-02-18T22:38:56Z

retest this please

HyukjinKwon · 2019-02-19T01:54:13Z

retest this please

SparkQA · 2019-02-19T16:19:10Z

Test build #102500 has finished for PR 23797 at commit 331cfcd.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
Spark %(lib_name)s libraries not found in class path. Try one of the following.

SparkQA · 2019-02-19T17:37:18Z

Test build #102511 has finished for PR 23797 at commit 7a8cc33.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-19T22:10:37Z

Test build #102512 has finished for PR 23797 at commit d524937.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-02-19T22:35:23Z

retest this please

SparkQA · 2019-02-20T02:53:04Z

Test build #102520 has finished for PR 23797 at commit d524937.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-02-20T03:12:32Z

retest this please

SparkQA · 2019-02-20T07:23:30Z

Test build #102530 has finished for PR 23797 at commit d524937.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-02-20T07:39:49Z

retest this please

SparkQA · 2019-02-20T08:05:02Z

Test build #102537 has finished for PR 23797 at commit d524937.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-02-20T08:10:08Z

retest this please

SparkQA · 2019-02-20T12:22:18Z

Test build #102542 has finished for PR 23797 at commit d524937.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-02-20T12:29:48Z

retest this please

python/pyspark/sql/avro/functions.py

SparkQA · 2019-02-20T16:22:53Z

Test build #102550 has finished for PR 23797 at commit d524937.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-02-20T17:29:31Z

retest this please

SparkQA · 2019-02-20T21:32:06Z

Test build #102558 has finished for PR 23797 at commit d524937.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-02-22T11:33:16Z

providing this Python file separately somewhere .. (so that it can be used via py-files).

I was also thinking about such thing but couldn't really come up something which is not horror complex from user perspective. Feel free to share if anybody has a good idea.

SparkQA · 2019-02-22T16:40:02Z

Test build #102644 has finished for PR 23797 at commit 89ff143.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-02-26T14:45:31Z

Hmm , @gengliangwang, @cloud-fan and @viirya, do you maybe have an idea about how we include Avro function APIs in Python side? I think it's reasonable to include Python API since we have it in Java/Scala too.

The current approach looks okay to me too. The advantage of it is that it is consistent with existing way.

Providing a separate Python file somewhere requires extra setup (like py-files) when users want to use the Python API.

gengliangwang · 2019-02-26T16:13:08Z

@gaborgsomogyi Thanks for the work!
I am not very familiar with PySpark. The approach in this PR is user-friendly and LGTM.

HyukjinKwon · 2019-02-27T05:08:50Z

Okie, im gonna take a close look soon. @gaborgsomogyi, please get rid of WIP tag if you think it's ready for reviewing.

gaborgsomogyi · 2019-02-27T10:04:27Z

@HyukjinKwon not much to add, so removed the WIP tag.

dev/sparktestsupport/modules.py

SparkQA · 2019-03-07T20:47:32Z

Test build #103147 has finished for PR 23797 at commit 65f523b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

python/pyspark/sql/avro/functions.py

dev/sparktestsupport/modules.py

HyukjinKwon · 2019-03-08T03:58:42Z

Looks good to me otherwise.

SparkQA · 2019-03-08T14:10:26Z

Test build #103203 has finished for PR 23797 at commit 74aaa6f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-03-09T00:25:24Z

python/pyspark/sql/avro/functions.py

+        .master("local[4]")\
+        .appName("sql.avro.functions tests")\
+        .getOrCreate()
+    sc = spark.sparkContext


Sorry, last nit. Looks we don't need this too.

holdenk · 2019-03-09T01:50:34Z

python/pyspark/sql/avro/functions.py

+    schema must match the read data, otherwise the behavior is undefined: it may fail or return
+    arbitrary result.
+
+    Note: Avro is built-in but external data source module since Spark 2.4. Please deploy the


Could we improve the wording here maybe? "built-in but external" might be a bit confusing. What do you think of something like it's a supported but optional data source that requires special deployment?

This part is taken over from the original feature which introduced in 2.4. I think the users already got used to it. If you still think it worth I suggest to modify the original feature as well.

SparkQA · 2019-03-11T01:05:00Z

Test build #103283 has finished for PR 23797 at commit f3f0348.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-03-11T01:14:36Z

Merged to master, given the positive feedback in general, and the approach is matched to what PySpark does.

dongjoon-hyun · 2019-04-02T02:39:43Z

python/pyspark/testing/utils.py

+    # Search jar in the project dir using the jar name_prefix for both sbt build and maven
+    # build because the artifact jars are in different directories.
+    sbt_build = glob.glob(os.path.join(
+        project_full_path, "target/scala-*/%s*.jar" % jar_name_prefix))


Hi, All.
This causes Python UT failures which blocks another PRs. Please see #24268 .

… for Kinesis assembly ## What changes were proposed in this pull request? After [SPARK-26856](#23797), `Kinesis` Python UT fails with `Found multiple JARs` exception due to a wrong pattern. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/104171/console ``` Exception: Found multiple JARs: .../spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar, .../spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar; please remove all but one ``` It's because the pattern was changed in a wrong way. **Original** ```python kinesis_asl_assembly_dir, "target/scala-*/%s-*.jar" % name_prefix)) kinesis_asl_assembly_dir, "target/%s_*.jar" % name_prefix)) ``` **After SPARK-26856** ```python project_full_path, "target/scala-*/%s*.jar" % jar_name_prefix)) project_full_path, "target/%s*.jar" % jar_name_prefix)) ``` The actual kinesis assembly jar files look like the followings. **SBT Build** ``` -rw-r--r-- 1 dongjoon staff 87459461 Apr 1 19:01 spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar -rw-r--r-- 1 dongjoon staff 309 Apr 1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar -rw-r--r-- 1 dongjoon staff 309 Apr 1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar ``` **MAVEN Build** ``` -rw-r--r-- 1 dongjoon staff 8.6K Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-sources.jar -rw-r--r-- 1 dongjoon staff 8.6K Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-test-sources.jar -rw-r--r-- 1 dongjoon staff 8.7K Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar -rw-r--r-- 1 dongjoon staff 21M Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar ``` In addition, after SPARK-26856, the utility function `search_jar` is shared to find `avro` jar files which are identical for both `sbt` and `mvn`. To sum up, The current jar pattern parameter cannot handle both `kinesis` and `avro` jars. This PR splits the single pattern into two patterns. ## How was this patch tested? Manual. Please note that this will remove only `Found multiple JARs` exception. Kinesis tests need more configurations to run locally. ``` $ build/sbt -Pkinesis-asl test:package streaming-kinesis-asl-assembly/assembly $ export ENABLE_KINESIS_TESTS=1 $ python/run-tests.py --python-executables python2.7 --module pyspark-streaming ``` Closes #24268 from dongjoon-hyun/SPARK-26856. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

javadi82 · 2020-01-25T03:44:47Z

Sorry for the newbie question: Which package should I include so that these functions are available ?

I tried this: pyspark --packages org.apache.spark:spark-avro_2.12:2.4.4

from pyspark.sql.avro.functions import from_avro, to_avro
ImportError: No module named avro.functions

import pyspark.sql.avro.functions
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named avro.functions```

dongjoon-hyun · 2020-01-25T06:23:57Z

Hi, @javadi82 . This is a new feature of 3.0.0. You can see Fix Version/s field.

https://issues.apache.org/jira/browse/SPARK-26856

Please try in 3.0.0-preview2.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
      /_/

Using Python version 3.7.6 (default, Jan 10 2020 13:37:46)
SparkSession available as 'spark'.
>>> from pyspark.sql.avro.functions import from_avro, to_avro
>>>

Avro is built-in but external data source module since Spark 2.4 but `from_avro` and `to_avro` APIs not yet supported in pyspark. In this PR I've made them available from pyspark. Please see the python API examples what I've added. cd docs/ SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build Manual webpage check. Closes apache#23797 from gaborgsomogyi/SPARK-26856. Authored-by: Gabor Somogyi <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 3729efb)

[SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs

0161e50

Style fix

1733afe

Style fix

ea67dca

gaborgsomogyi changed the title ~~[SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs~~ [WIP][SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs Feb 15, 2019

HyukjinKwon reviewed Feb 16, 2019

View reviewed changes

docs/sql-data-sources-avro.md Outdated Show resolved Hide resolved

docs/sql-data-sources-avro.md Outdated Show resolved Hide resolved

python/pyspark/sql/functions.py Outdated Show resolved Hide resolved

* Moved to package pyspark.sql.avro.functions

331cfcd

* Added avro artifact * Some refactoring * Formatting fixes

Style fix

7a8cc33

Compile fix

d524937

viirya reviewed Feb 20, 2019

View reviewed changes

python/pyspark/sql/avro/functions.py Show resolved Hide resolved

python/pyspark/sql/avro/functions.py Show resolved Hide resolved

gaborgsomogyi changed the title ~~[WIP][SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs~~ [SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs Feb 27, 2019

HyukjinKwon reviewed Mar 4, 2019

View reviewed changes

dev/sparktestsupport/modules.py Outdated Show resolved Hide resolved

Add dependencies

65f523b

HyukjinKwon reviewed Mar 8, 2019

View reviewed changes

python/pyspark/sql/avro/functions.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Mar 8, 2019

View reviewed changes

python/pyspark/sql/avro/functions.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Mar 8, 2019

View reviewed changes

dev/sparktestsupport/modules.py Outdated Show resolved Hide resolved

Review fix

74aaa6f

HyukjinKwon approved these changes Mar 9, 2019

View reviewed changes

holdenk reviewed Mar 9, 2019

View reviewed changes

Remove usuned variable

f3f0348

HyukjinKwon closed this in 3729efb Mar 11, 2019

dongjoon-hyun mentioned this pull request Apr 2, 2019

[SPARK-27323][CORE][SQL][STREAMING] Use Single-Abstract-Method support in Scala 2.12 to simplify code #24241

Closed

dongjoon-hyun reviewed Apr 2, 2019

View reviewed changes

dongjoon-hyun mentioned this pull request Apr 2, 2019

[SPARK-26856][PYSPARK][FOLLOWUP] Fix UT failure due to wrong patterns for Kinesis assembly #24268

Closed

[SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs #23797

[SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs #23797

Uh oh!

Conversation

gaborgsomogyi commented Feb 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 15, 2019

Uh oh!

SparkQA commented Feb 15, 2019

Uh oh!

SparkQA commented Feb 15, 2019

Uh oh!

HyukjinKwon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaborgsomogyi commented Feb 18, 2019

Uh oh!

gaborgsomogyi commented Feb 18, 2019

Uh oh!

HyukjinKwon commented Feb 19, 2019

Uh oh!

SparkQA commented Feb 19, 2019

Uh oh!

SparkQA commented Feb 19, 2019

Uh oh!

SparkQA commented Feb 19, 2019

Uh oh!

gaborgsomogyi commented Feb 19, 2019

Uh oh!

SparkQA commented Feb 20, 2019

Uh oh!

HyukjinKwon commented Feb 20, 2019

Uh oh!

SparkQA commented Feb 20, 2019

Uh oh!

HyukjinKwon commented Feb 20, 2019

Uh oh!

SparkQA commented Feb 20, 2019

Uh oh!

HyukjinKwon commented Feb 20, 2019

Uh oh!

SparkQA commented Feb 20, 2019

Uh oh!

HyukjinKwon commented Feb 20, 2019

Uh oh!

Uh oh!

Uh oh!

SparkQA commented Feb 20, 2019

Uh oh!

gaborgsomogyi commented Feb 20, 2019

Uh oh!

SparkQA commented Feb 20, 2019

Uh oh!

gaborgsomogyi commented Feb 22, 2019

Uh oh!

SparkQA commented Feb 22, 2019

Uh oh!

viirya commented Feb 26, 2019

Uh oh!

gengliangwang commented Feb 26, 2019

Uh oh!

HyukjinKwon commented Feb 27, 2019

Uh oh!

gaborgsomogyi commented Feb 27, 2019

Uh oh!

Uh oh!

SparkQA commented Mar 7, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Mar 8, 2019

Uh oh!

gaborgsomogyi commented Feb 15, 2019 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading

javadi82 commented Jan 25, 2020 •

edited

Loading