Skip to content

Conversation

@gaborgsomogyi
Copy link
Contributor

@gaborgsomogyi gaborgsomogyi commented Feb 15, 2019

What changes were proposed in this pull request?

Avro is built-in but external data source module since Spark 2.4 but from_avro and to_avro APIs not yet supported in pyspark.

In this PR I've made them available from pyspark.

How was this patch tested?

Please see the python API examples what I've added.

cd docs/
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build
Manual webpage check.

@SparkQA
Copy link

SparkQA commented Feb 15, 2019

Test build #102388 has finished for PR 23797 at commit 0161e50.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 15, 2019

Test build #102390 has finished for PR 23797 at commit 1733afe.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 15, 2019

Test build #102392 has finished for PR 23797 at commit ea67dca.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi gaborgsomogyi changed the title [SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs [WIP][SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs Feb 15, 2019
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments ...

  1. I think the import path should be pyspark.sql.avro.functions to be consistent.

  2. Probably, you should fix dev/sparktestsupport/modules.py to include avro artifact in PySpark tests.

  3. I am not sure which way is better. I am currently able to think ..
    a) somehow provide a python file that can be used via -py-files (considering arrow is a separate source)
    b) we can add some codes within Apache Spark like the current way. We could throw a proper exception after checking if some avro classes are loadable or not.

Let me think a bit more ..

* Added avro artifact
* Some refactoring
* Formatting fixes
@gaborgsomogyi
Copy link
Contributor Author

I'm doing further testing...

@gaborgsomogyi
Copy link
Contributor Author

retest this please

1 similar comment
@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Feb 19, 2019

Test build #102500 has finished for PR 23797 at commit 331cfcd.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • Spark %(lib_name)s libraries not found in class path. Try one of the following.

@SparkQA
Copy link

SparkQA commented Feb 19, 2019

Test build #102511 has finished for PR 23797 at commit 7a8cc33.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 19, 2019

Test build #102512 has finished for PR 23797 at commit d524937.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 20, 2019

Test build #102520 has finished for PR 23797 at commit d524937.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Feb 20, 2019

Test build #102530 has finished for PR 23797 at commit d524937.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Feb 20, 2019

Test build #102537 has finished for PR 23797 at commit d524937.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Feb 20, 2019

Test build #102542 has finished for PR 23797 at commit d524937.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Feb 20, 2019

Test build #102550 has finished for PR 23797 at commit d524937.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 20, 2019

Test build #102558 has finished for PR 23797 at commit d524937.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gaborgsomogyi
Copy link
Contributor Author

providing this Python file separately somewhere .. (so that it can be used via py-files).

I was also thinking about such thing but couldn't really come up something which is not horror complex from user perspective. Feel free to share if anybody has a good idea.

@SparkQA
Copy link

SparkQA commented Feb 22, 2019

Test build #102644 has finished for PR 23797 at commit 89ff143.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Feb 26, 2019

Hmm , @gengliangwang, @cloud-fan and @viirya, do you maybe have an idea about how we include Avro function APIs in Python side? I think it's reasonable to include Python API since we have it in Java/Scala too.

The current approach looks okay to me too. The advantage of it is that it is consistent with existing way.

Providing a separate Python file somewhere requires extra setup (like py-files) when users want to use the Python API.

@gengliangwang
Copy link
Member

@gaborgsomogyi Thanks for the work!
I am not very familiar with PySpark. The approach in this PR is user-friendly and LGTM.

@HyukjinKwon
Copy link
Member

Okie, im gonna take a close look soon. @gaborgsomogyi, please get rid of WIP tag if you think it's ready for reviewing.

@gaborgsomogyi gaborgsomogyi changed the title [WIP][SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs [SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs Feb 27, 2019
@gaborgsomogyi
Copy link
Contributor Author

@HyukjinKwon not much to add, so removed the WIP tag.

@SparkQA
Copy link

SparkQA commented Mar 7, 2019

Test build #103147 has finished for PR 23797 at commit 65f523b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Looks good to me otherwise.

@SparkQA
Copy link

SparkQA commented Mar 8, 2019

Test build #103203 has finished for PR 23797 at commit 74aaa6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.master("local[4]")\
.appName("sql.avro.functions tests")\
.getOrCreate()
sc = spark.sparkContext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, last nit. Looks we don't need this too.

schema must match the read data, otherwise the behavior is undefined: it may fail or return
arbitrary result.
Note: Avro is built-in but external data source module since Spark 2.4. Please deploy the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we improve the wording here maybe? "built-in but external" might be a bit confusing. What do you think of something like it's a supported but optional data source that requires special deployment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is taken over from the original feature which introduced in 2.4. I think the users already got used to it. If you still think it worth I suggest to modify the original feature as well.

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103283 has finished for PR 23797 at commit f3f0348.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master, given the positive feedback in general, and the approach is matched to what PySpark does.

# Search jar in the project dir using the jar name_prefix for both sbt build and maven
# build because the artifact jars are in different directories.
sbt_build = glob.glob(os.path.join(
project_full_path, "target/scala-*/%s*.jar" % jar_name_prefix))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, All.
This causes Python UT failures which blocks another PRs. Please see #24268 .

HyukjinKwon pushed a commit that referenced this pull request Apr 2, 2019
… for Kinesis assembly

## What changes were proposed in this pull request?

After [SPARK-26856](#23797), `Kinesis` Python UT fails with `Found multiple JARs` exception due to a wrong pattern.

- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/104171/console
```
Exception: Found multiple JARs:
.../spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar,
.../spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar;
please remove all but one
```

It's because the pattern was changed in a wrong way.

**Original**
```python
kinesis_asl_assembly_dir, "target/scala-*/%s-*.jar" % name_prefix))
kinesis_asl_assembly_dir, "target/%s_*.jar" % name_prefix))
```
**After SPARK-26856**
```python
project_full_path, "target/scala-*/%s*.jar" % jar_name_prefix))
project_full_path, "target/%s*.jar" % jar_name_prefix))
```

The actual kinesis assembly jar files look like the followings.

**SBT Build**
```
-rw-r--r--  1 dongjoon  staff  87459461 Apr  1 19:01 spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar
-rw-r--r--  1 dongjoon  staff       309 Apr  1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar
-rw-r--r--  1 dongjoon  staff       309 Apr  1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar
```

**MAVEN Build**
```
-rw-r--r--   1 dongjoon  staff   8.6K Apr  1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-sources.jar
-rw-r--r--   1 dongjoon  staff   8.6K Apr  1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-test-sources.jar
-rw-r--r--   1 dongjoon  staff   8.7K Apr  1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar
-rw-r--r--   1 dongjoon  staff    21M Apr  1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar
```

In addition, after SPARK-26856, the utility function `search_jar` is shared to find `avro` jar files which are identical for both `sbt` and `mvn`. To sum up, The current jar pattern parameter cannot handle both `kinesis` and `avro` jars. This PR splits the single pattern into two patterns.

## How was this patch tested?

Manual. Please note that this will remove only `Found multiple JARs` exception. Kinesis tests need more configurations to run locally.
```
$ build/sbt -Pkinesis-asl test:package streaming-kinesis-asl-assembly/assembly
$ export ENABLE_KINESIS_TESTS=1
$ python/run-tests.py --python-executables python2.7 --module pyspark-streaming
```

Closes #24268 from dongjoon-hyun/SPARK-26856.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@javadi82
Copy link

javadi82 commented Jan 25, 2020

Sorry for the newbie question: Which package should I include so that these functions are available ?

I tried this: pyspark --packages org.apache.spark:spark-avro_2.12:2.4.4

from pyspark.sql.avro.functions import from_avro, to_avro
ImportError: No module named avro.functions

import pyspark.sql.avro.functions
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named avro.functions```

@dongjoon-hyun
Copy link
Member

Hi, @javadi82 . This is a new feature of 3.0.0. You can see Fix Version/s field.

Please try in 3.0.0-preview2.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
      /_/

Using Python version 3.7.6 (default, Jan 10 2020 13:37:46)
SparkSession available as 'spark'.
>>> from pyspark.sql.avro.functions import from_avro, to_avro
>>>

moritzmeister pushed a commit to moritzmeister/spark that referenced this pull request Feb 12, 2021
Avro is built-in but external data source module since Spark 2.4 but  `from_avro` and `to_avro` APIs not yet supported in pyspark.

In this PR I've made them available from pyspark.

Please see the python API examples what I've added.

cd docs/
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build
Manual webpage check.

Closes apache#23797 from gaborgsomogyi/SPARK-26856.

Authored-by: Gabor Somogyi <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 3729efb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants