-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs #23797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #102388 has finished for PR 23797 at commit
|
|
Test build #102390 has finished for PR 23797 at commit
|
|
Test build #102392 has finished for PR 23797 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments ...
-
I think the import path should be
pyspark.sql.avro.functionsto be consistent. -
Probably, you should fix
dev/sparktestsupport/modules.pyto include avro artifact in PySpark tests. -
I am not sure which way is better. I am currently able to think ..
a) somehow provide a python file that can be used via-py-files(considering arrow is a separate source)
b) we can add some codes within Apache Spark like the current way. We could throw a proper exception after checking if some avro classes are loadable or not.
Let me think a bit more ..
* Added avro artifact * Some refactoring * Formatting fixes
|
I'm doing further testing... |
|
retest this please |
1 similar comment
|
retest this please |
|
Test build #102500 has finished for PR 23797 at commit
|
|
Test build #102511 has finished for PR 23797 at commit
|
|
Test build #102512 has finished for PR 23797 at commit
|
|
retest this please |
|
Test build #102520 has finished for PR 23797 at commit
|
|
retest this please |
|
Test build #102530 has finished for PR 23797 at commit
|
|
retest this please |
|
Test build #102537 has finished for PR 23797 at commit
|
|
retest this please |
|
Test build #102542 has finished for PR 23797 at commit
|
|
retest this please |
|
Test build #102550 has finished for PR 23797 at commit
|
|
retest this please |
|
Test build #102558 has finished for PR 23797 at commit
|
I was also thinking about such thing but couldn't really come up something which is not horror complex from user perspective. Feel free to share if anybody has a good idea. |
|
Test build #102644 has finished for PR 23797 at commit
|
The current approach looks okay to me too. The advantage of it is that it is consistent with existing way. Providing a separate Python file somewhere requires extra setup (like |
|
@gaborgsomogyi Thanks for the work! |
|
Okie, im gonna take a close look soon. @gaborgsomogyi, please get rid of WIP tag if you think it's ready for reviewing. |
|
@HyukjinKwon not much to add, so removed the WIP tag. |
|
Test build #103147 has finished for PR 23797 at commit
|
|
Looks good to me otherwise. |
|
Test build #103203 has finished for PR 23797 at commit
|
python/pyspark/sql/avro/functions.py
Outdated
| .master("local[4]")\ | ||
| .appName("sql.avro.functions tests")\ | ||
| .getOrCreate() | ||
| sc = spark.sparkContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, last nit. Looks we don't need this too.
| schema must match the read data, otherwise the behavior is undefined: it may fail or return | ||
| arbitrary result. | ||
| Note: Avro is built-in but external data source module since Spark 2.4. Please deploy the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we improve the wording here maybe? "built-in but external" might be a bit confusing. What do you think of something like it's a supported but optional data source that requires special deployment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is taken over from the original feature which introduced in 2.4. I think the users already got used to it. If you still think it worth I suggest to modify the original feature as well.
|
Test build #103283 has finished for PR 23797 at commit
|
|
Merged to master, given the positive feedback in general, and the approach is matched to what PySpark does. |
| # Search jar in the project dir using the jar name_prefix for both sbt build and maven | ||
| # build because the artifact jars are in different directories. | ||
| sbt_build = glob.glob(os.path.join( | ||
| project_full_path, "target/scala-*/%s*.jar" % jar_name_prefix)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, All.
This causes Python UT failures which blocks another PRs. Please see #24268 .
… for Kinesis assembly ## What changes were proposed in this pull request? After [SPARK-26856](#23797), `Kinesis` Python UT fails with `Found multiple JARs` exception due to a wrong pattern. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/104171/console ``` Exception: Found multiple JARs: .../spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar, .../spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar; please remove all but one ``` It's because the pattern was changed in a wrong way. **Original** ```python kinesis_asl_assembly_dir, "target/scala-*/%s-*.jar" % name_prefix)) kinesis_asl_assembly_dir, "target/%s_*.jar" % name_prefix)) ``` **After SPARK-26856** ```python project_full_path, "target/scala-*/%s*.jar" % jar_name_prefix)) project_full_path, "target/%s*.jar" % jar_name_prefix)) ``` The actual kinesis assembly jar files look like the followings. **SBT Build** ``` -rw-r--r-- 1 dongjoon staff 87459461 Apr 1 19:01 spark-streaming-kinesis-asl-assembly-3.0.0-SNAPSHOT.jar -rw-r--r-- 1 dongjoon staff 309 Apr 1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar -rw-r--r-- 1 dongjoon staff 309 Apr 1 18:58 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar ``` **MAVEN Build** ``` -rw-r--r-- 1 dongjoon staff 8.6K Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-sources.jar -rw-r--r-- 1 dongjoon staff 8.6K Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-test-sources.jar -rw-r--r-- 1 dongjoon staff 8.7K Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT-tests.jar -rw-r--r-- 1 dongjoon staff 21M Apr 1 18:55 spark-streaming-kinesis-asl-assembly_2.12-3.0.0-SNAPSHOT.jar ``` In addition, after SPARK-26856, the utility function `search_jar` is shared to find `avro` jar files which are identical for both `sbt` and `mvn`. To sum up, The current jar pattern parameter cannot handle both `kinesis` and `avro` jars. This PR splits the single pattern into two patterns. ## How was this patch tested? Manual. Please note that this will remove only `Found multiple JARs` exception. Kinesis tests need more configurations to run locally. ``` $ build/sbt -Pkinesis-asl test:package streaming-kinesis-asl-assembly/assembly $ export ENABLE_KINESIS_TESTS=1 $ python/run-tests.py --python-executables python2.7 --module pyspark-streaming ``` Closes #24268 from dongjoon-hyun/SPARK-26856. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
|
Sorry for the newbie question: Which package should I include so that these functions are available ? I tried this: pyspark --packages org.apache.spark:spark-avro_2.12:2.4.4 |
|
Hi, @javadi82 . This is a new feature of 3.0.0. You can see Please try in |
Avro is built-in but external data source module since Spark 2.4 but `from_avro` and `to_avro` APIs not yet supported in pyspark. In this PR I've made them available from pyspark. Please see the python API examples what I've added. cd docs/ SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build Manual webpage check. Closes apache#23797 from gaborgsomogyi/SPARK-26856. Authored-by: Gabor Somogyi <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 3729efb)
What changes were proposed in this pull request?
Avro is built-in but external data source module since Spark 2.4 but
from_avroandto_avroAPIs not yet supported in pyspark.In this PR I've made them available from pyspark.
How was this patch tested?
Please see the python API examples what I've added.
cd docs/
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build
Manual webpage check.