-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-15954][SQL][PySpark][TEST] Fix TestHiveContext interaction with PySpark issue #13737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @MLnick I factored out the test fix from the doc change PR. |
|
Test build #60709 has finished for PR 13737 at commit
|
|
re-ping @MLnick and cc @jkbradley / @rxin / @sameeragarwal - see the parent PR at #12938 for some discussion around the fix. |
|
Thanks, LGTM. It'd be great to add a comment here about the fallback and its implications on pyspark tests to prevent future regressions. Also, out of curiosity, do we know what the underlying cause of flakiness is? |
|
Why does Python need to load these test resources? I think the proper fix is to get rid of that dependency. Otherwise we are making the test harness more and more complicated and tighter coupling. |
|
@sameeragarwal I'm not super sure but my guess is that the resources are cleaned up at some point during the testing (and in the local dev build isn't normally packaged). As for removing the test resources requirement from PySpark: How about I look at doing that in a follow up PR since for now its difficult for PySpark developers to test and this is blocking another outstanding PR? |
|
@sameeragarwal updated the comment |
|
Test build #60894 has finished for PR 13737 at commit
|
|
@rxin from what I can see, the "quick" fix of trying to make say So the fix will likely be a little more involved, perhaps as @holdenk says it's best to go ahead with this PR and do the decoupling separately? Just given where we are in the release cycle. |
|
ping @rxin - I think we should fix this before cutting a new RC |
|
Can you check what files the python tests are using? My understanding is that there should only be a few number of places in Python that use these test files. |
|
@rxin Temporarily disabling registration it seems the files aren't actually used from the Python tests, rather the |
|
I'm suggesting just removing the dependency on TestHiveContext in Python altogether. It shouldn't be that difficult. Don't add hacks just to work around some init, if we can get rid of the problem. |
|
BTW Jenkins seems to be fine with Python? What's the problem? |
|
@rxin ah sorry, I though you wanted to remove the dependency around the file loading functionality - that makes a bit more sense. I can take a look at that possibility in more detail tomorrow if we don't want the simple fix. At first glance the scala I'm not convinced removing the Python dependency on the The problem seems to be that the files aren't always packaged as resources when you build locally (and in Jenkins it was somehow a race condition some of them time). It seemed better to try and get a small working fix in quickly and do the deep dive later. |
|
@rxin try running |
|
@MLnick @rxin yah so after poking at it a bit today I don't see a good way to disentagle this - we presumably want to make sure that PySpark works well with a hive based Spark Session (even if the tests weren't testing hive specific functionality). If we would rather fix it by disabling the loading of tables from Python - I did make some changes so that we could disable loading the test tables / required files for Python based tests (which might be good sort of regardless since loading the files presumably takes some time and the Scala test tables aren't used in the Python tests). |
|
Ah OK. I just looked into this and submitted a fix: #14005 |
|
Closing this since it seems like @rxin has a fix. |
What changes were proposed in this pull request?
SPARK-15745 made TestHive unreliable from PySpark test cases, to support it we should allow both resource or system property based lookup for loading the hive file.
How was this patch tested?
Existing PySpark tests now pass.