Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Apr 24, 2018

What changes were proposed in this pull request?

When PyArrow or Pandas are not available, the corresponding PySpark tests are skipped automatically. Currently, PySpark tests fail when we are not using -Phive. This PR aims to skip Hive related PySpark tests when -Phive is not given.

BEFORE

$ build/mvn -DskipTests clean package
$ python/run-tests.py --python-executables python2.7 --modules pyspark-sql
File "/Users/dongjoon/spark/python/pyspark/sql/readwriter.py", line 295, in pyspark.sql.readwriter.DataFrameReader.table
...
IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':"
**********************************************************************
   1 of   3 in pyspark.sql.readwriter.DataFrameReader.table
***Test Failed*** 1 failures.

AFTER

$ build/mvn -DskipTests clean package
$ python/run-tests.py --python-executables python2.7 --modules pyspark-sql
...
Tests passed in 138 seconds

Skipped tests in pyspark.sql.tests with python2.7:
...
    test_hivecontext (pyspark.sql.tests.HiveSparkSubmitTests) ... skipped 'Hive is not available.'

How was this patch tested?

This is a test-only change. First, this should pass the Jenkins. Then, manually do the following.

build/mvn -DskipTests clean package
python/run-tests.py --python-executables python2.7 --modules pyspark-sql

@dongjoon-hyun
Copy link
Member Author

Hi, @holdenk .
Could you review this PR when you have some time?

@SparkQA
Copy link

SparkQA commented Apr 24, 2018

Test build #89781 has finished for PR 21141 at commit dc81cf9.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Apr 24, 2018 via email

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 24, 2018

Actually, I think #20909 tries to fix the similar thing.

If both fix the similar things, this way looks a bit more preferable. @bersprockets, If that sounds same to you, we could maybe do the doctests skip thing in #20909 and get this merged in separately?

@SparkQA
Copy link

SparkQA commented Apr 24, 2018

Test build #89782 has finished for PR 21141 at commit 54fdfd0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this result in the right kind of message, particularly the kind that @HyukjinKwon is checking for in PR #21107?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, actually, I think it would be nicer in setUp for a better(?) message .. I replaced unittest.SkipTest to self.skipTest per https://docs.python.org/2/library/unittest.html#unittest.SkipTest and https://docs.python.org/2/library/unittest.html#unittest.TestCase.skipTest

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should see if @holdenk likes this or not too while we are here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like more infromation on skipped tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. According to the latest convention of #21107, it will be displayed like this.

Skipped tests in pyspark.sql.tests with python2.7:
...
    test_hivecontext (pyspark.sql.tests.HiveSparkSubmitTests) ... skipped 'Hive is not available.'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is test code its probably OK, but this assumes that someone hasn't packaged Spark as an assembly JAR I like the approach taken in #20909 for checking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps looks for TestHiveContext like in other test cases

newJObject("org.apache.spark.sql.hive.test.TestHiveContext", ssc, FALSE)

jtestHive = sparkContext._jvm.org.apache.spark.sql.hive.test.TestHiveContext(jsc, False)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR will follow #20909 like the other occurrence in HiveContextSQLTests of this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like more infromation on skipped tests.

@holdenk
Copy link
Contributor

holdenk commented Apr 26, 2018

Also weirdly when I run this locally without hive built I get some UDF registration exceptions (could be unrelated) - do you get that?

@bersprockets
Copy link
Contributor

@holdenk Yes, see this jira. If you build with sbt, you need to also run 'build/sbt sql/test:compile' to get udfs from the test files.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @HyukjinKwon , @holdenk , @bersprockets .

I didn't notice SPARK-23776 when I chose SPARK-23853 . I think we can merge those PRs now.

@bersprockets . Could you update your readwriter.py like this PR?

@HyukjinKwon
Copy link
Member

I am okay either way but I thought @bersprockets agreed up on doing this separately here? Doctests stuff need more looks and I think this one alone can be merged separately.

@bersprockets I'll leave it to your preference - you are the author of the original PR.

@bersprockets
Copy link
Contributor

@dongjoon-hyun @HyukjinKwon My PR is no longer addressing the issue described its associated Jira (SPARK-23776), which is that developers don't know what to do when they run the pyspark tests and get a failure with a UDF registration error (or Hive assembly missing error), as Holden experienced earlier. My PR morphed into a "skip the tests for missing components" change.

After these "skip tests" PRs go through, I will revisit this. In the meantime, feel free to use/ignore whatever is in #20909.

@dongjoon-hyun
Copy link
Member Author

I see. Thanks, @bersprockets . I'll proceed this PR according to your and other peoples comments.

@SparkQA
Copy link

SparkQA commented Apr 29, 2018

Test build #89974 has finished for PR 21141 at commit 271e152.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The PR is updated now. Could you review this again, @holdenk, @HyukjinKwon , @felixcheung , @bersprockets ?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dongjoon-hyun
Copy link
Member Author

Thank you for review and approval, @HyukjinKwon .

@bersprockets
Copy link
Contributor

My experience here is limited. Still, it also looks good to me.

@dongjoon-hyun
Copy link
Member Author

Thank you, @bersprockets .

@HyukjinKwon
Copy link
Member

Merged to master and branch-2.3.

asfgit pushed a commit that referenced this pull request May 1, 2018
…`-Phive`

## What changes were proposed in this pull request?

When `PyArrow` or `Pandas` are not available, the corresponding PySpark tests are skipped automatically. Currently, PySpark tests fail when we are not using `-Phive`. This PR aims to skip Hive related PySpark tests when `-Phive` is not given.

**BEFORE**
```bash
$ build/mvn -DskipTests clean package
$ python/run-tests.py --python-executables python2.7 --modules pyspark-sql
File "/Users/dongjoon/spark/python/pyspark/sql/readwriter.py", line 295, in pyspark.sql.readwriter.DataFrameReader.table
...
IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':"
**********************************************************************
   1 of   3 in pyspark.sql.readwriter.DataFrameReader.table
***Test Failed*** 1 failures.
```

**AFTER**
```bash
$ build/mvn -DskipTests clean package
$ python/run-tests.py --python-executables python2.7 --modules pyspark-sql
...
Tests passed in 138 seconds

Skipped tests in pyspark.sql.tests with python2.7:
...
    test_hivecontext (pyspark.sql.tests.HiveSparkSubmitTests) ... skipped 'Hive is not available.'
```

## How was this patch tested?

This is a test-only change. First, this should pass the Jenkins. Then, manually do the following.

```bash
build/mvn -DskipTests clean package
python/run-tests.py --python-executables python2.7 --modules pyspark-sql
```

Author: Dongjoon Hyun <[email protected]>

Closes #21141 from dongjoon-hyun/SPARK-23853.

(cherry picked from commit b857fb5)
Signed-off-by: hyukjinkwon <[email protected]>
@asfgit asfgit closed this in b857fb5 May 1, 2018
@dongjoon-hyun
Copy link
Member Author

Thank you, @HyukjinKwon!

@dongjoon-hyun dongjoon-hyun deleted the SPARK-23853 branch May 1, 2018 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants