Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion python/pyspark/sql/tests/test_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
import unittest
import warnings

from pyspark.sql import Row
from pyspark.sql import Row, SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import *
from pyspark.testing.sqlutils import ReusedSQLTestCase, have_pandas, have_pyarrow, \
Expand Down Expand Up @@ -421,6 +421,35 @@ def run_test(num_records, num_parts, max_records, use_delay=False):
run_test(*case)


@unittest.skipIf(
not have_pandas or not have_pyarrow,
pandas_requirement_message or pyarrow_requirement_message)
class MaxResultArrowTests(unittest.TestCase):
# These tests are separate as 'spark.driver.maxResultSize' configuration
# is a static configuration to Spark context.

@classmethod
def setUpClass(cls):
cls.spark = SparkSession.builder \
.master("local[4]") \
.appName(cls.__name__) \
.config("spark.driver.maxResultSize", "10k") \
.getOrCreate()

# Explicitly enable Arrow and disable fallback.
cls.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
cls.spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", "false")
Copy link
Member

@dongjoon-hyun dongjoon-hyun Aug 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, do we need these for test cases? This are opposite to the default values. In the reported JIRA's scenario, we didn't change the default configuration. Did we change the default values for those configurations in 3.0.0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql.execution.arrow.enabled now has an alias spark.sql.execution.arrow.pyspark.enabled as of d6632d1

And, spark.sql.execution.arrow.pyspark.fallback.enabled was just set to narrow down test scope. Correct behaviour should be a failure in Arrow optimized code path (although the JIRAs' case, it fails in Arrow optimized path first, and then fails again in non-Arrow optimized path).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to have a test that fails with default settings, for branch 2.4.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the issue SPARK-28881 happened because it passed in Arrow optimized code path, and it returned partial data or empty data).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to not use aliases when I port to branch-2.4 because techincally spark.sql.execution.arrow.enabled and spark.sql.execution.arrow.fallback.enabled are supposed to be removed out in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These configurations just narrow down the testing scope, and test_arrow.py is supposed to test this scope (as can be seen ArrowTests).


@classmethod
def tearDownClass(cls):
if hasattr(cls, "spark"):
cls.spark.stop()

def test_exception_by_max_results(self):
with self.assertRaisesRegexp(Exception, "is bigger than"):
self.spark.range(0, 10000, 1, 100).toPandas()


class EncryptionArrowTests(ArrowTests):

@classmethod
Expand Down