Skip to content

Commit 1d18096

Browse files
BryanCutlerHyukjinKwon
authored andcommitted
[SPARK-32162][PYTHON][TESTS] Improve error message of Pandas grouped map test with window
### What changes were proposed in this pull request? Improve the error message in test GroupedMapInPandasTests.test_grouped_over_window_with_key to show the incorrect values. ### Why are the changes needed? This test failure has come up often in Arrow testing because it tests a struct with timestamp values through a Pandas UDF. The current error message is not helpful as it doesn't show the incorrect values, only that it failed. This change will instead raise an assertion error with the incorrect values on a failure. Before: ``` ====================================================================== FAIL: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 588, in test_grouped_over_window_with_key self.assertTrue(all([r[0] for r in result])) AssertionError: False is not true ``` After: ``` ====================================================================== ERROR: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) ---------------------------------------------------------------------- ... AssertionError: {'start': datetime.datetime(2018, 3, 20, 0, 0), 'end': datetime.datetime(2018, 3, 25, 0, 0)}, != {'start': datetime.datetime(2020, 3, 20, 0, 0), 'end': datetime.datetime(2020, 3, 25, 0, 0)} ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Improved existing test Closes apache#28987 from BryanCutler/pandas-grouped-map-test-output-SPARK-32162. Authored-by: Bryan Cutler <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
1 parent 59a7087 commit 1d18096

File tree

1 file changed

+35
-22
lines changed

1 file changed

+35
-22
lines changed

python/pyspark/sql/tests/test_pandas_grouped_map.py

Lines changed: 35 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -545,13 +545,13 @@ def f(pdf):
545545

546546
def test_grouped_over_window_with_key(self):
547547

548-
data = [(0, 1, "2018-03-10T00:00:00+00:00", False),
549-
(1, 2, "2018-03-11T00:00:00+00:00", False),
550-
(2, 2, "2018-03-12T00:00:00+00:00", False),
551-
(3, 3, "2018-03-15T00:00:00+00:00", False),
552-
(4, 3, "2018-03-16T00:00:00+00:00", False),
553-
(5, 3, "2018-03-17T00:00:00+00:00", False),
554-
(6, 3, "2018-03-21T00:00:00+00:00", False)]
548+
data = [(0, 1, "2018-03-10T00:00:00+00:00", [0]),
549+
(1, 2, "2018-03-11T00:00:00+00:00", [0]),
550+
(2, 2, "2018-03-12T00:00:00+00:00", [0]),
551+
(3, 3, "2018-03-15T00:00:00+00:00", [0]),
552+
(4, 3, "2018-03-16T00:00:00+00:00", [0]),
553+
(5, 3, "2018-03-17T00:00:00+00:00", [0]),
554+
(6, 3, "2018-03-21T00:00:00+00:00", [0])]
555555

556556
expected_window = [
557557
{'start': datetime.datetime(2018, 3, 10, 0, 0),
@@ -562,30 +562,43 @@ def test_grouped_over_window_with_key(self):
562562
'end': datetime.datetime(2018, 3, 25, 0, 0)},
563563
]
564564

565-
expected = {0: (1, expected_window[0]),
566-
1: (2, expected_window[0]),
567-
2: (2, expected_window[0]),
568-
3: (3, expected_window[1]),
569-
4: (3, expected_window[1]),
570-
5: (3, expected_window[1]),
571-
6: (3, expected_window[2])}
565+
expected_key = {0: (1, expected_window[0]),
566+
1: (2, expected_window[0]),
567+
2: (2, expected_window[0]),
568+
3: (3, expected_window[1]),
569+
4: (3, expected_window[1]),
570+
5: (3, expected_window[1]),
571+
6: (3, expected_window[2])}
572+
573+
# id -> array of group with len of num records in window
574+
expected = {0: [1],
575+
1: [2, 2],
576+
2: [2, 2],
577+
3: [3, 3, 3],
578+
4: [3, 3, 3],
579+
5: [3, 3, 3],
580+
6: [3]}
572581

573582
df = self.spark.createDataFrame(data, ['id', 'group', 'ts', 'result'])
574583
df = df.select(col('id'), col('group'), col('ts').cast('timestamp'), col('result'))
575584

576-
@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
577585
def f(key, pdf):
578586
group = key[0]
579587
window_range = key[1]
580-
# Result will be True if group and window range equal to expected
581-
is_expected = pdf.id.apply(lambda id: (expected[id][0] == group and
582-
expected[id][1] == window_range))
583-
return pdf.assign(result=is_expected)
584588

585-
result = df.groupby('group', window('ts', '5 days')).apply(f).select('result').collect()
589+
# Make sure the key with group and window values are correct
590+
for _, i in pdf.id.iteritems():
591+
assert expected_key[i][0] == group, "{} != {}".format(expected_key[i][0], group)
592+
assert expected_key[i][1] == window_range, \
593+
"{} != {}".format(expected_key[i][1], window_range)
586594

587-
# Check that all group and window_range values from udf matched expected
588-
self.assertTrue(all([r[0] for r in result]))
595+
return pdf.assign(result=[[group] * len(pdf)] * len(pdf))
596+
597+
result = df.groupby('group', window('ts', '5 days')).applyInPandas(f, df.schema)\
598+
.select('id', 'result').collect()
599+
600+
for r in result:
601+
self.assertListEqual(expected[r[0]], r[1])
589602

590603
def test_case_insensitive_grouping_column(self):
591604
# SPARK-31915: case-insensitive grouping column should work.

0 commit comments

Comments
 (0)