[SPARK-49718][PS] Switch Scatter plot to sampled data

zhengruifeng · dongjoon-hyun · commit 6d1815eceea2 · 2024-09-19T12:31:48.000-07:00
### What changes were proposed in this pull request? Switch `Scatter` plot to sampled data ### Why are the changes needed? when the data distribution has relationship with the order, the first n rows will not be representative of the whole dataset for example: ``` import pandas as pd import numpy as np import pyspark.pandas as ps # ps.set_option("plotting.max_rows", 10000) np.random.seed(123) pdf = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD')).sort_values("A") psdf = ps.DataFrame(pdf) psdf.plot.scatter(x='B', y='A') ``` all 10k datapoints: ![image](https://github.com/user-attachments/assets/72cf7e97-ad10-41e0-a8a6-351747d5285f) before (first 1k datapoints): ![image](https://github.com/user-attachments/assets/1ed50d2c-7772-4579-a84c-6062542d9367) after (sampled 1k datapoints): ![image](https://github.com/user-attachments/assets/6c684cba-4119-4c38-8228-2bedcdeb9e59) ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? ci and manually test ### Was this patch authored or co-authored using generative AI tooling? no Closes #48164 from zhengruifeng/ps_scatter_sampling. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
diff --git a/python/pyspark/pandas/plot/core.py b/python/pyspark/pandas/plot/core.py
@@ -479,7 +479,7 @@ class PandasOnSparkPlotAccessor(PandasObject):
         "pie": TopNPlotBase().get_top_n,
         "bar": TopNPlotBase().get_top_n,
         "barh": TopNPlotBase().get_top_n,
-        "scatter": TopNPlotBase().get_top_n,
+        "scatter": SampledPlotBase().get_sampled,
         "area": SampledPlotBase().get_sampled,
         "line": SampledPlotBase().get_sampled,
     }

Original file line number	Diff line number	Diff line change
`@@ -479,7 +479,7 @@ class PandasOnSparkPlotAccessor(PandasObject):`
`479`	`479`	`"pie": TopNPlotBase().get_top_n,`
`480`	`480`	`"bar": TopNPlotBase().get_top_n,`
`481`	`481`	`"barh": TopNPlotBase().get_top_n,`
`482`		`- "scatter": TopNPlotBase().get_top_n,`
	`482`	`+ "scatter": SampledPlotBase().get_sampled,`
`483`	`483`	`"area": SampledPlotBase().get_sampled,`
`484`	`484`	`"line": SampledPlotBase().get_sampled,`
`485`	`485`	`}`