You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-49718][PS] Switch Scatter plot to sampled data
### What changes were proposed in this pull request?
Switch `Scatter` plot to sampled data
### Why are the changes needed?
when the data distribution has relationship with the order, the first n rows will not be representative of the whole dataset
for example:
```
import pandas as pd
import numpy as np
import pyspark.pandas as ps
# ps.set_option("plotting.max_rows", 10000)
np.random.seed(123)
pdf = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD')).sort_values("A")
psdf = ps.DataFrame(pdf)
psdf.plot.scatter(x='B', y='A')
```
all 10k datapoints:

before (first 1k datapoints):

after (sampled 1k datapoints):

### Does this PR introduce _any_ user-facing change?
yes
### How was this patch tested?
ci and manually test
### Was this patch authored or co-authored using generative AI tooling?
no
Closes#48164 from zhengruifeng/ps_scatter_sampling.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
0 commit comments