[SPARK-49718][PS] Switch `Scatter` plot to sampled data #48164

zhengruifeng · 2024-09-19T09:22:55Z

What changes were proposed in this pull request?

Switch Scatter plot to sampled data

Why are the changes needed?

when the data distribution has relationship with the order, the first n rows will not be representative of the whole dataset

for example:

import pandas as pd
import numpy as np
import pyspark.pandas as ps

# ps.set_option("plotting.max_rows", 10000)
np.random.seed(123)

pdf = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD')).sort_values("A")
psdf = ps.DataFrame(pdf)

psdf.plot.scatter(x='B', y='A')

all 10k datapoints:

before (first 1k datapoints):

after (sampled 1k datapoints):

Does this PR introduce any user-facing change?

yes

How was this patch tested?

ci and manually test

Was this patch authored or co-authored using generative AI tooling?

no

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2024-09-19T19:32:07Z

Merged to master. Thank you all.

xinrong-meng · 2024-09-20T00:38:19Z

Late LGTM, thank you!

### What changes were proposed in this pull request? Switch `Scatter` plot to sampled data ### Why are the changes needed? when the data distribution has relationship with the order, the first n rows will not be representative of the whole dataset for example: ``` import pandas as pd import numpy as np import pyspark.pandas as ps # ps.set_option("plotting.max_rows", 10000) np.random.seed(123) pdf = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD')).sort_values("A") psdf = ps.DataFrame(pdf) psdf.plot.scatter(x='B', y='A') ``` all 10k datapoints: ![image](https://github.com/user-attachments/assets/72cf7e97-ad10-41e0-a8a6-351747d5285f) before (first 1k datapoints): ![image](https://github.com/user-attachments/assets/1ed50d2c-7772-4579-a84c-6062542d9367) after (sampled 1k datapoints): ![image](https://github.com/user-attachments/assets/6c684cba-4119-4c38-8228-2bedcdeb9e59) ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? ci and manually test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48164 from zhengruifeng/ps_scatter_sampling. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

fix

fa1ce1e

github-actions bot added PYTHON PANDAS API ON SPARK labels Sep 19, 2024

zhengruifeng requested review from HyukjinKwon and xinrong-meng September 19, 2024 09:23

HyukjinKwon approved these changes Sep 19, 2024

View reviewed changes

dongjoon-hyun approved these changes Sep 19, 2024

View reviewed changes

dongjoon-hyun closed this in 6d1815e Sep 19, 2024

zhengruifeng deleted the ps_scatter_sampling branch September 20, 2024 00:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-49718][PS] Switch `Scatter` plot to sampled data #48164

[SPARK-49718][PS] Switch `Scatter` plot to sampled data #48164

Uh oh!

zhengruifeng commented Sep 19, 2024 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Sep 19, 2024

Uh oh!

xinrong-meng commented Sep 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-49718][PS] Switch Scatter plot to sampled data #48164

[SPARK-49718][PS] Switch Scatter plot to sampled data #48164

Uh oh!

Conversation

zhengruifeng commented Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 19, 2024

Uh oh!

xinrong-meng commented Sep 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-49718][PS] Switch `Scatter` plot to sampled data #48164

[SPARK-49718][PS] Switch `Scatter` plot to sampled data #48164

zhengruifeng commented Sep 19, 2024 •

edited

Loading