Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Sep 19, 2024

What changes were proposed in this pull request?

Switch Scatter plot to sampled data

Why are the changes needed?

when the data distribution has relationship with the order, the first n rows will not be representative of the whole dataset

for example:

import pandas as pd
import numpy as np
import pyspark.pandas as ps

# ps.set_option("plotting.max_rows", 10000)
np.random.seed(123)

pdf = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD')).sort_values("A")
psdf = ps.DataFrame(pdf)

psdf.plot.scatter(x='B', y='A')

all 10k datapoints:
image

before (first 1k datapoints):
image

after (sampled 1k datapoints):
image

Does this PR introduce any user-facing change?

yes

How was this patch tested?

ci and manually test

Was this patch authored or co-authored using generative AI tooling?

no

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Merged to master. Thank you all.

@zhengruifeng zhengruifeng deleted the ps_scatter_sampling branch September 20, 2024 00:17
@xinrong-meng
Copy link
Member

Late LGTM, thank you!

attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?
Switch `Scatter` plot to sampled data

### Why are the changes needed?
when the data distribution has relationship with the order, the first n rows will not be representative of the whole dataset

for example:
```
import pandas as pd
import numpy as np
import pyspark.pandas as ps

# ps.set_option("plotting.max_rows", 10000)
np.random.seed(123)

pdf = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD')).sort_values("A")
psdf = ps.DataFrame(pdf)

psdf.plot.scatter(x='B', y='A')
```

all 10k datapoints:
![image](https://github.com/user-attachments/assets/72cf7e97-ad10-41e0-a8a6-351747d5285f)

before (first 1k datapoints):
![image](https://github.com/user-attachments/assets/1ed50d2c-7772-4579-a84c-6062542d9367)

after (sampled 1k datapoints):
![image](https://github.com/user-attachments/assets/6c684cba-4119-4c38-8228-2bedcdeb9e59)

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
ci and manually test

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48164 from zhengruifeng/ps_scatter_sampling.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
### What changes were proposed in this pull request?
Switch `Scatter` plot to sampled data

### Why are the changes needed?
when the data distribution has relationship with the order, the first n rows will not be representative of the whole dataset

for example:
```
import pandas as pd
import numpy as np
import pyspark.pandas as ps

# ps.set_option("plotting.max_rows", 10000)
np.random.seed(123)

pdf = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD')).sort_values("A")
psdf = ps.DataFrame(pdf)

psdf.plot.scatter(x='B', y='A')
```

all 10k datapoints:
![image](https://github.com/user-attachments/assets/72cf7e97-ad10-41e0-a8a6-351747d5285f)

before (first 1k datapoints):
![image](https://github.com/user-attachments/assets/1ed50d2c-7772-4579-a84c-6062542d9367)

after (sampled 1k datapoints):
![image](https://github.com/user-attachments/assets/6c684cba-4119-4c38-8228-2bedcdeb9e59)

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
ci and manually test

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48164 from zhengruifeng/ps_scatter_sampling.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants