Skip to content

Conversation

@xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Oct 14, 2024

What changes were proposed in this pull request?

Support box plots with plotly backend on both Spark Connect and Spark classic.

Why are the changes needed?

While Pandas on Spark supports plotting, PySpark currently lacks this feature. The proposed API will enable users to generate visualizations. This will provide users with an intuitive, interactive way to explore and understand large datasets directly from PySpark DataFrames, streamlining the data analysis workflow in distributed environments.

See more at PySpark Plotting API Specification in progress.

Part of https://issues.apache.org/jira/browse/SPARK-49530.

Does this PR introduce any user-facing change?

Yes. Box plots are supported as shown below.

>>> data = [
...             ("A", 50, 55),
...             ("B", 55, 60),
...             ("C", 60, 65),
...             ("D", 65, 70),
...             ("E", 70, 75),
...             # outliers
...             ("F", 10, 15),
...             ("G", 85, 90),
...             ("H", 5, 150),
...         ]
>>> columns = ["student", "math_score", "english_score"]
>>> sdf = spark.createDataFrame(data, columns)
>>> fig1 = sdf.plot.box(column=["math_score", "english_score"])
>>> fig1.show()  # see below
>>> fig2 = sdf.plot(kind="box", column="math_score")
>>> fig2.show()  # see below

fig1:
newplot (17)

fig2:
newplot (18)

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

return sdf_result.first()


def _invoke_internal_function_over_columns(name: str, *cols: "ColumnOrName") -> Column:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo for myself: find a proper place for this helper function so it can be used in both pyspark df and ps.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending tests

@xinrong-meng xinrong-meng changed the title [WIP][SPARK-49929][PYTHON][CONNECT] Support box plots [SPARK-49929][PYTHON][CONNECT] Support box plots Oct 14, 2024
@xinrong-meng xinrong-meng marked this pull request as ready for review October 14, 2024 06:27
@xinrong-meng
Copy link
Member Author

Merged to master, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants