-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-48756][CONNECT][PYTHON]Support for df.debug() in Connect Mode
#47153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
df.debug() in Connect Modedf.debug() in Connect Mode
df.debug() in Connect Modedf.debug() in Connect Mode
| =========== | ||
| Spark Connect - Execution Info and Debug | ||
| =========== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have =========== to match with its size - otherwise Sphinx warns and complains about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| the usage. In addition, it makes sure that the captured metrics are properly collected | ||
| as part of the execution info. | ||
| .. versionadded:: 4.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| .. versionadded:: 4.0.0 | |
| .. versionadded:: 4.0.0 | |
Otherwise the HTML output is malformed
| from pyspark.errors import PySparkValueError | ||
| from pyspark.errors import PySparkValueError, PySparkTypeError | ||
| from pyspark.sql import Observation, Column | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per PEP8
|
|
||
| @classmethod | ||
| def count_values(cls) -> "DataDebugOp": | ||
| return DataDebugOp("count_values", F.count(F.lit(1)).alias("count_values")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a wrapper of observe API. I think it does not simplify a lot vs the existing uscase ..
observation = Observation("my metrics")
observed_df = df.observe(Observation("my metrics"), count(lit(1)).alias("count"), max(col("age")))
observation.get()and this won't work for streaming.
itholic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Let me address _capture_call_site as well
| self._execution_info.setObservations(self._plan.observations) | ||
| return self._execution_info | ||
|
|
||
| def debug(self, *other: List["DataDebugOp"]) -> "DataFrame": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the usage is:
spark.range(100).debug(DataDebugOp.max_value("id"), DataDebugOp.count_null_values("id"))instead of:
spark.range(100).debug([DataDebugOp.max_value("id"), DataDebugOp.count_null_values("id")])| def debug(self, *other: List["DataDebugOp"]) -> "DataFrame": | |
| def debug(self, *other: "DataDebugOp") -> "DataFrame": |
| """ | ||
| ... | ||
|
|
||
| def debug(self) -> "DataFrame": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The signature should all the same:
| def debug(self) -> "DataFrame": | |
| def debug(self, *other: "DataDebugOp") -> "DataFrame": |
| message_parameters={"member": "queryExecution"}, | ||
| ) | ||
|
|
||
| def debug(self) -> "DataFrame": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
| def debug(self) -> "DataFrame": | |
| def debug(self, *other: "DataDebugOp") -> "DataFrame": |
| data debug operations. | ||
| """ | ||
|
|
||
| @classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: @staticmethod if cls is not used?
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
At times users want to evaluate the properties of their data flow graph and understand how certain transformations behave. Today this is more complex than necessary. Even though the
df.observe()API has been around in Spark since Spark 3.3, it's usage is not widespread.To give users a more visible API for understanding the data flow execution in Spark, this patch adds a new method to the DataFrame API called
df.debug(). Debug will by default do the following:debug:<uuid>count(1)observation to itAfter the execution, users can now access the observation using the execution info property of the DataFrame.
The debug String contains the reference to the observation, the call site and the values.
In addition to the count, we have defined several useful additional debug observations that can be easily injected.
Produces the following output:
Why are the changes needed?
User-support
Does this PR introduce any user-facing change?
Adds new method.
How was this patch tested?
New UT
Was this patch authored or co-authored using generative AI tooling?
No