Skip to content

Conversation

@soxofaan
Copy link
Contributor

What changes were proposed in this pull request?

See https://issues.apache.org/jira/browse/SPARK-41989 for in depth explanation

Short summary: pyspark/pandas/__init__.py uses, at import time, logging.warning() which might silently call logging.basicConfig().
So by importing pyspark.pandas (directly or indirectly) a user might unknowingly break their own logging setup (e.g. when based on logging.basicConfig() or related). logging.getLogger(...).warning() does not trigger this behavior.

Does this PR introduce any user-facing change?

User-defined logging setups will be more predictable.

How was this patch tested?

Manual testing so far.
I'm not sure it's worthwhile to cover this with a unit test

@HyukjinKwon
Copy link
Member

Thanks for the fix @soxofaan. I think it was a mistake, cc @itholic

@soxofaan soxofaan force-pushed the SPARK-41989-pyspark-pandas-logging-setup branch from 25aab03 to 242ff01 Compare January 12, 2023 08:54
@HyukjinKwon
Copy link
Member

I don't think there's any test for this, and the linter passed. Therefore, merging it now.

Merged to master.

@HyukjinKwon
Copy link
Member

I merged it to branch-3.3 and branch-3.2 too.

HyukjinKwon pushed a commit that referenced this pull request Jan 12, 2023
See https://issues.apache.org/jira/browse/SPARK-41989 for in depth explanation

Short summary: `pyspark/pandas/__init__.py` uses, at import time,  `logging.warning()`  which might silently call `logging.basicConfig()`.
So by importing `pyspark.pandas` (directly or indirectly) a user might unknowingly break their own logging setup (e.g. when based on  `logging.basicConfig()` or related). `logging.getLogger(...).warning()`  does not trigger this behavior.

User-defined logging setups will be more predictable.

Manual testing so far.
I'm not sure it's worthwhile to cover this with a unit test

Closes #39516 from soxofaan/SPARK-41989-pyspark-pandas-logging-setup.

Authored-by: Stefaan Lippens <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 04836ba)
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon pushed a commit that referenced this pull request Jan 12, 2023
See https://issues.apache.org/jira/browse/SPARK-41989 for in depth explanation

Short summary: `pyspark/pandas/__init__.py` uses, at import time,  `logging.warning()`  which might silently call `logging.basicConfig()`.
So by importing `pyspark.pandas` (directly or indirectly) a user might unknowingly break their own logging setup (e.g. when based on  `logging.basicConfig()` or related). `logging.getLogger(...).warning()`  does not trigger this behavior.

User-defined logging setups will be more predictable.

Manual testing so far.
I'm not sure it's worthwhile to cover this with a unit test

Closes #39516 from soxofaan/SPARK-41989-pyspark-pandas-logging-setup.

Authored-by: Stefaan Lippens <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 04836ba)
Signed-off-by: Hyukjin Kwon <[email protected]>
vicennial pushed a commit to vicennial/spark that referenced this pull request Jan 17, 2023
### What changes were proposed in this pull request?

See https://issues.apache.org/jira/browse/SPARK-41989 for in depth explanation

Short summary: `pyspark/pandas/__init__.py` uses, at import time,  `logging.warning()`  which might silently call `logging.basicConfig()`.
So by importing `pyspark.pandas` (directly or indirectly) a user might unknowingly break their own logging setup (e.g. when based on  `logging.basicConfig()` or related). `logging.getLogger(...).warning()`  does not trigger this behavior.

### Does this PR introduce _any_ user-facing change?

User-defined logging setups will be more predictable.

### How was this patch tested?

Manual testing so far.
I'm not sure it's worthwhile to cover this with a unit test

Closes apache#39516 from soxofaan/SPARK-41989-pyspark-pandas-logging-setup.

Authored-by: Stefaan Lippens <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @soxofaan and @HyukjinKwon .
This seems to cause a production down because some applications already rely on this import logging. After this patch, PySpark jobs fail.

@dongjoon-hyun
Copy link
Member

FYI, cc @viirya , @huaxingao , @kazuyukitanimura , @sunchao

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 31, 2023

Hmmm ... the change decouples logging from PySpark in order to do nothing with other logging systems, meaning that it removes the relation with the existing logging system. So I wonder how this patch affects things.

If it matters, we can try with the original fix proposed (https://github.com/apache/spark/compare/25aab030e562153dbf7d11e8d2dadd8fd10ecc50..242ff01536c6e8307224256faa5e3d197e780da9).

For a bit of context:

  • PySpark haven't got its own logging system (for released versions). This was only place that logging was used in the released versions.
  • In master branch, now for ML Torch integration and Spark Connect, we have them.

So, it would be helpful to know how this patch affected some PySpark jobs so we can properly fix them in master branch and other branches too.

@soxofaan
Copy link
Contributor Author

soxofaan commented Feb 2, 2023

@dongjoon-hyun

because some applications already rely on this import logging. After this patch, PySpark jobs fail.

Do you really mean that removing import logging from pyspark/pandas/__init__.py is the cause of breakage? So you have code that does from pyspark.pandas import logging instead of just import logging? Or maybe you have from pyspark.pandas import * and your implementation assumes this also imports logging?

Or do you mean that the removal of the logging.warn() call (and it's hidden logging.basicConfig side-effect) is causing breakage?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 2, 2023

Thank you for the replies, @HyukjinKwon and @soxofaan . Please forget about my previous comments. After diggiing more about the PySpark 3.2 apps, it turns out that it was like a false alarm. Initially, there was a concern about the side-effect of the removed import statements, but this fix is a correct one and the affected applications should fix their python scripts if they are affected. Sorry about the false alarm.

@HyukjinKwon
Copy link
Member

Thanks for confirming!

sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
See https://issues.apache.org/jira/browse/SPARK-41989 for in depth explanation

Short summary: `pyspark/pandas/__init__.py` uses, at import time,  `logging.warning()`  which might silently call `logging.basicConfig()`.
So by importing `pyspark.pandas` (directly or indirectly) a user might unknowingly break their own logging setup (e.g. when based on  `logging.basicConfig()` or related). `logging.getLogger(...).warning()`  does not trigger this behavior.

User-defined logging setups will be more predictable.

Manual testing so far.
I'm not sure it's worthwhile to cover this with a unit test

Closes apache#39516 from soxofaan/SPARK-41989-pyspark-pandas-logging-setup.

Authored-by: Stefaan Lippens <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 04836ba)
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants