- 
                Notifications
    You must be signed in to change notification settings 
- Fork 28.9k
[SPARK-29160][CORE] Use UTF-8 explicitly for reading/writing event log file #25845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| As I commented in  | 
| We might be able to remedy the backward incompatible change via having new option to let ReplayListenerBus use default character set to read file, though I'm not 100% sure it's a good workaround. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but let me leave it to @vanzin
| @HeartSaVioR, if you concern about compatibility, you could leave a note in migration guide at https://github.com/apache/spark/blob/master/docs/core-migration-guide.md . I guess most of machines use UTF-8 by default though. | 
| +1 for the migration note. cc @gatorsmile . | 
| +1 to document it in the migration note. | 
| Should we provide a config for this? I am not sure if we state in migration guide this change, how end users can react for it, if they want to use previous character set? Or this is very rare and we can ignore? | 
| Test build #110968 has finished for PR 25845 at commit  
 | 
| There're three ways to deal with this: 
 I guess Spark hasn't provide additional tool to migrate existing one so 1) doesn't sound preferred one. If we would concern about backward compatibility, 2) seems to be the only option - remaining one is to allow custom character set or to use default charset. | 
| Btw, I could find another spots creating PrintWriter without explicit character set. Haven't touched them as I'd like to restrict/minimize the effects of changes. | 
| Build failure seems to be flaky one:  | 
| 
 | 
| maybeTruncated: Boolean = false, | ||
| eventsFilter: ReplayEventsFilter = SELECT_ALL_FILTER): Unit = { | ||
| val lines = Source.fromInputStream(logData).getLines() | ||
| val lines = Source.fromInputStream(logData, StandardCharsets.UTF_8.name()).getLines() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could avoid converting the charset to string and looking it up again by:
val lines = Source.fromInputStream(logData)(Codec.UTF8).getLines()There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion! Will address.
| retest this please | 
| Test build #110979 has finished for PR 25845 at commit  
 | 
| Build failure: known flaky test - https://issues.apache.org/jira/browse/SPARK-25903 | 
| retest this, please | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in principle
| Test build #111001 has finished for PR 25845 at commit  
 | 
| Uh, shall we talk about which option is preferred one? It needs to be done before describing in migration note. | 
| I think we can just leave it without other options. It's rather corner case and I think it's fine to break such stuff since we're moving to Spark 3. | 
| Ok for me. | 
| Great :) Please let me know if someone would still want to provide backward compatible option. (Even it can be addressed separately.) | 
| For me, the only remaining thing looks like a note at the migration guide to give prior warning to users, isn't it? It would be great if this PR includes that. | 
| Yeah I thought someone may say it should worth to provide backward-compatible option so waited for couple of days to see more voices, but doesn't look like so. I'll mention the change to the migration note. Thanks for reminding! | 
| Ya. I agree with you. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
| Test build #111106 has finished for PR 25845 at commit  
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 'Spark History Server' -> 'Spark History Server (SHS)'
Is SHS commonly used in the community?
| Just rephrased SHS to Spark History Server as I guess it's official term and more commonly used. I guess SHS is also widely used term, but it might be a view of one of contributors, not one of end users. | 
| Test build #111113 has finished for PR 25845 at commit  
 | 
| Merged to master. | 
| Thanks all for reviewing and merging! | 
What changes were proposed in this pull request?
Credit to @vanzin as he found and commented on this while reviewing #25670 - comment.
This patch proposes to specify UTF-8 explicitly while reading/writer event log file.
Why are the changes needed?
The event log file is being read/written as default character set of JVM process which may open the chance to bring some problems on reading event log files from another machines. Spark's de facto standard character set is UTF-8, so it should be explicitly set to.
Does this PR introduce any user-facing change?
Yes, if end users have been running Spark process with different default charset than "UTF-8", especially their driver JVM processes. No otherwise.
How was this patch tested?
Existing UTs, as ReplayListenerSuite contains "end-to-end" event logging/reading tests (both uncompressed/compressed).