[SPARK-29160][CORE] Use UTF-8 explicitly for reading/writing event log file #25845

HeartSaVioR · 2019-09-19T04:47:06Z

What changes were proposed in this pull request?

Credit to @vanzin as he found and commented on this while reviewing #25670 - comment.

This patch proposes to specify UTF-8 explicitly while reading/writer event log file.

Why are the changes needed?

The event log file is being read/written as default character set of JVM process which may open the chance to bring some problems on reading event log files from another machines. Spark's de facto standard character set is UTF-8, so it should be explicitly set to.

Does this PR introduce any user-facing change?

Yes, if end users have been running Spark process with different default charset than "UTF-8", especially their driver JVM processes. No otherwise.

How was this patch tested?

Existing UTs, as ReplayListenerSuite contains "end-to-end" event logging/reading tests (both uncompressed/compressed).

…log file

HeartSaVioR · 2019-09-19T04:48:12Z

As I commented in Does this PR introduce any user-facing change? section, it may not be backward compatible change for some users. Please take this into consideration. Thanks!

HeartSaVioR · 2019-09-19T04:51:41Z

We might be able to remedy the backward incompatible change via having new option to let ReplayListenerBus use default character set to read file, though I'm not 100% sure it's a good workaround.

HyukjinKwon

Looks good but let me leave it to @vanzin

HyukjinKwon · 2019-09-19T05:26:49Z

@HeartSaVioR, if you concern about compatibility, you could leave a note in migration guide at https://github.com/apache/spark/blob/master/docs/core-migration-guide.md . I guess most of machines use UTF-8 by default though.

dongjoon-hyun · 2019-09-19T06:32:02Z

+1 for the migration note. cc @gatorsmile .
Also, cc @MaxGekk

gatorsmile · 2019-09-19T06:55:32Z

+1 to document it in the migration note.

viirya · 2019-09-19T06:55:33Z

Should we provide a config for this? I am not sure if we state in migration guide this change, how end users can react for it, if they want to use previous character set? Or this is very rare and we can ignore?

SparkQA · 2019-09-19T07:05:02Z

Test build #110968 has finished for PR 25845 at commit 71bf026.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-09-19T07:07:36Z

There're three ways to deal with this:

provide simple tool to read input file with given character set and rewrite to UTF-8
expose new config to allow custom character set (or even just use default character set) on ReplayListenerBus
don't deal with this

I guess Spark hasn't provide additional tool to migrate existing one so 1) doesn't sound preferred one. If we would concern about backward compatibility, 2) seems to be the only option - remaining one is to allow custom character set or to use default charset.

HeartSaVioR · 2019-09-19T07:12:56Z

Btw, I could find another spots creating PrintWriter without explicit character set. Haven't touched them as I'd like to restrict/minimize the effects of changes.

HeartSaVioR · 2019-09-19T07:16:01Z

Build failure seems to be flaky one: Build step 'Execute shell' marked build as failure

MaxGekk · 2019-09-19T07:19:08Z

UFT-8 -> UTF-8 in the description: Spark's de facto standard character set is UFT-8, so it should be explicitly set to

MaxGekk · 2019-09-19T07:26:37Z

core/src/main/scala/org/apache/spark/scheduler/ReplayListenerBus.scala

      maybeTruncated: Boolean = false,
      eventsFilter: ReplayEventsFilter = SELECT_ALL_FILTER): Unit = {
-    val lines = Source.fromInputStream(logData).getLines()
+    val lines = Source.fromInputStream(logData, StandardCharsets.UTF_8.name()).getLines()


You could avoid converting the charset to string and looking it up again by:

val lines = Source.fromInputStream(logData)(Codec.UTF8).getLines()

Nice suggestion! Will address.

HyukjinKwon · 2019-09-19T07:33:10Z

retest this please

SparkQA · 2019-09-19T10:00:48Z

Test build #110979 has finished for PR 25845 at commit 71bf026.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-09-19T12:33:23Z

Build failure: known flaky test - https://issues.apache.org/jira/browse/SPARK-25903

HeartSaVioR · 2019-09-19T12:33:32Z

retest this, please

srowen

Looks good in principle

SparkQA · 2019-09-19T15:08:35Z

Test build #111001 has finished for PR 25845 at commit caad54d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-09-20T05:24:56Z

Uh, shall we talk about which option is preferred one? It needs to be done before describing in migration note.

HyukjinKwon · 2019-09-20T05:36:56Z

I think we can just leave it without other options. It's rather corner case and I think it's fine to break such stuff since we're moving to Spark 3.

viirya · 2019-09-20T05:42:09Z

Ok for me.

HeartSaVioR · 2019-09-21T01:53:10Z

Great :) Please let me know if someone would still want to provide backward compatible option. (Even it can be addressed separately.)

dongjoon-hyun · 2019-09-21T02:55:04Z

For me, the only remaining thing looks like a note at the migration guide to give prior warning to users, isn't it? It would be great if this PR includes that.

[SPARK-29160][CORE] Use UTF-8 explicitly for reading/writing event log file #25845 (comment)
[SPARK-29160][CORE] Use UTF-8 explicitly for reading/writing event log file #25845 (comment)
[SPARK-29160][CORE] Use UTF-8 explicitly for reading/writing event log file #25845 (comment)

HeartSaVioR · 2019-09-21T03:39:38Z

Yeah I thought someone may say it should worth to provide backward-compatible option so waited for couple of days to see more voices, but doesn't look like so. I'll mention the change to the migration note. Thanks for reminding!

dongjoon-hyun · 2019-09-21T03:47:52Z

Ya. I agree with you.

dongjoon-hyun

+1, LGTM.

SparkQA · 2019-09-21T06:35:41Z

Test build #111106 has finished for PR 25845 at commit c39b06f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk

nit: 'Spark History Server' -> 'Spark History Server (SHS)'

Is SHS commonly used in the community?

HeartSaVioR · 2019-09-21T08:17:51Z

Just rephrased SHS to Spark History Server as I guess it's official term and more commonly used. I guess SHS is also widely used term, but it might be a view of one of contributors, not one of end users.

SparkQA · 2019-09-21T11:07:52Z

Test build #111113 has finished for PR 25845 at commit 94558a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-09-21T14:59:10Z

Merged to master.

HeartSaVioR · 2019-09-21T22:21:55Z

Thanks all for reviewing and merging!

[SPARK-29160][CORE] Use UTF-8 specifically for reading/writing event …

71bf026

…log file

HyukjinKwon approved these changes Sep 19, 2019

View reviewed changes

dongjoon-hyun added the SPARK CORE label Sep 19, 2019

MaxGekk reviewed Sep 19, 2019

View reviewed changes

Address comment

caad54d

srowen reviewed Sep 19, 2019

View reviewed changes

MaxGekk approved these changes Sep 19, 2019

View reviewed changes

viirya approved these changes Sep 20, 2019

View reviewed changes

Add explanation of change to migration note

c39b06f

dongjoon-hyun approved these changes Sep 21, 2019

View reviewed changes

kiszk approved these changes Sep 21, 2019

View reviewed changes

Rephrase SHS to Spark History Server

94558a6

HyukjinKwon closed this in 81b6f11 Sep 21, 2019

HeartSaVioR deleted the SPARK-29160 branch September 21, 2019 22:22

HeartSaVioR mentioned this pull request Sep 21, 2019

[SPARK-28869][CORE] Roll over event log files #25670

Closed

[SPARK-29160][CORE] Use UTF-8 explicitly for reading/writing event log file #25845

[SPARK-29160][CORE] Use UTF-8 explicitly for reading/writing event log file #25845

Uh oh!

Conversation

HeartSaVioR commented Sep 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Sep 19, 2019

Uh oh!

HeartSaVioR commented Sep 19, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 19, 2019

Uh oh!

dongjoon-hyun commented Sep 19, 2019

Uh oh!

gatorsmile commented Sep 19, 2019

Uh oh!

viirya commented Sep 19, 2019

Uh oh!

SparkQA commented Sep 19, 2019

Uh oh!

HeartSaVioR commented Sep 19, 2019

Uh oh!

HeartSaVioR commented Sep 19, 2019

Uh oh!

HeartSaVioR commented Sep 19, 2019

Uh oh!

MaxGekk commented Sep 19, 2019

Uh oh!

MaxGekk Sep 19, 2019

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Sep 19, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 19, 2019

Uh oh!

SparkQA commented Sep 19, 2019

Uh oh!

HeartSaVioR commented Sep 19, 2019

Uh oh!

HeartSaVioR commented Sep 19, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 19, 2019

Uh oh!

HeartSaVioR commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Sep 20, 2019

Uh oh!

viirya commented Sep 20, 2019

Uh oh!

HeartSaVioR commented Sep 21, 2019

Uh oh!

dongjoon-hyun commented Sep 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Sep 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Sep 21, 2019

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 21, 2019

Uh oh!

kiszk left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Sep 19, 2019 •

edited

Loading

HeartSaVioR commented Sep 20, 2019 •

edited

Loading

dongjoon-hyun commented Sep 21, 2019 •

edited

Loading

HeartSaVioR commented Sep 21, 2019 •

edited

Loading