Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Sep 19, 2019

What changes were proposed in this pull request?

Credit to @vanzin as he found and commented on this while reviewing #25670 - comment.

This patch proposes to specify UTF-8 explicitly while reading/writer event log file.

Why are the changes needed?

The event log file is being read/written as default character set of JVM process which may open the chance to bring some problems on reading event log files from another machines. Spark's de facto standard character set is UTF-8, so it should be explicitly set to.

Does this PR introduce any user-facing change?

Yes, if end users have been running Spark process with different default charset than "UTF-8", especially their driver JVM processes. No otherwise.

How was this patch tested?

Existing UTs, as ReplayListenerSuite contains "end-to-end" event logging/reading tests (both uncompressed/compressed).

@HeartSaVioR
Copy link
Contributor Author

As I commented in Does this PR introduce any user-facing change? section, it may not be backward compatible change for some users. Please take this into consideration. Thanks!

@HeartSaVioR
Copy link
Contributor Author

We might be able to remedy the backward incompatible change via having new option to let ReplayListenerBus use default character set to read file, though I'm not 100% sure it's a good workaround.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but let me leave it to @vanzin

@HyukjinKwon
Copy link
Member

@HeartSaVioR, if you concern about compatibility, you could leave a note in migration guide at https://github.com/apache/spark/blob/master/docs/core-migration-guide.md . I guess most of machines use UTF-8 by default though.

@dongjoon-hyun
Copy link
Member

+1 for the migration note. cc @gatorsmile .
Also, cc @MaxGekk

@gatorsmile
Copy link
Member

+1 to document it in the migration note.

@viirya
Copy link
Member

viirya commented Sep 19, 2019

Should we provide a config for this? I am not sure if we state in migration guide this change, how end users can react for it, if they want to use previous character set? Or this is very rare and we can ignore?

@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #110968 has finished for PR 25845 at commit 71bf026.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

There're three ways to deal with this:

  1. provide simple tool to read input file with given character set and rewrite to UTF-8
  2. expose new config to allow custom character set (or even just use default character set) on ReplayListenerBus
  3. don't deal with this

I guess Spark hasn't provide additional tool to migrate existing one so 1) doesn't sound preferred one. If we would concern about backward compatibility, 2) seems to be the only option - remaining one is to allow custom character set or to use default charset.

@HeartSaVioR
Copy link
Contributor Author

Btw, I could find another spots creating PrintWriter without explicit character set. Haven't touched them as I'd like to restrict/minimize the effects of changes.

@HeartSaVioR
Copy link
Contributor Author

Build failure seems to be flaky one: Build step 'Execute shell' marked build as failure

@MaxGekk
Copy link
Member

MaxGekk commented Sep 19, 2019

UFT-8 -> UTF-8 in the description: Spark's de facto standard character set is UFT-8, so it should be explicitly set to

maybeTruncated: Boolean = false,
eventsFilter: ReplayEventsFilter = SELECT_ALL_FILTER): Unit = {
val lines = Source.fromInputStream(logData).getLines()
val lines = Source.fromInputStream(logData, StandardCharsets.UTF_8.name()).getLines()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could avoid converting the charset to string and looking it up again by:

val lines = Source.fromInputStream(logData)(Codec.UTF8).getLines()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion! Will address.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #110979 has finished for PR 25845 at commit 71bf026.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Build failure: known flaky test - https://issues.apache.org/jira/browse/SPARK-25903

@HeartSaVioR
Copy link
Contributor Author

retest this, please

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in principle

@SparkQA
Copy link

SparkQA commented Sep 19, 2019

Test build #111001 has finished for PR 25845 at commit caad54d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Sep 20, 2019

Uh, shall we talk about which option is preferred one? It needs to be done before describing in migration note.

@HyukjinKwon
Copy link
Member

I think we can just leave it without other options. It's rather corner case and I think it's fine to break such stuff since we're moving to Spark 3.

@viirya
Copy link
Member

viirya commented Sep 20, 2019

Ok for me.

@HeartSaVioR
Copy link
Contributor Author

Great :) Please let me know if someone would still want to provide backward compatible option. (Even it can be addressed separately.)

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 21, 2019

For me, the only remaining thing looks like a note at the migration guide to give prior warning to users, isn't it? It would be great if this PR includes that.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Sep 21, 2019

Yeah I thought someone may say it should worth to provide backward-compatible option so waited for couple of days to see more voices, but doesn't look like so. I'll mention the change to the migration note. Thanks for reminding!

@dongjoon-hyun
Copy link
Member

Ya. I agree with you.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@SparkQA
Copy link

SparkQA commented Sep 21, 2019

Test build #111106 has finished for PR 25845 at commit c39b06f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@kiszk kiszk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 'Spark History Server' -> 'Spark History Server (SHS)'

Is SHS commonly used in the community?

@HeartSaVioR
Copy link
Contributor Author

Just rephrased SHS to Spark History Server as I guess it's official term and more commonly used. I guess SHS is also widely used term, but it might be a view of one of contributors, not one of end users.

@SparkQA
Copy link

SparkQA commented Sep 21, 2019

Test build #111113 has finished for PR 25845 at commit 94558a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@HeartSaVioR
Copy link
Contributor Author

Thanks all for reviewing and merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants