Skip to content

Conversation

@gaborgsomogyi
Copy link
Contributor

@gaborgsomogyi gaborgsomogyi commented May 16, 2019

What changes were proposed in this pull request?

Kafka parameters are logged at several places and the following parameters has to be redacted:

  • Delegation token
  • ssl.truststore.password
  • ssl.keystore.password
  • ssl.key.password

This PR contains:

  • Spark central redaction framework used to redact passwords (spark.redaction.regex)
  • Custom redaction added to handle sasl.jaas.config (delegation token)
  • Redaction code added into consumer/producer code
  • Test refactor

How was this patch tested?

Existing + additional unit tests.

| useTicketCache=true
| serviceName="${clusterConf.kerberosServiceName}";
""".stripMargin.replace("\n", "")
""".stripMargin.replace("\n", "").trim
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated fix. There were additional spaces at the end of the generated string. This change removes it. Worth to mention the code works both way but thought it's just ugly.

| password="$password";
""".stripMargin.replace("\n", "")
logDebug(s"Scram JAAS params: ${params.replaceAll("password=\".*\"", "password=\"[hidden]\"")}")
""".stripMargin.replace("\n", "").trim
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated fix. There were additional spaces at the end of the generated string. This change removes it. Worth to mention the code works both way but thought it's just ugly.

@SparkQA
Copy link

SparkQA commented May 16, 2019

Test build #105457 has finished for PR 24627 at commit 7e445f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented May 18, 2019

Test build #105517 has finished for PR 24627 at commit 7e445f7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented May 18, 2019

Test build #105518 has finished for PR 24627 at commit 7e445f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val redactionPattern = SparkEnv.get.conf.get(SECRET_REDACTION_PATTERN)
params.map { case (key, value) =>
if (key.equalsIgnoreCase(SaslConfigs.SASL_JAAS_CONFIG)) {
(key, redactJaasParam(value.asInstanceOf[String]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @gaborgsomogyi .
Is this the only reason why we cannot use Utils.redact directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun
Not sure what you mean only reason. The short answer is yes.

A little but more detailed SaslConfigs.SASL_JAAS_CONFIG has different format than any other property. A normal property looks like the following:
Key=ssl.truststore.password, Value=secret.
SaslConfigs.SASL_JAAS_CONFIG however have the following syntax:
Key=sasl.jaas.config, Value=org.apache.kafka.common.security.scram.ScramLoginModule required tokenauth=true serviceName="kafka" username="admin" password="admin-secret";
Utils.redact makes a malformed and unreadable string out of it.

@HeartSaVioR
Copy link
Contributor

I know this patch just apply redaction to all the places which print configuration as of now, but I'm feeling that Kafka module is too verbose on configuration, especially KafkaConfigUpdater. Assuming debug log level is turned on, you would majorly want to track which consumers are being acquired or evicted, or closed. Logging in KafkaConfigUpdater doesn't help either.

Let's revisit the purpose of logging. I guess we are logging to find when consumer/producer is created, and when it's evicted/closed. (either tracking each instance, or counting for each key). Though we use whole configuration as a key of cache, we (as human) cannot check every key-value pairs to find same instance. Some pairs play as a differentiator and we (again, as human) just assume other pairs are not different.

So IMHO we don't even need to print out all configurations. Especially, if we really want to track each instance, explicit ID (as #19096 introduces) would work better and much helpful to debug.

@gaborgsomogyi
Copy link
Contributor Author

So IMHO we don't even need to print out all configurations.

Without printing out all the configurations I'm wondering how do you think one can find out why a consumer/producer creation fail?

explicit ID (as #19096 introduces) would work better and much helpful to debug.

I agree, introducing an ID would be enough but only when consumer/producer already created. Such case not necessary to log all configurations. This is the next PR in my queue but didn't want to mix up the 2 changes.

Kafka module is too verbose on configuration, especially KafkaConfigUpdater.

Yeah, this can be discussed.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented May 21, 2019

So IMHO we don't even need to print out all configurations.

Without printing out all the configurations I'm wondering how do you think one can find out why a consumer/producer creation fail?

Maybe I was not clear, more clearly I don't think we should print out configurations all the time. Does Kafka module support configuration change in runtime? If not a data source needs to print out these configuration only once, instead of for every creation/termination of instances.

@gaborgsomogyi
Copy link
Contributor Author

Maybe I was not clear, more clearly I don't think we should print out configurations all the time.

Same understanding. So this means redaction is needed.

Does Kafka module support configuration change in runtime?

I don't think so but I'll sit together with Kafka guys, have a deeper understanding and based on that we can go forward. That said my intention is to reduce log size drastically and introducing ID tracking but I suggest to do it in an additional PR to keep the focus.

@HeartSaVioR
Copy link
Contributor

I meant Kafka module is spark-sql-kafka, as we are concerning about config in spark-sql-kafka. Does we assume the config can be changed after query is being run? I have been simply guessing we don't support it, but the risk of supporting it may be just cache miss, so I'm not sure it is unsupported.

@gaborgsomogyi
Copy link
Contributor Author

From Spark side we're not changing the configuration otherwise caching doesn't make sense.

@HeartSaVioR
Copy link
Contributor

As this patch fixes the issue the current code has, IMHO we can go with this patch, and apply the new idea of improvements in another issues/PRs. I can help dealing with applying new idea as well.

@gaborgsomogyi
Copy link
Contributor Author

@dongjoon-hyun do you have any other comments?

@gaborgsomogyi
Copy link
Contributor Author

not much movement, maybe @vanzin ?

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These logs do seem a bit verbose, but since they're already there...

@SparkQA
Copy link

SparkQA commented May 30, 2019

Test build #105974 has finished for PR 24627 at commit c591632.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 3, 2019

Test build #106101 has finished for PR 24627 at commit 67aa337.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor

vanzin commented Jun 3, 2019

merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants