Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jul 17, 2024

What changes were proposed in this pull request?

This PR aims to redact awsAccessKeyId by including accesskey pattern.

  • Apache Spark 4.0.0-preview1
    There is no point to redact fs.s3a.access.key because the same value is exposed via fs.s3.awsAccessKeyId like the following. We need to redact all.
$ AWS_ACCESS_KEY_ID=A AWS_SECRET_ACCESS_KEY=B bin/spark-shell

Screenshot 2024-07-17 at 12 45 44

Why are the changes needed?

Since Apache Spark 1.1.0, AWS_ACCESS_KEY_ID is propagated like the following. However, Apache Spark does not redact them all consistently.

hadoopConf.set("fs.s3.awsAccessKeyId", keyId, source + ENV_VAR_AWS_ACCESS_KEY)
hadoopConf.set("fs.s3n.awsAccessKeyId", keyId, source + ENV_VAR_AWS_ACCESS_KEY)
hadoopConf.set("fs.s3a.access.key", keyId, source + ENV_VAR_AWS_ACCESS_KEY)
hadoopConf.set("fs.s3.awsSecretAccessKey", accessKey, source + ENV_VAR_AWS_SECRET_KEY)
hadoopConf.set("fs.s3n.awsSecretAccessKey", accessKey, source + ENV_VAR_AWS_SECRET_KEY)
hadoopConf.set("fs.s3a.secret.key", accessKey, source + ENV_VAR_AWS_SECRET_KEY)

Does this PR introduce any user-facing change?

Users may see more redactions on configurations whose name contains accesskey case-insensitively. However, those configurations are highly likely to be related to the credentials.

How was this patch tested?

Pass the CIs with the newly added test cases.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CORE label Jul 17, 2024
@dongjoon-hyun
Copy link
Member Author

cc @tgravescs , @mridulm , @yaooqinn

Copy link
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine to me

@dongjoon-hyun
Copy link
Member Author

Thank you, @tgravescs .

@dongjoon-hyun
Copy link
Member Author

Could you review this PR, @viirya ?

@dongjoon-hyun
Copy link
Member Author

Thank you, @viirya !

@viirya
Copy link
Member

viirya commented Jul 17, 2024

Thanks @dongjoon-hyun

dongjoon-hyun added a commit that referenced this pull request Jul 17, 2024
…pattern

### What changes were proposed in this pull request?

This PR aims to redact `awsAccessKeyId` by including `accesskey` pattern.

- **Apache Spark 4.0.0-preview1**
There is no point to redact `fs.s3a.access.key` because the same value is exposed via `fs.s3.awsAccessKeyId` like the following. We need to redact all.

```
$ AWS_ACCESS_KEY_ID=A AWS_SECRET_ACCESS_KEY=B bin/spark-shell
```

![Screenshot 2024-07-17 at 12 45 44](https://github.com/user-attachments/assets/e3040c5d-3eb9-4944-a6d6-5179b7647426)

### Why are the changes needed?

Since Apache Spark 1.1.0, `AWS_ACCESS_KEY_ID` is propagated like the following. However, Apache Spark does not redact them all consistently.
- #450

https://github.com/apache/spark/blob/5d16c3134c442a5546251fd7c42b1da9fdf3969e/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L481-L486

### Does this PR introduce _any_ user-facing change?

Users may see more redactions on configurations whose name contains `accesskey` case-insensitively. However, those configurations are highly likely to be related to the credentials.

### How was this patch tested?

Pass the CIs with the newly added test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47392 from dongjoon-hyun/SPARK-48930.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 1e17c39)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Jul 17, 2024
…pattern

### What changes were proposed in this pull request?

This PR aims to redact `awsAccessKeyId` by including `accesskey` pattern.

- **Apache Spark 4.0.0-preview1**
There is no point to redact `fs.s3a.access.key` because the same value is exposed via `fs.s3.awsAccessKeyId` like the following. We need to redact all.

```
$ AWS_ACCESS_KEY_ID=A AWS_SECRET_ACCESS_KEY=B bin/spark-shell
```

![Screenshot 2024-07-17 at 12 45 44](https://github.com/user-attachments/assets/e3040c5d-3eb9-4944-a6d6-5179b7647426)

### Why are the changes needed?

Since Apache Spark 1.1.0, `AWS_ACCESS_KEY_ID` is propagated like the following. However, Apache Spark does not redact them all consistently.
- #450

https://github.com/apache/spark/blob/5d16c3134c442a5546251fd7c42b1da9fdf3969e/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L481-L486

### Does this PR introduce _any_ user-facing change?

Users may see more redactions on configurations whose name contains `accesskey` case-insensitively. However, those configurations are highly likely to be related to the credentials.

### How was this patch tested?

Pass the CIs with the newly added test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47392 from dongjoon-hyun/SPARK-48930.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 1e17c39)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member Author

Merged to all live release branches.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-48930 branch July 17, 2024 22:57
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
…pattern

### What changes were proposed in this pull request?

This PR aims to redact `awsAccessKeyId` by including `accesskey` pattern.

- **Apache Spark 4.0.0-preview1**
There is no point to redact `fs.s3a.access.key` because the same value is exposed via `fs.s3.awsAccessKeyId` like the following. We need to redact all.

```
$ AWS_ACCESS_KEY_ID=A AWS_SECRET_ACCESS_KEY=B bin/spark-shell
```

![Screenshot 2024-07-17 at 12 45 44](https://github.com/user-attachments/assets/e3040c5d-3eb9-4944-a6d6-5179b7647426)

### Why are the changes needed?

Since Apache Spark 1.1.0, `AWS_ACCESS_KEY_ID` is propagated like the following. However, Apache Spark does not redact them all consistently.
- apache#450

https://github.com/apache/spark/blob/5d16c3134c442a5546251fd7c42b1da9fdf3969e/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L481-L486

### Does this PR introduce _any_ user-facing change?

Users may see more redactions on configurations whose name contains `accesskey` case-insensitively. However, those configurations are highly likely to be related to the credentials.

### How was this patch tested?

Pass the CIs with the newly added test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47392 from dongjoon-hyun/SPARK-48930.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants