Skip to content

Conversation

@vanzin
Copy link
Contributor

@vanzin vanzin commented Apr 19, 2014

This is related to SPARK-1459 / PR #375. Without this fix,
FileLogger.createLogDir() may try to create the log dir on
HDFS, while createWriter() will try to open the log file on
the local file system, leading to interesting errors and
confusion.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileSystem has a function to give you the default filesystem without you have to look at the configs directly:

public static URI getDefaultUri(Configuration conf)

@pwendell
Copy link
Contributor

Jenkins, test this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14319/

@pwendell
Copy link
Contributor

The ordering of initialization in the DAGScheduler is very sensitive. Could this patch instead just move the Hadoop configuration up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: style should be

private[spark] class EventLoggingListener(
    appName: String,
    ...
  extends ...

This is related to SPARK-1459 / PR apache#375. Without this fix,
FileLogger.createLogDir() may try to create the log dir on
HDFS, while createWriter() will try to open the log file on
the local file system, leading to interesting errors and
confusion.
@andrewor14
Copy link
Contributor

Thanks for the changes @vanzin. I think the logic is much more straightforward now. This LGTM.

@pwendell
Copy link
Contributor

Jenkins, retest this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14378/

@pwendell
Copy link
Contributor

Jenkins, retest this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14398/

@pwendell
Copy link
Contributor

Thanks - merged this.

@asfgit asfgit closed this in dd1b7a6 Apr 23, 2014
asfgit pushed a commit that referenced this pull request Apr 23, 2014
This is related to SPARK-1459 / PR #375. Without this fix,
FileLogger.createLogDir() may try to create the log dir on
HDFS, while createWriter() will try to open the log file on
the local file system, leading to interesting errors and
confusion.

Author: Marcelo Vanzin <[email protected]>

Closes #450 from vanzin/event-file-2 and squashes the following commits:

592cdb3 [Marcelo Vanzin] Honor default fs name when initializing event logger.
(cherry picked from commit dd1b7a6)

Signed-off-by: Patrick Wendell <[email protected]>
pwendell pushed a commit to pwendell/spark that referenced this pull request May 12, 2014
apache#450.

Only run ResubmitFailedStages event after a fetch fails

Previously, the ResubmitFailedStages event was called every
200 milliseconds, leading to a lot of unnecessary event processing
and clogged DAGScheduler logs.

Author: Kay Ousterhout <[email protected]>

== Merge branch commits ==

commit e603784b3a562980e6f1863845097effe2129d3b
Author: Kay Ousterhout <[email protected]>
Date:   Wed Feb 5 11:34:41 2014 -0800

    Re-add check for empty set of failed stages

commit d258f0ef50caff4bbb19fb95a6b82186db1935bf
Author: Kay Ousterhout <[email protected]>
Date:   Wed Jan 15 23:35:41 2014 -0800

    Only run ResubmitFailedStages event after a fetch fails

    Previously, the ResubmitFailedStages event was called every
    200 milliseconds, leading to a lot of unnecessary event processing
    and clogged DAGScheduler logs.
@vanzin vanzin deleted the event-file-2 branch May 23, 2014 22:39
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
This is related to SPARK-1459 / PR apache#375. Without this fix,
FileLogger.createLogDir() may try to create the log dir on
HDFS, while createWriter() will try to open the log file on
the local file system, leading to interesting errors and
confusion.

Author: Marcelo Vanzin <[email protected]>

Closes apache#450 from vanzin/event-file-2 and squashes the following commits:

592cdb3 [Marcelo Vanzin] Honor default fs name when initializing event logger.
gzm55 pushed a commit to MediaV/spark that referenced this pull request Jul 17, 2014
apache#450.

Only run ResubmitFailedStages event after a fetch fails

Previously, the ResubmitFailedStages event was called every
200 milliseconds, leading to a lot of unnecessary event processing
and clogged DAGScheduler logs.

Author: Kay Ousterhout <[email protected]>

== Merge branch commits ==

commit e603784b3a562980e6f1863845097effe2129d3b
Author: Kay Ousterhout <[email protected]>
Date:   Wed Feb 5 11:34:41 2014 -0800

    Re-add check for empty set of failed stages

commit d258f0ef50caff4bbb19fb95a6b82186db1935bf
Author: Kay Ousterhout <[email protected]>
Date:   Wed Jan 15 23:35:41 2014 -0800

    Only run ResubmitFailedStages event after a fetch fails

    Previously, the ResubmitFailedStages event was called every
    200 milliseconds, leading to a lot of unnecessary event processing
    and clogged DAGScheduler logs.
(cherry picked from commit 0b448df)

Signed-off-by: Patrick Wendell <[email protected]>
andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Jan 8, 2015
apache#450.

Only run ResubmitFailedStages event after a fetch fails

Previously, the ResubmitFailedStages event was called every
200 milliseconds, leading to a lot of unnecessary event processing
and clogged DAGScheduler logs.

Author: Kay Ousterhout <[email protected]>

== Merge branch commits ==

commit e603784b3a562980e6f1863845097effe2129d3b
Author: Kay Ousterhout <[email protected]>
Date:   Wed Feb 5 11:34:41 2014 -0800

    Re-add check for empty set of failed stages

commit d258f0ef50caff4bbb19fb95a6b82186db1935bf
Author: Kay Ousterhout <[email protected]>
Date:   Wed Jan 15 23:35:41 2014 -0800

    Only run ResubmitFailedStages event after a fetch fails

    Previously, the ResubmitFailedStages event was called every
    200 milliseconds, leading to a lot of unnecessary event processing
    and clogged DAGScheduler logs.
(cherry picked from commit 0b448df)

Signed-off-by: Patrick Wendell <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Jul 17, 2024
…pattern

### What changes were proposed in this pull request?

This PR aims to redact `awsAccessKeyId` by including `accesskey` pattern.

- **Apache Spark 4.0.0-preview1**
There is no point to redact `fs.s3a.access.key` because the same value is exposed via `fs.s3.awsAccessKeyId` like the following. We need to redact all.

```
$ AWS_ACCESS_KEY_ID=A AWS_SECRET_ACCESS_KEY=B bin/spark-shell
```

![Screenshot 2024-07-17 at 12 45 44](https://github.com/user-attachments/assets/e3040c5d-3eb9-4944-a6d6-5179b7647426)

### Why are the changes needed?

Since Apache Spark 1.1.0, `AWS_ACCESS_KEY_ID` is propagated like the following. However, Apache Spark does not redact them all consistently.
- #450

https://github.com/apache/spark/blob/5d16c3134c442a5546251fd7c42b1da9fdf3969e/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L481-L486

### Does this PR introduce _any_ user-facing change?

Users may see more redactions on configurations whose name contains `accesskey` case-insensitively. However, those configurations are highly likely to be related to the credentials.

### How was this patch tested?

Pass the CIs with the newly added test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47392 from dongjoon-hyun/SPARK-48930.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Jul 17, 2024
…pattern

### What changes were proposed in this pull request?

This PR aims to redact `awsAccessKeyId` by including `accesskey` pattern.

- **Apache Spark 4.0.0-preview1**
There is no point to redact `fs.s3a.access.key` because the same value is exposed via `fs.s3.awsAccessKeyId` like the following. We need to redact all.

```
$ AWS_ACCESS_KEY_ID=A AWS_SECRET_ACCESS_KEY=B bin/spark-shell
```

![Screenshot 2024-07-17 at 12 45 44](https://github.com/user-attachments/assets/e3040c5d-3eb9-4944-a6d6-5179b7647426)

### Why are the changes needed?

Since Apache Spark 1.1.0, `AWS_ACCESS_KEY_ID` is propagated like the following. However, Apache Spark does not redact them all consistently.
- #450

https://github.com/apache/spark/blob/5d16c3134c442a5546251fd7c42b1da9fdf3969e/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L481-L486

### Does this PR introduce _any_ user-facing change?

Users may see more redactions on configurations whose name contains `accesskey` case-insensitively. However, those configurations are highly likely to be related to the credentials.

### How was this patch tested?

Pass the CIs with the newly added test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47392 from dongjoon-hyun/SPARK-48930.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 1e17c39)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request Jul 17, 2024
…pattern

### What changes were proposed in this pull request?

This PR aims to redact `awsAccessKeyId` by including `accesskey` pattern.

- **Apache Spark 4.0.0-preview1**
There is no point to redact `fs.s3a.access.key` because the same value is exposed via `fs.s3.awsAccessKeyId` like the following. We need to redact all.

```
$ AWS_ACCESS_KEY_ID=A AWS_SECRET_ACCESS_KEY=B bin/spark-shell
```

![Screenshot 2024-07-17 at 12 45 44](https://github.com/user-attachments/assets/e3040c5d-3eb9-4944-a6d6-5179b7647426)

### Why are the changes needed?

Since Apache Spark 1.1.0, `AWS_ACCESS_KEY_ID` is propagated like the following. However, Apache Spark does not redact them all consistently.
- #450

https://github.com/apache/spark/blob/5d16c3134c442a5546251fd7c42b1da9fdf3969e/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L481-L486

### Does this PR introduce _any_ user-facing change?

Users may see more redactions on configurations whose name contains `accesskey` case-insensitively. However, those configurations are highly likely to be related to the credentials.

### How was this patch tested?

Pass the CIs with the newly added test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47392 from dongjoon-hyun/SPARK-48930.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 1e17c39)
Signed-off-by: Dongjoon Hyun <[email protected]>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
…pattern

### What changes were proposed in this pull request?

This PR aims to redact `awsAccessKeyId` by including `accesskey` pattern.

- **Apache Spark 4.0.0-preview1**
There is no point to redact `fs.s3a.access.key` because the same value is exposed via `fs.s3.awsAccessKeyId` like the following. We need to redact all.

```
$ AWS_ACCESS_KEY_ID=A AWS_SECRET_ACCESS_KEY=B bin/spark-shell
```

![Screenshot 2024-07-17 at 12 45 44](https://github.com/user-attachments/assets/e3040c5d-3eb9-4944-a6d6-5179b7647426)

### Why are the changes needed?

Since Apache Spark 1.1.0, `AWS_ACCESS_KEY_ID` is propagated like the following. However, Apache Spark does not redact them all consistently.
- apache#450

https://github.com/apache/spark/blob/5d16c3134c442a5546251fd7c42b1da9fdf3969e/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L481-L486

### Does this PR introduce _any_ user-facing change?

Users may see more redactions on configurations whose name contains `accesskey` case-insensitively. However, those configurations are highly likely to be related to the credentials.

### How was this patch tested?

Pass the CIs with the newly added test cases.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47392 from dongjoon-hyun/SPARK-48930.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…the database location as table location placeholder (apache#450)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants