Skip to content

Conversation

@dannycjones
Copy link
Contributor

@dannycjones dannycjones commented Jun 21, 2022

Description of PR

As noted in the ticket, this PR attempts to improve the committer docs given a fresh pair of eyes from someone who has not worked with the committers before.

I've tried to ensure that the Table of Contents makes more sense too.

How was this patch tested?

Reading :)

I did try and build the HTML but I couldn't get it to pickup the new markdown.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@dannycjones
Copy link
Contributor Author

dannycjones commented Jun 21, 2022

Note that with the current patch, the table of contents has changed.

FROM (trunk):

# Committing work to S3 with the "S3A Committers"
### January 2021 Update
## Introduction: The Commit Problem
### Background : Hadoop's "Commit Protocol"
## Meet the S3A Committers
### The Staging Committer
## Conflict Resolution in the Staging Committers
### The Magic Committer
#### Which Committer to Use?
## Switching to an S3A Committer
## Using the Directory and Partitioned Staging Committers
## The "Partitioned" Staging Committer
### Notes
## Using the Magic committer
### FileSystem client setup
### Enabling the committer
## Common Committer Options
## Staging committer (Directory and Partitioned) options
### Common Committer Options
### Staging Committer Options
### Disabling magic committer path rewriting
## <a name="concurrent-jobs"></a> Concurrent Jobs writing to the same destination
## Troubleshooting

TO (5e8cdf0):

# Committing work to S3 with the "S3A Committers"
### January 2021 Update
## Introduction: The Commit Problem
### Background: Hadoop's "Commit Protocol"
## Meet the S3A Committers
### The Staging Committers
#### Conflict Resolution in the Staging Committers
### The Magic Committer
### Which Committer to Use?
## Switching to an S3A Committer
## Using the Staging Committers
### The "Partitioned" Staging Committer
### Notes on using Staging Committers
## Using the Magic committer
### FileSystem client setup
### Enabling the committer
## Committer Options Reference
### Common S3A Committer Options
### Staging committer (Directory and Partitioned) options
### Disabling magic committer path rewriting
## <a name="concurrent-jobs"></a> Concurrent Jobs writing to the same destination
## Troubleshooting

@dannycjones dannycjones marked this pull request as ready for review June 21, 2022 10:33
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 0s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ trunk Compile Tests _
+1 💚 mvninstall 41m 7s trunk passed
+1 💚 mvnsite 0m 53s trunk passed
+1 💚 shadedclient 64m 54s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 36s the patch passed
-1 ❌ blanks 0m 0s /blanks-eol.txt The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
+1 💚 mvnsite 0m 37s the patch passed
+1 💚 shadedclient 23m 21s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 asflicense 0m 44s The patch does not generate ASF License warnings.
92m 26s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4478/1/artifact/out/Dockerfile
GITHUB PR #4478
Optional Tests dupname asflicense mvnsite codespell detsecrets markdownlint
uname Linux f71388264706 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 5e8cdf09694ef5b1a2f1cd33016a068ea9bd469c
Max. process+thread count 534 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4478/1/console
versions git=2.25.1 maven=3.6.3
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@dannycjones dannycjones force-pushed the HADOOP-18304-improve-committers-doc branch from 5e8cdf0 to ffdc6bd Compare June 21, 2022 11:44
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 1s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ trunk Compile Tests _
+1 💚 mvninstall 42m 40s trunk passed
+1 💚 mvnsite 1m 3s trunk passed
+1 💚 shadedclient 69m 9s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 40s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 mvnsite 0m 44s the patch passed
+1 💚 shadedclient 24m 19s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 asflicense 0m 43s The patch does not generate ASF License warnings.
97m 50s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4478/2/artifact/out/Dockerfile
GITHUB PR #4478
Optional Tests dupname asflicense mvnsite codespell detsecrets markdownlint
uname Linux 8c6d918a1efa 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / ffdc6bd
Max. process+thread count 567 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4478/2/console
versions git=2.25.1 maven=3.6.3
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@dannycjones
Copy link
Contributor Author

@ahmarsuhail - would you be able to review this week?

Copy link
Contributor

@ahmarsuhail ahmarsuhail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just some minor formatting changes

files which do not contain relevant data.

What the partitioned committer does is, where the tooling permits, allows callers
to add data to an existing partitioned layout*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider rephrasing. "If tool permits, the partitioned committer allows callers to add data to an existing partitioned layout." also remove * before the full stop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this one, i'm not sure if the * is referring to something. I will remove for now. Happy to add it back

it would be overridden. Set `fs.s3a.committer.staging.unique-filenames` to `true`
If the file `log-20170228.avro` in the example above already existed, it would be overwritten.

Set `fs.s3a.committer.staging.unique-filenames` to `true`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation isn't right here. I think it makes sense to have this as part of point 3, consider reverting this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the indentation like this I believe allows you to put the sentence on a new line but still part of the previous point.

That being said, it is not obvious from the markdown and I cannot test the output HTML so I'll revert.

@dannycjones
Copy link
Contributor Author

@ahmarsuhail, I've updated based on your feedback. Thanks for reviewing it with such detail.

I've put the changes for things like missing . in separate commit cd60050 in case we don't want to touch too many lines unnecessarily.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 57s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ trunk Compile Tests _
+1 💚 mvninstall 49m 4s trunk passed
+1 💚 mvnsite 1m 35s trunk passed
-1 ❌ shadedclient 86m 56s branch has errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 49s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 mvnsite 0m 52s the patch passed
-1 ❌ shadedclient 5m 10s patch has errors when building and testing our client artifacts.
_ Other Tests _
+0 🆗 asflicense 0m 34s ASF License check generated no output?
96m 21s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4478/3/artifact/out/Dockerfile
GITHUB PR #4478
Optional Tests dupname asflicense mvnsite codespell detsecrets markdownlint
uname Linux e68a2eb56e51 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / cd60050
Max. process+thread count 546 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4478/3/console
versions git=2.25.1 maven=3.6.3
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 20s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ trunk Compile Tests _
-1 ❌ mvninstall 4m 13s /branch-mvninstall-root.txt root in trunk failed.
+1 💚 mvnsite 4m 7s trunk passed
+1 💚 shadedclient 36m 24s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 36s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 mvnsite 0m 37s the patch passed
+1 💚 shadedclient 22m 48s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 asflicense 0m 45s The patch does not generate ASF License warnings.
63m 47s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4478/4/artifact/out/Dockerfile
GITHUB PR #4478
Optional Tests dupname asflicense mvnsite codespell detsecrets markdownlint
uname Linux dbf5984762f5 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / cd97bad
Max. process+thread count 608 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4478/4/console
versions git=2.25.1 maven=3.6.3
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@ahmarsuhail ahmarsuhail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, looks good!

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 51s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ trunk Compile Tests _
+1 💚 mvninstall 41m 24s trunk passed
+1 💚 mvnsite 0m 40s trunk passed
+1 💚 shadedclient 64m 19s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 31s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 mvnsite 0m 32s the patch passed
+1 💚 shadedclient 22m 42s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 asflicense 0m 34s The patch does not generate ASF License warnings.
90m 42s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4478/7/artifact/out/Dockerfile
GITHUB PR #4478
Optional Tests dupname asflicense mvnsite codespell detsecrets markdownlint
uname Linux 0e08d197efe0 4.15.0-192-generic #203-Ubuntu SMP Wed Aug 10 17:40:03 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / faec974
Max. process+thread count 528 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4478/7/console
versions git=2.25.1 maven=3.6.3
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@dannycjones
Copy link
Contributor Author

Would you or a colleague be able to take a look at this PR in the next week, @mehakmeet?

@mehakmeet mehakmeet self-requested a review October 6, 2022 12:18
@mehakmeet
Copy link
Contributor

Yes @dannycjones, I'll check this by tomorrow morning IST.

Copy link
Contributor

@mehakmeet mehakmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM bar few nits.

```



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: blank lines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will drop those!


Now that Amazon S3 is consistent, the magic committer is enabled by default.
Now that [Amazon S3 is consistent](https://aws.amazon.com/s3/consistency/),
the magic committer is enabled by default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File committer is still the default committer, right? So what does magic committer being "enabled by default" here mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe that's correct.

I believe what we're saying here is that S3A's "magic path rewriting" where it only stages the writes to __magic/ directories is now enabled by default.

I will update this to be clearer, something like:

Suggested change
the magic committer is enabled by default.
the magic directory path rewriting is enabled by default.

This then is the problem which the S3A committers address:

*How to safely and reliably commit work to Amazon S3 or compatible object store*
>*How to safely and reliably commit work to Amazon S3 or compatible object store.*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we quoting here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, it is not quoting anyone (or maybe Steve). I will let it join the previous sentence as I think that makes more sense.

@dannycjones dannycjones requested a review from mehakmeet October 7, 2022 09:18
@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 7s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ trunk Compile Tests _
+1 💚 mvninstall 43m 41s trunk passed
+1 💚 mvnsite 0m 56s trunk passed
+1 💚 shadedclient 69m 27s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 41s the patch passed
+1 💚 blanks 0m 1s The patch has no blanks issues.
+1 💚 mvnsite 0m 42s the patch passed
+1 💚 shadedclient 23m 32s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 asflicense 0m 42s The patch does not generate ASF License warnings.
97m 3s
Subsystem Report/Notes
Docker ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4478/8/artifact/out/Dockerfile
GITHUB PR #4478
Optional Tests dupname asflicense mvnsite codespell detsecrets markdownlint
uname Linux 84475ee88fc4 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / a56691d
Max. process+thread count 533 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-aws U: hadoop-tools/hadoop-aws
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4478/8/console
versions git=2.25.1 maven=3.6.3
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@mehakmeet mehakmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@mehakmeet mehakmeet merged commit 6207ac4 into apache:trunk Oct 19, 2022
@mehakmeet
Copy link
Contributor

Thanks @dannycjones, Can you please open a backport to branch-3.3, I'll hit merge after green yetus there.

dannycjones added a commit to dannycjones/hadoop that referenced this pull request Oct 19, 2022
@dannycjones dannycjones deleted the HADOOP-18304-improve-committers-doc branch October 19, 2022 09:52
@dannycjones
Copy link
Contributor Author

Thanks Mehakmeet! I've opened #5043 for backport to branch-3.3.

steveloughran pushed a commit that referenced this pull request Oct 19, 2022
asfgit pushed a commit that referenced this pull request Oct 19, 2022
HarshitGupta11 pushed a commit to HarshitGupta11/hadoop that referenced this pull request Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants