-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-16207 testMR failures #1115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
98fb648 to
5c10ce9
Compare
|
Testing: S3 Ireland. Not retested since rebasing to trunk; will do when I have 1h to spare. Currently this patch is not going to fix the testMR failures, what it will do is speed up test runs slightly, save the logs locally, and provide some more details on what it means when the MR job returns "false" |
|
Latest iteration should be more informative on a failure; I think I'd like to go one step further and see if I could actually put the logs of the failing MR app into the test log, which is a matter of:
|
af32ecd to
27faf9a
Compare
|
Tested: S3 Ireland. No failures on my test run. I do hope this modified test run will pick up on any failures which have been happening on other PRs, e.g #1123, so we can then track down what the failure was. |
|
Note that this test deletes four committer tests but only parameterizes three: directory, partitioned and magic. We don't do an explicit Staging committer, just its two subclasses. That's because those are the actual committers people are instructed to use, and we save one test run by cutting it. |
27faf9a to
54a8264
Compare
This patch is the preamble to any fix: working out what is wrong. With a side benefit of running the tests much faster. -Replaces the subclasses of the various committers with a paramterization of a single test. This is done via subclasses wihch define the specific test parameters and validate the output. -Which allows the tests to share the same cluster setup, rather than doing it once for each test -Makes the creation of an HDFS FS optional. As well as massively speeding up test setup, this means the logs of the test runs get collected. Oh, and did I say it's faster? Change-Id: I6c89424071e08cdb58ea5bf3e33418ada0e27b01
Change-Id: I30afdd3f1114ceded55ca43e47c701cff9ccdc54
Change-Id: Idd74dcef11da04c884f5d73347501c82a6380068 TODO: include the appId in the errors, as now there is >1 app in the same cluster, you need that.
Those about test names can be ignored, as we are using the numbering for executing tests in order Change-Id: I998077eb6147737fd6deba4d5a0c5c802f4a23dd
+ fix IDE hints Change-Id: I28ac4c744677bb6e5eae0c95dda0bafc7fbda208
…between this PR and trunk Change-Id: I9a10e1852e79a427f8e635da224d496cdf1b5d4e
Tracked down bug where staging dir was in a temp path under the NM's dir tree, so even though the fs was shared, only tasks running on the same NM had any output to commit. fix uses java tmp dir in junit process to set the staging dir property; also adds assertions for size of success file count so the problem is picked up on the specific job, rather than have the follow on job fail with a -1. Change-Id: Iab0ebdbf8c87518c4274eb2d9bf8468670b957d7 but: runs are now creating a tmp/ dir under hadoop-aws for staging files. This is wrong -we should be picking up the specific one for that test fork
* turn off all checks the disk free space. Without those, as soon as you get down to 40GB free, all minicluster tests fail. Way back we fixed this for mini HDFS, I believe -but not here. * create the test cluster on demand in the first setup() call which doesn't already have a cluster. * use an explicit absolute staging directory for consistent access across all nodes in the mini cluster. Change-Id: I41687100f68dd15625677d1ff7d1203854dc9061
54a8264 to
f915a07
Compare
|
🎊 +1 overall
This message was automatically generated. |
…on to the terasort tests. This is mainly done by changes to the shared superclass and making AbstractCommitTerasortIT non abstract and parameterized by committer name. Change-Id: Id48b5bf3a16ba693ad7bc15836e30325855bd0dc
|
💔 -1 overall
This message was automatically generated. |
pattern in the POM Change-Id: I999240c2139f0f12ca53e9425faf9c9b97ba2ca1
|
Full scale test: |
|
💔 -1 overall
This message was automatically generated. |
This isn't to directly debug problems in this commit, but it lines up the eest runs to collect more data needed to troubleshoot intermittent test failures. Change-Id: I8f4c7258b62a45a7a1c376e0315b429dde08117c
|
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title/commit message changes to 'Speed up MiniCluster jobs'?
...tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/commit/AbstractYarnClusterITest.java
Show resolved
Hide resolved
| #log4j.logger.org.apache.hadoop.fs.s3a.s3guard=DEBUG | ||
| # if set to debug, this will log the PUT/DELETE operations on a store | ||
| #log4j.logger.org.apache.hadoop.fs.s3a.s3guard.Operations=DEBUG | ||
| log4j.logger.org.apache.hadoop.fs.s3a.s3guard.Operations=DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentional? or left over from debugging.
| public void setup() throws Exception { | ||
| super.setup(); | ||
| requireScaleTestsEnabled(); | ||
| prepareToTerasort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In prepareToTerasort - the config is modified via getYarn().getConfig.set() ...
This is going to change the config for every test that makes use of this cluster. Likely better to create a new Configuration instance for the Job being submitted, and setting these config parameters at the job level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point -I'll clone
...tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/commit/AbstractYarnClusterITest.java
Show resolved
Hide resolved
| * suite. | ||
| */ | ||
| @SuppressWarnings("StaticNonFinalField") | ||
| private static ClusterBinding clusterBinding; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how well this is going to work, given there's at least 2 tests which are inheriting from this class.
Either could end up creating the cluster, and then terminating the cluster in '@afterclass' without knowing what the other test is doing - and I think this will lead to failures depending on the timing of executing ITestTerasortOnS3A and ITestS3ACommitterMRJob.
These tests likely need there own clusters, and teardown methods. Or maybe the parameter can be made non-static, so that individual test classes don't mess with each other. (They get their own clusters as a result). Afaik, the individual test classes will still be able to share clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I will need to look a bit closer at the JUnit runner and when there is after class methods are called. Before I did this cleanup I did actually have the bindings in each subclass, but at least in the test run this week it did actually work with everything in the superclass.
If I'm mistaken in my assumptions, I can just put the cleanup @afterclass into each of the subclasses, and call out in the javadocs that you have to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Junit @AfterClass javadocs
The @AfterClass methods declared in superclasses will be run after those of the current
class, unless they are shadowed in the current class.
Nothing to worry about then; good to remember that ordering too. BeforeClass and Before are the opposite: superclasses run before the subclasses. What isn't documented is the ordering of before/after methods within the same class...I wouldn't rely on anything there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand the concern I raised a little better now.
Will the two derived tests ever run in parallel on the same VM?
- If so, there's a problem with the static cluster being in the base class - since the cluster will be shared and shutdown when either of the tests complete.
If parallel tests will only use separate VMs - we should be OK.
If a single VM is used to run the tests serially - the "clusterBinding=null" is an important change, which I think you've already made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we have parameterized tests there will never be >1 CommitMRJob
test running in parallel. That was a problem we had before, where you could have multiple ITest*CommitMRJob test runs spawning so many processes that your dev box comes to a halt if you do a many threaded run. With these settings the delegation token test runs can overlap with the CommitMRJob sequence -i plan to follow up this patch with parameterization there too.
As all the tests have opted not to use a mini HDFS cluster we avoid the extra overhead of those processes too
| * JUnit runs these test suites one parameterized binding at a time. | ||
| * </li> | ||
| * <li> | ||
| * The test suites are declared to be executed in ascending order, so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little confused by the intent of this flow - i.e. test_000, test_100, test_200, test_500 in order.
- What happens if test_000 fails - will the rest of the tests still be run for the specific parameter?
- The names - test_000, test_100 etc don't say a lot about what is actually being done in the methods.
It may be easier to have a single test method - which takes care of setup, validation, etc - and allow override points within that for the individual committers. That would lead to a single test failing and not attempting to run the others after a previous dependent failure.
Also, mentioned making the cluster non-static in another comment. Given parameterized tests instantiate a new instance of the class each time - maybe just the '@afterclass' teerdown and cluster setup needs to move to this class (still as static).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test suite is tagged as @FixMethodOrder(MethodSorters.NAME_ASCENDING); they run in sequence. naming convention is as used elsewhere.
As to why the gaps between numbers: lets us put new values in without
renaming things. You never coded in Basic or APL?
Having separate tests
- isolates test method timeouts
- lets you debug individual operations.
- makes it clear which operation failed.
yes, the state of one is dependent on the others succeeding, which is why they try to have checks for the state of the store before they run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking: we could set markers on tests as they complete and so skip the successors. I'll do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not concerned about the gap in numbering at all.
I understand the ordering enforced by the annotation. The terrasort test has this as well - and it makes sense to me there, at least to some extent, since each test is doing something independently, which can be verified (e.g. run teragen, then terasort, then teravalidate).
In this case - some of the methods are just performing setup operations for the actual test.
This could be structured as.
@Test testMRJob() {
committerTestBinding.test_000();
committerTestBinding.test_100();
.. .. ..
}
That would make sure exactly one test runs, and fails the moment there's an error from any of the per committer bindings. Is also easier to debug I think - for someone new, if they see a failure on test_000, test_100, test_500 etc - they would not try to figure out each of the failures independently. Will also have context on the failures via the methodName + parameter.
IAC, that's my take on this. Have a preference for a single test approach, but not tied to it. Your plan to skip any successors solves the main concern as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am satisfied with what we have here -it is consistent with all the other large sequential test runs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. The change that you mentioned about not running tests if a previous test fails - if that's already in, I'm +1 for the patch.
* track stages executed and use an assume() to skip subsequent tests on a failure. Note: this makes it impossible to execute a single stage in the debugger. * Terasort options are set in `applyCustomConfigOptions` for the jobconf * use unique paths for the different parameterised runs (I'd missed that) * javadocs * terasort size options (partitions, rows, ...) set in constants. Change-Id: I42a4ce46562641e8d942a81460a67eb01acf87aa
|
Here is a new revision. Tests in progress.
|
|
Also removed all changes which went near the committer code itself, to avoid conflict with #1442. This is a test code only patch now |
|
💔 -1 overall
This message was automatically generated. |
|
The change that you mentioned about not running tests if a previous test fails - if that's already in, I'm +1 for the patch. |
|
thanks -merged in! |
This patch is the preamble to any fix: working out what is wrong.
With a side benefit of running the tests much faster.
of a single test. This is done via subclasses wihch define the specific test
parameters and validate the output.
it once for each test
speeding up test setup, this means the logs of the test runs get collected.
Oh, and did I say it's faster? Not by much; it's still ~3 minutes, but I've cut the file size down on a scale test run (leave that for other testers).
Now all the logs from test runs go into target/yarn-${timestamp}; once you can find the relevant stdouts from the MR App master then you can see what's gone on. Added some details to the log messages to make it more meaningful
Testing: S3 Ireland with/without s3guard (dynamo)