Skip to content

Conversation

@pp-mo
Copy link
Member

@pp-mo pp-mo commented Jun 14, 2022

This promises to be an ultra-simple fix for our latest CI problems,
as noted here

Wellll, a sufficient stop-gap anyway.
It could well still be the right time to adopt #4503

@pp-mo pp-mo requested a review from trexfeathers June 14, 2022 13:21
@bjlittle bjlittle self-assigned this Jun 14, 2022
@bjlittle bjlittle self-requested a review June 14, 2022 13:33
Copy link
Member

@bjlittle bjlittle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pp-mo LGTM 👍

Let's see what cirris-ci says...

@pp-mo
Copy link
Member Author

pp-mo commented Jun 14, 2022

Following previous musings + information #4794 (comment)

@lbdreyer suggested that you can see, on the Cirrus memory graphs, that memory usage has increased.
With cirrus.yml, this is hitherto configured at 4Gb, but increased to 8Gb for the test run specifically.
It seems that we get tasks cancelled when it goes over 8Gb.

Numbers look roughly like this ...

PR date peak memory / Gb tests passing?
#4600 2022-05-25 14:56 5.81 ✔️
#4767 2022-05-27 16:01 6.36 ✔️
#4775 2022-05-30 10:54 6.59 ✔️
#4727 2022-06-06 11:51 6.14 ✔️
#4783 2022-06-06 12:23 6.19 ✔️
#4782 2022-06-06 13:25 6.89 ✔️
#4784 2022-06-06 17:40 6.71 ✔️
#4778 2022-06-07 17:40 7.02 ✔️
#4785 2022-06-08 13:35 6.27 ✔️
#4792 2022-06-10 16:50 8.00
#4794 2022-06-13 14:21 8.00

Exactly why the memory usage should have notably increased is unclear.
But according to tests performed with #4790, it is not due to code or lockfile changes.

@bjlittle
Copy link
Member

Exactly why the memory usage should have notably increased is unclear.
But according to tests performed with #4790, it is not due to code or lockfile changes.

@pp-mo Could be something... could be nothing, but it would be good to get a handle on this.

That said, it may be that the run order of the tests have changed, for whatever reason, resulting in a spike?

Just a suggestion for the observable change in behaviour based on nothing more that speculation 🤔

@pp-mo
Copy link
Member Author

pp-mo commented Jun 14, 2022

a spike?

Well now I added a few older ones to the table.
It really looks like a growing trend to me.
And, as I said, with #4790 the problem was observed on a re-spin, even with all the same lockfiles + base code.

I will try re-spinning the oldest one from the above list with a lowest memory burden (#4600, previously had 5.81 Mb) ...

@pp-mo
Copy link
Member Author

pp-mo commented Jun 14, 2022

I will try re-spinning the oldest one from the above list with a lowest memory burden (#4600, previously had 5.93 Mb) ...

Here's a screenshot of the original tests run against #4600, (since I think a re-spin will destroy this info).

Screenshot_older_tests_run_#4600

@bjlittle
Copy link
Member

According to @https://github.com/SciTools/iris-test-data/releases, this footprint continues to bloat.

Just a fact, not necessarily a smoking gun.

@bjlittle
Copy link
Member

As an aside, I see that the benchmarking use of our test data is behind the curve... still on 2.5 rather than latest offering of 2.9.

Not that it should matter here.

@pp-mo
Copy link
Member Author

pp-mo commented Jun 14, 2022

According to @https://github.com/SciTools/iris-test-data/releases, this footprint continues to bloat.

But I thought that, since we now only download versioned builds of iris-test-data, that should not affect the re-spins of "older PRs" that I have been doing ?

@pp-mo
Copy link
Member Author

pp-mo commented Jun 14, 2022

Here is the re-spin of the above #4600 :
Screenshot_NEWER_RESPIN_tests_run_#4600

I'm not clear why we "only" have a peak of 7.39Mb, when we've previously seen 8.00.
But it is failing with tasks cancelled + the "Container errored with 'OOMKilled'" message.

So, it does look like 5.93 --> 7.39 Mb
With no env or code changes.

@pp-mo
Copy link
Member Author

pp-mo commented Jun 14, 2022

It seems that the magic of asking for 10Gb has maybe triggered another difficulty.
Tests were a bit slow to start, but now running ~50mins, apparently in "run tests" step but still not finished and no output visible ??

(UPDATE: I was confused, it was the doctests that had problems).

@pp-mo
Copy link
Member Author

pp-mo commented Jun 14, 2022

asking for 10Gb has maybe triggered another difficulty.

I'm going to cancel + respin it ...

@rcomer
Copy link
Member

rcomer commented Jun 14, 2022

Doctest memory also exploded at #4796 and #4797.

@pp-mo
Copy link
Member Author

pp-mo commented Jun 14, 2022

@pp-mo I'm going to cancel + respin it ...
@rcomer Doctest memory also exploded at #4796 and #4797.

Ok, the tests did now run
( but only on second try ?? UPDATED: I was wrong about this)
But the doctests blew their 4Gb limit.

There is definitely something going on here. Possibly it's their platform?

@pp-mo
Copy link
Member Author

pp-mo commented Jun 14, 2022

I know this is still a kluge, 3941fe3, but a temporary quick win might still be worth it, right now.

@rcomer
Copy link
Member

rcomer commented Jun 16, 2022

An observation: tests in this PR seem to be running with 32 workers which seems rather a lot. In #4503 GHA version, it's only 2 workers.

@pp-mo
Copy link
Member Author

pp-mo commented Jun 16, 2022

An observation: tests in this PR seem to be running with 32 workers which seems rather a lot. In #4503 GHA version, it's only 2 workers.

FYI, I think we are not proceeding with this.
Hoping to go with #4503 instead now.

@rcomer
Copy link
Member

rcomer commented Jun 16, 2022

An observation: tests in this PR seem to be running with 32 workers which seems rather a lot. In #4503 GHA version, it's only 2 workers.

FYI, I think we are not proceeding with this. Hoping to go with #4503 instead now.

Yep, It's just nice to have an explanation. #4767 only used 15 workers, so maybe they have been creeping up for some reason.

@trexfeathers
Copy link
Contributor

Closed by #4503

@trexfeathers trexfeathers removed their request for review June 17, 2022 11:30
@pp-mo pp-mo deleted the test_10gb branch April 28, 2023 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants