-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Improve Docker image caching and testing #78552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Docker image caching and testing #78552
Conversation
Firstly: we tag our Docker images with various pieces of information, including a timestamp for when the image was built. However, this makes it impossible completely cache the image. When developing the Docker images, it's very tedious to completely rebuild an image for every single change. Therefore, provided we're not building a proper release build, we fix the build time to midnight so that the Docker build cache is usable. Secondly: the `DockerBuildTask` outputs a marker file to indicate that an image has been built, but that isn't enough for a meaningful up-to-date check by Gradle. Improve this by fetching the newly-built image's hash, and writing that to the output file. Thirdly: improve the Docker tests to make them more ergonomic, and also disble `ingest.geoip.downloader.enabled` by default.
|
Pinging @elastic/es-delivery (Team:Delivery) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify this to only include the change to how we handle noCache, the other changes in practice don't have meaningful effects.
Firstly: we tag our Docker images with various pieces of information,
including a timestamp for when the image was built. However, this makes
it impossible completely cache the image.
Not exactly. It makes it impossible to cache those layers, but since we're following best practice and adding these labels at the end of our Dockerfile, we can still cache all the expensive stuff. In practice, rebuilding the layers that simply add the labels is trivial and amounts to almost no time at all.
Secondly: the DockerBuildTask outputs a marker file to indicate that
an image has been built, but that isn't enough for a meaningful
up-to-date check by Gradle.
In this case the content of the marker file is mostly irrelevant. It's simply there to tell Gradle "hey, you've run this task". The only time Gradle actually cares about the content of the file is if something were to modify it, but the only thing that will do that is a clean, which will just delete it, and in any case, any modifications (regardless of the original content) would simply trigger the task to run again.
The marker file is just there to be a placeholder for the fact that the "output" of this task is not a simple file or directory that Gradle can track. The bit that's important is not the output, it's the inputs. Which is to say that so long as that marker file exists, and the inputs haven't changed since the last execution, Gradle will continue to treat the task as UP-TO-DATE. The reality is that basically any code change (aside from plugins) will change the ES distro, so we have to rebuild the Docker image. This already works, so there's no improvement to make here.
I did some experiments locally. All I did was build the image, make a code change, then rebuild. Comparing this PR to simply changing master to use noCache = false the results are identical.
This pull request: 48.6s
master w/ cache: 48.2s
So the problem wasn't we were inefficiently leveraging Docker caching, it's that we weren't using it at all. I'd simply reduce this to just set noCache = false, but instead of doing this conditionally based on whether it's a snapshot build, I'd use BuildParams.isCi. I think we always want to rebuild in CI, even for snapshot build. What we really care about is local developer builds being quicker, which using the Docker cache will definitely improve.
|
The reason for writing the image SHA to the output file is so that Gradle knows to re-run tasks that depend on the Docker image - specifically the Cloud ESS image, which depends on the Cloud image. I was finding that when I edited the Cloud image, it was rebuilt but the Cloud ESS one was skipped. This is non-obvious and means you need to remove the marker file or run a clean to rebuild the ESS image. As for the timestamp stuff - you're right that these are quick tasks and we do them at the end of the build in order to leverage the layer cache, but they still change the output SHA, which means that if you're running tests, the export image step still runs, and that's not lightening quick. So minor cleanups, comment changes or formatting changes in the Dockerfile slow down the build / test cycle. I was hitting this myself, so it is a DX issue, albeit a pretty minor one. I'll make the |
Right. This would only come into play if the output of the normal docker image was an input to the cloud image task but I don't see where that's happening. I assume you're referring to the ESS image, which depends on the cloud image. There's a task dependency there, but we aren't wiring up any inputs, so as it is, updating the marker file to include a hash wouldn't change any behavior. So to get this working the way we want we'll want to make the output for the cloud build image an input to the ess cloud image. Also, like I said above, it's not super critical what the content of that marker file is, only that it changes to indicate to downstream tasks that they need to run again. So it can just be a timestamp for all we care, and that'll save us to work of having to shell out to |
|
Nevermind, I see where you're adding the input now. That makes sense. The point about the content of the marker file stands though. It doesn't have to be a hash, it could be anything that simply changes on every run. The only benefit of using a hash was that if you say, made a change, built, then reverted and built again, the build cache could kick in since we'd already built that input. Docker tasks aren't cacheable obviously because the "output" isn't a file on disk so it's not applicable here. |
|
The image hash is a digest of the contents, so surely it's the most appropriate data to put in the marker file? Anything else is just arbitrary. If I adjust the |
|
@mark-vieira I'd like to get this merged so that I can reduce the diff in #77544. |
Good point. That's a scenario I hadn't considered. You are definitely correct, using the hash is the more "correct" and it does better handle some scenarios. I'm good with keeping the implementation as is. Thanks, Rory! |
|
Backported to |
Firstly: we tag our Docker images with various pieces of information, including a timestamp for when the image was built. However, this makes it impossible completely cache the image. When developing the Docker images, it's very tedious to completely rebuild an image for every single change. Therefore, provided we're not building a proper release build, we fix the build time to midnight so that the Docker build cache is usable. Secondly: the `DockerBuildTask` outputs a marker file to indicate that an image has been built, but that isn't enough for a meaningful up-to-date check by Gradle. Improve this by fetching the newly-built image's hash, and writing that to the output file. Thirdly: improve the Docker tests to make them more ergonomic, and also disable `ingest.geoip.downloader.enabled` by default. Fourthly: add missing test coverage for sourcing settings from env vars.
…' into feature/data_stream_support_routing * wjp/feature/data_stream_support_routing: (44 commits) Revert "Adjust /_cat/templates not to request all metadata (elastic#78812)" Allow indices lookup to be built lazily (elastic#78745) [DOCS] Document default security in alpha2 (elastic#78227) Add cluster applier stats (elastic#77552) Fix failing URLDecodeProcessorTests::testProcessor test (elastic#78690) Upgrade to lucene snapshot ba75dc5e6bf (elastic#78817) Adjust /_cat/templates not to request all metadata (elastic#78812) Simplify build plugin license handling (elastic#77009) Fix SearchableSnapshotsBlobStoreCacheIntegTests.testBlobStoreCache (elastic#78616) Improve Docker image caching and testing (elastic#78552) Load knn vectors format with mmapfs (elastic#78724) Fix date math zone test to use negative minutes (elastic#78796) Changing name of shards field in node/stats api to shard_stats (elastic#78531) [DOCS] Fix system index refs in restore tutorial (elastic#78582) Add previously removed settings back for 8.0 (elastic#78784) TSDB: Fix template name in test Add a system property to forcibly format everything (elastic#78768) Revert "Adding config so that some tests will break if over-the-wire encryption fails (elastic#78409)" (elastic#78787) Must date math test failure Adding config so that some tests will break if over-the-wire encryption fails (elastic#78409) ...
Firstly: we tag our Docker images with various pieces of information,
including a timestamp for when the image was built. However, this makes
it impossible completely cache the image. When developing the Docker
images, it's very tedious to completely rebuild an image for every
single change. Therefore, provided we're not building a proper release
build, we fix the build time to midnight so that the Docker build cache
is usable.
Secondly: the
DockerBuildTaskoutputs a marker file to indicate thatan image has been built, but that isn't enough for a meaningful
up-to-date check by Gradle. Improve this by fetching the newly-built
image's hash, and writing that to the output file.
Thirdly: improve the Docker tests to make them more ergonomic, and also
disble
ingest.geoip.downloader.enabledby default.Fourthly: add missing test coverage for sourcing settings from env vars.