Skip to content

Conversation

@pugnascotia
Copy link
Contributor

Firstly: we tag our Docker images with various pieces of information,
including a timestamp for when the image was built. However, this makes
it impossible completely cache the image. When developing the Docker
images, it's very tedious to completely rebuild an image for every
single change. Therefore, provided we're not building a proper release
build, we fix the build time to midnight so that the Docker build cache
is usable.

Secondly: the DockerBuildTask outputs a marker file to indicate that
an image has been built, but that isn't enough for a meaningful
up-to-date check by Gradle. Improve this by fetching the newly-built
image's hash, and writing that to the output file.

Thirdly: improve the Docker tests to make them more ergonomic, and also
disble ingest.geoip.downloader.enabled by default.

Fourthly: add missing test coverage for sourcing settings from env vars.

Firstly: we tag our Docker images with various pieces of information,
including a timestamp for when the image was built. However, this makes
it impossible completely cache the image. When developing the Docker
images, it's very tedious to completely rebuild an image for every
single change. Therefore, provided we're not building a proper release
build, we fix the build time to midnight so that the Docker build cache
is usable.

Secondly: the `DockerBuildTask` outputs a marker file to indicate that
an image has been built, but that isn't enough for a meaningful
up-to-date check by Gradle. Improve this by fetching the newly-built
image's hash, and writing that to the output file.

Thirdly: improve the Docker tests to make them more ergonomic, and also
disble `ingest.geoip.downloader.enabled` by default.
@pugnascotia pugnascotia added >enhancement >non-issue >test Issues or PRs that are addressing/adding tests :Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts v8.0.0 v7.16.0 labels Oct 1, 2021
@pugnascotia pugnascotia requested a review from breskeby October 1, 2021 10:17
@elasticmachine elasticmachine added the Team:Delivery Meta label for Delivery team label Oct 1, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

Copy link
Contributor

@mark-vieira mark-vieira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this to only include the change to how we handle noCache, the other changes in practice don't have meaningful effects.

Firstly: we tag our Docker images with various pieces of information,
including a timestamp for when the image was built. However, this makes
it impossible completely cache the image.

Not exactly. It makes it impossible to cache those layers, but since we're following best practice and adding these labels at the end of our Dockerfile, we can still cache all the expensive stuff. In practice, rebuilding the layers that simply add the labels is trivial and amounts to almost no time at all.

Secondly: the DockerBuildTask outputs a marker file to indicate that
an image has been built, but that isn't enough for a meaningful
up-to-date check by Gradle.

In this case the content of the marker file is mostly irrelevant. It's simply there to tell Gradle "hey, you've run this task". The only time Gradle actually cares about the content of the file is if something were to modify it, but the only thing that will do that is a clean, which will just delete it, and in any case, any modifications (regardless of the original content) would simply trigger the task to run again.

The marker file is just there to be a placeholder for the fact that the "output" of this task is not a simple file or directory that Gradle can track. The bit that's important is not the output, it's the inputs. Which is to say that so long as that marker file exists, and the inputs haven't changed since the last execution, Gradle will continue to treat the task as UP-TO-DATE. The reality is that basically any code change (aside from plugins) will change the ES distro, so we have to rebuild the Docker image. This already works, so there's no improvement to make here.

I did some experiments locally. All I did was build the image, make a code change, then rebuild. Comparing this PR to simply changing master to use noCache = false the results are identical.

This pull request: 48.6s
master w/ cache: 48.2s

So the problem wasn't we were inefficiently leveraging Docker caching, it's that we weren't using it at all. I'd simply reduce this to just set noCache = false, but instead of doing this conditionally based on whether it's a snapshot build, I'd use BuildParams.isCi. I think we always want to rebuild in CI, even for snapshot build. What we really care about is local developer builds being quicker, which using the Docker cache will definitely improve.

@pugnascotia
Copy link
Contributor Author

The reason for writing the image SHA to the output file is so that Gradle knows to re-run tasks that depend on the Docker image - specifically the Cloud ESS image, which depends on the Cloud image. I was finding that when I edited the Cloud image, it was rebuilt but the Cloud ESS one was skipped. This is non-obvious and means you need to remove the marker file or run a clean to rebuild the ESS image.

As for the timestamp stuff - you're right that these are quick tasks and we do them at the end of the build in order to leverage the layer cache, but they still change the output SHA, which means that if you're running tests, the export image step still runs, and that's not lightening quick. So minor cleanups, comment changes or formatting changes in the Dockerfile slow down the build / test cycle. I was hitting this myself, so it is a DX issue, albeit a pretty minor one.

I'll make the BuildParams.isCi change.

@mark-vieira
Copy link
Contributor

The reason for writing the image SHA to the output file is so that Gradle knows to re-run tasks that depend on the Docker image - specifically the Cloud ESS image, which depends on the Cloud image. I was finding that when I edited the Cloud image, it was rebuilt but the Cloud ESS one was skipped.

Right. This would only come into play if the output of the normal docker image was an input to the cloud image task but I don't see where that's happening. I assume you're referring to the ESS image, which depends on the cloud image. There's a task dependency there, but we aren't wiring up any inputs, so as it is, updating the marker file to include a hash wouldn't change any behavior.

So to get this working the way we want we'll want to make the output for the cloud build image an input to the ess cloud image. Also, like I said above, it's not super critical what the content of that marker file is, only that it changes to indicate to downstream tasks that they need to run again. So it can just be a timestamp for all we care, and that'll save us to work of having to shell out to docker to grab the image sha.

@mark-vieira
Copy link
Contributor

Nevermind, I see where you're adding the input now. That makes sense. The point about the content of the marker file stands though. It doesn't have to be a hash, it could be anything that simply changes on every run. The only benefit of using a hash was that if you say, made a change, built, then reverted and built again, the build cache could kick in since we'd already built that input. Docker tasks aren't cacheable obviously because the "output" isn't a file on disk so it's not applicable here.

@pugnascotia
Copy link
Contributor Author

The image hash is a digest of the contents, so surely it's the most appropriate data to put in the marker file? Anything else is just arbitrary. If I adjust the Dockerfile in a way that doesn't change the layers (commenting, formatting), then downstream tasks shouldn't be affected, so it makes sense to me to output the image hash.

@pugnascotia
Copy link
Contributor Author

@mark-vieira I'd like to get this merged so that I can reduce the diff in #77544.

@mark-vieira
Copy link
Contributor

mark-vieira commented Oct 6, 2021

If I adjust the Dockerfile in a way that doesn't change the layers (commenting, formatting), then downstream tasks shouldn't be affected, so it makes sense to me to output the image hash.

Good point. That's a scenario I hadn't considered. You are definitely correct, using the hash is the more "correct" and it does better handle some scenarios. I'm good with keeping the implementation as is. Thanks, Rory!

@pugnascotia pugnascotia merged commit 67e310e into elastic:master Oct 7, 2021
@pugnascotia pugnascotia deleted the docker-caching-improvements branch October 7, 2021 08:19
@pugnascotia
Copy link
Contributor Author

Backported to 7.x in f75ba8b.

pugnascotia added a commit that referenced this pull request Oct 7, 2021
Firstly: we tag our Docker images with various pieces of information,
including a timestamp for when the image was built. However, this makes
it impossible completely cache the image. When developing the Docker
images, it's very tedious to completely rebuild an image for every
single change. Therefore, provided we're not building a proper release
build, we fix the build time to midnight so that the Docker build cache
is usable.

Secondly: the `DockerBuildTask` outputs a marker file to indicate that
an image has been built, but that isn't enough for a meaningful
up-to-date check by Gradle. Improve this by fetching the newly-built
image's hash, and writing that to the output file.

Thirdly: improve the Docker tests to make them more ergonomic, and also
disable `ingest.geoip.downloader.enabled` by default.

Fourthly: add missing test coverage for sourcing settings from env vars.
wjp719 added a commit to wjp719/elasticsearch that referenced this pull request Oct 7, 2021
…' into feature/data_stream_support_routing

* wjp/feature/data_stream_support_routing: (44 commits)
  Revert "Adjust /_cat/templates not to request all metadata (elastic#78812)"
  Allow indices lookup to be built lazily (elastic#78745)
  [DOCS] Document default security in alpha2 (elastic#78227)
  Add cluster applier stats (elastic#77552)
  Fix failing URLDecodeProcessorTests::testProcessor test (elastic#78690)
  Upgrade to lucene snapshot ba75dc5e6bf (elastic#78817)
  Adjust /_cat/templates not to request all metadata (elastic#78812)
  Simplify build plugin license handling (elastic#77009)
  Fix SearchableSnapshotsBlobStoreCacheIntegTests.testBlobStoreCache (elastic#78616)
  Improve Docker image caching and testing (elastic#78552)
  Load knn vectors format with mmapfs (elastic#78724)
  Fix date math zone test to use negative minutes (elastic#78796)
  Changing name of shards field in node/stats api to shard_stats (elastic#78531)
  [DOCS] Fix system index refs in restore tutorial (elastic#78582)
  Add previously removed settings back for 8.0 (elastic#78784)
  TSDB: Fix template name in test
  Add a system property to forcibly format everything (elastic#78768)
  Revert "Adding config so that some tests will break if over-the-wire encryption fails (elastic#78409)" (elastic#78787)
  Must date math test failure
  Adding config so that some tests will break if over-the-wire encryption fails (elastic#78409)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Delivery/Packaging RPM and deb packaging, tar and zip archives, shell and batch scripts >enhancement >non-issue Team:Delivery Meta label for Delivery team >test Issues or PRs that are addressing/adding tests v7.16.0 v8.0.0-beta1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants