-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[WIP][SPARK-46051][INFRA] Cache python deps for linter and documentation #43953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][SPARK-46051][INFRA] Cache python deps for linter and documentation #43953
Conversation
.github/workflows/build_and_test.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this step is also used in 3.3/3.4/3.5, so move it to Install dependencies for documentation generation for branch-3.3, branch-3.4, branch-3.5
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this another attempt. Thank you for keeping improving this, @zhengruifeng .
cc @LuciferYang , too, because he found the last outage on the release branches.
LuciferYang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
77a14d5 to
ba62fd1
Compare
|
please hold on, it seems there are env conflicts between |
ba62fd1 to
439c712
Compare
|
unfortunately, there is conflict in |
018762e to
7201d76
Compare
refer to miniconda dockerfile fix version
779a341 to
4b30f7d
Compare
| # See also https://issues.apache.org/jira/browse/SPARK-35375. | ||
| RUN conda create -n doc python=3.9 | ||
|
|
||
| RUN conda run -n doc pip install \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we listing out individual dependencies here vs. listing them in a requirements file (or equivalent) that we can use across Docker, GitHub Actions, and local build scripts? Don't we want our build and test dependencies to be consistent?
I made a past attempt at this over in #27928. It failed because, in addition to building a shared set of build and test dependencies, it also pinned transitive build and test dependencies, which the reviewers weren't keen on. But we can separate the two ideas from each other.
IMO there should be a single requirements file for build and test dependencies (whether or not it pins transitive dependencies is a separate issue), and that file should be used everywhere. What do you think?
I also don't follow why we need to pull in conda. What is it getting us over vanilla pip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @nchammas actually, I wanna give up this PR to cache the dependencies for linter and documentation, since there is still weird issue in building documentation with conda in CI (while it works well in my local).
The reason in this PR to try conda was that there were env conflicts between PySpark and lint/doc, that is, installation of lint/doc dependencies will break some tests.
As to why not use requirement file in CI, I guess a problem maybe, the modification in requirement file won't automatically refresh the cached testing image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As to why not use requirement file in CI, I guess a problem maybe, the modification in requirement file won't automatically refresh the cached testing image?
That shouldn't be the case. Assuming you COPY the requirements file into the image, changing the file will invalidate the cache:
The first encountered
COPYinstruction will invalidate the cache for all following instructions from the Dockerfile if the contents of<src>have changed. This includes invalidating the cache forRUNinstructions.
Also:
For the
ADDandCOPYinstructions, the modification time and size file metadata is used to determine whether cache is valid. During cache lookup, cache is invalidated if the file metadata has changed for any of the files involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to know this. You are right.
What changes were proposed in this pull request?
Cache python deps for linter and documentation
Why are the changes needed?
1, to avoid unnecessary installation: some packages were installed multiple times;
2, to centralize the installations: should only modify dockerfile in the future
Does this PR introduce any user-facing change?
no, infra-only
How was this patch tested?
ci
Was this patch authored or co-authored using generative AI tooling?
no