Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Cache python deps for linter and documentation

Why are the changes needed?

1, to avoid unnecessary installation: some packages were installed multiple times;
2, to centralize the installations: should only modify dockerfile in the future

Does this PR introduce any user-facing change?

no, infra-only

How was this patch tested?

ci

Was this patch authored or co-authored using generative AI tooling?

no

Copy link
Contributor Author

@zhengruifeng zhengruifeng Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this step is also used in 3.3/3.4/3.5, so move it to Install dependencies for documentation generation for branch-3.3, branch-3.4, branch-3.5

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this another attempt. Thank you for keeping improving this, @zhengruifeng .

cc @LuciferYang , too, because he found the last outage on the release branches.

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@zhengruifeng zhengruifeng force-pushed the infra_move_dep_dockerfile branch from 77a14d5 to ba62fd1 Compare November 22, 2023 09:07
@zhengruifeng
Copy link
Contributor Author

please hold on, it seems there are env conflicts between pyspark and lint
I convert it to draft, and will look into it

@zhengruifeng zhengruifeng marked this pull request as draft November 22, 2023 09:11
@zhengruifeng zhengruifeng changed the title [SPARK-46051][INFRA] Cache python deps for linter and documentation [WIP][SPARK-46051][INFRA] Cache python deps for linter and documentation Nov 22, 2023
@zhengruifeng zhengruifeng force-pushed the infra_move_dep_dockerfile branch from ba62fd1 to 439c712 Compare November 22, 2023 09:46
@zhengruifeng
Copy link
Contributor Author

unfortunately, there is conflict in jinja2 and cause multiple pyspark tests fail.
I plan to test miniconda later.

@zhengruifeng zhengruifeng force-pushed the infra_move_dep_dockerfile branch 5 times, most recently from 018762e to 7201d76 Compare November 27, 2023 09:21
fix doc
fix

fix

fix
refer to miniconda dockerfile

fix version
@zhengruifeng zhengruifeng force-pushed the infra_move_dep_dockerfile branch from 779a341 to 4b30f7d Compare November 27, 2023 09:48
# See also https://issues.apache.org/jira/browse/SPARK-35375.
RUN conda create -n doc python=3.9

RUN conda run -n doc pip install \
Copy link
Contributor

@nchammas nchammas Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we listing out individual dependencies here vs. listing them in a requirements file (or equivalent) that we can use across Docker, GitHub Actions, and local build scripts? Don't we want our build and test dependencies to be consistent?

I made a past attempt at this over in #27928. It failed because, in addition to building a shared set of build and test dependencies, it also pinned transitive build and test dependencies, which the reviewers weren't keen on. But we can separate the two ideas from each other.

IMO there should be a single requirements file for build and test dependencies (whether or not it pins transitive dependencies is a separate issue), and that file should be used everywhere. What do you think?

I also don't follow why we need to pull in conda. What is it getting us over vanilla pip?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @nchammas actually, I wanna give up this PR to cache the dependencies for linter and documentation, since there is still weird issue in building documentation with conda in CI (while it works well in my local).

The reason in this PR to try conda was that there were env conflicts between PySpark and lint/doc, that is, installation of lint/doc dependencies will break some tests.

As to why not use requirement file in CI, I guess a problem maybe, the modification in requirement file won't automatically refresh the cached testing image?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to why not use requirement file in CI, I guess a problem maybe, the modification in requirement file won't automatically refresh the cached testing image?

That shouldn't be the case. Assuming you COPY the requirements file into the image, changing the file will invalidate the cache:

The first encountered COPY instruction will invalidate the cache for all following instructions from the Dockerfile if the contents of <src> have changed. This includes invalidating the cache for RUN instructions.

Also:

For the ADD and COPY instructions, the modification time and size file metadata is used to determine whether cache is valid. During cache lookup, cache is invalidated if the file metadata has changed for any of the files involved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know this. You are right.

@zhengruifeng zhengruifeng deleted the infra_move_dep_dockerfile branch December 15, 2023 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants