-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-31167][BUILD] Refactor how we track Python test/build dependencies #27928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @HyukjinKwon. I moved the requirements changes from #27912 to this PR. |
|
Test build #119884 has finished for PR 27928 at commit
|
|
Jenkins, retest this please. |
|
Test build #119896 has finished for PR 27928 at commit
|
dev/requirements.txt
Outdated
| Unidecode==0.04.19 | ||
| sphinx | ||
| sphinx==2.3.1 | ||
| numpy==1.18.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quickly googled and skimmed other projects about the requirements here and here. I still think it's more usual to specify the range, rather than the specific version.
I am still not sure if it's right to pin the version yet. There's trade-off on pinning. I think Spark community is big enough to handle these issue from using the latest versions too. I would only pin the version when an issue difficult to fix is found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc somewhat arbitrary committers who might be interested in here: @srowen, @holdenk, @dongjoon-hyun, @BryanCutler. WDYT about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the trade-off mentioned by @HyukjinKwon . Actually, I tried to pin once before and dropped my PR due to that. +1 for @HyukjinKwon 's advice, I would only pin the version when an issue difficult to fix is found..
The only exception is spark-rm/Dockerfile. We need to use specific version and to manager explicitly there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon - When looking at project dependencies, there is an important distinction between projects that are used as libraries and projects that are used as stand-alone applications.
If your project is a library, then you know others are importing you alongside other dependencies too. To minimize the chance of transitive dependency conflicts, you want to be flexible in how you specify your dependencies.
When your project is a stand-alone application, you don't have to worry about such things. You can pin every dependency to a specific version to get the most predictable and reliable build and runtime behavior.
In our case, the Spark build environment is more akin to a stand-alone application than a library. We don't need to worry about downstream users struggling with dependency conflicts. We can get the most stable build behavior by pinning everything, and there is no downside as far as I can tell.
I'll use Trio as an example again to illustrate my point:
- Trio is a library that others will typically import alongside many other dependencies. So in Trio's setup.py they are very flexible in how they specify their dependencies.
- Trio's test environment, on the other hand, is only used by Trio contributors. So Trio locks down every test requirement using pip-tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's the trade-off here as well - it's the most stable but there might be multiple bugs existing fixed in the upstream. It will still mess developer's environment. Arguably there are not so many Python-dedicated dev people in Spark community who fluently use pyenv, virtualenv or conda. I think most of them just use pip and local installation.
I quickly skimmed requirements.txt for dev or docs at here. I skimmed top 20 projects, and found 6 instances.
- 3 of them were not quite pinned.
- 2 of them were partially pinned
- 1 of them was completely pinned.
I agree the dev envs more tend to specify the versions; however, I think it's still prevailing to don't pin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help then if we expanded dev/README.md to show how to setup a virtual environment? I'm willing to do that.
If we don't want to ask devs to use virtual environments at all, then perhaps we need to fork dev/requirements.txt and have a version that pins everything, for use in CI and releases, and a version that pins nothing, for use by devs who don't use virtual environments.
Another alternative is the compromise currently standing in this PR, with some versions specified as major.minor.*.
And yet another alternative (which I personally wouldn't favor, but I know it's common) is to Dockerize the whole development environment, but that's a lot of work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest a note in README to ensure people know how they can isolate this env if needed, and pinning minor version only? is that close enough for consensus?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll do that. I'll recommend that users create a virtual environment in dev/README.md and demonstrate how to do that. I'll also update the version specifiers to all be in the form of major.minor.*.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, on second thought, perhaps the README should be left to a separate PR. We already have advice on setting up a development environment in at least a couple of places, like Useful Developer Tools and docs/README.md.
Perhaps we should consolidate that advice over on the Useful Developer Tools page, since it fits in with the information already there.
Either that, or let's agree on some other approach to take. But I think we can defer any dev documentation changes to a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also would prefer not to pin specific versions and agree with #27928 (comment). It is good to have the community try with the latest to surface any issues, but we should be very clear what versions have been used in our CI, which could be from a virtualenv or just in a readme, so there is always an obvious fallback version.
dev/requirements.txt
Outdated
| @@ -1,5 +1,10 @@ | |||
| flake8==3.5.0 | |||
| flake8==3.7.* | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I still think we should pin every version here, perhaps this approach is a compromise we can agree on.
==3.7.* means pip will install the latest bugfix release on 3.7. If you already have any 3.7 version installed, even if it's not the latest one, pip will consider the requirement satisfied and won't do anything. To force pip to upgrade to the latest bugfix release of 3.7 when you already have a compatible version installed, simply request it via pip install --upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can assume these dependencies follow SemVer, it we should better use wildcards on minor versions ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned it elsewhere but I'll mention it again here: Linters like flake8 and pycodestyle introduce new checks in minor/feature releases. There is very high chance that every new check they introduce will flag new problems and fail the build.
In fact, we saw exactly that behavior with pydocstyle just before we removed it. And I experienced this with pycodestyle in Flintrock before pinning the version.
I don't understand the point of waiting for the build to break before pinning or severely limiting the versions for libraries like these.
|
I'm still pretty OK with pinning, but, sure, allow taking later maintenance releases? |
|
Test build #119934 has finished for PR 27928 at commit
|
|
Jenkinmensch, retest this please. |
|
Test build #119950 has finished for PR 27928 at commit
|
|
Test build #120116 has finished for PR 27928 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me?
|
@srowen, With this change, we will have to maintain and keep For the dependencies below, we have been testing via Github Actions. So far, I couldn't find any related issues related to the versions.
We shouldn't pin
The dependencies below are release-specific and tricky to test. I suspect it's better for new dev people to test them out and pin the versions later when it's needed?
FYI, there look some more occurrences such as conda create -y -p "$VIRTUALENV_PATH" python=$python numpy pandas pip setuptoolspip install --upgrade pip wheel numpy |
Maybe this is the disconnect between our points of view, because so far I haven't really been following your objections to pinning. Assuming we pin every library, why do we have to keep As long as we can build the docs, run tests, and do whatever else we need to do as part of regular development, that file can remain frozen as-is for years. It's only when we specifically want to use some new feature of, say, Sphinx, that we need to bump versions. But that will happen very rarely, I imagine not more than once every couple of years. Does that address your concern? Why do you think we'd need to touch that file more often than once in a long while?
But the specification of numpy in Maybe we can improve this by replacing numpy in A separate issue I raised earlier is that, if we want to not pin our build/test dependencies, we need to figure out what to do about the Spark Docker image and CI. Either those will also source the unpinned requirements from the same file, or we go back to having the requirements specified in duplicate--with pinned versions for Docker and CI, and without pinned versions for developers. Obviously, I'd prefer to pin everything and keep it in one place, but if you want to go one of these routes I guess I'll do that. I just want to understand and try to address your objections before going there. |
|
@BryanCutler @HyukjinKwon I'm not sure where the consensus lands here. I think part of your argument is you do want to take some breakage that comes with automatic version updates. That seems a bit less valuable than relative stability, to me. I suppose I end up neutral and we have some slight preferences for and against then? Is there any change here with broader support? |
|
Yes. I want to take some breakages that comes with automatic version updates, also given that we haven't faced many issues by doing that. Yes, seems like there's a bit of different preferences. Can we touch versions here and just keep the refactoring alone in this PR for now? I like that part of putting dependencies into |
|
What does everyone think of the current approach taken in this PR? Are we good to go now (pending a committer test of the release script to avoid a repeat of #27534)? |
|
Checking in again here. Is there something I can do to move this along? |
|
I'm still generally OK with pinning things "lightly"; I am not sure how to test it though as I'm also not an RM. |
…1167-missing-test-deps
|
I manually copied the pinned requirements to the required location and kicked off the Docker build as follows (basically, trying to mimic parts of It ran all the way through successfully on my laptop. @HyukjinKwon and @cloud-fan - Are you still interested in this PR? Would either of you be able to test If you'd prefer, I can back out any changes to the release machinery, just to be conservative. It's not ideal, but if that would help get this PR over the line, then I'll do it. |
|
Test build #121414 has finished for PR 27928 at commit
|
|
Test build #122250 has finished for PR 27928 at commit
|
…1167-missing-test-deps
|
Test build #122267 has finished for PR 27928 at commit
|
…1167-missing-test-deps
|
Test build #123391 has finished for PR 27928 at commit
|
|
Now that 3.0 has been released, do we want to revisit this? It's ready to go, as far as I'm concerned, save for a little testing by a maintainer just to be safe. |
…1167-missing-test-deps
|
Test build #124788 has finished for PR 27928 at commit
|
|
Retest this please. |
|
Test build #124818 has finished for PR 27928 at commit
|
|
Retest this please. |
|
Test build #124917 has finished for PR 27928 at commit
|
Seems unrelated to this PR. Retest this please. |
|
I think this PR is failing to garner enough interest. If there's nothing I can do at this time to make it more appealing, I'll close it. I think the idea is a good one, but it needs buy-in from a release manager. |
|
Hm, I'm mostly in favor of it. Hyukjin if you're somewhat negative on it, OK let's shelve this. If you were fairly neutral, I think we can do this. |
|
Happy to revisit this when there is more support from a release manager. |
What changes were proposed in this pull request?
This PR (SPARK-31167) refactors how we track various Python test and build dependencies so they are pinned and internally consistent.
Why are the changes needed?
This should make it easier to bump dependencies (since they are specified in fewer places) and promote more consistent build behavior across Docker, Jenkins, GitHub, and local development environments.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
I built the Docker image just far enough to confirm that the dependencies are installed correctly:
The GitHub workflow will be tested as part of this PR.