-
Notifications
You must be signed in to change notification settings - Fork 330
Spark JDK 17 for python 3.10 #896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| # | ||
|
|
||
| FROM docker.io/apache/spark:3.5.4-python3 | ||
| FROM docker.io/apache/spark:3.5.4-java17-python3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java 17 WFM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Is this Scala 2.13? I'd assume so, because there are separate images that have "scala2.12" in their tag name - but no images with "scala2.13".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is scala 2.12. Spark defaults to 2.12 for their images.
| USER root | ||
| RUN apt update | ||
| RUN apt-get install -y diffutils wget curl python3.8-venv | ||
| RUN apt-get install -y diffutils wget curl python3.10-venv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would 3.12 or 3.12 work? Those versions still get bugfixes (not just security fixes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will. For my local, I am using Python 3.13. However, if we want to use the official spark image and different versions of python, we will need to compile from source code. In my previous PR of rework the test cases to pytest (paused for now, will pick it up again soon), I was using Python as a base image and built our own spark image on top (in that case, I am not locked to what Spark image is using and nor need to compile from source... setting up Spark will just be installing a software). Both will work.
It really comes to if we want to use the official Spark image and don't want to do software compile, we will be using that specific version of Python (e.g. for Centos7 which is also EOL, it is defaulted to Python 2 and Python3 will be referred to 3.8, but it is possible to setup different version of Python 3 there via different repo or compiled from source). In this case, the JDK 11 base image used by Spark is default to python 3.8 and JDK 17 is default to python 3.10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit more context for those base images...Official Spark JDK 11 is based off eclipse-temurin:11-jre-focal which is built on top of ubuntu:20.04 while official Spark JDK 17 is based off eclipse-temurin:17.0.3_7-jdk-jammy which is built on top of ubuntu:22.04. And here is the python3 will resolve to for each of them when using apt with default repo (so to get diff version and not compile from source, we can also try to point to diff branch of the repo as well diff repo):
$ docker run -it ubuntu:20.04 /bin/bash
root@48dfe9519115:/# apt-cache madison python3
python3 | 3.8.2-0ubuntu2 | http://archive.ubuntu.com/ubuntu focal/main amd64 Packages
$ docker run -it ubuntu:22.04 /bin/bash
root@1a6950f03ad8:/# apt-cache madison python3
python3 | 3.10.6-1~22.04.1 | http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
python3 | 3.10.6-1~22.04 | http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages
python3 | 3.10.4-0ubuntu2 | http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good. Just a question ;)
However, I'd generally stay away from already EOL'd versions and soon-to-be EOL versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understand. If preferred, I can do a PR for using base python image then build spark on top. By doing so, we can do latest version for them (but we won't be using official spark image in that case as they don't have this type support). There is a similar request from Apache Iceberg as well, but their preferred route is using official spark image whenever possible.
let me know what you think. I can merge this one if no other concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, no need to do more effort at this point IMO. Seems to be a lot of initial and maintenance for a low win. Sticking w/ the official Spark image is fine for me. I don't see a pressing need to add more burden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort to look into this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anytime.
flyrain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This also remind me that other places still use Spark 3.5.3, like here https://github.com/polaris-catalog/polaris/blob/23385a00d89bc43e6b869eb0ea65ecb97b6c9bb7/regtests/run.sh#L22-L22. We need to update as well. Not a blocker for this PR though.
Yes, I can take care those a bit later this week as well. As main branch is now a lot more stable, I will bring up the work again for python test cases refactor and pytest for all of them. |
Due to the upcoming update of Poetry from version 1.x to 2.x (#873), the minimum Python version requirement will increase to 3.9 (3.8 is EOL already as of 2025). For more details, refer to Poetry PR #9692.
This upgrade presents compatibility issues with the current Spark image (
docker.io/apache/spark:3.5.4-python3), which is based oneclipse-temurin:11-jre-focaland uses JDK 11 with Python 3.8. Official Spark images are only available with JDK 11 and JDK 17.To resolve this, I propose we switch to the JDK 17 version of the Spark image, which uses
eclipse-temurin:17-jammyand includes Python 3.10. This pull request updates our Spark image accordingly.