Skip to content

Conversation

@MonkeyCanCode
Copy link
Contributor

@MonkeyCanCode MonkeyCanCode commented Jan 28, 2025

Due to the upcoming update of Poetry from version 1.x to 2.x (#873), the minimum Python version requirement will increase to 3.9 (3.8 is EOL already as of 2025). For more details, refer to Poetry PR #9692.

This upgrade presents compatibility issues with the current Spark image (docker.io/apache/spark:3.5.4-python3), which is based on eclipse-temurin:11-jre-focal and uses JDK 11 with Python 3.8. Official Spark images are only available with JDK 11 and JDK 17.

To resolve this, I propose we switch to the JDK 17 version of the Spark image, which uses eclipse-temurin:17-jammy and includes Python 3.10. This pull request updates our Spark image accordingly.

#

FROM docker.io/apache/spark:3.5.4-python3
FROM docker.io/apache/spark:3.5.4-java17-python3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java 17 WFM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Is this Scala 2.13? I'd assume so, because there are separate images that have "scala2.12" in their tag name - but no images with "scala2.13".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is scala 2.12. Spark defaults to 2.12 for their images.

USER root
RUN apt update
RUN apt-get install -y diffutils wget curl python3.8-venv
RUN apt-get install -y diffutils wget curl python3.10-venv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would 3.12 or 3.12 work? Those versions still get bugfixes (not just security fixes).

Copy link
Contributor Author

@MonkeyCanCode MonkeyCanCode Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will. For my local, I am using Python 3.13. However, if we want to use the official spark image and different versions of python, we will need to compile from source code. In my previous PR of rework the test cases to pytest (paused for now, will pick it up again soon), I was using Python as a base image and built our own spark image on top (in that case, I am not locked to what Spark image is using and nor need to compile from source... setting up Spark will just be installing a software). Both will work.

It really comes to if we want to use the official Spark image and don't want to do software compile, we will be using that specific version of Python (e.g. for Centos7 which is also EOL, it is defaulted to Python 2 and Python3 will be referred to 3.8, but it is possible to setup different version of Python 3 there via different repo or compiled from source). In this case, the JDK 11 base image used by Spark is default to python 3.8 and JDK 17 is default to python 3.10.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit more context for those base images...Official Spark JDK 11 is based off eclipse-temurin:11-jre-focal which is built on top of ubuntu:20.04 while official Spark JDK 17 is based off eclipse-temurin:17.0.3_7-jdk-jammy which is built on top of ubuntu:22.04. And here is the python3 will resolve to for each of them when using apt with default repo (so to get diff version and not compile from source, we can also try to point to diff branch of the repo as well diff repo):

$ docker run -it ubuntu:20.04 /bin/bash
root@48dfe9519115:/# apt-cache madison python3
   python3 | 3.8.2-0ubuntu2 | http://archive.ubuntu.com/ubuntu focal/main amd64 Packages

$ docker run -it ubuntu:22.04 /bin/bash
root@1a6950f03ad8:/# apt-cache madison python3
   python3 | 3.10.6-1~22.04.1 | http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
   python3 | 3.10.6-1~22.04 | http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages
   python3 | 3.10.4-0ubuntu2 | http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good. Just a question ;)

However, I'd generally stay away from already EOL'd versions and soon-to-be EOL versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand. If preferred, I can do a PR for using base python image then build spark on top. By doing so, we can do latest version for them (but we won't be using official spark image in that case as they don't have this type support). There is a similar request from Apache Iceberg as well, but their preferred route is using official spark image whenever possible.

let me know what you think. I can merge this one if no other concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, no need to do more effort at this point IMO. Seems to be a lot of initial and maintenance for a low win. Sticking w/ the official Spark image is fine for me. I don't see a pressing need to add more burden.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort to look into this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anytime.

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This also remind me that other places still use Spark 3.5.3, like here https://github.com/polaris-catalog/polaris/blob/23385a00d89bc43e6b869eb0ea65ecb97b6c9bb7/regtests/run.sh#L22-L22. We need to update as well. Not a blocker for this PR though.

@MonkeyCanCode
Copy link
Contributor Author

LGTM. This also remind me that other places still use Spark 3.5.3, like here https://github.com/polaris-catalog/polaris/blob/23385a00d89bc43e6b869eb0ea65ecb97b6c9bb7/regtests/run.sh#L22-L22. We need to update as well. Not a blocker for this PR though.

Yes, I can take care those a bit later this week as well. As main branch is now a lot more stable, I will bring up the work again for python test cases refactor and pytest for all of them.

@MonkeyCanCode MonkeyCanCode merged commit 7168326 into apache:main Jan 30, 2025
5 checks passed
@MonkeyCanCode MonkeyCanCode deleted the spark_jdk17 branch June 25, 2025 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants