Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 29 additions & 16 deletions .github/workflows/build_and_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -689,17 +689,13 @@ jobs:
# Should delete this section after SPARK 3.5 EOL.
python3.9 -m pip install 'flake8==3.9.0' pydata_sphinx_theme 'mypy==0.982' 'pytest==7.1.3' 'pytest-mypy-plugins==1.9.3' numpydoc 'jinja2<3.0.0' 'black==22.6.0'
python3.9 -m pip install 'pandas-stubs==1.2.0.53' ipython 'grpcio==1.59.3' 'grpc-stubs==1.24.11' 'googleapis-common-protos-stubs==2.2.0'
- name: Install Python linter dependencies
if: inputs.branch != 'branch-3.3' && inputs.branch != 'branch-3.4' && inputs.branch != 'branch-3.5'
run: |
# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
# Jinja2 3.0.0+ causes error when building with Sphinx.
# See also https://issues.apache.org/jira/browse/SPARK-35375.
python3.9 -m pip install 'flake8==3.9.0' pydata_sphinx_theme 'mypy==0.982' 'pytest==7.1.3' 'pytest-mypy-plugins==1.9.3' numpydoc 'jinja2<3.0.0' 'black==23.9.1'
python3.9 -m pip install 'pandas-stubs==1.2.0.53' ipython 'grpcio==1.59.3' 'grpc-stubs==1.24.11' 'googleapis-common-protos-stubs==2.2.0'
- name: Python linter
python3.9 -m pip install 'protobuf==4.25.1' 'mypy-protobuf==3.3.0'
- name: Python linter for branch-3.3, branch-3.4, branch-3.5
if: inputs.branch == 'branch-3.3' || inputs.branch == 'branch-3.4' || inputs.branch == 'branch-3.5'
run: PYTHON_EXECUTABLE=python3.9 ./dev/lint-python
- name: Python linter
if: inputs.branch != 'branch-3.3' && inputs.branch != 'branch-3.4' && inputs.branch != 'branch-3.5'
run: conda run -n doc ./dev/lint-python
- name: Install dependencies for Python code generation check
if: inputs.branch != 'branch-3.3' && inputs.branch != 'branch-3.4'
run: |
Expand All @@ -708,11 +704,13 @@ jobs:
mkdir -p $HOME/buf
tar -xvzf buf-Linux-x86_64.tar.gz -C $HOME/buf --strip-components 1
rm buf-Linux-x86_64.tar.gz
python3.9 -m pip install 'protobuf==4.25.1' 'mypy-protobuf==3.3.0'
- name: Python code generation check for branch-3.5
if: inputs.branch == 'branch-3.5'
run: PATH=$PATH:$HOME/buf/bin PYTHON_EXECUTABLE=python3.9 ./dev/connect-check-protos.py
- name: Python code generation check
if: inputs.branch != 'branch-3.3' && inputs.branch != 'branch-3.4'
run: if test -f ./dev/connect-check-protos.py; then PATH=$PATH:$HOME/buf/bin PYTHON_EXECUTABLE=python3.9 ./dev/connect-check-protos.py; fi
# Should delete this section after SPARK 3.5 EOL.
if: inputs.branch != 'branch-3.3' && inputs.branch != 'branch-3.4' && inputs.branch != 'branch-3.5'
run: PATH=$PATH:$HOME/buf/bin conda run -n doc ./dev/connect-check-protos.py
# Should delete this section after SPARK 3.5 EOL.
- name: Install JavaScript linter dependencies for branch-3.3, branch-3.4, branch-3.5
if: inputs.branch == 'branch-3.3' || inputs.branch == 'branch-3.4' || inputs.branch == 'branch-3.5'
run: |
Expand Down Expand Up @@ -743,8 +741,6 @@ jobs:
Rscript -e "install.packages(c('devtools', 'testthat', 'knitr', 'rmarkdown', 'markdown', 'e1071', 'roxygen2', 'ggplot2', 'mvtnorm', 'statmod'), repos='https://cloud.r-project.org/')"
Rscript -e "devtools::install_version('pkgdown', version='2.0.1', repos='https://cloud.r-project.org')"
Rscript -e "devtools::install_version('preferably', version='0.4', repos='https://cloud.r-project.org')"
- name: Install dependencies for documentation generation
Copy link
Contributor Author

@zhengruifeng zhengruifeng Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this step is also used in 3.3/3.4/3.5, so move it to Install dependencies for documentation generation for branch-3.3, branch-3.4, branch-3.5

run: |
# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
# Jinja2 3.0.0+ causes error when building with Sphinx.
Expand All @@ -755,13 +751,30 @@ jobs:
python3.9 -m pip install ipython_genutils # See SPARK-38517
python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' pyarrow pandas 'plotly>=4.8'
python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421
- name: Install dependencies for documentation generation
run: |
gem install bundler
cd docs
bundle install
- name: R linter
run: ./dev/lint-r
- name: Run documentation build for branch-3.3, branch-3.4, branch-3.5
if: inputs.branch == 'branch-3.3' || inputs.branch == 'branch-3.4' || inputs.branch == 'branch-3.5'
run: |
if [ -f "./dev/is-changed.py" ]; then
# Skip PySpark and SparkR docs while keeping Scala/Java/SQL docs
pyspark_modules=`cd dev && python3.9 -c "import sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if m.name.startswith('pyspark')))"`
if [ `./dev/is-changed.py -m $pyspark_modules` = false ]; then export SKIP_PYTHONDOC=1; fi
if [ `./dev/is-changed.py -m sparkr` = false ]; then export SKIP_RDOC=1; fi
fi
cd docs
bundle exec jekyll build
- name: Run documentation build
if: inputs.branch != 'branch-3.3' && inputs.branch != 'branch-3.4' && inputs.branch != 'branch-3.5'
shell: 'script -q -e -c "bash {0}"'
run: |
# See also https://github.com/conda/conda/issues/7980
source $(conda info --base)/etc/profile.d/conda.sh && conda activate doc
if [ -f "./dev/is-changed.py" ]; then
# Skip PySpark and SparkR docs while keeping Scala/Java/SQL docs
pyspark_modules=`cd dev && python3.9 -c "import sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if m.name.startswith('pyspark')))"`
Expand Down
57 changes: 57 additions & 0 deletions dev/infra/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -139,3 +139,60 @@ RUN python3.12 -m pip install 'grpcio==1.59.3' 'grpcio-status==1.59.3' 'protobuf
# TODO(SPARK-46078) Use official one instead of nightly build when it's ready
RUN python3.12 -m pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
RUN python3.12 -m pip install torcheval


# Refer to https://github.com/ContinuumIO/docker-images/blob/main/miniconda3/debian/Dockerfile
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh -q && \
bash miniconda.sh -b -p /opt/miniconda3 && \
rm miniconda.sh && \
ln -s /opt/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
ln -s /opt/miniconda3/bin/conda /usr/local/bin/conda && \
find /opt/miniconda3/ -follow -type f -name '*.a' -delete && \
find /opt/miniconda3/ -follow -type f -name '*.js.map' -delete && \
conda clean -afy

# Additional Python deps for linter and documentation, delete this section if another Python version is used
# Since there maybe conflicts between envs, here uses conda to manage it.
# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
# Jinja2 3.0.0+ causes error when building with Sphinx.
# See also https://issues.apache.org/jira/browse/SPARK-35375.
RUN conda create -n doc python=3.9

RUN conda run -n doc pip install \
Copy link
Contributor

@nchammas nchammas Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we listing out individual dependencies here vs. listing them in a requirements file (or equivalent) that we can use across Docker, GitHub Actions, and local build scripts? Don't we want our build and test dependencies to be consistent?

I made a past attempt at this over in #27928. It failed because, in addition to building a shared set of build and test dependencies, it also pinned transitive build and test dependencies, which the reviewers weren't keen on. But we can separate the two ideas from each other.

IMO there should be a single requirements file for build and test dependencies (whether or not it pins transitive dependencies is a separate issue), and that file should be used everywhere. What do you think?

I also don't follow why we need to pull in conda. What is it getting us over vanilla pip?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @nchammas actually, I wanna give up this PR to cache the dependencies for linter and documentation, since there is still weird issue in building documentation with conda in CI (while it works well in my local).

The reason in this PR to try conda was that there were env conflicts between PySpark and lint/doc, that is, installation of lint/doc dependencies will break some tests.

As to why not use requirement file in CI, I guess a problem maybe, the modification in requirement file won't automatically refresh the cached testing image?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As to why not use requirement file in CI, I guess a problem maybe, the modification in requirement file won't automatically refresh the cached testing image?

That shouldn't be the case. Assuming you COPY the requirements file into the image, changing the file will invalidate the cache:

The first encountered COPY instruction will invalidate the cache for all following instructions from the Dockerfile if the contents of <src> have changed. This includes invalidating the cache for RUN instructions.

Also:

For the ADD and COPY instructions, the modification time and size file metadata is used to determine whether cache is valid. During cache lookup, cache is invalidated if the file metadata has changed for any of the files involved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know this. You are right.

'black==23.9.1' \
'docutils<0.18.0' \
'flake8==3.9.0' \
'googleapis-common-protos-stubs==2.2.0' \
'grpc-stubs==1.24.11' \
'grpcio==1.59.3' \
'ipython' \
'ipython_genutils' \
'jinja2<3.0.0' \
'markupsafe==2.0.1' \
'matplotlib' \
'mkdocs' \
'mypy-protobuf==3.3.0' \
'mypy==0.982' \
'nbsphinx' \
'numpy>=1.20.0' \
'numpydoc' \
'pandas' \
'pandas-stubs==1.2.0.53' \
'plotly>=4.8' \
'protobuf==4.25.1' \
'pyarrow' \
'pydata_sphinx_theme>=0.13' \
'pytest-mypy-plugins==1.9.3' \
'pytest==7.1.3' \
'pyzmq<24.0.0' \
'sphinx-copybutton' \
'sphinx==4.2.0' \
'sphinx_plotly_directive'
RUN conda run -n doc pip install \
'torch<=2.0.1' \
'torchvision' --index-url https://download.pytorch.org/whl/cpu
RUN conda run -n doc pip install \
'torcheval'

RUN conda clean -afy