Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Feb 14, 2024

What changes were proposed in this pull request?

This PR aims to spin off yarn and connect as a new test pipeline for the following:

  • To stabilize more by off-loading
  • To re-trigger easily in case of failures.
  • To isolate yarn module change and avoid triggering other module's tests like Kafka module.
  • To isolate connect module change and avoid triggering other module's tests like Kafka module.

Why are the changes needed?

These two modules are known to be flaky in various GitHub Action CI pipelines.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs in this PR.

Screenshot 2024-02-14 at 15 44 56

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the INFRA label Feb 14, 2024
@dongjoon-hyun dongjoon-hyun marked this pull request as draft February 14, 2024 23:19
@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review February 14, 2024 23:45
@dongjoon-hyun
Copy link
Member Author

Could you review this test pipeline adjustment PR, @HyukjinKwon ?

connect, protobuf
kubernetes, hadoop-cloud, spark-ganglia-lgpl, protobuf
- >-
yarn, connect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm ... Does separating the build help the stabilization? I am not against this change but splitting this will cause 30mins time for building. Although it runs in parallel, individual fork has some resource limitation.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I also was reluctant due to that building time. Specifically, It will take 34 minutes in total run time. The second goal is to reduce the re-trigger time. Previously, YARN/Connect/Kafka modules build a bad synergy because their failure rates are multiplied. And, I needed to re-trigger and wait over 70 ~ 90minutes for this pipeline.

Screenshot 2024-02-14 at 16 39 19

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, the streaming, ... pipeline run time is down to 62m. So, the total increased overhead is around 20min.

Screenshot 2024-02-14 at 16 47 28

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm.. I think we should actually better fix the flakiness .. I am fine with this as a temporary change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I hoped so too. For now, SparkSessionE2ESuite flakiness and YarnClusterSuite seems to be abandoned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side, we don't use YARN and Spark Connect.

@dongjoon-hyun
Copy link
Member Author

Merged to master for now.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-47051 branch February 15, 2024 05:12
dongjoon-hyun added a commit that referenced this pull request May 6, 2024
### What changes were proposed in this pull request?

We have been providing a dedicated test environment for `yarn` and `connect` module because they are flaky.
- #45107

However, they are still flaky. So, this PR aims to run `yarn` test only in PR builders (if needed) and Daily CIs (always).
- Reduce the irrelevant re-tries by triggering `YARN CI` only when we need to test `YARN` module.
- Protect YARN CI from `connect` flakiness by providing an independent GitHub Action environment in PR Builders and Daily CIs.
- Lastly, commit builder will offload YARN module tests to the daily CIs

### Why are the changes needed?

- PR builders provide an extensive test coverage with YARN testing.
- Daily CIs with YARN tests
   - NON-ANSI CI: https://github.com/apache/spark/actions/workflows/build_non_ansi.yml (1AM)
   - Java 21 SBT CI: https://github.com/apache/spark/actions/workflows/build_java21.yml (4AM)
   - RockDB UI CI: https://github.com/apache/spark/actions/workflows/build_rockdb_as_ui_backend.yml (6AM)
   - Maven Java 17 CI: https://github.com/apache/spark/actions/workflows/build_maven.yml (1PM)
   - Maven Java 21 CI: https://github.com/apache/spark/actions/workflows/build_maven_java21.yml (2PM)
   - Maven Java 21 on AppleSilicon CI: https://github.com/apache/spark/actions/workflows/build_maven_java21_macos14.yml (8PM every two days)

- YARN CI has been flaky in GitHub Action environment and requires irrelevant re-tries very frequently.
    - https://github.com/apache/spark/actions/runs/8962451417/job/24611353908 (2024-05-05)
    - https://github.com/apache/spark/actions/runs/8962440192/job/24611326971 (2024-05-05)

```
[info] *** 6 TESTS FAILED ***
[error] Failed tests:
[error] 	org.apache.spark.deploy.yarn.YarnClusterSuite
[error] (yarn / Test / test) sbt.TestsFailedException: Tests unsuccessful
```

  <img width="544" alt="Screenshot 2024-05-05 at 20 12 28" src="https://github.com/apache/spark/assets/9700541/cbf9fb03-fc4c-4513-b5e5-158c3c9a085a">

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46395 from dongjoon-hyun/SPARK-48137.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit that referenced this pull request May 7, 2024
### What changes were proposed in this pull request?

This PR aims to merge connect back to the original test pipeline to reduce the maximum concurrency of GitHub Action by one.
- https://infra.apache.org/github-actions-policy.html
  > All workflows SHOULD have a job concurrency level less than or equal to 15.

### Why are the changes needed?

This is a partial recover from the following.
- #45107

We stabilized the root cause of #45107 via the following PRs. In addition we will disable a flaky test case if exists.

- #46395
- #46396
- #46425

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46441 from dongjoon-hyun/SPARK-48174.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants