Skip to content

Conversation

@turboFei
Copy link
Member

@turboFei turboFei commented Nov 1, 2024

πŸ” Description

Issue References πŸ”—

I found that, the spark.sql("show databases").isEmpty will trigger the job with community spark-3.5.2(3.1 does not).

image image image

Describe Your Solution πŸ”§

Using dataFrame.take(1).isEmpty instead of dataFrame.isEmpty.

Types of changes πŸ”–

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Test Plan πŸ§ͺ

Behavior Without This Pull Request ⚰️

Behavior With This Pull Request πŸŽ‰

Related Unit Tests

No job triggered for take(1).isEmpty.
image


Checklist πŸ“

Be nice. Be informative.

@turboFei turboFei closed this Nov 1, 2024
@turboFei turboFei changed the title do not trigger job Prevent trigger spark job when executing the initialization sql Nov 1, 2024
@turboFei turboFei changed the title Prevent trigger spark job when executing the initialization sql Prevent triggering spark job when executing the initialization sql Nov 1, 2024
@turboFei turboFei reopened this Nov 1, 2024
Copy link
Contributor

@bowenliang123 bowenliang123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice catch.

protected val initJobId: Int = if (SPARK_ENGINE_RUNTIME_VERSION >= "4.0") 0 else 1
// KYUUBI #6789 makes it avoid triggering job
// protected val initJobId: Int = if (SPARK_ENGINE_RUNTIME_VERSION >= "4.0") 0 else 1
protected val initJobId: Int = 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do you think about? @pan3793

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our use case, I set initial/mini executors to 0 for notebook connections.

If job triggered by initialization SQL, it will need at least 1 executor before the connection ready.

Copy link
Member Author

@turboFei turboFei Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for notebook connections, we will init the spark driver in a temporary queue, and then move it to the user queue eventually, so it should be fast to make the connection ready even the user queue has no available resources.

But if job triggered during initialization, it might make it slow if the user queue has no available resources.

debug(s"Execute initialization sql: $sql")
try {
spark.sql(sql).isEmpty
spark.sql(sql).take(1).isEmpty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql(sql).take(1) has two issues:

  • it will do an row conversion from InternalRow to Row
  • it does not do column pruning

how about using the same way of the pr apache/spark#45373 to improve it ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether it is easy to maintain this workaround in kyuubi end.

}

/** SPARK-47270: Returns a optimized plan for CommandResult, convert to `LocalRelation`. */
def commandResultOptimized[T](dataset: Dataset[T]): Dataset[T] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -0 on this change, I would consider it a Spark side issue. Spark’s master branch is undergoing a major refactoring, I'm worried about accessing Spark's non-public API in the engine's "core code path".

For users who want to avoid triggering executor launch, they can either patch their spark or set init SQL as something like: SET spark.app.id

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will cherry-pick the pr into our managed spark 3.5.

@turboFei
Copy link
Member Author

turboFei commented Nov 5, 2024

Prefer to backport the spark PR to managed spark

@turboFei turboFei closed this Nov 5, 2024
@turboFei turboFei deleted the do_not_trigger_job branch November 5, 2024 03:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants