Skip to content

Conversation

@amaliujia
Copy link
Contributor

@amaliujia amaliujia commented Nov 1, 2022

What changes were proposed in this pull request?

This PR adds range API to Python client's RemoteSparkSession with tests.

This PR also updates start, end, step to int64 in the Connect proto.

Why are the changes needed?

Improve API coverage.

Does this PR introduce any user-facing change?

NO

How was this patch tested?

UT

@amaliujia
Copy link
Contributor Author

R: @HyukjinKwon @zhengruifeng

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can use step: int = 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

furthermore, i think we can make step a required field in proto

Copy link
Contributor Author

@amaliujia amaliujia Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm we are not marking step as required because Scala side implementation does not treat it as required and thus it has also a default value.

  private def transformRange(rel: proto.Range): LogicalPlan = {
    val start = rel.getStart
    val end = rel.getEnd
    val step = if (rel.hasStep) {
      rel.getStep.getStep
    } else {
      1
    }
    val numPartitions = if (rel.hasNumPartitions) {
      rel.getNumPartitions.getNumPartitions
    } else {
      session.leafNodeDefaultParallelism
    }
    logical.Range(start, end, step, numPartitions)
  }

Same for numPartitions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding a new case like range(start=10, end=20) and check:
1, step is set 1 (if we make the default value 1);
2, num_partitions not set;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this test case but only test step and num_partitions is not set.

Right now there is a division between client and server such that:

  1. Client take care of required fields, meaning that clients need to make sure the required fields are set.
  2. Server side take care of default values for optional fields.

This is to reduce load for both sides of implementation:

  1. clients do not need to worry about default values for optional fields unless the default value is exposed on the DataFrame API already.
  2. Server side do not care for whether required field is set (clients enforce it) but server side tracks the default value for optional fields. This can also avoid that clients side to set different default value. The default values are documented in proto:
    // it is set, or 2) spark default parallelism.

@zhengruifeng
Copy link
Contributor

merged into master

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?

This PR adds `range` API to Python client's `RemoteSparkSession` with tests.

This PR also updates `start`, `end`, `step` to `int64` in the Connect proto.

### Why are the changes needed?

Improve API coverage.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

UT

Closes apache#38460 from amaliujia/SPARK-40981.

Authored-by: Rui Wang <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants