Skip to content

Conversation

@gatorsmile
Copy link
Member

@gatorsmile gatorsmile commented Mar 5, 2020

What changes were proposed in this pull request?

Based on the discussion in the mailing list [Proposal] Modification to Spark's Semantic Versioning Policy , this PR is to add back the following APIs whose maintenance cost are relatively small.

  • HiveContext
  • createExternalTable APIs

Why are the changes needed?

Avoid breaking the APIs that are commonly used.

Does this PR introduce any user-facing change?

Adding back the APIs that were removed in 3.0 branch does not introduce the user-facing changes, because Spark 3.0 has not been released.

How was this patch tested?

add a new test suite for createExternalTable APIs.

@SparkQA
Copy link

SparkQA commented Mar 6, 2020

Test build #119423 has finished for PR 27815 at commit 294337d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HiveContext(SQLContext):

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, HiveContext is more than alias-level-maintenance. And, technically, this falls into a migration from 1.6.x age.
cc @marmbrus and @gatorsmile

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 9, 2020

Test build #119546 has finished for PR 27815 at commit 294337d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class HiveContext(SQLContext):

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28980][FOLLOW-UP] Add back HiveContext and createExternalTable [SPARK-31088][SQL] Add back HiveContext and createExternalTable Mar 9, 2020
@SparkQA
Copy link

SparkQA commented Mar 9, 2020

Test build #119555 has finished for PR 27815 at commit ff2d91f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DeprecatedCreateExternalTableSuite extends SharedSparkSession

:param jhiveContext: An optional JVM Scala HiveContext. If set, we do not instantiate a new
:class:`HiveContext` in the JVM, instead we make all calls to this object.
.. note:: Deprecated in 2.0.0. Use SparkSession.builder.enableHiveSupport().getOrCreate().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rethink about this. Although the other removed API are recovered, I believe we should not recover this.
cc @marmbrus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you justify this position using the rubric we agreed upon?

Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @gatorsmile should justify this using the new rubric about why this is maintenance cost are relatively small. We are reviewing this PR.
Also, cc @rxin since he is the release manager for 3.0.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at the code in this PR. I see a bunch of alias methods that call other methods and a few minimal tests. What part of this is hard to maintain?

Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class HiveContext(SQLContext) itself is not a simple alias. You know that. This is a whole new layer from Spark 1.6.x age.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see 63 lines of code in that file? That class itself looks like an alias... not a whole separate implementation. What am I missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we have different ideas on the alias. You are saying that all wrappers are alias, aren't you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are just delegating to the "correct"/new implementation, then that sounds like an alias to me. If we are maintaining large parallel/duplication implementation, then that is a different story.

@gatorsmile gatorsmile marked this pull request as ready for review March 20, 2020 04:51
@SparkQA
Copy link

SparkQA commented Mar 20, 2020

Test build #120078 has finished for PR 27815 at commit c3a45e0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2020

Test build #4996 has finished for PR 27815 at commit c3a45e0.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

CC @dongjoon-hyun @marmbrus @rxin @srowen This PR is ready for review.

@SparkQA
Copy link

SparkQA commented Mar 24, 2020

Test build #120242 has finished for PR 27815 at commit 1f1a570.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • final class ANOVASelector @Since(\"3.1.0\")(@Since(\"3.1.0\") override val uid: String)
  • final class VarianceThresholdSelector @Since(\"3.1.0\")(@Since(\"3.1.0\") override val uid: String)
  • //case class Foo(i: Int) extends AnyValwill return typeIntinstead ofFoo.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as elsewhere: keeping aliases is easy, so it's hard to feel strongly about it either way. I see the value in not changing APIs even across major versions. So, why were these deprecated in the first place, if you intended to never remove them, and why has that only come up now? Going forward, I assume we are just not going to deprecate anything, as nothing will realistically be removed

@SparkQA
Copy link

SparkQA commented Mar 24, 2020

Test build #4997 has finished for PR 27815 at commit 1f1a570.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • final class ANOVASelector @Since(\"3.1.0\")(@Since(\"3.1.0\") override val uid: String)
  • final class VarianceThresholdSelector @Since(\"3.1.0\")(@Since(\"3.1.0\") override val uid: String)
  • //case class Foo(i: Int) extends AnyValwill return typeIntinstead ofFoo.

@dongjoon-hyun
Copy link
Member

I also still strongly believe that we should let this go away from 3.0.0.

@gatorsmile
Copy link
Member Author

gatorsmile commented Mar 24, 2020

Adoption of Spark 3.0 is much more important than removal of these deprecated methods. I strongly disagree we removed this APIs in Spark 3.0 at the beginning.

@gatorsmile
Copy link
Member Author

retest this please

@gatorsmile
Copy link
Member Author

To eventually remove these deprecated APIs, we need at least build some migration tools for end users to collect the usage of the deprecated APIs. So far, this is missing based on my knowledge. In general, I believe we need a systematic way for reduce the migration pain and make more end users easily upgrade to the latest version of Apache Spark.

@SparkQA
Copy link

SparkQA commented Mar 25, 2020

Test build #120288 has finished for PR 27815 at commit 1f1a570.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • final class ANOVASelector @Since(\"3.1.0\")(@Since(\"3.1.0\") override val uid: String)
  • final class VarianceThresholdSelector @Since(\"3.1.0\")(@Since(\"3.1.0\") override val uid: String)
  • //case class Foo(i: Int) extends AnyValwill return typeIntinstead ofFoo.

@srowen
Copy link
Member

srowen commented Mar 25, 2020

@gatorsmile where did you disagree with this removal originally?

@gatorsmile
Copy link
Member Author

Let me correct the typo and rewrite my original reply to make my point clear.

Based on the latest discussion in the community, I strongly disagree that we removed these APIs in Spark 3.0 initially. I think we need to add them back. If we want to remove them, we need more public discussions in the community. This PR is just to add the removed APIs back and delay the related discussions to the future releases.

@dongjoon-hyun
Copy link
Member

Hi, All.
Although I have been against this reverting PR until now, I'll not be against any longer if @gatorsmile has his reason to support this.

@gatorsmile
Copy link
Member Author

gatorsmile commented Mar 25, 2020

There are three major reasons why I think we should keep HiveContext:

  • SQLContext will not be removed in the near future, because it is part of DSV1 APIs. HiveContext is just the corresponding parts with Hive metastore support.
  • SQLContext/HiveContext is being used as the function parameter in many internal/public libraries. Replace it by SparkSession is not straightforward. It requires massive updates when end users upgrade Spark from 2.x to 3.x.
  • SQLContext/HiveContext is being documented in various tutorials and books.

@srowen
Copy link
Member

srowen commented Mar 25, 2020

@gatorsmile these changes happened almost a year ago. I'd like to understand why you have advanced these arguments only now. Has something changed?

@gatorsmile
Copy link
Member Author

@srowen Like the previous releases, we are doing the API auditing and checking all the changes we made in the upcoming 3.0 release. During the auditing, we found it and multiple Spark committers (including @marmbrus @zsxwing @yhuai ) challenged this API change in our offline discussion. Thus, I submitted this PR to add them back.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Since we keep SQLContext as well, this is consistent to also keep HiveContext.

gatorsmile added a commit that referenced this pull request Mar 27, 2020
### What changes were proposed in this pull request?
Based on the discussion in the mailing list [[Proposal] Modification to Spark's Semantic Versioning Policy](http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-Modification-to-Spark-s-Semantic-Versioning-Policy-td28938.html) , this PR is to add back the following APIs whose maintenance cost are relatively small.

- HiveContext
- createExternalTable APIs

### Why are the changes needed?

Avoid breaking the APIs that are commonly used.

### Does this PR introduce any user-facing change?
Adding back the APIs that were removed in 3.0 branch does not introduce the user-facing changes, because Spark 3.0 has not been released.

### How was this patch tested?

add a new test suite for createExternalTable APIs.

Closes #27815 from gatorsmile/addAPIsBack.

Lead-authored-by: gatorsmile <[email protected]>
Co-authored-by: yi.wu <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
(cherry picked from commit b9eafcb)
Signed-off-by: gatorsmile <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
Based on the discussion in the mailing list [[Proposal] Modification to Spark's Semantic Versioning Policy](http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-Modification-to-Spark-s-Semantic-Versioning-Policy-td28938.html) , this PR is to add back the following APIs whose maintenance cost are relatively small.

- HiveContext
- createExternalTable APIs

### Why are the changes needed?

Avoid breaking the APIs that are commonly used.

### Does this PR introduce any user-facing change?
Adding back the APIs that were removed in 3.0 branch does not introduce the user-facing changes, because Spark 3.0 has not been released.

### How was this patch tested?

add a new test suite for createExternalTable APIs.

Closes apache#27815 from gatorsmile/addAPIsBack.

Lead-authored-by: gatorsmile <[email protected]>
Co-authored-by: yi.wu <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants