Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jul 11, 2019

What changes were proposed in this pull request?

The V2SessionCatalog has 2 functionalities:

  1. work as an adapter: provide v2 APIs and translate calls to the SessionCatalog.
  2. allow users to extend it, so that they can add hooks to apply custom logic before calling methods of the builtin catalog (session catalog).

To leverage the second functionality, users must extend V2SessionCatalog which is an internal class. There is no doc to explain this usage.

This PR does 2 things:

  1. refine the document of the config spark.sql.catalog.session.
  2. add a public abstract class CatalogExtension for users to write implementations.

TODOs for followup PRs:

  1. discuss if we should allow users to completely overwrite the v2 session catalog with a new one.
  2. discuss to change the name of session catalog, so that it's less likely to conflict with existing namespace names.

How was this patch tested?

existing tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this end-to-end test checks the actual behavior the end-users will hit, as we don't set a fake session catalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the test name a little bit, because we will always use default catalog when the config is set, no matter what the table provider is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that, this is never supported. The test can pass before because we set a fake session catalog which returns writable tables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, this doesn't work because the ORC source is broken.

The test case is still valid, so I don't see a reason why we should get rid of it. I also think that it is a valid use to override the v2 session catalog implementation, which this relies on for tests.

@cloud-fan
Copy link
Contributor Author

cc @rdblue @brkyvz @jose-torres @jzhuge

@SparkQA
Copy link

SparkQA commented Jul 11, 2019

Test build #107498 has finished for PR 25104 at commit cba0e25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CatalogManager(conf: SQLConf, val v2SessionCatalog: TableCatalog)
  • class NoopV2SessionCatalog extends TableCatalog
  • class V2SessionCatalog(catalog: SessionCatalog, conf: SQLConf) extends TableCatalog

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use the Java-like "get"? I think def catalog(name: String) is sufficient.

I try to avoid adding get to names. It doesn't help understanding unless the method is a getter -- and this isn't -- so why include an extra word?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. No need for get.

@rdblue
Copy link
Contributor

rdblue commented Jul 11, 2019

This looks okay overall, but it breaks the test for v2 providers with the session catalog. I don't think this can be committed until that is resolved. @cloud-fan, can you test whether one of the test providers will work in place of ORC v2 in this test?

The second issue is that this removes the ability to override the v2 session catalog implementation. I think we want to support that so you can modify the behavior of the v2 session catalog, but continue to use the v1 catalog for v1 cases. @cloud-fan, do you have any idea about how to make this possible?

I think @brkyvz is interested in overriding the v2 session catalog as well.

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue I still have some confusion around the default.catalog conf and catalog.session.

  • spark.sql.default.catalog: Name of the default v2 catalog, used when a catalog is not identified in queries
  • spark.sql.catalog.session: Name of the default v2 catalog, used when a catalog is not identified in queries.

The description of both is identical. The default v2 catalog can be configured on a session basis just like the v2_session_catalog. Why can't DEFAULT_V2_CATALOG be used to configure the session catalog?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The session catalog needs to be configurable. This is how custom data sources / table formats will plugin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the nice error message handling is gone, if there's an error, what does it look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error message handling is in defaultCatalog() and v2SessionCatalog.

@rdblue
Copy link
Contributor

rdblue commented Jul 11, 2019

@brkyvz, the description must have been copied by mistake. The session property controls what implementation we use for the v2 session catalog. See my comment above for the difference between the v2 session catalog and the default catalog.

@cloud-fan cloud-fan force-pushed the session-catalog branch 2 times, most recently from 6a6b362 to 4f89b88 Compare July 12, 2019 12:52
@cloud-fan cloud-fan changed the title [SPARK-28341][SQL] remove session catalog config [SPARK-28341][SQL] refine the v2 session catalog config Jul 12, 2019
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be a StaticConf? Wouldn't this make it impossible to plug in your own session catalog when using the Spark Shell for example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to make this a static conf because the v2 session catalog cannot change after a session loads it the first time. Static confs can still be set using --conf right?

@SparkQA
Copy link

SparkQA commented Jul 12, 2019

Test build #107596 has finished for PR 25104 at commit 4f89b88.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CatalogManager(conf: SQLConf, val sessionCatalog: SessionCatalog) extends Logging
  • .doc(\"The implementation class of the v2 session catalog, which is a wrapper of the internal\" +
  • class V2SessionCatalog extends TableCatalog with RequiresSessionCatalog

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a method, so it isn't a marker interface.

This should note that this is for implementations that want to use Spark's session catalog for storage, which is needed for writing replacements for V2SessionCatalog. Maybe also mention that this will go away when the SessionCatalog is removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this doesn't replace the existing catalog initialization method, then it should have a different name. How about setSessionCatalog?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not fall back to using an implementation other than the one the user configured. If the configured v2 session catalog cannot be instantiated, then the error should be thrown each time there is an attempt to use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not swallow all non-fatal errors. It should catch CatalogNotFoundException only.

@SparkQA
Copy link

SparkQA commented Jul 12, 2019

Test build #107598 has finished for PR 25104 at commit 5f0c5e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CatalogManager(conf: SQLConf, val sessionCatalog: SessionCatalog) extends Logging
  • .doc(\"The implementation class of the v2 session catalog, which is a wrapper of the internal\" +
  • class V2SessionCatalog extends TableCatalog with RequiresSessionCatalog

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this TableCatalog and not CatalogPlugin?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should reference a JIRA issue to track the fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the original test case to minimize changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is odd that this requires passing in the session when getTestCatalog does not. I would prefer not using this function in places where the session is different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this import used? Same with V2SessionCatalog. And check HiveSessionStateBuilder, too.

@SparkQA
Copy link

SparkQA commented Sep 5, 2019

Test build #110162 has finished for PR 25104 at commit 155c37f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 5, 2019

Test build #110169 has finished for PR 25104 at commit 75e2ca7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 5, 2019

Test build #110170 has finished for PR 25104 at commit f89f25a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 5, 2019

Test build #110184 has finished for PR 25104 at commit 494e031.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 6, 2019

Test build #110218 has finished for PR 25104 at commit 8240062.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

try {
catalogs.getOrElseUpdate(CatalogManager.SESSION_CATALOG_NAME, loadV2SessionCatalog())
} catch {
case NonFatal(_) => defaultSessionCatalog
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd log the error. The user asked for a specific catalog, but we're giving them the default. There's no way to figure out the discrepancy without having the error

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except the one case around logging the error. Should we also make V2SessionCatalog private[spark] after this case?

@SparkQA
Copy link

SparkQA commented Sep 6, 2019

Test build #110248 has finished for PR 25104 at commit 8240062.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor

brkyvz commented Sep 6, 2019

Another question is, some guidance on how a StagingTableCatalog can leverage the CatalogExtension API.

@cloud-fan
Copy link
Contributor Author

If V2SessionCatalog implements StagingTableCatalog one day, it should still call createTable, dropTable, etc. somewhere which can be extended by CatalogExtension.

If users want to override the staging logic, they can implement StagingTableCatalog themselves.

@SparkQA
Copy link

SparkQA commented Sep 9, 2019

Test build #110328 has finished for PR 25104 at commit 05db860.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 9, 2019

Test build #110336 has finished for PR 25104 at commit 05db860.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in abec6d7 Sep 9, 2019
@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@cloud-fan
Copy link
Contributor Author

BTW, V2SessionCatalog is in the execution package, which is meant to be private, like the catalyst package.

@rdblue
Copy link
Contributor

rdblue commented Sep 11, 2019

@cloud-fan, if execution is intended to be private, should we move those classes into catalyst?

@cloud-fan
Copy link
Contributor Author

We can do it if V2SessionCatalog doesn't refer to anything in sql/core.

PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
## What changes were proposed in this pull request?

The `V2SessionCatalog` has 2 functionalities:
1. work as an adapter: provide v2 APIs and translate calls to the `SessionCatalog`.
2. allow users to extend it, so that they can add hooks to apply custom logic before calling methods of the builtin catalog (session catalog).

To leverage the second functionality, users must extend `V2SessionCatalog` which is an internal class. There is no doc to explain this usage.

This PR does 2 things:
1. refine the document of the config `spark.sql.catalog.session`.
2. add a public abstract class `CatalogExtension` for users to write implementations.

TODOs for followup PRs:
1. discuss if we should allow users to completely overwrite the v2 session catalog with a new one.
2. discuss to change the name of session catalog, so that it's less likely to conflict with existing namespace names.

## How was this patch tested?

existing tests

Closes apache#25104 from cloud-fan/session-catalog.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants