Add options to the bootstrap command to specify a schema file #1942

eric-maynard · 2025-06-25T18:53:16Z

Instead of always using the hardcoded schema-v1.sql file, it would be nice if users could specify a file to bootstrap from. This is especially relevant after #1686 which proposes to add a new "version" of the schema.

dimas-b

The general idea LGTM. I've got some comments, though :)

dimas-b · 2025-06-25T22:20:12Z

...rc/main/java/org/apache/polaris/persistence/relational/jdbc/JdbcMetaStoreManagerFactory.java

-    if (isBootstrap) {
+    // If this is set, we are bootstrapping
+    if (schemaOptions.isPresent()) {


Since we're touching this code, I'd like to avoid having bootstrap logic on the main runtime call path (in server). Do you think we could refactor that to execute DDL only in the admin tool?

I tried my hand at this in 6fee80f, let me know if that looks like what you had in mind

Looking, I only pulled the script execution up one level into bootstrapRealms rather than into the admin tool itself -- though it only runs as a result of that admin tool code path. If we want to pull the logic out of even bootstrapRealms I think we may need to do some refactoring to MetaStoreManagerFactory which I would prefer to avoid doing in this PR.

dimas-b · 2025-06-25T22:22:46Z

...lational-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/DatabaseType.java

+    } else {
+      String schemaSuffix;
+      switch (schemaOptions.schemaVersion()) {
+        case SchemaOptions.LATEST -> schemaSuffix = "schema-v1.sql";


suggestion: make schemaVersion() optional and default to latest.

In SchemaOptions it is:

@Value.Default default String schemaVersion() { return LATEST; }

I mean default in bootstrap code to whatever version happens to be latest when no explicit version is requested, but avoid having the LATEST constant.

alternatively: use actual current schema version in the LATEST constant.

wait... that does not carry over to NoSQL 🤔

Do we really need the LATEST constant now?

I was thinking it's nice to have so that users can be explicit about taking the LATEST schema -- the alternative is just that you get LATEST by not setting the version, right?

Besides the ability to be explicit about wanting the LATEST, I think the constant is potentially useful in the contrived scenario where a persistence wants to default to a non-LATEST schema

Fair enough... but in this case I'd prefer to go back to the non-null schemaVersion() with LATEST as default in the Immutable class... Sorry about back-and-forth 🤦

Well, in that case we actually lose this:

the contrived scenario where a persistence wants to default to a non-LATEST schema

Since there is no way for the value to be unset, it's either LATEST or some number. But I don't feel too strongly about this, as I think that scenario is pretty unlikely. I'm happy to either leave the default or remove LATEST; I'll try getting rid of LATEST for now

runtime/admin/src/main/java/org/apache/polaris/admintool/BootstrapCommand.java

polaris-core/src/main/java/org/apache/polaris/core/persistence/bootstrap/SchemaOptions.java

dimas-b · 2025-06-27T23:46:10Z

...laris/persistence/relational/jdbc/AtomicMetastoreManagerWithJdbcBasePersistenceImplTest.java

-          String.format("%s/schema-v2.sql", DatabaseType.H2.getDisplayName()));
+      ClassLoader classLoader = DatasourceOperations.class.getClassLoader();
+      InputStream scriptStream =
+          classLoader.getResourceAsStream(


nit: this stream normally needs to be closed, AFAIK.

ditto the above -- I think we just weren't closing it before, but added an explicit close into executeScript to fix this

Why not a try-with-resources block?

Per the javadoc, I think it's better to have this method take on the responsibility for closing the stream after it's used rather than scatter try-with-resources amongst the callers. The loan pattern would be better, but we're already getting pretty far away from the intent of this change.

...-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/DatasourceOperations.java

dimas-b · 2025-06-30T14:32:48Z

...-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/DatasourceOperations.java


  /**
-   * Execute SQL script.
+   * Execute SQL script and close the associated input stream


WDYT about adding @WillClose to the stream parameter?

That sounds like a good idea... is this the one from jsr305? It doesn't seem that we have that dependency currently, and we're not using it anywhere else in the project.

yes, that jar 👍 TBH, I was surprised we did not have any existing use cases 😅

I noticed this is in the banned-dependencies.txt?

# Contains old javax.* annotations that we do not want com.google.code.findbugs:jsr305

TIL... well, I'm fine with not using it in this case.

Me too! I learned this when I went to add the dep 😅

If there is another similar annotation we start using in the future that would be nice as this does sound pretty helpful.

dimas-b · 2025-06-30T15:03:03Z

...-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/DatasourceOperations.java

+          });
+    } finally {
+      try {
+        scriptInputStream.close();


I'm pretty sure the try on line 88 closes this stream too... Could you double check?

Currently no, the only resource is the Statement:

try (Statement statement = connection.createStatement())

But if we added the BufferedReader there too, it might close the stream. Do you want to do that instead?

Yeah, if possible... I think try-with-resources is more intuitive and we get better exception propagation OOTB.

Do we still need this call to .close() now that we have the BufferedReader inside the try-with-resources?

...lational-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/DatabaseType.java

dimas-b

LGTM with one minor comment about possibly unnecessary .close() call (old comment thread)

eric-maynard · 2025-06-30T22:08:16Z

Sorry, looks like the old auto-merge went in with your comment still outstanding @dimas-b! That was set a while ago before this InputStream business. close is generally safe to double-call, but I agree that the second call shouldn't be necessary. If you want, I can quickly followup and try to remove that.

dimas-b · 2025-06-30T22:12:52Z

No worries, but if you do not mind, let's remove the unnecessary close call. I'm not so much worried about calling it twice, but the surrounding try/catch code makes the method a bit more complex than necessary :)

eric-maynard · 2025-06-30T22:53:14Z

Sounds good @dimas-b, please check PR#1982 and sorry again about the auto merge!

#1942 changed the way that the bootstrap init script is handled, but it added an extra `InputStream.close` call that shouldn't be needed after the BufferedReader [here](https://github.com/apache/polaris/pull/1942/files#diff-de43b240b5b5e07aba7e89f5515a417cefd908845b85432f3fcc0819911f3e2eR89) is closed. This PR removes that extra call.

* Exclude unused dependency for polaris spark client dependency (apache#1933) * enable ETag integration tests (apache#1935) tests were added in 8b5dfa9 and afaict supposed to get enabled after ec97c1b * Fix Pagination for Catalog Federation (apache#1849) Details can be found in this issue: apache#1848 * Update doc to fix docker build inconsistency issue (apache#1946) * Simplify install dependency doc (apache#1941) * Simply getting start doc * Simply install dependecy doc * Minor words change * Fix admin tool for quick start (apache#1945) When attempting to use the `polaris-admin-tool.jar` to bootstrap a realm, the application fails with a `jakarta.enterprise.inject.UnsatisfiedResolutionException` because it cannot find a `javax.sql.DataSource` bean. Detail in apache#1943 This issue occurs because `quarkus.datasource.db-kind`is a build-time property in Quarkus. Its value must be defined during the application's build process to enable the datasource extension and generate the necessary CDI bean producer (ref: https://quarkus.io/guides/all-config#quarkus-datasource_quarkus-datasource-db-kind). I think we only support postgres for now, thus, I set `quarkus.datasource.db-kind=postgresql`. This can be problematic if we later want to support more data sources other than postgres. There are couple options we have for this such as use multiple named datasources in the config during build time. But this may be out of scope of this PR. I am open for more discussion on this, but for the time being, it may be better to unblock people who are trying to use the quick start doc. Sample output for the bootstrap container after the fix: ``` ➜ polaris git:(1943) docker logs polaris-polaris-bootstrap-1 Realm 'POLARIS' successfully bootstrapped. Bootstrap completed successfully. ``` * fix(build): Fix deprecation warnings in FeatureConfiguration (apache#1894) * Fix NPE in listCatalogs (apache#1949) listCatalogs is non-atomic. It first atomically lists all entities and then iterates through each one and does an individual loadEntity call. This causes an NPE when calling `CatalogEntity::new`. I don't think it's ever useful for listCatalogsUnsafe to return null since the caller isn't expecting a certain length of elements, so I just filtered it there. * Fix doc for sample log and default password (apache#1951) Minor updates for the quick start doc: 1. update sample output to reflect with the latest code 2. update default password to the right value 3. remove trailing space * Optimize the location overlap check with an index (apache#1686) The location overlap check for "sibling" tables (those which share a parent) has been a performance bottleneck since its introduction, but we haven't historically had a good way around this other than just disabling the check. <hr> ### Current Behavior The current logic is that when we create a table, we list all sibling tables and check each and every one to ensure there is no location overlap. This results in O(N^2) checks when adding N tables to a namespace, quickly becoming untenable. With the `CreateTreeDataset` [benchmark](https://github.com/eric-maynard/polaris-tools/blob/main/benchmarks/src/gatling/scala/org/apache/polaris/benchmarks/simulations/CreateTreeDataset.scala) I tested creating 5000 sibling tables using the current code: It is apparent that latency increases over time. Runs took between 90 and 200+ seconds, and Polaris instances with a small memory allocation were prone to crashing due to OOMs: ### Proposed change This PR adds a new persistence API, `hasOverlappingSiblings`, which if implemented can be used to directly check for the presence of siblings at the metastore layer. This API is implemented for the JDBC metastore in a new schema version, and some changes are made to account for an evolving schema version now and in the future. This implementation breaks a location down into components and queries for a sibling at each of those locations, so a new table at location `s3://bucket/root/n1/nA/t1/` will require checking for an entity with location `s3://bucket/`, `s3://bucket/root/`, `s3://bucket/root/n1/`, `s3://bucket/root/n1/nA/`, and finally `s3://bucket/root/n1/nA/t1/%`. All of this can be done in a single query which makes a single pass over the data. The query is optimized by the introduction of a new index over a new _location_ column. With the changes enabled, I tested creating 5000 sibling tables: Latency is stable over time, and runs consistently completed in less than 30 seconds. I did not observe any OOMs when testing with the feature enabled. * Add SUPPORTED_EXTERNAL_CATALOG_AUTHENTICATION_TYPES feature configuration (apache#1931) * Add SUPPORTED_FEDERATION_AUTHENTICATION_TYPES feature configuration * Add unit tests * Update Helm chart version (apache#1957) * Remove the maintainer list in Helm Chart README (apache#1962) * Use multi-lines instead of single line (apache#1961) * Fix invalid sample script in CLI doc (apache#1964) * Fix hugo blockquote (apache#1967) * Fix hugo blockquote * Add license header * Fix lint rules (apache#1953) * Mutable objects used for immutable values (apache#1596) * fix: Only include project LICENSE and NOTICE in Spark Client Jar (apache#1950) * Add Sushant as a collaborator (apache#1956) * Adds missing Google Flatbuffers license information (apache#1968) * fix: Typo in Spark Client Build File (apache#1969) debugrmation * Python code format (apache#1954) * test(integration): refactor PolarisRestCatalogIntegrationTest to run against any cloud provider (apache#1934) * Make Catalog Integration Test suite cloud native * Fix admin tool doc (apache#1977) * Fix admin tool doc * Fix admin tool doc * Update release-guide.md (apache#1927) * Add relational-jdbc to helm (apache#1937) Motivation for the Change Polaris needs to support relational-jdbc as the default persistence type for simpler database configuration and better cloud-native deployment experience. Description of the Status Quo (Current Behavior) Currently, the Helm chart only supports eclipse-link persistence type as the default, which requires complex JPA configuration with persistence.xml files. Desired Behavior Add relational-jdbc persistence type support to Helm chart Use relational-jdbc as the default persistence type Inject JDBC configuration (username, password, jdbc_url) through Kubernetes Secrets as environment variables Maintain backward compatibility with eclipse-link Additional Details Updated persistence-values.yaml for CI testing Updated test coverage for relational-jdbc configuration JDBC credentials are injected via QUARKUS_DATASOURCE_* environment variables from Secret Secret keys: username, password, jdbc_url * Add CHANGELOG (apache#1952) * Add rudimentary CHANGELOG.md * Add the Jetbrains Changelog Gradle plugin to help managing CHANGELOG.md * Share Polaris Community Meeting for 2025-06-26 (apache#1978) * Correct javadoc text in generateOverlapQuery() (apache#1975) * Fix javadoc warning: invalid input: '&' * Correct javadoc text in generateOverlapQuery() * Do not serialize null properties in the management model (apache#1955) * Ignore null values in JSON output * This may have an impact on existing client, but it is not likely to be substantial because normally absent properties should be treated the same as having `null` values. * This change enables adding new optional fields to the Management API while maintaining backward compatibility in the future: New properties will not be exposed to clients unless a value for them in explicitly set. * Add OpenHFT in Spark plugin LICENSE (apache#1979) * Add additional unit and integration tests for etag functionality (apache#1972) * Additional unit test for Etags * Added a few corner case IT tests for testing etags with schema changes. * Added IT tests to test changes after DDL and DML * Add options to the bootstrap command to specify a schema file (apache#1942) Instead of always using the hardcoded `schema-v1.sql` file, it would be nice if users could specify a file to bootstrap from. This is especially relevant after apache#1686 which proposes to add a new "version" of the schema. * Added support for `s3a` scheme (apache#1932) * Fix the sign failure (apache#1926) * Fix doc to remove outdated note about fine-grained access controls support (apache#1983) Minor update for the access control doc: 1. Remove the misleading section on privileges can only be granted at catalog level. I've tested the fine-grained access controls and confirmed that privileges can be applied to an individual table in the catalog. * Add support for catalog federation in the CLI (apache#1912) The CLI currently only supports the version of EXTERNAL catalogs that was present in 0.9.0. Now, EXTERNAL catalogs can be configured with various configurations relating to federation. This PR updates the CLI to better match the REST API so that federated catalogs can be easily set up in the CLI. * fix: Remove db-kind in helm chart (apache#1987) * Add a Spark session builder for the tests (apache#1985) * Fix doc for CLI update (apache#1994) PR for apache#1866 * Improve createPrincipal example in API docs (apache#1992) In apache#1929 it was pointed out that the example in the Polaris docs suggests that users can provide a client ID during principal creation: . . . This PR attempts to fix this by adding an explicit example to the spec. * Add doc for repair option (apache#1993) PR for apache#1864 * Refactor relationalJdbc in helm (apache#1996) * Add regression test coverage for Spark Client with package conf (apache#1997) * Remove unnecessary `InputStream.close` call (apache#1982) apache#1942 changed the way that the bootstrap init script is handled, but it added an extra `InputStream.close` call that shouldn't be needed after the BufferedReader [here](https://github.com/apache/polaris/pull/1942/files#diff-de43b240b5b5e07aba7e89f5515a417cefd908845b85432f3fcc0819911f3e2eR89) is closed. This PR removes that extra call. * Materialize Realm ID for Session Supplier in JDBC (apache#1988) It was discovered that the Session Supplier maps used in the MetaStoreManagerFactory implementations were passing in RealmContext objects to the supplier directly and then using the RealmContext objects to create BasePersistence implementation objects within the supplier. This supplier is cached on a per-realm basis in most MetaStoreManagerFactory implementations. RealmContext objects are request-scoped beans. As a result, if any work is being done outside the scope of the request, such as during a Task, any calls to getOrCreateSessionSupplier for creating a BasePersistence implementation will fail as the RealmContext object is no longer available. This PR will ensure for the JdbcMetaStoreManagerFactory that the Realm ID is materialized from the RealmContext and used inside the supplier so that the potentially deactivated RealmContext object does not need to be used in creating the BasePersistence object. Given that we are caching on a per-realm basis, this should not introduce any unforeseen behavior for the JdbcMetaStoreManagerFactory as the Realm ID must match exactly for the same supplier to be returned from the Session Supplier map. * rebase/changes * minor refactoring * Last merged commit 8fa6bf2 --------- Co-authored-by: Yun Zou <[email protected]> Co-authored-by: Christopher Lambert <[email protected]> Co-authored-by: Rulin Xing <[email protected]> Co-authored-by: MonkeyCanCode <[email protected]> Co-authored-by: Alexandre Dutra <[email protected]> Co-authored-by: Andrew Guterman <[email protected]> Co-authored-by: Eric Maynard <[email protected]> Co-authored-by: Pooja Nilangekar <[email protected]> Co-authored-by: Yufei Gu <[email protected]> Co-authored-by: fabio-rizzo-01 <[email protected]> Co-authored-by: Russell Spitzer <[email protected]> Co-authored-by: Sushant Raikar <[email protected]> Co-authored-by: Jiwon Park <[email protected]> Co-authored-by: Dmitri Bourlatchkov <[email protected]> Co-authored-by: JB Onofré <[email protected]> Co-authored-by: Sandhya Sundaresan <[email protected]> Co-authored-by: Pavan Lanka <[email protected]> Co-authored-by: CG <[email protected]> Co-authored-by: Adnan Hemani <[email protected]>

eric-maynard added 3 commits June 25, 2025 11:42

initial commit

4e1766b

stabilize some tests

2fe7cdc

autolint

3be2225

eric-maynard requested review from adutra and singhpk234 June 25, 2025 18:53

github-project-automation bot added this to Basic Kanban Board Jun 25, 2025

github-project-automation bot moved this to PRs In Progress in Basic Kanban Board Jun 25, 2025

eric-maynard mentioned this pull request Jun 25, 2025

Optimize the location overlap check with an index #1686

Merged

eric-maynard added 2 commits June 25, 2025 12:22

fix?

0a8c6d0

autolint

9969903

eric-maynard marked this pull request as ready for review June 25, 2025 19:39

dimas-b reviewed Jun 25, 2025

View reviewed changes

eric-maynard added 4 commits June 25, 2025 16:02

support arbitrary files

6e9546b

autolint

db19939

attempt refactor around datasourceOperations

6fee80f

autolint

183cfbd

eric-maynard requested a review from dimas-b June 25, 2025 23:11

dimas-b reviewed Jun 25, 2025

View reviewed changes

polaris-core/src/main/java/org/apache/polaris/core/persistence/bootstrap/SchemaOptions.java Outdated Show resolved Hide resolved

eric-maynard added 3 commits June 25, 2025 17:13

no more LATEST default

1b53b8e

typofix

6a929e2

fix check

06e2a51

dimas-b previously approved these changes Jun 26, 2025

View reviewed changes

github-project-automation bot moved this from PRs In Progress to Ready to merge in Basic Kanban Board Jun 26, 2025

eric-maynard dismissed dimas-b’s stale review via b80f6ae June 26, 2025 21:21

eric-maynard added 3 commits June 26, 2025 14:21

try removing LATEST

b80f6ae

resolve conflicts

e99d70e

test fix

49cce5b

eric-maynard enabled auto-merge (squash) June 27, 2025 17:35

dimas-b reviewed Jun 27, 2025

View reviewed changes

add close

e4a2f2a

dimas-b reviewed Jun 30, 2025

View reviewed changes

...lational-jdbc/src/main/java/org/apache/polaris/persistence/relational/jdbc/DatabaseType.java Outdated Show resolved Hide resolved

eric-maynard added 3 commits June 30, 2025 14:03

rename

3559234

autolint

28617e0

try-with-resource

b7e2e5a

eric-maynard requested a review from dimas-b June 30, 2025 21:51

dimas-b approved these changes Jun 30, 2025

View reviewed changes

eric-maynard merged commit 44d52d4 into apache:main Jun 30, 2025
11 checks passed

github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Jun 30, 2025

eric-maynard mentioned this pull request Jun 30, 2025

Remove unnecessary InputStream.close call #1982

Merged

eric-maynard mentioned this pull request Sep 23, 2025

Add fallback in case the VERSION table is not present #2653

Merged

Add options to the bootstrap command to specify a schema file #1942

Add options to the bootstrap command to specify a schema file #1942

Uh oh!

Conversation

eric-maynard commented Jun 25, 2025

Uh oh!

dimas-b left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-maynard Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-maynard Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-maynard Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dimas-b left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eric-maynard commented Jun 30, 2025

Uh oh!

dimas-b commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

eric-maynard Jun 26, 2025 •

edited

Loading

eric-maynard Jun 28, 2025 •

edited

Loading

eric-maynard Jun 29, 2025 •

edited

Loading

dimas-b commented Jun 30, 2025 •

edited

Loading