Skip to content

Commit bb44c84

Browse files
authored
NoSQL: Persistence API (#2965)
Provides the persistence API parts for the NoSQL persistence work. Most of this PR are purely interfaces and annotations. It consists of a low-level `Persistence` interface to read and write individual `Obj`ects and the corresponding pluggable `ObjType`s. The API module also contains upstream SPIs for database specific implementations and downstream APIs for atomic commits and indexes. Also some CDI specific infrastructure helper annotations. Unit tests cover the few pieces of actual executable code in this change. This change also adds a README, which references functionality and modules that are not in this PR, but already provide an overview of the overall interactions.
1 parent 639340b commit bb44c84

File tree

68 files changed

+6041
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

68 files changed

+6041
-0
lines changed

bom/build.gradle.kts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,8 @@ dependencies {
4848
api(project(":polaris-nodes-impl"))
4949
api(project(":polaris-nodes-spi"))
5050

51+
api(project(":polaris-persistence-nosql-api"))
52+
5153
api(project(":polaris-config-docs-annotations"))
5254
api(project(":polaris-config-docs-generator"))
5355

gradle/libs.versions.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ jakarta-validation-api = { module = "jakarta.validation:jakarta.validation-api",
7575
jakarta-ws-rs-api = { module = "jakarta.ws.rs:jakarta.ws.rs-api", version = "4.0.0" }
7676
javax-servlet-api = { module = "javax.servlet:javax.servlet-api", version = "4.0.1" }
7777
junit-bom = { module = "org.junit:junit-bom", version = "5.14.1" }
78+
junit-pioneer = { module = "org.junit-pioneer:junit-pioneer", version = "2.3.0" }
7879
keycloak-admin-client = { module = "org.keycloak:keycloak-admin-client", version = "26.0.7" }
7980
jcstress-core = { module = "org.openjdk.jcstress:jcstress-core", version = "0.16" }
8081
jmh-core = { module = "org.openjdk.jmh:jmh-core", version.ref = "jmh" }

gradle/projects.main.properties

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,4 +63,5 @@ polaris-nodes-api=persistence/nosql/nodes/api
6363
polaris-nodes-impl=persistence/nosql/nodes/impl
6464
polaris-nodes-spi=persistence/nosql/nodes/spi
6565
# persistence / database agnostic
66+
polaris-persistence-nosql-api=persistence/nosql/persistence/api
6667
polaris-persistence-nosql-varint=persistence/nosql/persistence/varint
Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Database agnostic persistence framework
21+
22+
The NoSQL persistence API and functional implementations are based on the assumption that all databases targeted as
23+
backing stores for Polaris support "compare and swap" operations on a single row.
24+
These CAS operations are the only requirement.
25+
26+
Since some databases do enforce hard size limits, for example, DynamoDB has a hard 400kB row size limit.
27+
MariaDB/MySQL has a default 512kB packet size limit.
28+
Other databases have row-size recommendations around similar sizes.
29+
Polaris persistence respects those limits and recommendations using a common hard limit of 350kB.
30+
31+
Objects exposed via the `Persistence` interface are typed Java objects that must be immutable and serializable using
32+
Jackson.
33+
Each type is described via an implementation of the `ObjType` interface using a name, which must be unique
34+
in Polaris, and a target Java type: the Jackson serializable Java type.
35+
Object types are registered using the Java service API using `ObjType`.
36+
The actual java target types must extend the `Obj` interface.
37+
The (logical) key for each `Obj` is a composite of the `ObjType.id()` and a `long` ID (64-bit signed int),
38+
combined using the `ObjId` composite type.
39+
40+
The "primary key" of each object in a database is always _realmId + object-ID_, where realm-ID is a string and
41+
object-ID is a 64-bit integer.
42+
This allows, but does not enforce, storing multiple realms in one backend database.
43+
44+
Data in/for/of each Polaris realm (think: _tenant_) is isolated using the realm's ID (string).
45+
The base `Persistence` API interface is always scoped to exactly one realm ID.
46+
47+
## Supporting more databases
48+
49+
The code to support a particular database is isolated in a project, for example `polaris-persistence-nosql-inmemory` and
50+
`polaris-persistence-nosql-mongodb`.
51+
52+
When adding another database, it must also be wired up to Quarkus in `polaris-persistence-nosql-cdi-quarkus` preferably
53+
using Quarkus extensions, added to the `polaris-persitence-corectness` tests and available in
54+
`polaris-persistence-nosql-benchmark` for low-level benchmarks.
55+
56+
## Named pointers
57+
58+
Polaris represents a catalog for data lakehouses, which means that the information of and for catalog entities like
59+
Iceberg tables, views, and namespaces must be consistent, even if multiple catalog entities are changes in a single
60+
atomic operation.
61+
62+
Polaris leverages a concept called "Named pointers."
63+
The state of the whole catalog is referenced via the so-called HEAD (think: Git HEAD),
64+
which _points to_ all catalog entities.
65+
This state is persisted as an `Obj` with an index of the catalog entities,
66+
the ID of that "current catalog state `Obj`" is maintained in one named pointer.
67+
68+
Named pointers are also used for other purposes than catalog entities, for example, to maintain realms or
69+
configurations.
70+
71+
## Committing changes
72+
73+
Changes are persisted using a commit mechanism, providing atomic changes across multiple entities against one named
74+
pointer.
75+
The logic implementation ensures that even high-frequency concurrent changes do neither let clients fail
76+
nor cause timeouts.
77+
The behavior and achievable throughput depend on the database being used; some databases perform
78+
_much_ better than others.
79+
80+
A use-case agnostic "committer" abstraction exists to ease implementing committing operations.
81+
For catalog operations, there is a more specialized abstraction.
82+
83+
## `long` IDs
84+
85+
Polaris NoSQL persistence uses so-called Snowflake IDs, which are 64-bit integers that represent a timestamp, a
86+
node-ID, and a sequence number.
87+
The epoch of these timestamps is 2025-03-01-00:00:00.0 GMT.
88+
Timestamps occupy 41 bits at millisecond precision, which lasts for about 69 years.
89+
Node-IDs are 10 bits, which allows 1024 concurrently active "JVMs running Polaris."
90+
Twelve (12) bits are used by the sequence number, which then allows each node to generate 4096 IDs per
91+
millisecond.
92+
One bit is reserved for future use.
93+
94+
Node IDs are leased by every "JVM running Polaris" for a period of time.
95+
The ID generator implementation guarantees that no IDs will be generated for a timestamp that exceeds the "lease time."
96+
Leases can be extended.
97+
The implementation leverages atomic database operations (CAS) for the lease implementation.
98+
99+
ID generators must not use timestamps before or after the lease period, nor must they re-use an older timestamp.
100+
This requirement is satisfied using a monotonic clock implementation.
101+
102+
## Caching
103+
104+
Since most `Obj`s are by default assumed to be immutable, caching is very straight forward and does not require any
105+
coordination, which simplifies the design and implementation quite a bit.
106+
107+
## Strong vs. eventual consistency
108+
109+
Polaris NoSQL persistence offers two ways to persist `Obj`s: strongly consistent and eventually consistent.
110+
The former is slower than the latter.
111+
112+
Since Polaris NoSQL persistence respects the hard size limitations mentioned above, it cannot persist the serialized
113+
representation of objects that exceed those limits in a single database row.
114+
However, some objects legibly exceed those limits.
115+
Polaris NoSQL persistence allows such "big object serializations" and writes those into multiple database rows,
116+
with the restriction that this is only supported for eventually consistent write operations.
117+
The serialized representation for strong consistency writes must always be within the hard limit.
118+
119+
## Indexes
120+
121+
The state of a data-lakehouse catalog can contain many thousand, potentially a few 100,000, tables/views/namespaces.
122+
Even space-efficient serialization of an index for that many entries can exceed the "common hard 350kB limit."
123+
New changes end in the index, which is "embedded" in the "current catalog state `Obj`".
124+
If the respective index size limit of this "embedded" index is being approached,
125+
the index is spilled out to separate rows in the database.
126+
The implementation is built to split and combine when needed.
127+
128+
## Change log / events / notifications
129+
130+
The commit mechanism described above builds a commit log.
131+
All changes can be inspected via that log in exactly the order in which those happened (think: `git log`).
132+
Since the log of changes is already present, it is possible to retrieve the changes from some point in time or
133+
commit log ID.
134+
This allows clients to receive all changes that have happened since the last known commit ID,
135+
offering a mechanism to poll for changes.
136+
Since the necessary `Obj`s are immutable,
137+
such change-log-requests likely hit already cached data and rather not the database.
138+
139+
## Clean up old commits / unused data
140+
141+
Despite the beauty of having a "commit log" and all metadata representation in the backing database,
142+
the size of that database would always grow.
143+
144+
Purging unused table/view metadata memoized in the database is one piece.
145+
Purging old commit log entries is the second part.
146+
Purging (then) unreferenced `Obj`s the third part.
147+
148+
See [maintenance service](#maintenance-service) below.
149+
150+
## Realms (aka tenants)
151+
152+
Bootstrapping but more importantly, deleting/purging a realm is a non-trivial operation, which requires its own
153+
lifecycle.
154+
Bootstrapping is a straight forward operation as the necessary information can be validated and enhanced if necessary.
155+
156+
Both the logical but also the physical process of realm deletion are more complex.
157+
From a logical point of view,
158+
users want to disable the realm for a while before they eventually are okay with deleting the information.
159+
160+
The process to delete a realm's data from the database can be quite time-consuming, and how that happens is
161+
database-specific.
162+
While some databases can do bulk-deletions, which "just" take some time (RDBMS, BigTable), other databases
163+
require that the process of deleting a realm must happen during a full scan of the database (for example, RocksDB
164+
and Apache Cassandra).
165+
Since scanning the whole database itself can take quite long, and no more than one instance should scan the database
166+
at any time.
167+
168+
The realm has a status to reflect its lifecycle.
169+
The initial status of a realm is `CREATED`, which effectively only means that the realm-ID has been reserved and that
170+
the necessary data needs to be populated (bootstrap).
171+
Once a realm has been fully bootstrapped, its status is changed to `ACTIVE`.
172+
Only `ACTIVE` realms can be used for user requests.
173+
174+
Between `CREATED` and `ACTIVE`/`INACTIVE` there are two states that are mutually exclusive.
175+
The state `INITIALIZING` means that Polaris will initialize the realm as a fresh, new realm.
176+
The state `LOADING` means that realm data, which has been exported from another Polaris instance, is to be imported.
177+
178+
Realm deletion is a multistep approach as well: Realms are first put into `INACTIVE` state, which can be reverted
179+
to `ACTIVE` state or into `PURGING` state.
180+
The state `PURGING` means that the realm's data is being deleted from the database,
181+
once purging has been started, the realm's information in the database is inconsistent and cannot be restored.
182+
Once the realm's data has been purged, the realm is put into `PURGED` state. Only realms that are in state `PURGED`
183+
can be deleted.
184+
185+
The multi-state approach also prevents that a realm can only be used when the system knows that all necessary
186+
information is present.
187+
188+
**Note**: the realm state machine is not fully implemented yet.
189+
190+
## `::system::` realm
191+
192+
Polaris NoSQL persistence uses a system realm which is used for node ID leases and realm management.
193+
The realm-IDs starting with two colons (`::`) are reserved for system use.
194+
195+
### Named pointers in the `::system::` realm
196+
197+
| Named pointer | Meaning |
198+
|---------------|-----------------|
199+
| `realms` | Realms, by name |
200+
201+
## "User" realms
202+
203+
### Named pointers in the user realms
204+
205+
| Named pointer | Meaning |
206+
|-------------------|------------------------------|
207+
| `root` | Pointer to the "root" entity |
208+
| `catalogs` | Catalogs |
209+
| `principals` | Principals |
210+
| `principal-roles` | Principal roles |
211+
| `grants` | All grants |
212+
| `immediate-tasks` | Immediately scheduled tasks |
213+
| `policy-mappings` | Policy mappings |
214+
215+
Per catalog named pointers, where `%d` refers to the catalog's integer ID:
216+
217+
| Named pointer | Meaning |
218+
|---------------------|--------------------------------------------------|
219+
| `cat/%d/roles` | Catalog roles |
220+
| `cat/%d/heads/main` | Catalog content (namespaces, tables, views, etc) |
221+
| `cat/%d/grants` | Catalog related grants (*) |
222+
223+
(*) = currently not used, stored in the realm grants.
224+
225+
## Maintenance Service
226+
227+
**Note**: maintenance service not yet in the code base.
228+
229+
The maintenance service is a mechanism to scan the backend database and perform necessary maintenance operations
230+
as a background service.
231+
232+
The most important maintenance operation is to purge unreferenced objects from the database.
233+
Pluggable "identifiers" are used to "mark" objects to retain.
234+
235+
The implementation calls all per-realm "identifiers," which then "mark" the named pointers and objects that have to be
236+
retained.
237+
Plugins and/or extensions can provide per-object-type "identifiers," which get called for "identified" objects.
238+
The second phase of the maintenance service scans the whole backend database and purges those objects,
239+
which have not been "marked" to be retained.
240+
241+
Maintenance service invocations require two sets of realm-ids: the set of realms to retain and the set of realms
242+
to purge.
243+
These sets can be derived using `RealmManagement.list()` and grouping realms by their status.
244+
245+
### Purging realms
246+
247+
Eventually, purging realm from the backend database can happen in two different ways, depending on the database.
248+
Some databases support deleting one or more realms using bulk deletions.
249+
Other databases do not support this kind of bulk deletion.
250+
Both ways are supported by the maintenance service.
251+
252+
Eventually, purging realms is a responsibility of the [maintenance service](#maintenance-service).
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one
3+
* or more contributor license agreements. See the NOTICE file
4+
* distributed with this work for additional information
5+
* regarding copyright ownership. The ASF licenses this file
6+
* to you under the Apache License, Version 2.0 (the
7+
* "License"); you may not use this file except in compliance
8+
* with the License. You may obtain a copy of the License at
9+
*
10+
* http://www.apache.org/licenses/LICENSE-2.0
11+
*
12+
* Unless required by applicable law or agreed to in writing,
13+
* software distributed under the License is distributed on an
14+
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
* KIND, either express or implied. See the License for the
16+
* specific language governing permissions and limitations
17+
* under the License.
18+
*/
19+
20+
plugins {
21+
id("org.kordamp.gradle.jandex")
22+
id("polaris-server")
23+
}
24+
25+
description = "Polaris NoSQL persistence API, no concrete implementations"
26+
27+
dependencies {
28+
api(project(":polaris-version"))
29+
api(project(":polaris-misc-types"))
30+
31+
implementation(project(":polaris-idgen-api"))
32+
implementation(project(":polaris-nodes-api"))
33+
implementation(project(":polaris-persistence-nosql-varint"))
34+
35+
implementation(platform(libs.jackson.bom))
36+
implementation("com.fasterxml.jackson.core:jackson-annotations")
37+
implementation("com.fasterxml.jackson.core:jackson-core")
38+
implementation("com.fasterxml.jackson.core:jackson-databind")
39+
runtimeOnly("com.fasterxml.jackson.datatype:jackson-datatype-guava")
40+
runtimeOnly("com.fasterxml.jackson.datatype:jackson-datatype-jdk8")
41+
runtimeOnly("com.fasterxml.jackson.datatype:jackson-datatype-jsr310")
42+
43+
implementation(libs.guava)
44+
implementation(libs.slf4j.api)
45+
46+
compileOnly(libs.smallrye.config.core)
47+
compileOnly(platform(libs.quarkus.bom))
48+
compileOnly("io.quarkus:quarkus-core")
49+
50+
compileOnly(project(":polaris-immutables"))
51+
annotationProcessor(project(":polaris-immutables", configuration = "processor"))
52+
53+
compileOnly(libs.jakarta.annotation.api)
54+
compileOnly(libs.jakarta.validation.api)
55+
compileOnly(libs.jakarta.inject.api)
56+
compileOnly(libs.jakarta.enterprise.cdi.api)
57+
58+
testImplementation(platform(libs.jackson.bom))
59+
testImplementation("com.fasterxml.jackson.dataformat:jackson-dataformat-smile")
60+
61+
testImplementation(libs.junit.pioneer)
62+
63+
testFixturesImplementation(platform(libs.jackson.bom))
64+
testFixturesImplementation("com.fasterxml.jackson.core:jackson-databind")
65+
66+
testFixturesCompileOnly(project(":polaris-immutables"))
67+
testFixturesAnnotationProcessor(project(":polaris-immutables", configuration = "processor"))
68+
69+
testFixturesCompileOnly(libs.jakarta.annotation.api)
70+
testFixturesCompileOnly(libs.jakarta.validation.api)
71+
}

0 commit comments

Comments
 (0)