|
| 1 | +<!-- |
| 2 | + Licensed to the Apache Software Foundation (ASF) under one |
| 3 | + or more contributor license agreements. See the NOTICE file |
| 4 | + distributed with this work for additional information |
| 5 | + regarding copyright ownership. The ASF licenses this file |
| 6 | + to you under the Apache License, Version 2.0 (the |
| 7 | + "License"); you may not use this file except in compliance |
| 8 | + with the License. You may obtain a copy of the License at |
| 9 | + |
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | + |
| 12 | + Unless required by applicable law or agreed to in writing, |
| 13 | + software distributed under the License is distributed on an |
| 14 | + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | + KIND, either express or implied. See the License for the |
| 16 | + specific language governing permissions and limitations |
| 17 | + under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +# Database agnostic persistence framework |
| 21 | + |
| 22 | +The NoSQL persistence API and functional implementations are based on the assumption that all databases targeted as |
| 23 | +backing stores for Polaris support "compare and swap" operations on a single row. |
| 24 | +These CAS operations are the only requirement. |
| 25 | + |
| 26 | +Since some databases do enforce hard size limits, for example, DynamoDB has a hard 400kB row size limit. |
| 27 | +MariaDB/MySQL has a default 512kB packet size limit. |
| 28 | +Other databases have row-size recommendations around similar sizes. |
| 29 | +Polaris persistence respects those limits and recommendations using a common hard limit of 350kB. |
| 30 | + |
| 31 | +Objects exposed via the `Persistence` interface are typed Java objects that must be immutable and serializable using |
| 32 | +Jackson. |
| 33 | +Each type is described via an implementation of the `ObjType` interface using a name, which must be unique |
| 34 | +in Polaris, and a target Java type: the Jackson serializable Java type. |
| 35 | +Object types are registered using the Java service API using `ObjType`. |
| 36 | +The actual java target types must extend the `Obj` interface. |
| 37 | +The (logical) key for each `Obj` is a composite of the `ObjType.id()` and a `long` ID (64-bit signed int), |
| 38 | +combined using the `ObjId` composite type. |
| 39 | + |
| 40 | +The "primary key" of each object in a database is always _realmId + object-ID_, where realm-ID is a string and |
| 41 | +object-ID is a 64-bit integer. |
| 42 | +This allows, but does not enforce, storing multiple realms in one backend database. |
| 43 | + |
| 44 | +Data in/for/of each Polaris realm (think: _tenant_) is isolated using the realm's ID (string). |
| 45 | +The base `Persistence` API interface is always scoped to exactly one realm ID. |
| 46 | + |
| 47 | +## Supporting more databases |
| 48 | + |
| 49 | +The code to support a particular database is isolated in a project, for example `polaris-persistence-nosql-inmemory` and |
| 50 | +`polaris-persistence-nosql-mongodb`. |
| 51 | + |
| 52 | +When adding another database, it must also be wired up to Quarkus in `polaris-persistence-nosql-cdi-quarkus` preferably |
| 53 | +using Quarkus extensions, added to the `polaris-persitence-corectness` tests and available in |
| 54 | +`polaris-persistence-nosql-benchmark` for low-level benchmarks. |
| 55 | + |
| 56 | +## Named pointers |
| 57 | + |
| 58 | +Polaris represents a catalog for data lakehouses, which means that the information of and for catalog entities like |
| 59 | +Iceberg tables, views, and namespaces must be consistent, even if multiple catalog entities are changes in a single |
| 60 | +atomic operation. |
| 61 | + |
| 62 | +Polaris leverages a concept called "Named pointers." |
| 63 | +The state of the whole catalog is referenced via the so-called HEAD (think: Git HEAD), |
| 64 | +which _points to_ all catalog entities. |
| 65 | +This state is persisted as an `Obj` with an index of the catalog entities, |
| 66 | +the ID of that "current catalog state `Obj`" is maintained in one named pointer. |
| 67 | + |
| 68 | +Named pointers are also used for other purposes than catalog entities, for example, to maintain realms or |
| 69 | +configurations. |
| 70 | + |
| 71 | +## Committing changes |
| 72 | + |
| 73 | +Changes are persisted using a commit mechanism, providing atomic changes across multiple entities against one named |
| 74 | +pointer. |
| 75 | +The logic implementation ensures that even high-frequency concurrent changes do neither let clients fail |
| 76 | +nor cause timeouts. |
| 77 | +The behavior and achievable throughput depend on the database being used; some databases perform |
| 78 | +_much_ better than others. |
| 79 | + |
| 80 | +A use-case agnostic "committer" abstraction exists to ease implementing committing operations. |
| 81 | +For catalog operations, there is a more specialized abstraction. |
| 82 | + |
| 83 | +## `long` IDs |
| 84 | + |
| 85 | +Polaris NoSQL persistence uses so-called Snowflake IDs, which are 64-bit integers that represent a timestamp, a |
| 86 | +node-ID, and a sequence number. |
| 87 | +The epoch of these timestamps is 2025-03-01-00:00:00.0 GMT. |
| 88 | +Timestamps occupy 41 bits at millisecond precision, which lasts for about 69 years. |
| 89 | +Node-IDs are 10 bits, which allows 1024 concurrently active "JVMs running Polaris." |
| 90 | +Twelve (12) bits are used by the sequence number, which then allows each node to generate 4096 IDs per |
| 91 | +millisecond. |
| 92 | +One bit is reserved for future use. |
| 93 | + |
| 94 | +Node IDs are leased by every "JVM running Polaris" for a period of time. |
| 95 | +The ID generator implementation guarantees that no IDs will be generated for a timestamp that exceeds the "lease time." |
| 96 | +Leases can be extended. |
| 97 | +The implementation leverages atomic database operations (CAS) for the lease implementation. |
| 98 | + |
| 99 | +ID generators must not use timestamps before or after the lease period, nor must they re-use an older timestamp. |
| 100 | +This requirement is satisfied using a monotonic clock implementation. |
| 101 | + |
| 102 | +## Caching |
| 103 | + |
| 104 | +Since most `Obj`s are by default assumed to be immutable, caching is very straight forward and does not require any |
| 105 | +coordination, which simplifies the design and implementation quite a bit. |
| 106 | + |
| 107 | +## Strong vs. eventual consistency |
| 108 | + |
| 109 | +Polaris NoSQL persistence offers two ways to persist `Obj`s: strongly consistent and eventually consistent. |
| 110 | +The former is slower than the latter. |
| 111 | + |
| 112 | +Since Polaris NoSQL persistence respects the hard size limitations mentioned above, it cannot persist the serialized |
| 113 | +representation of objects that exceed those limits in a single database row. |
| 114 | +However, some objects legibly exceed those limits. |
| 115 | +Polaris NoSQL persistence allows such "big object serializations" and writes those into multiple database rows, |
| 116 | +with the restriction that this is only supported for eventually consistent write operations. |
| 117 | +The serialized representation for strong consistency writes must always be within the hard limit. |
| 118 | + |
| 119 | +## Indexes |
| 120 | + |
| 121 | +The state of a data-lakehouse catalog can contain many thousand, potentially a few 100,000, tables/views/namespaces. |
| 122 | +Even space-efficient serialization of an index for that many entries can exceed the "common hard 350kB limit." |
| 123 | +New changes end in the index, which is "embedded" in the "current catalog state `Obj`". |
| 124 | +If the respective index size limit of this "embedded" index is being approached, |
| 125 | +the index is spilled out to separate rows in the database. |
| 126 | +The implementation is built to split and combine when needed. |
| 127 | + |
| 128 | +## Change log / events / notifications |
| 129 | + |
| 130 | +The commit mechanism described above builds a commit log. |
| 131 | +All changes can be inspected via that log in exactly the order in which those happened (think: `git log`). |
| 132 | +Since the log of changes is already present, it is possible to retrieve the changes from some point in time or |
| 133 | +commit log ID. |
| 134 | +This allows clients to receive all changes that have happened since the last known commit ID, |
| 135 | +offering a mechanism to poll for changes. |
| 136 | +Since the necessary `Obj`s are immutable, |
| 137 | +such change-log-requests likely hit already cached data and rather not the database. |
| 138 | + |
| 139 | +## Clean up old commits / unused data |
| 140 | + |
| 141 | +Despite the beauty of having a "commit log" and all metadata representation in the backing database, |
| 142 | +the size of that database would always grow. |
| 143 | + |
| 144 | +Purging unused table/view metadata memoized in the database is one piece. |
| 145 | +Purging old commit log entries is the second part. |
| 146 | +Purging (then) unreferenced `Obj`s the third part. |
| 147 | + |
| 148 | +See [maintenance service](#maintenance-service) below. |
| 149 | + |
| 150 | +## Realms (aka tenants) |
| 151 | + |
| 152 | +Bootstrapping but more importantly, deleting/purging a realm is a non-trivial operation, which requires its own |
| 153 | +lifecycle. |
| 154 | +Bootstrapping is a straight forward operation as the necessary information can be validated and enhanced if necessary. |
| 155 | + |
| 156 | +Both the logical but also the physical process of realm deletion are more complex. |
| 157 | +From a logical point of view, |
| 158 | +users want to disable the realm for a while before they eventually are okay with deleting the information. |
| 159 | + |
| 160 | +The process to delete a realm's data from the database can be quite time-consuming, and how that happens is |
| 161 | +database-specific. |
| 162 | +While some databases can do bulk-deletions, which "just" take some time (RDBMS, BigTable), other databases |
| 163 | +require that the process of deleting a realm must happen during a full scan of the database (for example, RocksDB |
| 164 | +and Apache Cassandra). |
| 165 | +Since scanning the whole database itself can take quite long, and no more than one instance should scan the database |
| 166 | +at any time. |
| 167 | + |
| 168 | +The realm has a status to reflect its lifecycle. |
| 169 | +The initial status of a realm is `CREATED`, which effectively only means that the realm-ID has been reserved and that |
| 170 | +the necessary data needs to be populated (bootstrap). |
| 171 | +Once a realm has been fully bootstrapped, its status is changed to `ACTIVE`. |
| 172 | +Only `ACTIVE` realms can be used for user requests. |
| 173 | + |
| 174 | +Between `CREATED` and `ACTIVE`/`INACTIVE` there are two states that are mutually exclusive. |
| 175 | +The state `INITIALIZING` means that Polaris will initialize the realm as a fresh, new realm. |
| 176 | +The state `LOADING` means that realm data, which has been exported from another Polaris instance, is to be imported. |
| 177 | + |
| 178 | +Realm deletion is a multistep approach as well: Realms are first put into `INACTIVE` state, which can be reverted |
| 179 | +to `ACTIVE` state or into `PURGING` state. |
| 180 | +The state `PURGING` means that the realm's data is being deleted from the database, |
| 181 | +once purging has been started, the realm's information in the database is inconsistent and cannot be restored. |
| 182 | +Once the realm's data has been purged, the realm is put into `PURGED` state. Only realms that are in state `PURGED` |
| 183 | +can be deleted. |
| 184 | + |
| 185 | +The multi-state approach also prevents that a realm can only be used when the system knows that all necessary |
| 186 | +information is present. |
| 187 | + |
| 188 | +**Note**: the realm state machine is not fully implemented yet. |
| 189 | + |
| 190 | +## `::system::` realm |
| 191 | + |
| 192 | +Polaris NoSQL persistence uses a system realm which is used for node ID leases and realm management. |
| 193 | +The realm-IDs starting with two colons (`::`) are reserved for system use. |
| 194 | + |
| 195 | +### Named pointers in the `::system::` realm |
| 196 | + |
| 197 | +| Named pointer | Meaning | |
| 198 | +|---------------|-----------------| |
| 199 | +| `realms` | Realms, by name | |
| 200 | + |
| 201 | +## "User" realms |
| 202 | + |
| 203 | +### Named pointers in the user realms |
| 204 | + |
| 205 | +| Named pointer | Meaning | |
| 206 | +|-------------------|------------------------------| |
| 207 | +| `root` | Pointer to the "root" entity | |
| 208 | +| `catalogs` | Catalogs | |
| 209 | +| `principals` | Principals | |
| 210 | +| `principal-roles` | Principal roles | |
| 211 | +| `grants` | All grants | |
| 212 | +| `immediate-tasks` | Immediately scheduled tasks | |
| 213 | +| `policy-mappings` | Policy mappings | |
| 214 | + |
| 215 | +Per catalog named pointers, where `%d` refers to the catalog's integer ID: |
| 216 | + |
| 217 | +| Named pointer | Meaning | |
| 218 | +|---------------------|--------------------------------------------------| |
| 219 | +| `cat/%d/roles` | Catalog roles | |
| 220 | +| `cat/%d/heads/main` | Catalog content (namespaces, tables, views, etc) | |
| 221 | +| `cat/%d/grants` | Catalog related grants (*) | |
| 222 | + |
| 223 | +(*) = currently not used, stored in the realm grants. |
| 224 | + |
| 225 | +## Maintenance Service |
| 226 | + |
| 227 | +**Note**: maintenance service not yet in the code base. |
| 228 | + |
| 229 | +The maintenance service is a mechanism to scan the backend database and perform necessary maintenance operations |
| 230 | +as a background service. |
| 231 | + |
| 232 | +The most important maintenance operation is to purge unreferenced objects from the database. |
| 233 | +Pluggable "identifiers" are used to "mark" objects to retain. |
| 234 | + |
| 235 | +The implementation calls all per-realm "identifiers," which then "mark" the named pointers and objects that have to be |
| 236 | +retained. |
| 237 | +Plugins and/or extensions can provide per-object-type "identifiers," which get called for "identified" objects. |
| 238 | +The second phase of the maintenance service scans the whole backend database and purges those objects, |
| 239 | +which have not been "marked" to be retained. |
| 240 | + |
| 241 | +Maintenance service invocations require two sets of realm-ids: the set of realms to retain and the set of realms |
| 242 | +to purge. |
| 243 | +These sets can be derived using `RealmManagement.list()` and grouping realms by their status. |
| 244 | + |
| 245 | +### Purging realms |
| 246 | + |
| 247 | +Eventually, purging realm from the backend database can happen in two different ways, depending on the database. |
| 248 | +Some databases support deleting one or more realms using bulk deletions. |
| 249 | +Other databases do not support this kind of bulk deletion. |
| 250 | +Both ways are supported by the maintenance service. |
| 251 | + |
| 252 | +Eventually, purging realms is a responsibility of the [maintenance service](#maintenance-service). |
0 commit comments