Skip to content

Conversation

@akhaku
Copy link
Contributor

@akhaku akhaku commented Apr 22, 2023

No description provided.

Comment on lines 108 to 115
String.format(
"Too many fields in encoded UDT value, expected %d",
cqlType.getFieldTypes().size()));
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand not failing in this case, but logging a (no-spam) warning seems appropriate. Based on my understanding, this should only happen if a schema change happens on the server to add fields to a UDT, and the client gets an updated UDT in a response before finding out about the new schema.

Are there any other situations when this might occur?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the scenario we ran into, and I can't think of any other scenarios.
Fair, a no-spam warn is a good idea but it doesn't look like the client currently has a no-spam logger. I could copy the one from the server but maybe not the best thing to do in this PR... and a vanilla log warn would be bad here since it would get logged every time we decode a UDT field matching that scenario. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about this some more. The driver seeing extra fields in a UDT is expected for a short period after a schema change to add fields, since the control connection may find out later than the coordinator or replicas, and EVENTs are sent over the control connection async. It may be worth a warning if this state persists for too long, since that's an indication that a schema change EVENT wasn't delivered, but delivery failures are expected to be tolerated as well.

I think it makes sense to drop this to a DEBUG or TRACE log, rather than using a no-spam logger.

In the more general case of EVENT delivery failing and client metadata going out of sync, there's more we could do to retry EVENTs or support bounded metadata lag, but all of that is out of scope for this ticket.

The only other change I'm considering is whether we should add a hint to refresh client metadata in this situation where the server has indicated that it has a new schema the client isn't yet aware of. Metadata refreshes should be debounced so we don't attempt a refresh on every response decode. Refreshing client metadata would only help in the case when the control instance does know about the schema but EVENT delivery failed, not in cases where the control instance has not yet found out about the new schema but a particular coordinator has, which is more likely the case for larger clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while but my memory is a little fuzzy as to the exact scenario with which we ran into this, but it had to do with the object mapper caching schemas somewhere too. In any case, the client metadata refresh hint sounds reasonable for when we tackle what you said earlier about EVENT delivery failing, since that would be the logical place for the hint to propagate and affect. For now I'll add a debug log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we had an existing NoSpamLogger, logging this once at WARN would be pretty informative; Maybe something we can do as a follow on PR later? Wondering if there's anything else in the driver that could benefit from a no spam logger...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent a while paging in the context for this tonight, another piece of it is that the old schema will usually get cached by the CachingCodecRegistry and you can't overwrite it because the driver thinks it collides, so the driver will have the old UdtCodec until it gets bounced/reinitialized.

@lukasz-antoniak
Copy link
Member

I've had hard time trying to replicate this issue. Attempts to ignore schema updates on control connection did not replicate the problem. In fact, DefaultCodecRegistry will always attempt to create new codec if it does not find corresponding one in the cache (source). Cache key takes into account UDT name, but also names and types of fields.

This triggered my attention to the way application attempts to read UDT from row. Leveraging row.getUdtValue(...) will again register new codec (source). However, trying to read the data with given UdtCodec will indeed expose the issue. Since there is a public API method to read the column using certain codec, I think that this issue is valid.

Below is the integration test that exposes the problem. Shall we add it to the PR?

/**
 * @jira_ticket JAVA-3057
 */
@Test
public void should_decoding_udt_be_backward_compatible() {
    CqlSession session = sessionRule.session();
    session.execute("CREATE TYPE test_type_1 (a text, b int)");
    session.execute("CREATE TABLE test_table_1 (e int primary key, f frozen<test_type_1>)");
    // insert a row using version 1 of the UDT schema
    session.execute("INSERT INTO test_table_1(e, f) VALUES(1, {a: 'a', b: 1})");
    UserDefinedType udt = session
            .getMetadata()
            .getKeyspace(sessionRule.keyspace())
            .flatMap(ks -> ks.getUserDefinedType("test_type_1"))
            .orElseThrow(IllegalStateException::new);
    TypeCodec<?> oldCodec = session.getContext().getCodecRegistry().codecFor(udt);
    // update UDT schema
    session.execute("ALTER TYPE test_type_1 add i text");
    // insert a row using version 2 of the UDT schema
    session.execute("INSERT INTO test_table_1(e, f) VALUES(2, {a: 'b', b: 2, i: 'b'})");
    Row row = session.execute("SELECT f FROM test_table_1 WHERE e = ?", 2).one();
    // Try to read new row with old codec. Using row.getUdtValue() would not cause any issues,
    // because new codec will be automatically registered (using all 3 attributes).
    // If application leverages generic row.get(String, Codec) method, data reading with old codec should
    // be backward-compatible.
    UdtValue value = (UdtValue) row.get("f", oldCodec);
    assertThat(value.getString("a")).isEqualTo("b");
    assertThat(value.getInt("b")).isEqualTo(2);
    assertThatThrownBy(() -> value.getString("i")).hasMessage("i is not a field in this UDT");
}

@akhaku
Copy link
Contributor Author

akhaku commented May 26, 2024

Awesome, thank you for doing that work for me Lukasz! I added it in a new integration test since I couldn't find another existing one that fit. Also rebased.

The case we ran into this issue had to do with the object mapper since it holds on to schema somewhere, and uses that to generate the cql queries for example. But this test is great too!

@tolbertam tolbertam self-requested a review August 27, 2024 00:07
Copy link
Contributor

@tolbertam tolbertam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think this should be safe. There's a couple issues with the test, but +1 assuming those get fixed.


@Test
public void should_decoding_udt_be_backward_compatible() {
try (CqlSession session = sessionRule.session()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny problem here, while the test passes, the CqlSession returned by SessionRule.session() gets used in teardown to drop keyspaces, so shouldn't put it in try (as it would get closed when it exists the block).

java.lang.IllegalStateException: Session is closed

	at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:232)
	at com.datastax.oss.driver.api.testinfra.session.SessionUtils.dropKeyspace(SessionUtils.java:223)
	at com.datastax.oss.driver.api.testinfra.session.SessionRule.after(SessionRule.java:198)
	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:59)

UdtValue value = cqlType.newValue();
int i = 0;
while (input.hasRemaining()) {
if (i == cqlType.getFieldTypes().size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels ok given that the only operations you can do on ALTER TYPE are ADD and RENAME,

Given that:

  1. you can't drop a udt field;
  2. adding a field is always additive (e.g. if you have existing fields, any added fields should come after, at least that is what it looks like this does: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/statements/schema/AlterTypeStatement.java#L161-L162)

This means you won't risk reading the wrong field as position-ally everything should be retained after any ALTER TYPE operation.

Also considering that you can still write a udt value with the old udtCodec when the type is altered, it seems reasonable that you'd be able to read partially from it as well.

I suppose there is some inherent risk of someone creating a UdtCodec by hand and potentially reading the wrong interpretation of things; but I think that should probably done at your own peril anyways.

If you pull a UdtCodec from the registry and the type is later changed, the codec would still be usable after this change, this seems of more value than protecting against the risk I outlined.

import org.junit.rules.RuleChain;
import org.junit.rules.TestRule;

@Category(ParallelizableTests.class)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reformat this with?:

mvn fmt:format

Copy link
Contributor

@tolbertam tolbertam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you 👍

@absurdfarce
Copy link
Contributor

I'm a bit reluctant to remove the safeguard of decoding a UDT whose set of fields is different from what we expect. If we have a mismatch there it seems like we can't say anything with certainty about what's actually in the UDT in that case.

It seems like the best case we can do from a client perspective in this case is something like the following:

  1. Try to read the UDT
  2. If the read fails because of the indicated exception try a schema refresh (since def on server might have changed since schema was originally loaded)
  3. Try to read again with new UDT def. If this fails again you've just got bad data and there's nothing you can do about it.

Am I missing something here?

@akhaku
Copy link
Contributor Author

akhaku commented Sep 16, 2024

That will work in some cases, but a few things that make me feel like the current approach is better:

  1. I'm hesitant to do a schema refresh in the request path of the driver, it will cause that request to look latent from the caller's perspective. The closest existing parallel I can think of is preparing a statement on an UNPREPARED response but that particular case seems warranted.
  2. A refresh in that code path isn't enough, the codec is likely be cached in the CachingCodecRegistry and there's (currently) no way to invalidate that. For example when using the object mapper it validates the schema against your POJOs by default and ends up caching the codec at init time. You can't even re-register your UdtCodec on an error since it hits the collides check and doesn't get re-registered - changing that will be a fair bit more invasive.

whose set of fields is different from what we expect.

  1. The only changes you can make to a UDT are to add fields, and this code path deals specifically with having more fields than we expect. Additionally, given Cassandra is byte buffers all over the place, there's nothing stopping a user from trying to read bytes with the wrong codec except for schema verification, and that's what protects us from the general "wrong UDT definition" case. I think that means we can assume users only make "sane" changes to a UDT while a client app is running (i.e. UDT field additions) and so the "right" thing to do is ignore fields that were added that we don't recognize, to make it easier to roll out a UDT change across the writer apps and reader apps.

Let me know if you want to chat about this over a call, it took me a couple hours to page in enough context about this to be able to respond to you, would be nice to resolve this before it gets paged out again :)

@tolbertam
Copy link
Contributor

tolbertam commented Nov 13, 2024

@absurdfarce , was wondering if you could give this a second look? I see we discussed that we had some thoughts about whether we could provoke a schema refresh in this case, but I think we should strongly consider just merging this as is, because it seems relatively safe and unbreaks applications that hit this case (e.g. as mentioned in #1942)

@netudima
Copy link

netudima commented Nov 13, 2024

Hi, I faced the same issue and came independently to the same kind of patch in https://issues.apache.org/jira/browse/CASSJAVA-4. We started to observe the issue once we switched from 3.x to 4.x java driver. 3.x driver version has the same behaviour as in the patch.
A context: our app is low latency and highly available, so we want to do schema changes without business apps downtime. Also as a way to reduce value columns overhead we use UDTs extensively. Currently, as a WA for the issue we have to use a custom codec, but is not convenient and error-prone to add it everywhere.

I also think that a retry on the request path as an attempt to fix the type caching is not the best option:

  1. it will make the query/several queries at least 2 times slower
  2. it does not guarantee correctness: Cassandra server (at least 4.x versions) does not manage schema in a transactional way, so there is no a guarantee that all server nodes are in agreement about the UDT change yet and a schema can be retrieved from an node with an older version again...

patch by Ammar Khaku; reviewed by Andy Tolbert and Bret McGuire
reference: apache#1635
@akhaku
Copy link
Contributor Author

akhaku commented Nov 23, 2024

Preemptively rebased and updated the commit message to include Bret as a reviewer 🙏 so it should be good to merge once I get the 👍

@absurdfarce
Copy link
Contributor

Apologies, gang, this one has been entirely held up by me. After pondering this for a bit I retract my concerns: I accept @tolbertam's argument that this is relatively safe.

To be clear: my argument wasn't necessarily to perform a schema refresh under the covers... I was thinking of providing a meaningful error back to the user so that they could execute a schema refresh within their application. There would be no "hidden latency" in that case... but it does still require educating users that they should consider taking this action in this case, which is always error-prone. It's certainly more error-prone than what is proposed here.

Regardless, I've kept this PR waiting too long and it legitimately does unblock users... so I'm moving it forward.

Copy link
Contributor

@absurdfarce absurdfarce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do still have some reservations but (a) I don't have anything better right now and (b) this definitely does make life better for people who are stuck on it. So let's do it.

@absurdfarce absurdfarce merged commit 7ca013f into apache:4.x Jan 6, 2025
1 check failed
@absurdfarce
Copy link
Contributor

Thanks everybody for the good debate on approaches here. And special thanks to @akhaku for his patience with me on this one! :)

dkropachev added a commit to scylladb/java-driver that referenced this pull request Mar 19, 2025
* CASSANDRA-19635: Run integration tests with C* 5.x

patch by Lukasz Antoniak; reviewed by Andy Tolbert, and Bret McGuire for CASSANDRA-19635

* CASSANDRA-19635: Configure Jenkins to run integration tests with C* 5.x

patch by Lukasz Antoniak; reviewed by Bret McGuire for CASSANDRA-19635

* update badge URL to org.apache.cassandra/java-driver-core

* Limit calls to Conversions.resolveExecutionProfile

Those repeated calls account for a non-negligible portion of my application
CPU (0.6%) and can definitly be a final field so that it gets resolved only
once per CqlRequestHandler.

patch by Benoit Tellier; reviewed by Andy Tolbert, and Bret McGuire
reference: apache#1623

* autolink JIRA tickets in commit messages

patch by Stefan Miklosovic; reviewed by Michael Semb Wever for CASSANDRA-19854

* Don't return empty routing key when partition key is unbound

DefaultBoundStatement#getRoutingKey has logic to infer the
routing key when no one has explicitly called setRoutingKey
or otherwise set the routing key on the statement.
It however doesn't check for cases where nothing has been
bound yet on the statement.
This causes more problems if the user decides to get a
BoundStatementBuilder from the PreparedStatement, set some
fields on it, and then copy it by constructing new
BoundStatementBuilder objects with the BoundStatement as a
parameter, since the empty ByteBuffer gets copied to all
bound statements, resulting in all requests being targeted
to the same Cassandra node in a token-aware load balancing
policy.

patch by Ammar Khaku; reviewed by Andy Tolbert, and Bret McGuire
reference: apache#1620

* JAVA-3167: CompletableFutures.allSuccessful() may return never completed future

patch by Lukasz Antoniak; reviewed by Andy Tolbert, and Bret McGuire for JAVA-3167

* ninja-fix Various test fixes

* Run integration tests with DSE 6.9.0

patch by Lukasz Antoniak; reviewed by Bret McGuire
reference: apache#1955

* JAVA-3117: Call CcmCustomRule#after if CcmCustomRule#before fails to allow subsequent tests to run

patch by Henry Hughes; reviewed by Alexandre Dutra and Andy Tolbert for JAVA-3117

* JAVA-3149: Support request cancellation in request throttler
patch by Lukasz Antoniak; reviewed by Andy Tolbert and Chris Lohfink for JAVA-3149

* Fix C* 3.0 tests failing on Jenkins
patch by Lukasz Antoniak; reviewed by Bret McGuire
reference: apache#1939

* Reduce lock held duration in ConcurrencyLimitingRequestThrottler

It might take some (small) time for callback handling when the
throttler request proceeds to submission.

Before this change, the throttler proceed request will happen while
holding the lock, preventing other tasks from proceeding when there is
spare capacity and even preventing tasks from enqueuing until the
callback completes.

By tracking the expected outcome, we can perform the callback outside
of the lock. This means that request registration and submission can
proceed even when a long callback is being processed.

patch by Jason Koch; Reviewed by Andy Tolbert and Chris Lohfink for CASSANDRA-19922

* Annotate BatchStatement, Statement, SimpleStatement methods with CheckReturnValue

Since the driver's default implementation is for
BatchStatement and SimpleStatement methods to be immutable,
we should annotate those methods with @CheckReturnValue.
Statement#setNowInSeconds implementations are immutable so
annotate that too.

patch by Ammar Khaku; reviewed by Andy Tolbert and Bret McGuire
reference: apache#1607

* Remove "beta" support for Java17 from docs

patch by Bret McGuire; reviewed by Andy Tolbert and Alexandre Dutra
reference: apache#1962

* Fix uncaught exception during graceful channel shutdown

after exceeding max orphan ids

patch by Christian Aistleitner; reviewed by Andy Tolbert, and Bret McGuire for apache#1938

* Build a public CI for Apache Cassandra Java Driver

 patch by Siyao (Jane) He; reviewed by Mick Semb Wever for CASSANDRA-19832

* CASSANDRA-19932: Allow to define extensions while creating table
patch by Lukasz Antoniak; reviewed by Bret McGuire and Chris Lohfink

* Fix DefaultSslEngineFactory missing null check on close

patch by Abe Ratnofsky; reviewed by Andy Tolbert and Chris Lohfink for CASSANDRA-20001

* Query builder support for NOT CQL syntax

patch by Bret McGuire; reviewed by Bret McGuire and Andy Tolbert for CASSANDRA-19930

* Fix CustomCcmRule to drop `CURRENT` flag no matter what

If super.after() throws an Exception `CURRENT` flag is never dropped
which leads next tests to fail with IllegalStateException("Attempting to use a Ccm rule while another is in use.  This is disallowed")

Patch by Dmitry Kropachev; reviewed by Andy Tolbert and Bret McGuire for JAVA-3117

* JAVA-3051: Memory leak

patch by Jane He; reviewed by Alexandre Dutra and Bret McGuire for JAVA-3051

* Automate latest Cassandra versions when running CI

 patch by Siyao (Jane) He; reviewed by Mick Semb Wever for CASSJAVA-25

* Refactor integration tests to support multiple C* distributions. Test with DataStax HCD 1.0.0

patch by Lukasz Antoniak; reviewed by Bret McGuire
reference: apache#1958

* Fix TableMetadata.describe() when containing a vector column

patch by Stefan Miklosovic; reviewed by Bret McGuire for CASSJAVA-2

* Move Apache Cassandra 5.x off of beta1 and remove some older Apache Cassandra versions.

patch by Bret McGuire; reviewed by Bret McGuire for CASSJAVA-54

* Update link to Jira to be CASSJAVA

Updating the link to Jira.

Previously we had a component in the CASSANDRA Jira project but now we have a project for each driver - in the case of Java, it's CASSJAVA.

Added CASSJAVA to .asf.yaml

patch by Jeremy Hanna; reviewed by Bret McGuire for CASSJAVA-61

* Move DataStax shaded Guava module into Java driver

patch by Lukasz Antoniak; reviewed by Alexandre Dutra and Bret McGuire for CASSJAVA-52

* JAVA-3057 Allow decoding a UDT that has more fields than expected

patch by Ammar Khaku; reviewed by Andy Tolbert and Bret McGuire
reference: apache#1635

* CASSJAVA-55 Remove setting "Host" header for metadata requests.

With some sysprops enabled this will actually be respected which completely borks Astra routing.

patch by Bret McGuire; reviewed by Alexandre Dutra and Bret McGuire for CASSJAVA-55

* JAVA-3118: Add support for vector data type in Schema Builder, QueryBuilder
patch by Jane He; reviewed by Mick Semb Wever and Bret McGuire for JAVA-3118
reference: apache#1931

* Upgrade Guava to 33.3.1-jre

patch by Lukasz Antoniak; reviewed by Alexandre Dutra and Bret McGuire for CASSJAVA-53

* Do not always cleanup Guava shaded module before packaging

* Revert "Do not always cleanup Guava shaded module before packaging"

This reverts commit 5be52ec.

* Conditionally compile shaded Guava module

* JAVA-3143: Extend driver vector support to arbitrary subtypes and fix handling of variable length types (OSS C* 5.0)

patch by Jane He; reviewed by Bret McGuire and João Reis
reference: apache#1952

* JAVA-3168 Copy node info for contact points on initial node refresh only from first match by endpoint

patch by Alex Sasnouskikh; reviewed by Andy Tolbert and Alexandre Dura for JAVA-3168

* JAVA-3055: Prevent PreparedStatement cache to be polluted if a request is cancelled.

There was a critical issue when the external code cancels a request, indeed the cached CompletableFuture will then always throw a CancellationException.
This may happens, for example, when used by reactive like Mono.zip or Mono.firstWithValue.

patch by Luc Boutier; reviewed by Alexandre Dutra and Bret McGuire
reference: apache#1757

* Expose a decorator for CqlPrepareAsyncProcessor cache rather than the ability to specify an
arbitrary cache from scratch.

Also bringing tests from apache#2003 forward
with a few minor changes due to this implementation

patch by Bret McGuire; reviewed by Bret McGuire and Andy Tolbert
reference: apache#2008

* ninja-fix Using shaded Guava classes for import in order to make OSGi class paths happy.

Major hat tip to Dmitry Konstantinov for the find here!

* Changelog updates for 4.19.0

* [maven-release-plugin] prepare release 4.19.0

* [maven-release-plugin] prepare for next development iteration

* Specify maven-clean-plugin version

Sets the version to 3.4.1 in parent pom.
Having it unspecified causes the following warning:
```
[WARNING] Some problems were encountered while building the effective model for com.scylladb:java-driver-guava-shaded:jar:4.19.0.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-clean-plugin is missing. @ line 97, column 15
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
```

* Install guava-shaded before running core's compile in CI

Without this module available "Build" and "Unit tests" job fail with
"package <x> does not exist" or "cannot find symbol" pointing to
`[...].shaded.guava.[...]` packages.

* Remove exception catch in `prepared_stmt_metadata_update_loopholes_test`

Since incorporating "JAVA-3057 Allow decoding a UDT that has more fields than
expected" the asuumptions of removed check are no longer valid.

* Switch shaded guava's groupId in osgi-tests

Switches from `org.apache.cassandra` to `com.scylladb`
in `BundleOptions#commonBundles()`.

* Typo: increase line's loglevel in CcmBridge

---------

Co-authored-by: Lukasz Antoniak <[email protected]>
Co-authored-by: Brad Schoening <[email protected]>
Co-authored-by: Benoit Tellier <[email protected]>
Co-authored-by: Stefan Miklosovic <[email protected]>
Co-authored-by: Ammar Khaku <[email protected]>
Co-authored-by: absurdfarce <[email protected]>
Co-authored-by: Henry Hughes <[email protected]>
Co-authored-by: Jason Koch <[email protected]>
Co-authored-by: Christian Aistleitner <[email protected]>
Co-authored-by: janehe <[email protected]>
Co-authored-by: Abe Ratnofsky <[email protected]>
Co-authored-by: absurdfarce <[email protected]>
Co-authored-by: Dmitry Kropachev <[email protected]>
Co-authored-by: janehe <[email protected]>
Co-authored-by: Jeremy Hanna <[email protected]>
Co-authored-by: SiyaoIsHiding <[email protected]>
Co-authored-by: Alex Sasnouskikh <[email protected]>
Co-authored-by: Luc Boutier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants