Async multi packet fixes #3534

Wraith2 · 2025-07-31T20:57:25Z

Fix 1:
When reading multi packet strings it is possible for multiple strings to happen in a single row. When reading asynchronously a snapshot is used which contains a linked list of packets. The current codebase has logic which keeps a cleared spare linked list node when the snapshot is cleared. The logic to clear the spare packet was faulty and did not clear all the fields leaving the data length in the node. In specific circumstances it is possible to re-use the spare linked list node containing an old data value as the first packet in a new linked list of packets. When this happens in a read which reaches the continue stage (3 or more packets) the size calculation is incorrect and various errors can occur.

The spare packet functionality is not very useful because it can store a single node. It doesn't retain the byte[] buffer so the memory saving is tiny. I have removed it and changed the linked list node fields to be readonly. This resolves the bug.

Fix 2:
When reading a multi packet string the plp chunks are read from each packet and the end is signalled by a terminator. It is possible for the data to align such that the contents of a string complete exactly at the end of a packet and the terminator is in the next packet. In this case some pre-existing logic checks for a 0 chars remaining and exists early.

This logic needed to be updated so that in when continuing it returns the entire existing length read and not a 0 value.

Fix 3:
While debugging the first two issues the buffer sizes and calculations were confusing me. I eventually realised that the code was directly using _longlenleft which is measured in bytes to size a char array, meaning that all char arrays were twice as long as needed. I have updated the code to handle that and use smaller appropriately sized arrays.

I have updated the existing test to iterate from 512 (minimum packet size) to 2048 bytes in size. This can cause lots of interesting alignments in the data testing the paths through the string reading code more effectively. The range could be increased but I considered that the runtime needed to be low enough to not timeout CI runs, most higher packet size will be similar to lower sized runs due to factoring.

Thanks to @erenes and @Suchiman for their help finding the reproduction that worked on my machine, without that I would have been unable to fix anything

@dotnet/sqlclientdevteam can I get a CI run please.

/cc @Jakimar

Edits:

Fixes 3572.
When reading data sequentially and calling GetFieldValueAsync it is possible for the call to TryReadColumnInternal to fail because there is not enough data. When this happens an async read has been started so we should wait for it to return but the code does not do this and instead falls through to attempting a call to TryReadColumnValue. Under a debugger it's clear that there is a problem because you hit an assert but at runtime this was silently failing which could cause data loss as an async read collided with a sync read.. The fix is to prevent the fallthrough and queue the wait for the next packet.

Wraith2 · 2025-07-31T20:58:26Z

@ErikEJ this might reduce memory usage for string reads. It might be worth benching the artifacts if the CI runs green.

mdaigle · 2025-07-31T21:31:59Z

/azp run

azure-pipelines · 2025-07-31T21:32:23Z

Azure Pipelines successfully started running 2 pipeline(s).

Wraith2 · 2025-08-01T01:28:54Z

I've added an additional fix which is the same as the 0 length left in terminator case and which occurs on the varchar not nvarchar read path.

mdaigle · 2025-08-01T15:55:28Z

/azp run

azure-pipelines · 2025-08-01T15:55:52Z

Azure Pipelines successfully started running 2 pipeline(s).

Wraith2 · 2025-08-12T20:34:35Z

@dotnet/sqlclientdevteam can I get a CI run on this please.

I've added a new commit which forces process sni mode to compatibility mode (and by extension, disabled async-continue mode) and adds in a fix for the pending read counter imbalance that we discussed and that @rhuijben has been assisting with tracking down today. This is a possible stable current codebase state to evaluate.

mdaigle · 2025-08-12T20:38:07Z

/azp run

azure-pipelines · 2025-08-12T20:38:27Z

Azure Pipelines successfully started running 2 pipeline(s).

Wraith2 · 2025-08-12T21:38:01Z

I've aligned the appcontext switch test with the new defaults. Can i get another run please @dotnet/sqlclientdevteam

paulmedynski · 2025-08-12T22:26:59Z

/azp run

azure-pipelines · 2025-08-12T22:27:20Z

Azure Pipelines successfully started running 2 pipeline(s).

paulmedynski

Asking for some clarity on >> 1 versus / 2.

src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParser.cs

rhuijben · 2025-08-13T08:17:38Z

@Wraith2 when I run the testcase from the other issue against this branch I get in DEBUG mode

 SearchDogCrash.SearchDogCrashTests.TestSearchDogCrash
   Source: Class1.cs line 11
   Duration: 947 ms

  Message: 
Microsoft.VisualStudio.TestPlatform.TestHost.DebugAssertException : Method Debug.Fail failed with 'Invalid token after performing CleanPartialRead: 04
', and was translated to Microsoft.VisualStudio.TestPlatform.TestHost.DebugAssertException to avoid terminating the process hosting the test.

  Stack Trace: 
SqlDataReader.TryCleanPartialRead() line 867
SqlDataReader.TryCloseInternal(Boolean closeReader) line 1058
SqlDataReader.Close() line 1009
SqlDataReader.Dispose(Boolean disposing) line 924
DbDataReader.DisposeAsync()
SearchDogCrashTests.Do() line 49
ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
Task.RunContinuations(Object continuationObject)
Task`1.TrySetResult(TResult result)
UnwrapPromise`1.TrySetFromTask(Task task, Boolean lookForOce)
UnwrapPromise`1.ProcessInnerTask(Task task)
Task.RunContinuations(Object continuationObject)
Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
ThreadPoolWorkQueue.Dispatch()
WorkerThread.WorkerThreadStart()

In release mode the test passes.
(Reverted to old version of the library for RepoDB and $dayjob)

Wraith2 · 2025-08-13T08:34:44Z

multitasking here, can you link me to the exact repro you're talking about?

rhuijben · 2025-08-13T09:05:07Z

multitasking here, can you link me to the exact repro you're talking about?

The testcase from
#3519 (comment)

I'm currently trying to get things reproduced against a docker instance of sqlserver 2019 so we can look at the same thing (and maybe even test this on github actions, like I do in the RepoDB project)

rhuijben · 2025-08-13T09:20:08Z

This case rhuijben@1964bc1
(Extracted from #3519 (comment))

fails for me on this docker setup.
(If you have docker, running docker compose up -d in the directory with the compose script will give you a local sqlserver instance. The testcase then adds the schema and runs)

Too bad it is not the error I'm seeing myself, but it is still a valid testcase. Trying to extend this to include my case.

It fails on the first (smallest) packetsize of 512.

 SearchDogCrash.SearchDogCrashTests.OtherRepro
   Source: CrashTest.cs line 822
   Duration: 3,6 min

  Message: 
Microsoft.VisualStudio.TestPlatform.TestHost.DebugAssertException : Method Debug.Fail failed with 'partially read packets cannot be appended to the snapshot
', and was translated to Microsoft.VisualStudio.TestPlatform.TestHost.DebugAssertException to avoid terminating the process hosting the test.

  Stack Trace: 
StateSnapshot.AppendPacketData(Byte[] buffer, Int32 read) line 4158
TdsParserStateObject.ProcessSniPacketCompat(PacketHandle packet, UInt32 error) line 529
TdsParserStateObject.ProcessSniPacket(PacketHandle packet, UInt32 error) line 19
TdsParserStateObject.ReadAsyncCallback(IntPtr key, PacketHandle packet, UInt32 error) line 353
TdsParserStateObject.ReadSni(TaskCompletionSource`1 completion) line 3236
TdsParserStateObject.TryReadNetworkPacket() line 2818
TdsParserStateObject.TryPrepareBuffer() line 1299
TdsParserStateObject.TryReadByteArray(Span`1 buff, Int32 len, Int32& totalRead, Int32 startOffset, Boolean writeDataSizeToSnapshot) line 1492
TdsParserStateObject.TryReadByteArray(Span`1 buff, Int32 len, Int32& totalRead) line 1453
TdsParserStateObject.TryReadInt64(Int64& value) line 1721
<13 more frames...>
AwaitTaskContinuation.RunOrScheduleAction(Action action, Boolean allowInlining)
Task.RunContinuations(Object continuationObject)
Task`1.TrySetResult(TResult result)
TaskCompletionSource`1.TrySetResult(TResult result)
SqlDataReader.CompleteAsyncCall[T](Task`1 task, SqlDataReaderBaseAsyncCallContext`1 context) line 6123
SqlDataReaderBaseAsyncCallContext`1.CompleteAsyncCallCallback(Task`1 task, Object state) line 5822
ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
ThreadPoolWorkQueue.Dispatch()
WorkerThread.WorkerThreadStart()

rhuijben · 2025-08-13T09:42:17Z

Debug.Assert(TdsEnums.HEADER_LEN + Packet.GetDataLengthFromHeader(buffer) == read, "partially read packets cannot be appended to the snapshot");

read=512
buffer = byte[512], (first and last byte are 0x04)
TdsEnums.HEADER_LEN = 8.
Packet.GetDataLengthFromHeader(buffer) returns 503

503+8 = 511, so mismatch.

Looks like the first byte of the next package is already in the buffer here.

Wraith2 · 2025-08-13T10:20:07Z

That assert will fire periodically when packet multiplexing is disabled. We should add in the context switch to the assertion.

That might be correct. I saw something similar while looking at the multipart xml reads with a weird packet size. If the packet status does not include the last packet bit, and the requiredlength is less than the total packet as long as the transferred data amount is the same as the buffer size it's technically correct, I think. I'm referring to this as padded packets. I hadn't seen them before 2 weeks ago but the spec doesn't preclude them. When i saw them the remaining data space in the packet buffer was filled with FF. This is part of the reason that i added the DumpPackets and DumpInBuff functions to my debug branch.

rhuijben · 2025-08-13T10:28:25Z

With packet size configured as 512 I see 511 byte packets (which fail these tests), but also one really large packet (>= 60 KB). Not sure if the debug assert does the right thing. It looks like the demultiplexer handles these cases just fine.

With this packet code you also always have to handle short-reads caused by network security and TCP packets. There are standard proxies for that last case so you can always get small (or large) packets from the network layer. The DotNet core project uses fuzzing with that to catch http errors, as do a lot of other libraries.

Looks like these asserts are on the wrong layer... as from the network you can have much smaller or larger packets than the TDS packets (smaller when processing really fast, and much longer when the network already delivered more data than a single packet... Which can also happen on slow networks when one packet got lost and is re-delivered, while others are already in the queue.

Wraith2 · 2025-08-13T11:04:53Z

With the multiplexer active every packet which is appended to the snapshot is required to have a correct header and be a complete packet. If you look at the locations where packets are assigned to _inBuff from the snapshot you'll see that it gets the status and data length from the packet, that requires a correct packet header.

With the multiplexer inactive, which is ProcessSni compatibility mode those requirements no longer hold. So the correct things to do is to change the asserts to be:

                Debug.Assert(buffer != null, "packet data cannot be null");
                Debug.Assert(LocalAppContextSwitches.UseCompatibilityProcessSni || read >= TdsEnums.HEADER_LEN, "minimum packet length is TdsEnums.HEADER_LEN");
                Debug.Assert(LocalAppContextSwitches.UseCompatibilityProcessSni || TdsEnums.HEADER_LEN + Packet.GetDataLengthFromHeader(buffer) == read, "partially read packets cannot be appended to the snapshot");

In terms of packet lengths I refer to the TDS definition of a packet which is that all packets are of the communicated packetsize unless they are the last packet in a stream. Even though it's possible in network terms to receive any length of bytes up to the buffer size that was passed to the function we will generally find that there are full packets unless your server is quite slow. Unfortunately the slow case is very hard to replicate in a reliable way so there have been holes in the code.

The demultiplexer current does not handle padded packets (physical length > logical length) correctly. I need to get some example packets with padding and then write some new test infra for the multiplexer to make sure that CurrentLength and RequiredLength are used correctly.

rhuijben · 2025-08-13T11:18:44Z

On https://github.com/rhuijben/SqlClient/tree/test/repro-1 (your branch + 1 commit) I now have (with AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.UseCompatibilityProcessSni", false);)

Running the tests against a docker with a clean sqlserver 2019 container. (Create script is in that commit)

 SearchDogCrash.SearchDogCrashTests.OtherRepro
   Source: CrashTest.cs line 824
   Duration: 40,4 sec

  Message: 
    Microsoft.VisualStudio.TestPlatform.TestHost.DebugAssertException : Method Debug.Fail failed with 'dumping buffer
    _inBytesRead = 21
    _inBytesUsed = 11
    used buffer:
    04 01 00 15 00 34 01 00 FD 20 00 
    unused buffer:
    FD 00 00 00 00 00 00 00 00 00 
    
    ', and was translated to Microsoft.VisualStudio.TestPlatform.TestHost.DebugAssertException to avoid terminating the process hosting the test.

  Stack Trace: 
    TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) line 2043
    TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) line 1923
    TdsParser.ProcessAttention(TdsParserStateObject stateObj) line 8178
    TdsParser.ProcessPendingAck(TdsParserStateObject stateObj) line 344
    TdsParserStateObject.ResetCancelAndProcessAttention() line 839
    TdsParserStateObject.CloseSession() line 919
    SqlDataReader.TryCloseInternal(Boolean closeReader) line 1078
    SqlDataReader.Close() line 1009
    SqlDataReader.Dispose(Boolean disposing) line 924
    DbDataReader.DisposeAsync()
    SearchDogCrashTests.OtherRepro() line 844
    ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
    AwaitTaskContinuation.RunOrScheduleAction(IAsyncStateMachineBox box, Boolean allowInlining)
    Task.RunContinuations(Object continuationObject)
    Task.FinishSlow(Boolean userDelegateExecute)
    Task.TrySetException(Object exceptionObject)
    TaskCompletionSource`1.TrySetException(Exception exception)
    SqlDataReader.CompleteAsyncCall[T](Task`1 task, SqlDataReaderBaseAsyncCallContext`1 context) line 6115
    SqlDataReaderBaseAsyncCallContext`1.CompleteAsyncCallCallback(Task`1 task, Object state) line 5822
    ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
    Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
    ThreadPoolWorkQueue.Dispatch()
    WorkerThread.WorkerThreadStart()

(Don't look at the timings. My laptop is in silent mode, which is very slow with todays weather)

Wraith2 · 2025-08-13T11:46:03Z

Token packet, last packet of stream. No idea what the content is supposed to represent. This would silently fail in release mode because all that diagnostic output is a #DEBUG block

rhuijben · 2025-08-13T12:30:38Z

04 01 00 15 00 34 01 00 FD 20 00

Not sure if I read this correctly but looks like:
Packet of 0x15 length (byte 2 and 3 in the array), so 21 decimal. This happens to be the number of bytes available. But only 11 are actually marked as read...

src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/DataReaderTest/DataReaderTest.cs

Wraith2 · 2025-08-13T13:15:12Z

04 01 00 15 00 34 01 00 FD 20 00

Not sure if I read this correctly but looks like: Packet of 0x15 length (byte 2 and 3 in the array), so 21 decimal. This happens to be the number of bytes available. But only 11 are actually marked as read...

Yes. That was my interpretation. Since I've no idea what the bytes that have been read were a part of I have no way to tell if the extra bytes are meaningful or not. It seems likely that they should be. If you've got my additional logging enabled from the -dev branch then you could look at the output of DumpLog() and see what operation consumed those bytes.

paulmedynski · 2025-08-13T16:53:41Z

/azp run

mdaigle · 2025-08-29T17:28:35Z

FYI, we're having issues with macos tests at the moment, separate from this PR

Wraith2 · 2025-09-04T20:59:43Z

Any specific thing blocking merging?

paulmedynski · 2025-09-05T11:11:55Z

Waiting on #3597 to fix our macOS CI issues. One that merges, you can re-merge main and we can kick off another CI run.

fix 0 length read at the start of a packe in plp stream returning 0 when continuing handle char array sizing better change existing test to use multiple packet sizes

…ault

…ring

… functions

shotren new test duration by using larger increments

Wraith2 · 2025-09-06T21:24:09Z

that pr to fix the ci merged, rebased this pr. please run the CI.

paulmedynski · 2025-09-08T09:52:34Z

/azp run

azure-pipelines · 2025-09-08T09:52:56Z

Azure Pipelines successfully started running 2 pipeline(s).

ErikEJ · 2025-09-08T12:48:25Z

So to take it to 11, what do I need to do with the latest main artifact? (Planning to share the simplest possible approach in a blog post)

Wraith2 · 2025-09-08T12:57:21Z

This makes main stable. It doesn't contain the speed improvements because this code can never enter an async-continue state. If you want to use this PR artifact for the sake of being on main and getting stability then that's fine, but that seems like a small use-case given that we have 6.1.1 which is stable.

I have some work in a branch which will add the speed improvements back in. I'll get started on that this evening.

ErikEJ · 2025-09-08T13:14:30Z

@Wraith2 will that be opt-in via a switch then?

Wraith2 · 2025-09-08T13:19:20Z

Yes.

Wraith2 requested a review from a team as a code owner July 31, 2025 20:57

apoorvdeshmukh added this to the 6.1.1 milestone Aug 4, 2025

cheenamalhotra removed this from the 6.1.1 milestone Aug 12, 2025

paulmedynski requested changes Aug 12, 2025

View reviewed changes

src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParser.cs Outdated Show resolved Hide resolved

rhuijben reviewed Aug 13, 2025

View reviewed changes

src/Microsoft.Data.SqlClient/tests/ManualTests/SQL/DataReaderTest/DataReaderTest.cs Show resolved Hide resolved

Wraith2 added 16 commits September 6, 2025 22:22

remove spare packet

31b5f28

fix read error with reused uncleared packet

4b891d8

fix 0 length read at the start of a packe in plp stream returning 0 when continuing handle char array sizing better change existing test to use multiple packet sizes

additinal fix for 0 length case

924342f

fix pending read counters force process sni compatibility mode by def…

6109626

…ault

use division not shift

2374129

Fix cached buffer attempting to use storage when continue is not occu…

f88a354

…ring

fix and clarify RequiredLength vs CurrentLength in ProcessSniPacket

09dd588

sync assertions

bc0e487

prevent multiple appends to an open snapshot

e54144b

add continue capability back and make it opt-in for supporting callee…

0934b92

… functions

reduce memory usage when reading strings in continue mode

a91ec4f

avoid 0 length array allocations

8b6e985

fix ReqeustContinue to be sync safe

737ccf9

fix 3527

2e38fa9

shotren new test duration by using larger increments

fix rebase merge error

64f45c6

review feedback, add local for char size

916db6a

Wraith2 force-pushed the fix-3519 branch from f5e1845 to 916db6a Compare September 6, 2025 21:23

paulmedynski added this to the 7.0-preview1 milestone Sep 8, 2025

paulmedynski merged commit 2acf6e0 into dotnet:main Sep 8, 2025
236 checks passed

Wraith2 mentioned this pull request Sep 9, 2025

Add continue request usage #3605

Open

Async multi packet fixes #3534

Async multi packet fixes #3534

Uh oh!

Conversation

Wraith2 commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Wraith2 commented Jul 31, 2025

Uh oh!

mdaigle commented Jul 31, 2025

Uh oh!

azure-pipelines bot commented Jul 31, 2025

Uh oh!

Wraith2 commented Aug 1, 2025

Uh oh!

mdaigle commented Aug 1, 2025

Uh oh!

azure-pipelines bot commented Aug 1, 2025

Uh oh!

Wraith2 commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdaigle commented Aug 12, 2025

Uh oh!

azure-pipelines bot commented Aug 12, 2025

Uh oh!

Wraith2 commented Aug 12, 2025

Uh oh!

paulmedynski commented Aug 12, 2025

Uh oh!

azure-pipelines bot commented Aug 12, 2025

Uh oh!

paulmedynski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rhuijben commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Wraith2 commented Aug 13, 2025

Uh oh!

rhuijben commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhuijben commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhuijben commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Wraith2 commented Aug 13, 2025

Uh oh!

rhuijben commented Aug 13, 2025

Uh oh!

Wraith2 commented Aug 13, 2025

Uh oh!

rhuijben commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Wraith2 commented Aug 13, 2025

Uh oh!

rhuijben commented Aug 13, 2025

Uh oh!

Uh oh!

Wraith2 commented Aug 13, 2025

Uh oh!

paulmedynski commented Aug 13, 2025

Uh oh!

mdaigle commented Aug 29, 2025

Uh oh!

Wraith2 commented Sep 4, 2025

Uh oh!

paulmedynski commented Sep 5, 2025

Uh oh!

Wraith2 commented Sep 6, 2025

Uh oh!

paulmedynski commented Sep 8, 2025

Uh oh!

azure-pipelines bot commented Sep 8, 2025

Uh oh!

Wraith2 commented Jul 31, 2025 •

edited

Loading

Wraith2 commented Aug 12, 2025 •

edited

Loading

rhuijben commented Aug 13, 2025 •

edited

Loading

rhuijben commented Aug 13, 2025 •

edited

Loading

rhuijben commented Aug 13, 2025 •

edited

Loading

rhuijben commented Aug 13, 2025 •

edited

Loading

rhuijben commented Aug 13, 2025 •

edited

Loading