Skip to content

Conversation

francisjervis
Copy link

This PR adds callbacks on_socket_connect and on_socket_disconnect to enable (among other things) state restoration in the absence of authentication/built-in data persistence and handling of connectivity issues with the client.

The current on_chat_restore handler depends on the Chainlit data layer and and on_chat_start/on_chat_end are ambiguous wrt their handling of 1: WS status changes 2: user intent - for instance, on_chat_end is called if the server times out the connection (#2198).

on_chat_resume does not appear to be called (even if the data layer is enabled) after a WS drop/reconnect event cycle. on_chat_start is not called if the WS connection is dropped and re-established, only on the first connection for a session. Note that the current documentation for on_chat_start has "to react to the user websocket connection event" which is ambiguous/misleading.

In at least some contexts this leads to a silent failure state (on both ends) where user inputs are not processed, there is no "disconnected" message shown to the user, and there is no backend signal. This is particularly problematic when the WS connection drops after an AskUserMessage was sent.

While there is certainly some overlap with the existing chat lifecycle callbacks, adding unambiguous/unopinionated network state change handling both avoids breaking changes (ie renaming/refactoring the chat start/end callbacks) and adds the ability to use "hand-rolled" state persistence approaches which do not depend on the Chainlit auth/data layer features.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. backend Pertains to the Python backend. labels Sep 2, 2025
@hayescode
Copy link
Contributor

@francisjervis thank you for your contribution! My preference would be to limit the number of callbacks and duplication as I think this could introduce even more ambiguity especially for newer developers. Would you be able to fire the existing callbacks during new web socket connection instead, to fix the existing callbacks? I agree with you that this should be covered and there are unhandled edge cases, but I worry about increasing complexity with this approach especially as we're now community managed we're looking for clean scalability and maintainability.

@francisjervis
Copy link
Author

@francisjervis thank you for your contribution! My preference would be to limit the number of callbacks and duplication as I think this could introduce even more ambiguity especially for newer developers.

The current callbacks are, frankly, quite poorly named (and confusingly documented). They do not reflect the actual chat start and end - eg it is not safe to include clean-up/finalization logic in on_chat_end because it is called whenever the socket drops. I suppose they were designed 1: with the intention that the LiteralAI hosted data layer would be used 2: without adequate testing which would have revealed that there is a major unhandled not-really-edge case: intermittent dropping of WS connection during anything other than a "vanilla" cl.Message exchange.

To expand on that last point: if the connection is dropped/resumed & the last server/assistant action was sending a cl.Message, and the user sends a regular message after that, the user session does appear to be restored correctly. If the last server/assistant message was an AskUserMessage, the "answer" triggers the on_message callback instead of being handled as the response to the AskUserMessage. Theoretically one could try to recover the state there...but...

If the last server/assistant action was an AskActionMessage or AskFileMessage (tested; I presume the same applies to custom actions), the user sees a "no callback" error message and on_message does not fire.

Would you be able to fire the existing callbacks during new web socket connection instead, to fix the existing callbacks?

No, that would be a breaking change. Moreover it would not resolve ambiguity/naming issues. Quite the opposite.

Obviously the best solution would be for the original Ask* callback to seamlessly pick up handling the response but that is probably intractable.

The second best option - again, breaking existing implementations, probably for everyone - would be to remove the current chat start/end/resume callbacks completely and replace them with unambiguous network state signals, leaving the developer to handle the logic for new/existing chat sessions (ie checking whether the X-Chainlit-Session-Id header matches an existing one).

What I am proposing would at least allow developers to detect connection drops and re-send the last Ask* message so the response can be handled appropriately...

I agree with you that this should be covered and there are unhandled edge cases, but I worry about increasing complexity with this approach especially as we're now community managed we're looking for clean scalability and maintainability.

This is a scalability issue because you cannot run a Chainlit application which uses the Ask* message functionality in production without setting the server request timeout to its maximum value to avoid premature, silent, chat termination. Even then the chat session length is limited to the server request timeout value. When that is reached, the chat silently ends!!

To reiterate: at present, if the application uses Ask* messages, Chainlit demands that the end user has a stable network connection, does not switch away, even momentarily, from the chat window (at least on mobile ie 70% of web traffic), and finishes their chat before the server request times out (on GCR, 1 hour). This is not really an edge case - it is the difference between being OK for a demo where you can avoid actions which will drop the connection, and usable "in the wild."

Also note that installing the package from a fork does not work, #2467 - so that is not an option.

@hayescode
Copy link
Contributor

@francisjervis your description of the socket connection drops not being handled by the existing callbacks makes sense based on their definitions, but I have not experienced this at all and I have hundreds of users with auth/persistence/etc. Based on your description users are leaving/switching away and over an hour later (after maximizing timeout) they're coming back and it fails because the session has expired? That does seem like an edge case to me. Waiting over an hour between interactions in the same thread is not typical LLM usage and even then chat resume is a thing that enables this. I'd like to see some reproducible code demonstrating this issue and the circumstances it happens in to get a better understanding of why new callbacks are required or if there's another way.

@asvishnyakov
Copy link
Member

asvishnyakov commented Sep 2, 2025

No, that would be a breaking change.

But then moving on_chat_end after session and user session cleanup IS a breaking change too, because all code in users' on_chat_end callbacks which use session or user session will break after this PR is merged

Thank you for #2467. I think it should be fixed ASAP, and it can be done quite easily. I'll look into it today

@asvishnyakov
Copy link
Member

@hayescode I experienced WebSocket connection drop when streaming large AI responses, and we also have user connections lasting more than 2 hours - people just leave the tab open in their browsers for days, if not weeks or months, and browsers only put it into sleeping mode only after some time. Based on my experience with WebSockets in other technologies (we used them in GraphQL subscriptions), connection drops happen quite often

@francisjervis
Copy link
Author

francisjervis commented Sep 2, 2025

@francisjervis your description of the socket connection drops not being handled by the existing callbacks makes sense based on their definitions, but I have not experienced this at all and I have hundreds of users with auth/persistence/etc.

Are you using AskUserMessage or AskFileMessage or response-sending custom elements? If not, no, you wouldn't.

Based on your description users are leaving/switching away and over an hour later (after maximizing timeout) they're coming back and it fails because the session has expired?

No, that is not what I mean. Mobile browsers - particularly Chrome - drop WS connections if the user minimizes the app/switches to another app, etc. So someone can be using a Chainlit-based web app, momentarily (and I literally mean for any non-zero amount of time) lock/reopen their phone or check a message in another app, and this bug occurs.

It also occurs if they are actively using the chat for more than the server request timeout period, which, in the case of Google Cloud Run, cannot be set to more than 3600 seconds. No leaving/switching away needed.

Setting the timeout to this long a period as a work-around is really not a good solution at scale.

That does seem like an edge case to me. Waiting over an hour between interactions in the same thread is not typical LLM usage and even then chat resume is a thing that enables this. I'd like to see some reproducible code demonstrating this issue and the circumstances it happens in to get a better understanding of why new callbacks are required or if there's another way.

As I said, that isn't the issue. Though that would still not be great.

By the way, I set up a test project with auth/data persistence, and on_chat_resume is absolutely not called on socket connection restoration. It's only called when a user re-opens a chat from their history. This is what I mean about the confusing naming! on_chat_restore would be more descriptive.

In my use case - which is a postdoc research project - I am using CL to run an interviewing agent. The first message is a "consent screen" with a yes/no AskActionMessage - recruiting participants eg with a QR code link to the interviewer site is completely broken, because most people will not immediately start completing the interview (never mind stay on the page for the duration thereof). When they return to the page it is jammed - can't send the response payload to continue, no informative message. The interview continues with a series of AskUserMessages (and sometimes Action/File) so any socket drop kills the interview flow.

I also kinda disagree that this is not "typical LLM usage" lol, I have Claude etc chats open that I come back to after days. I do not think people are somehow OK with this as just what happens with AI chat apps because I cannot think of any other services which are affected by this.

Try running this file. To demonstrate the issue, connect using ngrok and open the URL on your phone. While the yes/no question is displayed but before responding, lock, then immediately re-open your phone. Try it again (after reloading of course, because the original chat is now borked), answering the yes/no question then switching away while the "what is your name" question is displayed. Try answering the question. Watch the logs. Try it on desktop and use the dev tools connection control to go "offline" then reconnect to simulate network loss at the same stages (this will not work if you are on localhost:8000 - you have to use ngrok to simulate) so you can once again watch the logs.

app.py

@francisjervis
Copy link
Author

francisjervis commented Sep 2, 2025

user connections lasting more than 2 hours

Are you using Ask* messages? This does not occur with vanilla cl.Message based apps. I suspect you would see connections dropped/reestablished a lot if you were watching the connection status (which, of course, you can't do without the callbacks in this PR). I'd pretty much guarantee you are not maintaining WS connectivity for the whole period.

browsers only put it into sleeping mode only after some time

Chrome for iOS drops it basically immediately. I was surprised too...

But then moving on_chat_end after session and user session cleanup IS a breaking change too, because all code in users' on_chat_end callbacks which use session or user session will break after this PR is merged

I'm not quite sure who this was addressed to. My point was that it would be worse to change the opinionated "start" and "end" callback logic for precisely this reason. The docs should (and per my matching PR are) be updated to clarify exactly when these events will (not) fire. This PR does not modify the existing callbacks at all.

@hayescode
Copy link
Contributor

@francisjervis so this only occurs when using AskUserMessage or AskFileMessage? If so, why can't those contain socket checks with resend logic? Or put another way, why are adding 2 new callbacks the only way to address this?

I also kinda disagree that this is not "typical LLM usage" lol, I have Claude etc chats open that I come back to after days. I do not think people are somehow OK with this as just what happens with AI chat apps because I cannot think of any other services which are affected by this.

You mention you aren't using authentication/data persistence. This is why (or one of the reasons why) you're seeing this behavior. Of course people come back to chats which is why we persist them and use the on_chat_resume i mentioned to resume/continue the conversation. This is common, storing conversation state in memory/polling the socket connection is not so common. If you don't want to add data persistence, i wonder if you could use cl.context.session inside of on_chat_start/on_message to accomplish your goal.

Please understand we're aiming to simplify and reduce complexity of Chainlit now that there are no full-time devs. I understand this PR may solve your issue but we have to think of all current and future developers.

@asvishnyakov asvishnyakov reopened this Sep 2, 2025
@asvishnyakov
Copy link
Member

@francisjervis Sorry, accidentally clicked on close instead of comment :)

@hayescode I use persistence, but I also use LangGraph with InMemoryStore even in production, because Chainlit’s native data persistence requires less effort to implement and integrates directly with it. I need to call thread deletion in LangChain’s InMemoryStore in on_chat_end to prevent memory leaks, which leads to additional logic for restoring it again from Chainlit’s persistence after a WebSocket connection drop in on_message, since on_chat_end is triggered both on connection drops and real chat ends (i.e., when the user closes the chat window). The issue is that on_chat_start or on_chat_resume aren’t called on reconnection, so this forces unnecessary workarounds, unclear logic, and code complications in my project.

If we had a clear workflow with separate callbacks for real thread lifecycle events and connection lifecycle events, this problem wouldn’t exist.

My only concern is breaking changes, so I’d prefer that we first discuss and build a diagram/workflow, define the callbacks we really need, and then proceed with deprecating on_chat_start, on_chat_resume and on_chat_end in favor of introducing:

  • on_thread_start
  • on_thread_resume
  • on_thread_end
  • on_connect
  • on_disconnect

and may be with making first two aliases to existing callbacks.

I don’t think these callbacks would overcomplicate Chainlit, since they’re simple both for Chainlit developers and end-user developers and just need to be well documented. I’ve already started working on describing all callbacks in a diagram here, but haven’t completed it yet - this PR could be a good push to finalize that work.

@francisjervis
Copy link
Author

So the thing is, @hayescode, that isn't actually what data persistence/chat "resume" does. See this video (app is the file posted earlier):

Screen.Recording.2025-09-02.at.9.53.30.PM.mov

That is without auth/data persistence. As you can see, even a very brief disconnection breaks the AskActionMessage in an ugly way. Note the repeated on_chat_end calls and lack of any server side event on reconnection.

Now here's the same test with auth enabled:

cl.auth.vid.mov

As you can see, the behavior on disconnect/reconnect is exactly the same. The error on clicking the now-defunct AskActionMessage Action is the same. That is why I insist that this is not the solution - I have tried it!

You mention you aren't using authentication/data persistence. This is why (or one of the reasons why) you're seeing this behavior. Of course people come back to chats which is why we persist them and use the on_chat_resume i mentioned to resume/continue the conversation.

Respectfully, this is a misunderstanding of how auth/persistence work in practice. As you can see in the second video, on_chat_resume does not fire when the socket is reconnected. Only when the user reloads a chat from the history sidebar. Maybe in some other circumstances too, but not on socket reconnection.

What is not common (but probably should be) is the "questioner" design pattern - when an agent is designed to elicit information from the user proactively. Compare to the "assistant" pattern where the user initiates the conversation and the AI responds. CL, LangChain and honestly most other LLM scaffolding projects are more or less obnoxiously opinionated in this regard - LangGraph I believe can handle it but I do not recommend. They are all built with "ChatGPT clone" as the model, which is unfortunate to say the least.

@francisjervis
Copy link
Author

@francisjervis Sorry, accidentally clicked on close instead of comment :)

haha no worries xD

@hayescode I use persistence, but I also use LangGraph with InMemoryStore even in production, because Chainlit’s native data persistence requires less effort to implement and integrates directly with it. I need to call thread deletion in LangChain’s InMemoryStore in on_chat_end to prevent memory leaks, which leads to additional logic for restoring it again from Chainlit’s persistence after a WebSocket connection drop in on_message, since on_chat_end is triggered both on connection drops and real chat ends (i.e., when the user closes the chat window). The issue is that on_chat_start or on_chat_resume aren’t called on reconnection, so this forces unnecessary workarounds, unclear logic, and code complications in my project.

this is exactly the kind of scenario i believe this PR addresses, yes. the naming is only logical with a very specific design pattern (and user behavior/infrastructure) in mind, where window is opened, chat happens, chat ends, all with a stable connection...

If we had a clear workflow with separate callbacks for real thread lifecycle events and connection lifecycle events, this problem wouldn’t exist.

indeed.

My only concern is breaking changes, so I’d prefer that we first discuss and build a diagram/workflow, define the callbacks we really need, and then proceed with deprecating on_chat_start, on_chat_resume and on_chat_end in favor of introducing:

  • on_thread_start
  • on_thread_resume
  • on_thread_end
  • on_connect
  • on_disconnect

and may be with making first two aliases to existing callbacks.

I don’t think these callbacks would overcomplicate Chainlit, since they’re simple both for Chainlit developers and end-user developers and just need to be well documented. I’ve already started working on describing all callbacks in a diagram here, but haven’t completed it yet - this PR could be a good push to finalize that work.

i would argue for clearer separation between functionality which depends on built-in auth/persistence and agnostic event signals. this even seems to be confusing internally (see prev comment). there is a strong argument for unopinionated callbacks, since this project is no longer the "top of funnel" for a hosted cloud service ;) if your thread callbacks refer to the data layer's threads, that should be "on the tin" and project organization should reflect that (ie import from the data layer), IMO.

agree the whole callbacks universe is a mess rn but this level of refactoring feels like it should be a major version change, ie deprecate on_chat_start etc at 3.0.0...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Pertains to the Python backend. size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants