-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Feature/websocket lifecycle callbacks #2479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature/websocket lifecycle callbacks #2479
Conversation
…backs' into feature/websocket-lifecycle-callbacks
…no step emitted by design)
…backs' into feature/websocket-lifecycle-callbacks
@francisjervis thank you for your contribution! My preference would be to limit the number of callbacks and duplication as I think this could introduce even more ambiguity especially for newer developers. Would you be able to fire the existing callbacks during new web socket connection instead, to fix the existing callbacks? I agree with you that this should be covered and there are unhandled edge cases, but I worry about increasing complexity with this approach especially as we're now community managed we're looking for clean scalability and maintainability. |
The current callbacks are, frankly, quite poorly named (and confusingly documented). They do not reflect the actual chat start and end - eg it is not safe to include clean-up/finalization logic in To expand on that last point: if the connection is dropped/resumed & the last server/assistant action was sending a cl.Message, and the user sends a regular message after that, the user session does appear to be restored correctly. If the last server/assistant message was an If the last server/assistant action was an
No, that would be a breaking change. Moreover it would not resolve ambiguity/naming issues. Quite the opposite. Obviously the best solution would be for the original Ask* callback to seamlessly pick up handling the response but that is probably intractable. The second best option - again, breaking existing implementations, probably for everyone - would be to remove the current chat start/end/resume callbacks completely and replace them with unambiguous network state signals, leaving the developer to handle the logic for new/existing chat sessions (ie checking whether the What I am proposing would at least allow developers to detect connection drops and re-send the last Ask* message so the response can be handled appropriately...
This is a scalability issue because you cannot run a Chainlit application which uses the Ask* message functionality in production without setting the server request timeout to its maximum value to avoid premature, silent, chat termination. Even then the chat session length is limited to the server request timeout value. When that is reached, the chat silently ends!! To reiterate: at present, if the application uses Ask* messages, Chainlit demands that the end user has a stable network connection, does not switch away, even momentarily, from the chat window (at least on mobile ie 70% of web traffic), and finishes their chat before the server request times out (on GCR, 1 hour). This is not really an edge case - it is the difference between being OK for a demo where you can avoid actions which will drop the connection, and usable "in the wild." Also note that installing the package from a fork does not work, #2467 - so that is not an option. |
@francisjervis your description of the socket connection drops not being handled by the existing callbacks makes sense based on their definitions, but I have not experienced this at all and I have hundreds of users with auth/persistence/etc. Based on your description users are leaving/switching away and over an hour later (after maximizing timeout) they're coming back and it fails because the session has expired? That does seem like an edge case to me. Waiting over an hour between interactions in the same thread is not typical LLM usage and even then chat resume is a thing that enables this. I'd like to see some reproducible code demonstrating this issue and the circumstances it happens in to get a better understanding of why new callbacks are required or if there's another way. |
But then moving Thank you for #2467. I think it should be fixed ASAP, and it can be done quite easily. I'll look into it today |
@hayescode I experienced WebSocket connection drop when streaming large AI responses, and we also have user connections lasting more than 2 hours - people just leave the tab open in their browsers for days, if not weeks or months, and browsers only put it into sleeping mode only after some time. Based on my experience with WebSockets in other technologies (we used them in GraphQL subscriptions), connection drops happen quite often |
Are you using
No, that is not what I mean. Mobile browsers - particularly Chrome - drop WS connections if the user minimizes the app/switches to another app, etc. So someone can be using a Chainlit-based web app, momentarily (and I literally mean for any non-zero amount of time) lock/reopen their phone or check a message in another app, and this bug occurs. It also occurs if they are actively using the chat for more than the server request timeout period, which, in the case of Google Cloud Run, cannot be set to more than 3600 seconds. No leaving/switching away needed. Setting the timeout to this long a period as a work-around is really not a good solution at scale.
As I said, that isn't the issue. Though that would still not be great. By the way, I set up a test project with auth/data persistence, and In my use case - which is a postdoc research project - I am using CL to run an interviewing agent. The first message is a "consent screen" with a yes/no I also kinda disagree that this is not "typical LLM usage" lol, I have Claude etc chats open that I come back to after days. I do not think people are somehow OK with this as just what happens with AI chat apps because I cannot think of any other services which are affected by this. Try running this file. To demonstrate the issue, connect using ngrok and open the URL on your phone. While the yes/no question is displayed but before responding, lock, then immediately re-open your phone. Try it again (after reloading of course, because the original chat is now borked), answering the yes/no question then switching away while the "what is your name" question is displayed. Try answering the question. Watch the logs. Try it on desktop and use the dev tools connection control to go "offline" then reconnect to simulate network loss at the same stages (this will not work if you are on localhost:8000 - you have to use ngrok to simulate) so you can once again watch the logs. |
Are you using Ask* messages? This does not occur with vanilla
Chrome for iOS drops it basically immediately. I was surprised too...
I'm not quite sure who this was addressed to. My point was that it would be worse to change the opinionated "start" and "end" callback logic for precisely this reason. The docs should (and per my matching PR are) be updated to clarify exactly when these events will (not) fire. This PR does not modify the existing callbacks at all. |
@francisjervis so this only occurs when using
You mention you aren't using authentication/data persistence. This is why (or one of the reasons why) you're seeing this behavior. Of course people come back to chats which is why we persist them and use the Please understand we're aiming to simplify and reduce complexity of Chainlit now that there are no full-time devs. I understand this PR may solve your issue but we have to think of all current and future developers. |
@francisjervis Sorry, accidentally clicked on close instead of comment :) @hayescode I use persistence, but I also use LangGraph with If we had a clear workflow with separate callbacks for real thread lifecycle events and connection lifecycle events, this problem wouldn’t exist. My only concern is breaking changes, so I’d prefer that we first discuss and build a diagram/workflow, define the callbacks we really need, and then proceed with deprecating
and may be with making first two aliases to existing callbacks. I don’t think these callbacks would overcomplicate Chainlit, since they’re simple both for Chainlit developers and end-user developers and just need to be well documented. I’ve already started working on describing all callbacks in a diagram here, but haven’t completed it yet - this PR could be a good push to finalize that work. |
So the thing is, @hayescode, that isn't actually what data persistence/chat "resume" does. See this video (app is the file posted earlier): Screen.Recording.2025-09-02.at.9.53.30.PM.movThat is without auth/data persistence. As you can see, even a very brief disconnection breaks the Now here's the same test with auth enabled: cl.auth.vid.movAs you can see, the behavior on disconnect/reconnect is exactly the same. The error on clicking the now-defunct
Respectfully, this is a misunderstanding of how auth/persistence work in practice. As you can see in the second video, What is not common (but probably should be) is the "questioner" design pattern - when an agent is designed to elicit information from the user proactively. Compare to the "assistant" pattern where the user initiates the conversation and the AI responds. CL, LangChain and honestly most other LLM scaffolding projects are more or less obnoxiously opinionated in this regard - LangGraph I believe can handle it but I do not recommend. They are all built with "ChatGPT clone" as the model, which is unfortunate to say the least. |
haha no worries xD
this is exactly the kind of scenario i believe this PR addresses, yes. the naming is only logical with a very specific design pattern (and user behavior/infrastructure) in mind, where window is opened, chat happens, chat ends, all with a stable connection...
indeed.
i would argue for clearer separation between functionality which depends on built-in auth/persistence and agnostic event signals. this even seems to be confusing internally (see prev comment). there is a strong argument for unopinionated callbacks, since this project is no longer the "top of funnel" for a hosted cloud service ;) if your thread callbacks refer to the data layer's threads, that should be "on the tin" and project organization should reflect that (ie import from the data layer), IMO. agree the whole callbacks universe is a mess rn but this level of refactoring feels like it should be a major version change, ie deprecate |
…backs' into feature/websocket-lifecycle-callbacks
This PR adds callbacks
on_socket_connect
andon_socket_disconnect
to enable (among other things) state restoration in the absence of authentication/built-in data persistence and handling of connectivity issues with the client.The current
on_chat_restore
handler depends on the Chainlit data layer and andon_chat_start
/on_chat_end
are ambiguous wrt their handling of 1: WS status changes 2: user intent - for instance,on_chat_end
is called if the server times out the connection (#2198).on_chat_resume
does not appear to be called (even if the data layer is enabled) after a WS drop/reconnect event cycle.on_chat_start
is not called if the WS connection is dropped and re-established, only on the first connection for a session. Note that the current documentation foron_chat_start
has "to react to the user websocket connection event" which is ambiguous/misleading.In at least some contexts this leads to a silent failure state (on both ends) where user inputs are not processed, there is no "disconnected" message shown to the user, and there is no backend signal. This is particularly problematic when the WS connection drops after an AskUserMessage was sent.
While there is certainly some overlap with the existing chat lifecycle callbacks, adding unambiguous/unopinionated network state change handling both avoids breaking changes (ie renaming/refactoring the chat start/end callbacks) and adds the ability to use "hand-rolled" state persistence approaches which do not depend on the Chainlit auth/data layer features.