Skip to content

Assistant: Initial pass at implementing a data summary tool for Python #8208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 18, 2025

Conversation

melissa-barca
Copy link
Contributor

@melissa-barca melissa-barca commented Jun 19, 2025

First pass at #7114

Provides Assistant with a getDataSummary tool, currently only implemented for Python, that provides a JSON structured summary of a data object by using the Positron API to communicate with the Variables Comm. I updated the variable's python backend to reuse existing functionality from the data explorer.

I used the inspectVariables tool as a guide for retrieving info from the variables comm.

image

Release Notes

New Features

  • N/A

Bug Fixes

  • N/A

QA Notes

@:data-explorer
@:assistant
@:variables
@:plots
@:viewer

@melissa-barca melissa-barca requested a review from wesm June 19, 2025 21:15
Copy link

github-actions bot commented Jun 19, 2025

E2E Tests 🚀
This PR will run tests tagged with: @:critical @:data-explorer @:assistant @:variables @:plots @:viewer

readme  valid tags

Copy link
Contributor

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start. My main suggestion is to rename the API that routes requests to the variables comm to something more generic (and it can just query a single session variable at a time) so that we can use it to add more data querying tools without having to modify the Positron API each time

The other changes that we will want to make is to make the handling of these tool calls "asynchronous" so they they do not block the functioning of the variables comm — this means basically copying the pattern from the data explorer comm for the get_column_profiles request (and its corresponding return_column_profiles front-end API, see https://github.com/posit-dev/positron/blob/main/extensions/positron-python/python_files/posit/positron/data_explorer.py#L492-L519)

"type_display": column.type_display,
"summary_stats": summary_stats,
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good starting point to have this tool surfaced in the variables comm — since computing summary stats or other computed profiles can be expensive (and thus block other messaging handling in the variables comm), we'll probably want to separate "expensive" requests (e.g. summary stats, frequency tables, histograms, etc.) from "cheap" requests (like asking for the schema), and make sure that the expensive requests and performed in an asynchronous-response pattern like the get_column_profiles request in the data explorer. This doesn't all have to get done in this PR so can be follow up work

@melissa-barca melissa-barca force-pushed the feature/ai-data branch 2 times, most recently from 29b64a0 to 94cb220 Compare June 27, 2025 04:03
@melissa-barca melissa-barca requested a review from jmcphers June 27, 2025 04:42
@melissa-barca melissa-barca marked this pull request as ready for review June 27, 2025 04:49
Copy link
Contributor

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is close to a good stopping point for the initial pass — I think the main thing that need to get fixed is the return type for the query_variable_data RPC — since it isn't easy to access all of the data explorer comm types in all the layers where this function is called, we can just return serialized JSON from the function for now (effectively schema: string, column_profiles: string[])

# Create a temporary table view with a temporary comm
temp_state = DataExplorerState("temp_summary")
temp_comm = PositronComm.create(target_name="temp_summary", comm_id="temp_summary_comm")
table_view = _get_table_view(value, temp_comm, temp_state, self.kernel.job_queue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe later we can set up a persistent data explorer comm to use for Assistant tool calls (I realized just now after my earlier comment about the async column profiles — not needed for now — that these depend on there being a live comm available to send the frontend event though with the asynchronous result. We can look more closely at this later)

"description": "Result of the summarize operation",
"type": "object",
"properties": {
"children": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is returning a different return type right now (with the schema and column profiles, so a lot more complex). I think to avoid having to drag along the schema and profile result type (and mainly having to expose these in the Positron runtime / extHost API) we can just return the schema and profiles as a serialized JSON string to sidestep this issue for now -- it would be good to make these results well-typed everywhere but there's a bunch of plumbing needed).

@wesm wesm changed the title initial pass at implementing a data summary tool for Python Assistant: Initial pass at implementing a data summary tool for Python Jun 30, 2025
@wesm wesm force-pushed the feature/ai-data branch from b902acc to df11174 Compare June 30, 2025 18:59
@wesm
Copy link
Contributor

wesm commented Jun 30, 2025

I rebased this today and will work on some unit tests on the Python backend portion before it can be merged

@wesm wesm force-pushed the feature/ai-data branch 3 times, most recently from b0bb2d8 to 3d81ecd Compare July 2, 2025 23:58
@wesm
Copy link
Contributor

wesm commented Jul 3, 2025

@sharon-wang @jmcphers I think I've got this to a good stopping point on the Python side — I can go ahead and merge but it will be broken for R until #8343 is tackled (shouldn't be too difficult, I don't think!). Let me know how you would like to proceed

Copy link
Member

@sharon-wang sharon-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm uninitiated with the python data explorer code, so I mostly looked at the assistant changes!

For #8343, is the leftover work to implement the equivalent to the extensions/positron-python changes in this PR for extensions/positron-r?

Comment on lines +318 to +336
// temporarily only enable for Python sessions
let session: positron.LanguageRuntimeSession | undefined;
const sessions = await positron.runtime.getActiveSessions();
if (sessions && sessions.length > 0) {
session = sessions.find(
(session) => session.metadata.sessionId === options.input.sessionIdentifier,
);
}
if (!session) {
return new vscode.LanguageModelToolResult([
new vscode.LanguageModelTextPart('[[]]')
]);
}

if (session.runtimeMetadata.languageId !== 'python') {
return new vscode.LanguageModelToolResult([
new vscode.LanguageModelTextPart('[[]]')
]);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on filtering out this tool when there's no session and (temporarily) if it's not a python session? Maybe we can also temporarily add a note to the tool description that the tool is only available for Python. This way, the tool shouldn't be available or run at all.

We have some tool filtering logic here:

// List of tools for use by the language model.
const tools: vscode.LanguageModelChatTool[] = vscode.lm.tools.filter(
tool => {
// Don't allow any tools in the terminal.
if (this.id === ParticipantID.Terminal) {
return false;
}
// Define more readable variables for filtering.
const inChatPane = request.location2 === undefined;
const inEditor = request.location2 instanceof vscode.ChatRequestEditorData;
const hasSelection = inEditor && request.location2.selection?.isEmpty === false;
const isAgentMode = this.id === ParticipantID.Agent;
// If streaming edits are enabled, don't allow any tools in inline editor chats.
if (isStreamingEditsEnabled() && this.id === ParticipantID.Editor) {
return false;
}
// If the tool requires a workspace, but no workspace is open, don't allow the tool.
if (tool.tags.includes(TOOL_TAG_REQUIRES_WORKSPACE) && !isWorkspaceOpen()) {
return false;
}
switch (tool.name) {
// Only include the execute code tool in the Chat pane; the other
// panes do not have an affordance for confirming executions.
//
// CONSIDER: It would be better for us to introspect the tool itself
// to see if it requires confirmation, but that information isn't
// currently exposed in `vscode.LanguageModelChatTool`.
case PositronAssistantToolName.ExecuteCode:
return inChatPane &&
// The execute code tool does not yet support notebook sessions.
positronContext.activeSession?.mode !== positron.LanguageRuntimeSessionMode.Notebook &&
isAgentMode;
// Only include the documentEdit tool in an editor and if there is
// no selection.
case PositronAssistantToolName.DocumentEdit:
return inEditor && !hasSelection;
// Only include the selectionEdit tool in an editor and if there is
// a selection.
case PositronAssistantToolName.SelectionEdit:
return inEditor && hasSelection;
// Only include the edit file tool in edit or agent mode i.e. for the edit participant.
case PositronAssistantToolName.EditFile:
return this.id === ParticipantID.Edit || isAgentMode;
// Only include the documentCreate tool in the chat pane and if the user is an agent.
case PositronAssistantToolName.DocumentCreate:
return inChatPane && isAgentMode;
// Otherwise, include the tool if it is tagged for use with Positron Assistant.
// Allow all tools in Agent mode.
default:
return isAgentMode ||
tool.tags.includes('positron-assistant');
}
}
);

Otherwise, we could throw an Error noting that this is only available for Python or return a string instead of returning an empty text part, just so it's clear to the user and the model why we were unable to grab the table summary info?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let me see if I can figure out how to do that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Comment on lines +313 to +315
return new vscode.LanguageModelToolResult([
new vscode.LanguageModelTextPart('[[]]')
]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw an error here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is copy-pasted from the inspectVariablesTool so I'm a bit out of my depth what this should be changed to — I'll push my changes now and let me know if you'd like to change this to something else and I'll let you edit the branch directly

@wesm wesm force-pushed the feature/ai-data branch from 3d81ecd to a6f95c6 Compare July 16, 2025 10:29
@wesm
Copy link
Contributor

wesm commented Jul 16, 2025

Just rebased this and addressed everything but the question about whether to raise an error at https://github.com/posit-dev/positron/pull/8208/files#diff-e480e08db3fbdac969a0529ab74c8ff701d647882e9610bc2eec7b5e2a9f45f2 — I think this is mergeable in its current state and we can make improvements in follow up PRs. The tool is only available in Python sessions for now

@sharon-wang sharon-wang self-requested a review July 17, 2025 00:21
sharon-wang
sharon-wang previously approved these changes Jul 17, 2025
Copy link
Member

@sharon-wang sharon-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'm seeing the tool only available for Python sessions. Thank you for adding this 🙌

@wesm
Copy link
Contributor

wesm commented Jul 18, 2025

@sharon-wang this failed with

[compile           ] [16:33:11] Error: /home/runner/work/positron/positron/extensions/positron-assistant/src/participants.ts(288,30): Property 'activeSession' does not exist on type 'ChatContext'.

I'll try rebasing again to see if it is fixed on top of main

improve logging performance to satisfy linter

clean up code

provide temp comm to satisfy pyright

modify openRPC specs to autogen comms ccode and fix bug with passing
'path' parameter, also rename summarizeData function to make it more
generic

create data explorer helper functions

revert formatting change
@wesm wesm force-pushed the feature/ai-data branch from c7c3878 to 080c254 Compare July 18, 2025 14:54
@sharon-wang sharon-wang self-requested a review July 18, 2025 18:41
Copy link
Member

@sharon-wang sharon-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay checks are green!

@wesm wesm merged commit b74dc34 into main Jul 18, 2025
30 checks passed
@wesm wesm deleted the feature/ai-data branch July 18, 2025 19:02
@github-actions github-actions bot locked and limited conversation to collaborators Jul 18, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants