-
Notifications
You must be signed in to change notification settings - Fork 45
add gemini agent and computer use tool #169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
lorenss-m
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See two comments
| logger = logging.getLogger(__name__) | ||
|
|
||
| # Maximum number of recent turns to keep screenshots for | ||
| MAX_RECENT_TURN_WITH_SCREENSHOTS = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something every agent should have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% sure how other agent work on this, this is something I found from the Google's doc of gemini computer use agent in their reference implementation. The intention was to keep only the last few terms of screenshots to reduce the big chunck of multimodal context for the API.
| for key in ( | ||
| "text", | ||
| "press_enter", | ||
| "clear_before_typing", | ||
| "safety_decision", | ||
| "safetyDecision", | ||
| "direction", | ||
| "magnitude", | ||
| "url", | ||
| "keys", | ||
| "x", | ||
| "y", | ||
| "destination_x", | ||
| "destination_y", | ||
| ): | ||
| if key in raw_args: | ||
| normalized_args[key] = raw_args[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these hardcoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are intentionally hard‑coded as an allowlist of supported tool-arg keys, prevents arbitrary/unknown fields from leaking into actions, or I can also declare them at start of the code just like PREDEFINED_COMPUTER_USE_FUNCTIONS
|
can we also add gemini agent to hud eval |
|
|
||
| # Only create a new page if we didn't already reuse one above | ||
| if self.page is None: | ||
| self.page = await self._browser_context.new_page() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Redundant Page Creation Logic
In the _ensure_browser method, lines 236-238 create a new page if self.page is None, but this logic is immediately overridden by lines 240-246, which unconditionally execute and re-check the pages from _browser_context.pages. This means the condition check at line 237 is ineffective - the code at lines 240-246 will always execute and bypass the intended early-exit logic. This can cause unnecessary page recreation and defeats the purpose of the check at line 237. The code should either use an else block to prevent the redundant logic, or remove lines 237-238 entirely.
Parth220
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just got a chance to review this. Mostly looks good! Cross referenced with the API and reference repo, so only a few nits.
We should do the following order of operations:
- Add new tool and agent into hud-python sdk and make a release
- Update environments to include the new hud-python sdk verison and push updated canonical images for our core environments
- Simplify the boilerplate to support Gemini after updating environments (look at the difference between claude_example vs gemini_example).
Additionally here's my few pieces of feedback:
- Screenshot trimming count is hardcoded (not configurable). Let's add an env var with the default set to 3, then use that env var when initializing the agent/tool.
- Update the safety checks for the correct variable name.
| "safety_decision", | ||
| "safetyDecision", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need both of these?
I only see safety_decision in the reference docs
Also are we handling those checks correctly?
https://github.com/google-gemini/computer-use-preview/blob/main/agent.py#L310-L326
| # (include multiple canonical spellings to maximize compatibility) | ||
| response_dict["acknowledged"] = True | ||
| response_dict["acknowledged_safety"] = True | ||
| response_dict["acknowledgedSafetyDecision"] = True | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# (include multiple canonical spellings to maximize compatibility)
This doesn't make sense, since the expected field by the gemini reference docs issafety_acknowledgement https://github.com/google-gemini/computer-use-preview/blob/main/agent.py#L310-L326
Please fix this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # (include multiple canonical spellings to maximize compatibility) | |
| response_dict["acknowledged"] = True | |
| response_dict["acknowledged_safety"] = True | |
| response_dict["acknowledgedSafetyDecision"] = True | |
| # For Gemini Computer Use actions, always acknowledge safety decisions | |
| requires_ack = False | |
| if tool_call.arguments: | |
| requires_ack = bool(tool_call.arguments.get("safety_decision")) | |
| if gemini_name in PREDEFINED_COMPUTER_USE_FUNCTIONS and requires_ack: | |
| # Per Gemini Computer Use API docs: safety_acknowledgement is the documented field | |
| response_dict["safety_acknowledgement"] = True``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that makes more sense now, will be fixed in the new pr
Summary
Adds a new
GeminiAgentand a Gemini computer-use tool, wiring them into thehud/agentsandhud/toolsfolders with minimal tests and example. This is a draft for early review and feedback.Changes
hud/agents/__init__.py: exportGeminiAgenthud/agents/gemini.py: new MCP-based Gemini agenthud/agents/tests/test_gemini.py: some unit testshud/tools/__init__.py,hud/tools/computer/__init__.py: exposeGeminiComputerToolhud/tools/computer/gemini.py: Gemini Computer Use tool (maps some gemini predefined actions → executor)hud/tools/computer/settings.py: Gemini environment resolution width/height/rescale settingshud/tools/playwright.py: page/context reuse +wait_for_load_statehud/tools/types.py: addurltoContentResultto support URL for gemini tool call requestpyproject.toml: addgoogle-genaiWhy
Usage
Default model in tests/examples:
gemini-2.5-computer-use-preview-10-2025