Skip to content

Conversation

@alberthu233
Copy link
Contributor

Summary

Adds a new GeminiAgent and a Gemini computer-use tool, wiring them into the hud/agents and hud/tools folders with minimal tests and example. This is a draft for early review and feedback.

Changes

  • hud/agents/__init__.py: export GeminiAgent
  • hud/agents/gemini.py: new MCP-based Gemini agent
  • hud/agents/tests/test_gemini.py: some unit tests
  • hud/tools/__init__.py, hud/tools/computer/__init__.py: expose GeminiComputerTool
  • hud/tools/computer/gemini.py: Gemini Computer Use tool (maps some gemini predefined actions → executor)
  • hud/tools/computer/settings.py: Gemini environment resolution width/height/rescale settings
  • hud/tools/playwright.py: page/context reuse + wait_for_load_state
  • hud/tools/types.py: add url to ContentResult to support URL for gemini tool call request
  • pyproject.toml: add google-genai

Why

  • Enables first-class Gemini support alongside existing agents.
  • Implements Gemini Computer Use parity (click/hover/type/scroll/navigate/key combos/drag).

Usage

export GEMINI_API_KEY=***
python examples/gemini_agent.py

Default model in tests/examples: gemini-2.5-computer-use-preview-10-2025

Copy link
Contributor

@lorenss-m lorenss-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See two comments

logger = logging.getLogger(__name__)

# Maximum number of recent turns to keep screenshots for
MAX_RECENT_TURN_WITH_SCREENSHOTS = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something every agent should have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure how other agent work on this, this is something I found from the Google's doc of gemini computer use agent in their reference implementation. The intention was to keep only the last few terms of screenshots to reduce the big chunck of multimodal context for the API.

Comment on lines +263 to +279
for key in (
"text",
"press_enter",
"clear_before_typing",
"safety_decision",
"safetyDecision",
"direction",
"magnitude",
"url",
"keys",
"x",
"y",
"destination_x",
"destination_y",
):
if key in raw_args:
normalized_args[key] = raw_args[key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these hardcoded?

Copy link
Contributor Author

@alberthu233 alberthu233 Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are intentionally hard‑coded as an allowlist of supported tool-arg keys, prevents arbitrary/unknown fields from leaking into actions, or I can also declare them at start of the code just like PREDEFINED_COMPUTER_USE_FUNCTIONS

@jdchawla29
Copy link
Collaborator

can we also add gemini agent to hud eval

@alberthu233 alberthu233 marked this pull request as ready for review October 29, 2025 02:55

# Only create a new page if we didn't already reuse one above
if self.page is None:
self.page = await self._browser_context.new_page()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Redundant Page Creation Logic

In the _ensure_browser method, lines 236-238 create a new page if self.page is None, but this logic is immediately overridden by lines 240-246, which unconditionally execute and re-check the pages from _browser_context.pages. This means the condition check at line 237 is ineffective - the code at lines 240-246 will always execute and bypass the intended early-exit logic. This can cause unnecessary page recreation and defeats the purpose of the check at line 237. The code should either use an else block to prevent the redundant logic, or remove lines 237-238 entirely.

Fix in Cursor Fix in Web

Copy link
Contributor

@Parth220 Parth220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just got a chance to review this. Mostly looks good! Cross referenced with the API and reference repo, so only a few nits.

We should do the following order of operations:

  • Add new tool and agent into hud-python sdk and make a release
  • Update environments to include the new hud-python sdk verison and push updated canonical images for our core environments
  • Simplify the boilerplate to support Gemini after updating environments (look at the difference between claude_example vs gemini_example).

Additionally here's my few pieces of feedback:

  • Screenshot trimming count is hardcoded (not configurable). Let's add an env var with the default set to 3, then use that env var when initializing the agent/tool.
  • Update the safety checks for the correct variable name.

Comment on lines +267 to +268
"safety_decision",
"safetyDecision",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both of these?

I only see safety_decision in the reference docs

Also are we handling those checks correctly?

https://github.com/google-gemini/computer-use-preview/blob/main/agent.py#L310-L326

Comment on lines +375 to +379
# (include multiple canonical spellings to maximize compatibility)
response_dict["acknowledged"] = True
response_dict["acknowledged_safety"] = True
response_dict["acknowledgedSafetyDecision"] = True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

           # (include multiple canonical spellings to maximize compatibility)

This doesn't make sense, since the expected field by the gemini reference docs issafety_acknowledgement https://github.com/google-gemini/computer-use-preview/blob/main/agent.py#L310-L326

Please fix this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# (include multiple canonical spellings to maximize compatibility)
response_dict["acknowledged"] = True
response_dict["acknowledged_safety"] = True
response_dict["acknowledgedSafetyDecision"] = True
# For Gemini Computer Use actions, always acknowledge safety decisions
requires_ack = False
if tool_call.arguments:
requires_ack = bool(tool_call.arguments.get("safety_decision"))
if gemini_name in PREDEFINED_COMPUTER_USE_FUNCTIONS and requires_ack:
# Per Gemini Computer Use API docs: safety_acknowledgement is the documented field
response_dict["safety_acknowledgement"] = True```

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes more sense now, will be fixed in the new pr

alberthu233 added a commit to alberthu233/hud-python that referenced this pull request Oct 29, 2025
Parth220 pushed a commit that referenced this pull request Oct 29, 2025
* Add Gemini sub pr from #169

* add cli support for gemini agent

* fix lint

* fix ruff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants