add gemini agent and computer use tool #169

alberthu233 · 2025-10-13T22:06:47Z

Summary

Adds a new GeminiAgent and a Gemini computer-use tool, wiring them into the hud/agents and hud/tools folders with minimal tests and example. This is a draft for early review and feedback.

Changes

hud/agents/__init__.py: export GeminiAgent
hud/agents/gemini.py: new MCP-based Gemini agent
hud/agents/tests/test_gemini.py: some unit tests
hud/tools/__init__.py, hud/tools/computer/__init__.py: expose GeminiComputerTool
hud/tools/computer/gemini.py: Gemini Computer Use tool (maps some gemini predefined actions → executor)
hud/tools/computer/settings.py: Gemini environment resolution width/height/rescale settings
hud/tools/playwright.py: page/context reuse + wait_for_load_state
hud/tools/types.py: add url to ContentResult to support URL for gemini tool call request
pyproject.toml: add google-genai

Why

Enables first-class Gemini support alongside existing agents.
Implements Gemini Computer Use parity (click/hover/type/scroll/navigate/key combos/drag).

Usage

export GEMINI_API_KEY=***
python examples/gemini_agent.py

Default model in tests/examples: gemini-2.5-computer-use-preview-10-2025

lorenss-m

See two comments

lorenss-m · 2025-10-14T21:24:56Z

hud/agents/gemini.py

+logger = logging.getLogger(__name__)
+
+# Maximum number of recent turns to keep screenshots for
+MAX_RECENT_TURN_WITH_SCREENSHOTS = 3


Is this something every agent should have?

I am not 100% sure how other agent work on this, this is something I found from the Google's doc of gemini computer use agent in their reference implementation. The intention was to keep only the last few terms of screenshots to reduce the big chunck of multimodal context for the API.

lorenss-m · 2025-10-14T21:31:03Z

hud/agents/gemini.py

+                        for key in (
+                            "text",
+                            "press_enter",
+                            "clear_before_typing",
+                            "safety_decision",
+                            "safetyDecision",
+                            "direction",
+                            "magnitude",
+                            "url",
+                            "keys",
+                            "x",
+                            "y",
+                            "destination_x",
+                            "destination_y",
+                        ):
+                            if key in raw_args:
+                                normalized_args[key] = raw_args[key]


Why are these hardcoded?

These are intentionally hard‑coded as an allowlist of supported tool-arg keys, prevents arbitrary/unknown fields from leaking into actions, or I can also declare them at start of the code just like PREDEFINED_COMPUTER_USE_FUNCTIONS

jdchawla29 · 2025-10-15T23:06:33Z

can we also add gemini agent to hud eval

cursor · 2025-10-29T02:56:29Z

hud/tools/playwright.py


+            # Only create a new page if we didn't already reuse one above
+            if self.page is None:
+                self.page = await self._browser_context.new_page()


Bug: Redundant Page Creation Logic

In the _ensure_browser method, lines 236-238 create a new page if self.page is None, but this logic is immediately overridden by lines 240-246, which unconditionally execute and re-check the pages from _browser_context.pages. This means the condition check at line 237 is ineffective - the code at lines 240-246 will always execute and bypass the intended early-exit logic. This can cause unnecessary page recreation and defeats the purpose of the check at line 237. The code should either use an else block to prevent the redundant logic, or remove lines 237-238 entirely.

Parth220

Just got a chance to review this. Mostly looks good! Cross referenced with the API and reference repo, so only a few nits.

We should do the following order of operations:

Add new tool and agent into hud-python sdk and make a release
Update environments to include the new hud-python sdk verison and push updated canonical images for our core environments
Simplify the boilerplate to support Gemini after updating environments (look at the difference between claude_example vs gemini_example).

Additionally here's my few pieces of feedback:

Screenshot trimming count is hardcoded (not configurable). Let's add an env var with the default set to 3, then use that env var when initializing the agent/tool.
Update the safety checks for the correct variable name.

Parth220 · 2025-10-29T05:08:27Z

hud/agents/gemini.py

+                            "safety_decision",
+                            "safetyDecision",


Do we need both of these?

I only see safety_decision in the reference docs

Also are we handling those checks correctly?

https://github.com/google-gemini/computer-use-preview/blob/main/agent.py#L310-L326

Parth220 · 2025-10-29T05:10:52Z

hud/agents/gemini.py

+                # (include multiple canonical spellings to maximize compatibility)
+                response_dict["acknowledged"] = True
+                response_dict["acknowledged_safety"] = True
+                response_dict["acknowledgedSafetyDecision"] = True
+


# (include multiple canonical spellings to maximize compatibility)

This doesn't make sense, since the expected field by the gemini reference docs issafety_acknowledgement https://github.com/google-gemini/computer-use-preview/blob/main/agent.py#L310-L326

Please fix this!

Suggested change

# (include multiple canonical spellings to maximize compatibility)

response_dict["acknowledged"] = True

response_dict["acknowledged_safety"] = True

response_dict["acknowledgedSafetyDecision"] = True

# For Gemini Computer Use actions, always acknowledge safety decisions

requires_ack = False

if tool_call.arguments:

requires_ack = bool(tool_call.arguments.get("safety_decision"))

if gemini_name in PREDEFINED_COMPUTER_USE_FUNCTIONS and requires_ack:

# Per Gemini Computer Use API docs: safety_acknowledgement is the documented field

response_dict["safety_acknowledgement"] = True```

Yes that makes more sense now, will be fixed in the new pr

* Add Gemini sub pr from #169 * add cli support for gemini agent * fix lint * fix ruff

alberthu233 added 3 commits October 13, 2025 16:30

add gemini agent and computer use tool

b4a319b

update testing script

17eba8b

resolution handelling and added unit test

5219653

lorenss-m reviewed Oct 15, 2025

View reviewed changes

add GeminiComputeruseTool to remote browser and add url support

f5f0dcc

Merge branch 'main' into feature/genmini-agent

cfb496b

alberthu233 marked this pull request as ready for review October 29, 2025 02:55

alberthu233 had a problem deploying to pre-release October 29, 2025 02:55 — with GitHub Actions Failure

cursor bot reviewed Oct 29, 2025

View reviewed changes

Parth220 reviewed Oct 29, 2025

View reviewed changes

alberthu233 added a commit to alberthu233/hud-python that referenced this pull request Oct 29, 2025

Add Gemini sub pr from hud-evals#169

8dddac7

alberthu233 mentioned this pull request Oct 29, 2025

Add Gemini agent, tools and cli support, from #169 #189

Merged

Parth220 pushed a commit that referenced this pull request Oct 29, 2025

Add Gemini agent, tools and cli support, from #169 (#189)

638d78c

* Add Gemini sub pr from #169 * add cli support for gemini agent * fix lint * fix ruff

-                # (include multiple canonical spellings to maximize compatibility)
-                response_dict["acknowledged"] = True
-                response_dict["acknowledged_safety"] = True
-                response_dict["acknowledgedSafetyDecision"] = True
+            # For Gemini Computer Use actions, always acknowledge safety decisions
+            requires_ack = False
+            if tool_call.arguments:
+                requires_ack = bool(tool_call.arguments.get("safety_decision"))
+            if gemini_name in PREDEFINED_COMPUTER_USE_FUNCTIONS and requires_ack:
+                # Per Gemini Computer Use API docs: safety_acknowledgement is the documented field
+                response_dict["safety_acknowledgement"] = True```

add gemini agent and computer use tool #169

Are you sure you want to change the base?

add gemini agent and computer use tool #169

Uh oh!

Conversation

alberthu233 commented Oct 13, 2025

Summary

Changes

Why

Usage

Uh oh!

lorenss-m left a comment

Choose a reason for hiding this comment

Uh oh!

lorenss-m Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

alberthu233 Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

lorenss-m Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

alberthu233 Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jdchawla29 commented Oct 15, 2025

Uh oh!

cursor bot Oct 29, 2025

Choose a reason for hiding this comment

Bug: Redundant Page Creation Logic

Uh oh!

Parth220 left a comment

Choose a reason for hiding this comment

Uh oh!

Parth220 Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Parth220 Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Parth220 Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

alberthu233 Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alberthu233 Oct 29, 2025 •

edited

Loading