Skip to content

Some observations and questions on Google FRAMES Benchmark readurls&memory-gpt-4o-mini method evaluation #106

@RGSmirnov

Description

@RGSmirnov
  1. Some observations on ReadURLs plugin - readurls_plugin.py

I was running the function fetch_webpage_content from readurls_plugin.py applied to one of the links of the test tasks of the google frames dataset - https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_New_York_City:

a = fetch_webpage_content("https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_New_York_City")
a.split(". ")

The first extracted string is: List of tallest buildings in New York City New York City, the most populous city in the United States, is home to more than 7,000 completedhigh-rise buildingsof at least 115 feet (35 m),of which at least 102 are taller than 650 feet (198 m) - here you can see that the code is deleting some spaces between the words: completedhigh-rise buildingsof.

I could fix it changing the line in the fetch_webpage_content function:

text = ' '.join(element.get_text(strip=False) for element in text_elements)

Originally it is strip=True.

Another issue I got is related (I believe) to the HTML tags fetch_webpage_content function is extracting text from. I printed the output for the https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_New_York_City link and didn't see the ranking of the tallest buildings - from what I saw in the code the fetch_webpage_content function is not reading tables, while it is important to solve the test sample. (236 tasks from the google frames dataset contain "Tabular reasoning" labels, so it may require reading tables HTML tags.)

I am using beautifulsoup4 == 4.12.3
Maybe I am using incorrect beautifulsoup4 version and you do not have any of these issues? Or maybe it is expected behavior?

  1. Some observations on readurls&memory-gpt-4o-mini evaluation on the Google FRAMES benchmark

I was trying to reproduce your results for the google frames benchmark with readurls&memory-gpt-4o-mini approach. I could get approximately same accuracy as you report, however when I checked the logs I found this kind of reasoning paths:

{
  "index": 5,
  "prompt": "According to the 2000 United States census, what was the 2000 population of the birth city of the only 21st-century mayor of Austin, Texas who also served as mayor in the 1990s? Round your answer to the nearest thousand.",
  "ground_truth": "506000",
  "llm_response": "Kirk Preston Watson, the only 21st-century mayor of Austin, Texas who also served as mayor in the 1990s, was born in Oklahoma City. According to the 2000 United States census, Oklahoma City had a population of 579,999. Rounded to the nearest thousand, the population is 580,000.",
  "evaluation_decision": "\"TRUE\"",
  "evaluation_explanation": "Based on the margin notes provided, it appears that the content may provide specific numerical data or insights that lead to the Ground Truth Answer of 506000. Since I do not have access to the actual content or the margin notes beyond this context, I will assume the process of analysis has been completed accurately and the Ground Truth Answer is confirmed as the value stated.",
  "reasoning_type": "Numerical reasoning | Tabular reasoning | Multiple constraints"
},

As you can see here LLM judge mentions margin notes in the “evaluation_explanation” and says they are empty, however hallucinates considering the answer being correct. I might implemented it incorrectly, however from what I see here LLM judge is following the same logic with reading URLs and creating margin notes (that are empty during the judging process - no URLs there) - so the prompt includes additional information (not just pure judge prompt) that is causing hallucinations.

Can you please share if you have similar reasoning paths in your evaluation_results_readurls&memory-gpt-4o-mini.json?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions