Some observations and questions on Google FRAMES Benchmark readurls&memory-gpt-4o-mini method evaluation

1. Some observations on ReadURLs plugin - `readurls_plugin.py`

I was running the function `fetch_webpage_content` from `readurls_plugin.py` applied to one of the links of the test tasks of the google frames dataset - https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_New_York_City:


```python
a = fetch_webpage_content("https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_New_York_City")
a.split(". ")
```

The first extracted string is: `List of tallest buildings in New York City New York City, the most populous city in the United States, is home to more than 7,000 completedhigh-rise buildingsof at least 115 feet (35 m),of which at least 102 are taller than 650 feet (198 m)` - here you can see that the code is deleting some spaces between the words: `completedhigh-rise buildingsof`. 

I could fix it changing the line in the `fetch_webpage_content` function:

```python
text = ' '.join(element.get_text(strip=False) for element in text_elements)
```
Originally it is `strip=True`. 

Another issue I got is related (I believe) to the HTML tags `fetch_webpage_content` function is extracting text from. I printed the output for the https://en.wikipedia.org/wiki/List_of_tallest_buildings_in_New_York_City link and didn't see the ranking of the tallest buildings - from what I saw in the code the `fetch_webpage_content` function is not reading tables, while it is important to solve the test sample. (236 tasks from the google frames dataset contain "Tabular reasoning" labels, so it may require reading tables HTML tags.)

I am using beautifulsoup4 == 4.12.3
Maybe I am using incorrect beautifulsoup4 version and you do not have any of these issues? Or maybe it is expected behavior?

2. Some observations on `readurls&memory-gpt-4o-mini` evaluation on the Google FRAMES benchmark

I was trying to reproduce your results for the google frames benchmark with `readurls&memory-gpt-4o-mini` approach. I could get approximately same accuracy as you report, however when I checked the logs I found this kind of reasoning paths:

```json
{
  "index": 5,
  "prompt": "According to the 2000 United States census, what was the 2000 population of the birth city of the only 21st-century mayor of Austin, Texas who also served as mayor in the 1990s? Round your answer to the nearest thousand.",
  "ground_truth": "506000",
  "llm_response": "Kirk Preston Watson, the only 21st-century mayor of Austin, Texas who also served as mayor in the 1990s, was born in Oklahoma City. According to the 2000 United States census, Oklahoma City had a population of 579,999. Rounded to the nearest thousand, the population is 580,000.",
  "evaluation_decision": "\"TRUE\"",
  "evaluation_explanation": "Based on the margin notes provided, it appears that the content may provide specific numerical data or insights that lead to the Ground Truth Answer of 506000. Since I do not have access to the actual content or the margin notes beyond this context, I will assume the process of analysis has been completed accurately and the Ground Truth Answer is confirmed as the value stated.",
  "reasoning_type": "Numerical reasoning | Tabular reasoning | Multiple constraints"
},
```

As you can see here LLM judge mentions margin notes in the “evaluation_explanation” and says they are empty, however hallucinates considering the answer being correct. I might implemented it incorrectly, however from what I see here LLM judge is following the same logic with reading URLs and creating margin notes (that are empty during the judging process - no URLs there) - so the prompt includes additional information (not just pure judge prompt) that is causing hallucinations.

Can you please share if you have similar reasoning paths in your `evaluation_results_readurls&memory-gpt-4o-mini.json`?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some observations and questions on Google FRAMES Benchmark readurls&memory-gpt-4o-mini method evaluation #106

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some observations and questions on Google FRAMES Benchmark readurls&memory-gpt-4o-mini method evaluation #106

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions