- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 19.2k
Closed
Labels
IO HTMLread_html, to_html, Styler.apply, Styler.applymapread_html, to_html, Styler.apply, Styler.applymapOutput-Formatting__repr__ of pandas objects, to_string__repr__ of pandas objects, to_string
Milestone
Description
Code Sample
url = """https://en.wikipedia.org/wiki/List_of_winners_of_the_Boston_Marathon"""
tables = pd.read_html(url, header=0)
print(tables[0].head())
Problem description
The above code ''should' just extract the displayed text in the HTML table; what's in the dataframe should be what's displayed on screen. This isn't what happens. If the HTML contains a hyperlink with a title attribute, this is picked up and added to the dataframe, duplicating the data.
Expected Output
   Year                   Athlete  \
0  1897          John J. McDermott   
1  1898         Ronald J. MacDonald   
2  1899          Lawrence Brignolia   
3  1900         John "Jack" Caffery   
4  1901         John "Jack" Caffery   
                      Country/State     Time        Notes  
0                United States (NY)  2:55:10          NaN  
1                     Canada Canada  2:42:00          NaN  
2                United States (MA)  2:54:38          NaN  
3                            Canada  2:39:44          NaN  
4                            Canada  2:29:23  2nd victory 
Output
Here's the actual output, the duplication is in the Athlete and Country/State columns.
   Year                                  Athlete  
0  1897      McDermott, John J.John J. McDermott   
1  1898  MacDonald, Ronald J.Ronald J. MacDonald   
2  1899    Brignolia, LawrenceLawrence Brignolia   
3  1900         Caffery, JohnJohn "Jack" Caffery   
4  1901         Caffery, JohnJohn "Jack" Caffery   
                      Country/State     Time        Notes  
0  United States United States (NY)  2:55:10          NaN  
1                     Canada Canada  2:42:00          NaN  
2  United States United States (MA)  2:54:38          NaN  
3                     Canada Canada  2:39:44          NaN  
4                     Canada Canada  2:29:23  2nd victory 
Metadata
Metadata
Assignees
Labels
IO HTMLread_html, to_html, Styler.apply, Styler.applymapread_html, to_html, Styler.apply, Styler.applymapOutput-Formatting__repr__ of pandas objects, to_string__repr__ of pandas objects, to_string