Profile picture

Reading HTML Tables

Last updated: November 25th, 20192019-11-25Project preview

rmotr


Reading HTML tables

In this lecture we'll learn how to read and parse HTML tables from websites into a list of DataFrame objects to work with.

purple-divider

Hands on!

In [1]:
!pip install lxml
Requirement already satisfied: lxml in /Users/mati/.virtualenvs/jupyter-ext/lib/python3.6/site-packages (4.4.1)
In [1]:
import pandas as pd

green-divider

 Parsing raw HTML strings

Another useful pandas method is read_html(). This method will read HTML tables from a given URL, a file-like object, or a raw string containing HTML, and return a list of DataFrame objects.

Let's try to read the following html_string into a DataFrame.

In [2]:
html_string = """
<table>
    <thead>
      <tr>
        <th>Order date</th>
        <th>Region</th> 
        <th>Item</th>
        <th>Units</th>
        <th>Unit cost</th>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>1/6/2018</td>
        <td>East</td> 
        <td>Pencil</td>
        <td>95</td>
        <td>1.99</td>
      </tr>
      <tr>
        <td>1/23/2018</td>
        <td>Central</td> 
        <td>Binder</td>
        <td>50</td>
        <td>19.99</td>
      </tr>
      <tr>
        <td>2/9/2018</td>
        <td>Central</td> 
        <td>Pencil</td>
        <td>36</td>
        <td>4.99</td>
      </tr>
      <tr>
        <td>3/15/2018</td>
        <td>West</td> 
        <td>Pen</td>
        <td>27</td>
        <td>19.99</td>
      </tr>
    </tbody>
</table>
"""
In [3]:
dfs = pd.read_html(html_string)

The read_html just returned one DataFrame object:

In [4]:
len(dfs)
Out[4]:
1
In [5]:
df = dfs[0]

df
Out[5]:
Order date Region Item Units Unit cost
0 1/6/2018 East Pencil 95 1.99
1 1/23/2018 Central Binder 50 19.99
2 2/9/2018 Central Pencil 36 4.99
3 3/15/2018 West Pen 27 19.99

Previous DataFrame looks quite similar to the raw HTML table, but now we have a DataFrame object, so we can apply any pandas operation we want to it.

In [7]:
df.shape
Out[7]:
(4, 5)
In [8]:
df.loc[df['Region'] == 'Central']
Out[8]:
Order date Region Item Units Unit cost
1 1/23/2018 Central Binder 50 19.99
2 2/9/2018 Central Pencil 36 4.99
In [9]:
df.loc[df['Units'] > 35]
Out[9]:
Order date Region Item Units Unit cost
0 1/6/2018 East Pencil 95 1.99
1 1/23/2018 Central Binder 50 19.99
2 2/9/2018 Central Pencil 36 4.99

 Defining header

Pandas will automatically find the header to use thanks to the tag.

But in many cases we'll find wrong or incomplete tables that make the read_html method parse the tables in a wrong way without the proper headers.

To fix them we can use the header parameter.

In [6]:
html_string = """
<table>
  <tr>
    <td>Order date</td>
    <td>Region</td> 
    <td>Item</td>
    <td>Units</td>
    <td>Unit cost</td>
  </tr>
  <tr>
    <td>1/6/2018</td>
    <td>East</td> 
    <td>Pencil</td>
    <td>95</td>
    <td>1.99</td>
  </tr>
  <tr>
    <td>1/23/2018</td>
    <td>Central</td> 
    <td>Binder</td>
    <td>50</td>
    <td>19.99</td>
  </tr>
  <tr>
    <td>2/9/2018</td>
    <td>Central</td> 
    <td>Pencil</td>
    <td>36</td>
    <td>4.99</td>
  </tr>
  <tr>
    <td>3/15/2018</td>
    <td>West</td> 
    <td>Pen</td>
    <td>27</td>
    <td>19.99</td>
  </tr>
</table>
"""
In [7]:
pd.read_html(html_string)[0]
Out[7]:
0 1 2 3 4
0 Order date Region Item Units Unit cost
1 1/6/2018 East Pencil 95 1.99
2 1/23/2018 Central Binder 50 19.99
3 2/9/2018 Central Pencil 36 4.99
4 3/15/2018 West Pen 27 19.99

In this case, we'll need to pass the row number to use as header using the header parameter.

In [12]:
pd.read_html(html_string, header=0)[0]
Out[12]:
Order date Region Item Units Unit cost
0 1/6/2018 East Pencil 95 1.99
1 1/23/2018 Central Binder 50 19.99
2 2/9/2018 Central Pencil 36 4.99
3 3/15/2018 West Pen 27 19.99

green-divider

 Parsing HTML tables from the web

Now that we know how read_html works, go one step beyond and try to parse HTML tables directly from an URL.

To do that we'll call the read_html method with an URL as paramter.

Simple example

In [8]:
html_url = "https://www.basketball-reference.com/leagues/NBA_2019_per_game.html"
In [9]:
nba_tables = pd.read_html(html_url)
In [10]:
len(nba_tables)
Out[10]:
1

We'll work with the only one table found:

In [16]:
nba = nba_tables[0]
In [17]:
nba.head()
Out[17]:
Rk Player Pos Age Tm G GS MP FG FGA ... FT% ORB DRB TRB AST STL BLK TOV PF PTS
0 1 Álex Abrines SG 25 OKC 31 2 19.0 1.8 5.1 ... .923 0.2 1.4 1.5 0.6 0.5 0.2 0.5 1.7 5.3
1 2 Quincy Acy PF 28 PHO 10 0 12.3 0.4 1.8 ... .700 0.3 2.2 2.5 0.8 0.1 0.4 0.4 2.4 1.7
2 3 Jaylen Adams PG 22 ATL 34 1 12.6 1.1 3.2 ... .778 0.3 1.4 1.8 1.9 0.4 0.1 0.8 1.3 3.2
3 4 Steven Adams C 25 OKC 80 80 33.4 6.0 10.1 ... .500 4.9 4.6 9.5 1.6 1.5 1.0 1.7 2.6 13.9
4 5 Bam Adebayo C 21 MIA 82 28 23.3 3.4 5.9 ... .735 2.0 5.3 7.3 2.2 0.9 0.8 1.5 2.5 8.9

5 rows × 30 columns

Complex example

We can also use the requests module to get HTML code from an URL to parse it into DataFrame objects.

If we look at the given URL we can see multiple tables about The Simpsons TV show.

We want to keep the table with information about each season.

In [11]:
import requests

html_url = "https://en.wikipedia.org/wiki/The_Simpsons"
In [12]:
r = requests.get(html_url)

wiki_tables = pd.read_html(r.text, header=0)
In [13]:
len(wiki_tables)
Out[13]:
27
In [14]:
simpsons = wiki_tables[1]
In [15]:
simpsons.head()
Out[15]:
Season Season.1 No. ofepisodes Originally aired Originally aired.1 Originally aired.2 Viewership Viewership.1 Viewership.2
0 Season Season No. ofepisodes Season premiere Season finale Time Slot (ET) Avg. viewers(in millions) Most watched episode Most watched episode
1 Season Season No. ofepisodes Season premiere Season finale Time Slot (ET) Avg. viewers(in millions) Viewers(millions) Episode Title
2 1 1989–90 13 December 17, 1989 May 13, 1990 Sunday 8:30 PM 27.8 33.5 "Life on the Fast Lane"
3 2 1990–91 22 October 11, 1990 July 11, 1991 Thursday 8:00 PM 24.4 33.6 "Bart Gets an F"
4 3 1991–92 24 September 19, 1991 August 27, 1992 Thursday 8:00 PM 21.8 25.5 "Colonel Homer"

Quick clean on the table: remove extra header rows and set Season as index.

In [16]:
simpsons.drop([0, 1], inplace=True)
In [17]:
simpsons.set_index('Season', inplace=True)

Which season has the lowest number of episodes?

In [18]:
simpsons['No. ofepisodes'].unique()
Out[18]:
array(['13', '22', '24', '25', '23', '21', '20', 'TBA'], dtype=object)
In [27]:
simpsons = simpsons.loc[simpsons['No. ofepisodes'] != 'TBA']
In [19]:
min_season = simpsons['No. ofepisodes'].min()

min_season
Out[19]:
'13'
In [29]:
simpsons.loc[simpsons['No. ofepisodes'] == min_season]
Out[29]:
Season.1 No. ofepisodes Originally aired Originally aired.1 Originally aired.2 Viewership Viewership.1 Viewership.2
Season
1 1989–90 13 December 17, 1989 May 13, 1990 Sunday 8:30 PM 27.8 33.5 "Life on the Fast Lane"

green-divider

 Save to CSV file

Finally save the DataFrame to a CSV file as we saw on previous lectures.

In [30]:
simpsons.head()
Out[30]:
Season.1 No. ofepisodes Originally aired Originally aired.1 Originally aired.2 Viewership Viewership.1 Viewership.2
Season
1 1989–90 13 December 17, 1989 May 13, 1990 Sunday 8:30 PM 27.8 33.5 "Life on the Fast Lane"
2 1990–91 22 October 11, 1990 July 11, 1991 Thursday 8:00 PM 24.4 33.6 "Bart Gets an F"
3 1991–92 24 September 19, 1991 August 27, 1992 Thursday 8:00 PM 21.8 25.5 "Colonel Homer"
4 1992–93 22 September 24, 1992 May 13, 1993 Thursday 8:00 PM 22.4 28.6 "Lisa's First Word"
5 1993–94 22 September 30, 1993 May 19, 1994 Thursday 8:00 PM 18.9 24.0 "Treehouse of Horror IV"
In [31]:
simpsons.to_csv('out.csv')
In [32]:
pd.read_csv('out.csv', index_col='Season').head()
Out[32]:
Season.1 No. ofepisodes Originally aired Originally aired.1 Originally aired.2 Viewership Viewership.1 Viewership.2
Season
1 1989–90 13 December 17, 1989 May 13, 1990 Sunday 8:30 PM 27.8 33.5 "Life on the Fast Lane"
2 1990–91 22 October 11, 1990 July 11, 1991 Thursday 8:00 PM 24.4 33.6 "Bart Gets an F"
3 1991–92 24 September 19, 1991 August 27, 1992 Thursday 8:00 PM 21.8 25.5 "Colonel Homer"
4 1992–93 22 September 24, 1992 May 13, 1993 Thursday 8:00 PM 22.4 28.6 "Lisa's First Word"
5 1993–94 22 September 30, 1993 May 19, 1994 Thursday 8:00 PM 18.9 24.0 "Treehouse of Horror IV"

purple-divider

Notebooks AI
Notebooks AI Profile20060