Profile picture

Reading Data From Relational Databases

Last updated: November 25th, 20192019-11-25Project preview

rmotr


Reading data from relational databases

In this lesson you will learn how to read SQL queries and relational database tables into DataFrame objects using pandas. Also, we'll take a look at different techniques to persist that pandas DataFrame objects to database tables.

purple-divider

Hands on!

In [1]:
!pip install sqlalchemy
Requirement already satisfied: sqlalchemy in /Users/mati/.virtualenvs/jupyter-ext/lib/python3.6/site-packages (1.3.11)
In [2]:
import pandas as pd

green-divider

 Read data from SQL database

Reading data from SQL relational databases is fairly simple and pandas support a variety of methods to deal with it.

We'll start with an example using SQLite, as it's a builtin Python package, and we don't need anything extra installed.

In [3]:
import sqlite3

In order to work with a SQLite database from Python, we first have to connect to it. We can do that using the connect function, which returns a Connection object.

We'll use this example database.

In [4]:
conn = sqlite3.connect('chinook.db')

Once we have a Connection object, we can then create a Cursor object. Cursors allow us to execute SQL queries against a database:

In [5]:
cur = conn.cursor()

The Cursor created has a method execute, which will receive SQL parameters to run against the database.

The code below will fetch the first 5 rows from the employees table:

In [6]:
cur.execute('SELECT * FROM employees LIMIT 5;')
Out[6]:
<sqlite3.Cursor at 0x10d106260>

You may have noticed that we didn't assign the result of the above query to a variable. This is because we need to run another command to actually fetch the results.

We can use the fetchall method to fetch all of the results of a query:

In [7]:
results = cur.fetchall()
In [8]:
results
Out[8]:
[(1,
  'Adams',
  'Andrew',
  'General Manager',
  None,
  '1962-02-18 00:00:00',
  '2002-08-14 00:00:00',
  '11120 Jasper Ave NW',
  'Edmonton',
  'AB',
  'Canada',
  'T5K 2N1',
  '+1 (780) 428-9482',
  '+1 (780) 428-3457',
  'andrew@chinookcorp.com'),
 (2,
  'Edwards',
  'Nancy',
  'Sales Manager',
  1,
  '1958-12-08 00:00:00',
  '2002-05-01 00:00:00',
  '825 8 Ave SW',
  'Calgary',
  'AB',
  'Canada',
  'T2P 2T3',
  '+1 (403) 262-3443',
  '+1 (403) 262-3322',
  'nancy@chinookcorp.com'),
 (3,
  'Peacock',
  'Jane',
  'Sales Support Agent',
  2,
  '1973-08-29 00:00:00',
  '2002-04-01 00:00:00',
  '1111 6 Ave SW',
  'Calgary',
  'AB',
  'Canada',
  'T2P 5M5',
  '+1 (403) 262-3443',
  '+1 (403) 262-6712',
  'jane@chinookcorp.com'),
 (4,
  'Park',
  'Margaret',
  'Sales Support Agent',
  2,
  '1947-09-19 00:00:00',
  '2003-05-03 00:00:00',
  '683 10 Street SW',
  'Calgary',
  'AB',
  'Canada',
  'T2P 5G3',
  '+1 (403) 263-4423',
  '+1 (403) 263-4289',
  'margaret@chinookcorp.com'),
 (5,
  'Johnson',
  'Steve',
  'Sales Support Agent',
  2,
  '1965-03-03 00:00:00',
  '2003-10-17 00:00:00',
  '7727B 41 Ave',
  'Calgary',
  'AB',
  'Canada',
  'T3B 1Y7',
  '1 (780) 836-9987',
  '1 (780) 836-9543',
  'steve@chinookcorp.com')]

As you can see, the results are returned as a list of tuples. Each tuple corresponds to a row in the database that we accessed. Dealing with data this way is painful.

We'd need to manually add column headers, and manually parse the data. Luckily, the pandas library has an easier way, which we'll look at in the next section.

In [9]:
df = pd.DataFrame(results)
In [10]:
df.head()
Out[10]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 Adams Andrew General Manager NaN 1962-02-18 00:00:00 2002-08-14 00:00:00 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
1 2 Edwards Nancy Sales Manager 1.0 1958-12-08 00:00:00 2002-05-01 00:00:00 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
2 3 Peacock Jane Sales Support Agent 2.0 1973-08-29 00:00:00 2002-04-01 00:00:00 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
3 4 Park Margaret Sales Support Agent 2.0 1947-09-19 00:00:00 2003-05-03 00:00:00 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
4 5 Johnson Steve Sales Support Agent 2.0 1965-03-03 00:00:00 2003-10-17 00:00:00 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com

Before we move on, it's good practice to close Connection objects and Cursor objects that are open. This prevents the SQLite database from being locked. When a SQLite database is locked, you may be unable to update the database, and may get errors. We can close the Cursor and the Connection like this:

In [11]:
cur.close()
conn.close()

green-divider

Using pandas read_sql method

We can use the pandas read_sql function to read the results of a SQL query directly into a pandas DataFrame. The code below will execute the same query that we just did, but it will return a DataFrame. It has several advantages over the query we did above:

  • It doesn't require us to create a Cursor object or call fetchall at the end.
  • It automatically reads in the names of the headers from the table.
  • It creates a DataFrame, so we can quickly explore the data.
In [12]:
conn = sqlite3.connect('chinook.db')
In [13]:
df = pd.read_sql('SELECT * FROM employees;', conn)
In [14]:
df.head()
Out[14]:
EmployeeId LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
0 1 Adams Andrew General Manager NaN 1962-02-18 00:00:00 2002-08-14 00:00:00 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
1 2 Edwards Nancy Sales Manager 1.0 1958-12-08 00:00:00 2002-05-01 00:00:00 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
2 3 Peacock Jane Sales Support Agent 2.0 1973-08-29 00:00:00 2002-04-01 00:00:00 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
3 4 Park Margaret Sales Support Agent 2.0 1947-09-19 00:00:00 2003-05-03 00:00:00 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
4 5 Johnson Steve Sales Support Agent 2.0 1965-03-03 00:00:00 2003-10-17 00:00:00 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com
In [15]:
df = pd.read_sql('SELECT * FROM employees;', conn,
                 index_col='EmployeeId',
                 parse_dates=['BirthDate', 'HireDate'])
In [16]:
df.head()
Out[16]:
LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
EmployeeId
1 Adams Andrew General Manager NaN 1962-02-18 2002-08-14 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
2 Edwards Nancy Sales Manager 1.0 1958-12-08 2002-05-01 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
3 Peacock Jane Sales Support Agent 2.0 1973-08-29 2002-04-01 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
4 Park Margaret Sales Support Agent 2.0 1947-09-19 2003-05-03 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
5 Johnson Steve Sales Support Agent 2.0 1965-03-03 2003-10-17 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com
In [17]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8 entries, 1 to 8
Data columns (total 14 columns):
LastName      8 non-null object
FirstName     8 non-null object
Title         8 non-null object
ReportsTo     7 non-null float64
BirthDate     8 non-null datetime64[ns]
HireDate      8 non-null datetime64[ns]
Address       8 non-null object
City          8 non-null object
State         8 non-null object
Country       8 non-null object
PostalCode    8 non-null object
Phone         8 non-null object
Fax           8 non-null object
Email         8 non-null object
dtypes: datetime64[ns](2), float64(1), object(11)
memory usage: 960.0+ bytes
In [18]:
df['ReportsTo'].isna().sum()
Out[18]:
1
In [19]:
df['ReportsTo'].mean()
Out[19]:
2.857142857142857
In [20]:
df['ReportsTo'] > 1.75
Out[20]:
EmployeeId
1    False
2    False
3     True
4     True
5     True
6    False
7     True
8     True
Name: ReportsTo, dtype: bool
In [21]:
df['City'] = df['City'].astype('category')
In [22]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8 entries, 1 to 8
Data columns (total 14 columns):
LastName      8 non-null object
FirstName     8 non-null object
Title         8 non-null object
ReportsTo     7 non-null float64
BirthDate     8 non-null datetime64[ns]
HireDate      8 non-null datetime64[ns]
Address       8 non-null object
City          8 non-null category
State         8 non-null object
Country       8 non-null object
PostalCode    8 non-null object
Phone         8 non-null object
Fax           8 non-null object
Email         8 non-null object
dtypes: category(1), datetime64[ns](2), float64(1), object(10)
memory usage: 1008.0+ bytes

green-divider

Using pandas read_sql_query method

It turns out that the read_sql method we saw above is just a wrapper around read_sql_query and read_sql_table.

We can get the same result using read_sql_query method:

In [23]:
conn = sqlite3.connect('chinook.db')
In [24]:
df = pd.read_sql_query('SELECT * FROM employees LIMIT 5;', conn)
In [25]:
df.head()
Out[25]:
EmployeeId LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
0 1 Adams Andrew General Manager NaN 1962-02-18 00:00:00 2002-08-14 00:00:00 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
1 2 Edwards Nancy Sales Manager 1.0 1958-12-08 00:00:00 2002-05-01 00:00:00 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
2 3 Peacock Jane Sales Support Agent 2.0 1973-08-29 00:00:00 2002-04-01 00:00:00 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
3 4 Park Margaret Sales Support Agent 2.0 1947-09-19 00:00:00 2003-05-03 00:00:00 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
4 5 Johnson Steve Sales Support Agent 2.0 1965-03-03 00:00:00 2003-10-17 00:00:00 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com
In [26]:
df = pd.read_sql_query('SELECT * FROM employees;', conn,
                       index_col='EmployeeId',
                       parse_dates=['BirthDate', 'HireDate'])
In [27]:
df.head()
Out[27]:
LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
EmployeeId
1 Adams Andrew General Manager NaN 1962-02-18 2002-08-14 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
2 Edwards Nancy Sales Manager 1.0 1958-12-08 2002-05-01 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
3 Peacock Jane Sales Support Agent 2.0 1973-08-29 2002-04-01 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
4 Park Margaret Sales Support Agent 2.0 1947-09-19 2003-05-03 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
5 Johnson Steve Sales Support Agent 2.0 1965-03-03 2003-10-17 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com

green-divider

Using read_sql_table method

read_sql_table is a useful function, but it works only with SQLAlchemy, a Python SQL Toolkit and Object Relational Mapper.

This is just a demonstration of its usage where we read the whole employees table.

In [28]:
from sqlalchemy import create_engine
In [29]:
engine = create_engine('sqlite:///chinook.db')

connection = engine.connect()
In [30]:
df = pd.read_sql_table('employees', con=connection)
In [31]:
df.head()
Out[31]:
EmployeeId LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
0 1 Adams Andrew General Manager NaN 1962-02-18 2002-08-14 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
1 2 Edwards Nancy Sales Manager 1.0 1958-12-08 2002-05-01 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
2 3 Peacock Jane Sales Support Agent 2.0 1973-08-29 2002-04-01 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
3 4 Park Margaret Sales Support Agent 2.0 1947-09-19 2003-05-03 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
4 5 Johnson Steve Sales Support Agent 2.0 1965-03-03 2003-10-17 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com
In [32]:
df = pd.read_sql_table('employees', con=connection,
                       index_col='EmployeeId',
                       parse_dates=['BirthDate', 'HireDate'])
In [33]:
df.head()
Out[33]:
LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
EmployeeId
1 Adams Andrew General Manager NaN 1962-02-18 2002-08-14 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
2 Edwards Nancy Sales Manager 1.0 1958-12-08 2002-05-01 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
3 Peacock Jane Sales Support Agent 2.0 1973-08-29 2002-04-01 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
4 Park Margaret Sales Support Agent 2.0 1947-09-19 2003-05-03 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
5 Johnson Steve Sales Support Agent 2.0 1965-03-03 2003-10-17 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com
In [34]:
connection.close()

green-divider

 Create tables from DataFrame objects

Finally we can persist DataFrame objects we've working on in a database using the pandas to_sql method.

Although it is easy to implement, it could be a very slow process.

In [35]:
df.head()
Out[35]:
LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
EmployeeId
1 Adams Andrew General Manager NaN 1962-02-18 2002-08-14 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
2 Edwards Nancy Sales Manager 1.0 1958-12-08 2002-05-01 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
3 Peacock Jane Sales Support Agent 2.0 1973-08-29 2002-04-01 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
4 Park Margaret Sales Support Agent 2.0 1947-09-19 2003-05-03 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
5 Johnson Steve Sales Support Agent 2.0 1965-03-03 2003-10-17 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com
In [36]:
df.to_sql('employees2', conn)
In [37]:
pd.read_sql_query('SELECT * FROM employees2;', conn).head()
Out[37]:
EmployeeId LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
0 1 Adams Andrew General Manager NaN 1962-02-18 00:00:00 2002-08-14 00:00:00 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
1 2 Edwards Nancy Sales Manager 1.0 1958-12-08 00:00:00 2002-05-01 00:00:00 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
2 3 Peacock Jane Sales Support Agent 2.0 1973-08-29 00:00:00 2002-04-01 00:00:00 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
3 4 Park Margaret Sales Support Agent 2.0 1947-09-19 00:00:00 2003-05-03 00:00:00 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
4 5 Johnson Steve Sales Support Agent 2.0 1965-03-03 00:00:00 2003-10-17 00:00:00 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com
In [38]:
#pd.read_sql_query('DROP TABLE employees2;', conn)

Custom behaviour

The if_exists parameter define how to behave if the table already exists and adds a ton of flexibility, letting you decide wheather to replace current database data, append new data at the end, or simply fail if database already exists.

In [39]:
pd.DataFrame().to_sql('employees2',
                      conn,
                      if_exists='replace')
In [40]:
pd.read_sql_query('SELECT * FROM employees2;', conn).head()
Out[40]:
index
In [41]:
df.to_sql('employees2',
          conn,
          if_exists='replace')
In [42]:
pd.read_sql_query('SELECT * FROM employees2;', conn).head()
Out[42]:
EmployeeId LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
0 1 Adams Andrew General Manager NaN 1962-02-18 00:00:00 2002-08-14 00:00:00 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
1 2 Edwards Nancy Sales Manager 1.0 1958-12-08 00:00:00 2002-05-01 00:00:00 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
2 3 Peacock Jane Sales Support Agent 2.0 1973-08-29 00:00:00 2002-04-01 00:00:00 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
3 4 Park Margaret Sales Support Agent 2.0 1947-09-19 00:00:00 2003-05-03 00:00:00 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
4 5 Johnson Steve Sales Support Agent 2.0 1965-03-03 00:00:00 2003-10-17 00:00:00 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com
In [43]:
conn.close()

purple-divider

Notebooks AI
Notebooks AI Profile20060