Profile picture

Reading External Data

Last updated: May 14th, 20192019-05-14Project preview

rmotr


Reading external data

Pandas can easily read data stored in different file formats like CSV, JSON, XML or even Excel. Parsing always involves specifying the correct structure, encoding and other details.

We'll read different type of data including:

  • CSV files (.csv)
  • Raw text files (.txt)
  • JSON data from a file and from an API
  • Data from a SQL query (read from database)
  • Data returned from APIs (see this post)

purple-divider

Hands on!

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

 Read CSV files

The read_csv method reads CSV files and accepts many parameters.

In [ ]:
df = pd.read_csv('data/btc-market-price.csv')
In [ ]:
df.head()

The CSV file we're reading has only two columns: timestamp and price. It doesn't have a header, it contains whitespaces and has values separated by commas. Pandas automatically assigned the first row of data as headers, which is incorrect. We can overwrite this behavior with the header parameter.

Also we can define a na_values parameter with the values we want to be recognized as NA/NaN. In this case empty strings '' will be recognized as null values.

In [ ]:
df = pd.read_csv('data/btc-market-price.csv',
                 header=None,
                 na_values=[''])
In [ ]:
df.head()

Custom data separators

Default separator will be , (comma), but any separator can be defined using the sep parameter.

  • sep=';': will be used if our data is separated by ; (semicolon).
  • sep='\t': will be used if our data has tabular separated values (TSV).

 Custom data encoding

Custom encoding can be defined while reading data with the encoding parameter.

  • encoding='UTF-8': will be used if our data is UTF-8 encoded.

 Column names

We can then set the names of each column explicitely by setting the df.columns attribute:

In [ ]:
df.columns = ['Timestamp', 'Price']
In [ ]:
df.head()

The type of the Price column was correctly interpreted as float, but the Timestamp was interpreted as a regular string (object in Pandas notation):

In [ ]:
df.dtypes

We can perform a vectorized operation to parse all the Timestamp values as Datetime objects:

In [ ]:
pd.to_datetime(df['Timestamp']).head()
In [ ]:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
In [ ]:
df.head()
In [ ]:
df.dtypes

The timestamp looks a lot like the index of this DataFrame: date > price. We can change the autoincremental ID generated by pandas and use the Timestamp DS column as the Index:

In [ ]:
df.set_index('Timestamp', inplace=True)
In [ ]:
df.head()

green-divider

Putting everything together

And now, we've finally arrived to the final, desired version of the DataFrame parsed from our CSV file. The steps were:

In [ ]:
df = pd.read_csv('data/btc-market-price.csv', header=None)
df.columns = ['Timestamp', 'Price']
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df.set_index('Timestamp', inplace=True)
In [ ]:
df.head()

There should be a better way. And there is 😎. And there usually is, explicitly with all these repetitive tasks with pandas.

The read_csv function is extremely powerful and you can specify many more parameters at import time. We can achive the same results with only one line by doing:

In [ ]:
df = pd.read_csv(
    'data/btc-market-price.csv',
    header=None,
    names=['Timestamp', 'Price'],
    index_col=0,
    parse_dates=True
)
In [ ]:
df.head()
In [ ]:
?pd.read_csv
In [ ]:
df.plot(figsize=(14, 7))

green-divider

A more challenging parsing

To demonstrate plotting two columns together, we'll try to add Ether prices to our df DataFrame. The ETH prices data can be found in the data/eth-price.csv file. The problem is that it seems like that CSV file was created by someone who really hated programmers. Take a look at it and see how ugly it looks like. We'll still use pandas to parse it.

In [ ]:
eth = pd.read_csv('data/eth-price.csv')

eth.head()

As you can see, it has a Value column (which represents the price), a Date(UTC) one that has a string representing dates and also a UnixTimeStamp date represeting the datetime in unix timestamp format. The header is read automatically, let's try to parse dates with the CSV Reader:

In [ ]:
eth = pd.read_csv('data/eth-price.csv', parse_dates=True)

print(eth.dtypes)
eth.head()

Seems like the parse_dates attribute didn't work. We'll need to add a little bit more customization. Let's divide this problem and focus on the problem of "date parsing" first. The simplest option would be to use the UnixTimeStamp column. The pandas module has a to_datetime function that converts Unix timestamps to Datetime objects automatically:

In [ ]:
pd.to_datetime(eth['UnixTimeStamp']).head()

The problem is the precision of unix timestamps. To match both columns we'll need to use the same index and, our df containing Bitcoin prices, is "per day":

In [ ]:
df.head()

We could either, remove the precision of UnixTimeStamp or attempt to parse the Date(UTC). Let's do String parsing of Date(UTC) for fun:

In [ ]:
pd.to_datetime(eth['Date(UTC)']).head()

That seems to work fine! Why isn't it then parsing the Date(UTC) column? Simple, the parse_dates=True parameter will instruct pandas to parse the index of the DataFrame. If you want to parse any other column, you must explicitly pass the column position or name:

In [ ]:
pd.read_csv('data/eth-price.csv', parse_dates=[0]).head()

Putting everything together again:

In [ ]:
eth = pd.read_csv('data/eth-price.csv', parse_dates=True, index_col=0)
print(eth.info())

eth.head()
In [ ]:
eth['Value'].plot(figsize=(14, 7))

We can now combine both DataFrames into one. Both have the same index, so aligning both prices will be easy. Let's first create an empty DataFrame and with the index from Bitcoin prices:

In [ ]:
prices = pd.DataFrame(index=df.index)
In [ ]:
prices.head()

And we can now just set columns from the other DataFrames:

In [ ]:
prices['Bitcoin'] = df['Price']
In [ ]:
prices['Ether'] = eth['Value']
In [ ]:
prices.head()
In [ ]:
prices.plot(figsize=(14, 7))

green-divider

 Read raw text files

The read_csv method also reads raw text files:

In [ ]:
cars = pd.read_csv('data/mtcars.txt')

cars.head()

green-divider

 Read JSON files

The read_json method reads JSON files and accepts many parameters.

In [ ]:
?pd.read_json
In [ ]:
games = pd.read_json('data/games.json')

games.head()

green-divider

 An advanced JSON example

JSON documents tend not to be so straightforward to read as tables, specially when they have nested structures.

Dataset source

In [ ]:
df = pd.read_json('data/emissions.json')
df.head()
In [ ]:
import json
In [ ]:
with open('data/emissions.json') as fp:
    doc = json.load(fp)
In [ ]:
df = pd.DataFrame.from_dict(doc['value'], orient='index')

df.head()

green-divider

 Read JSON data from an API

We'll read a JSON data stream from SWAPI and put it in a Pandas dataframe:

In [ ]:
import json 
import requests

request = requests.get('https://swapi.co/api/people')
df = pd.read_json(request.text)

df
In [ ]:
pd.DataFrame.from_dict(request.json()['results'])

green-divider

 Read data from SQL database

Reading data from SQL stores is fairly simple. Pandas support a variety of methods. We'll start with an example using SQLite, as it's a builtin Python package, and we don't need anything extra installed.

In order to work with a SQLite database from Python, we first have to connect to it. We can do that using the connect function, which returns a Connection object.

We'll use this example database.

In [ ]:
import sqlite3

conn = sqlite3.connect('data/chinook.db')

Once we have a Connection object, we can then create a Cursor object. Cursors allow us to execute SQL queries against a database:

In [ ]:
cur = conn.cursor()

The Cursor created has a method execute, which will receive SQL parameters to run against the database.

The code below will fetch the first 5 rows from the employees table:

In [ ]:
cur.execute('SELECT * FROM employees LIMIT 5;')

You may have noticed that we didn't assign the result of the above query to a variable. This is because we need to run another command to actually fetch the results. We can use the fetchall method to fetch all of the results of a query:

In [ ]:
results = cur.fetchall()

results

As you can see, the results are returned as a list of tuples. Each tuple corresponds to a row in the database that we accessed. Dealing with data this way is painful. We'd need to manually add column headers, and manually parse the data. Luckily, the pandas library has an easier way, which we'll look at in the next section.

Before we move on, it's good practice to close Connection objects and Cursor objects that are open. This prevents the SQLite database from being locked. When a SQLite database is locked, you may be unable to update the database, and may get errors. We can close the Cursor and the Connection like this:

In [ ]:
cur.close()
conn.close()

Reading results into a pandas DataFrame

We can use the pandas read_sql function to read the results of a SQL query directly into a pandas DataFrame. The code below will execute the same query that we just did, but it will return a DataFrame. It has several advantages over the query we did above:

  • It doesn't require us to create a Cursor object or call fetchall at the end.
  • It automatically reads in the names of the headers from the table.
  • It creates a DataFrame, so we can quickly explore the data.
In [ ]:
?pd.read_sql
In [ ]:
conn = sqlite3.connect('data/chinook.db')

df = pd.read_sql('SELECT * FROM employees;', conn, parse_dates=['BirthDate'])

df
In [ ]:
df.info()
In [ ]:
df
In [ ]:
df['ReportsTo'].isna().sum()
In [ ]:
df['ReportsTo'].mean()
In [ ]:
df['ReportsTo'] > 1.75
In [ ]:
df
In [ ]:
df.set_index('EmployeeId', inplace=True)
In [ ]:
df
In [ ]:
df['City'] = df['City'].astype('category')
In [ ]:
df.info()

It turns out, that read_sql is just a wrapper around read_sql_query and read_sql_table. Our previous example corresponds to read_sql_query, as you'll observe the same result as before:

In [ ]:
conn = sqlite3.connect('data/chinook.db')

df = pd.read_sql_query('SELECT * FROM employees LIMIT 5;', conn)

df

read_sql_table method

read_sql_table is a useful function, but it works only with SQLAlchemy, Python's most popular ORM. This is just a demonstration of its usage:

In [ ]:
import sqlalchemy as db
In [ ]:
engine = db.create_engine('sqlite:///data/chinook.db')
connection = engine.connect()
In [ ]:
conn = sqlite3.connect('data/chinook.db')

df = pd.read_sql_table('employees', connection)

df.head()

purple-divider

Notebooks AI
Notebooks AI Profile20060