Profile picture

Reading External Data

Last updated: November 25th, 20192019-11-25Project preview

rmotr


Reading external data

Pandas can easily read data stored in different file formats like CSV, JSON, XML or even Excel. Parsing always involves specifying the correct structure, encoding and other details.

We'll read different type of data including:

  • CSV files (.csv)
  • Raw text files (.txt)
  • JSON data from a file and from an API
  • Data from a SQL query (read from database)
  • Data returned from APIs (see this post)

purple-divider

Hands on!

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

 Read CSV files

The read_csv method reads CSV files and accepts many parameters.

In [11]:
df = pd.read_csv('data/btc-market-price.csv')
In [12]:
df.head()
Out[12]:
2017-04-02 00:00:00 1099.169125
0 2017-04-03 00:00:00 1141.813000
1 2017-04-04 00:00:00 1141.600363
2 2017-04-05 00:00:00 1133.079314
3 2017-04-06 00:00:00 1196.307937
4 2017-04-07 00:00:00 1190.454250

The CSV file we're reading has only two columns: timestamp and price. It doesn't have a header, it contains whitespaces and has values separated by commas. Pandas automatically assigned the first row of data as headers, which is incorrect. We can overwrite this behavior with the header parameter.

Also we can define a na_values parameter with the values we want to be recognized as NA/NaN. In this case empty strings '' will be recognized as null values.

In [13]:
df = pd.read_csv('data/btc-market-price.csv',
                 header=None,
                 na_values=[''])
In [14]:
df.head()
Out[14]:
0 1
0 2017-04-02 00:00:00 1099.169125
1 2017-04-03 00:00:00 1141.813000
2 2017-04-04 00:00:00 1141.600363
3 2017-04-05 00:00:00 1133.079314
4 2017-04-06 00:00:00 1196.307937

Custom data separators

Default separator will be , (comma), but any separator can be defined using the sep parameter.

  • sep=';': will be used if our data is separated by ; (semicolon).
  • sep='\t': will be used if our data has tabular separated values (TSV).

 Custom data encoding

Custom encoding can be defined while reading data with the encoding parameter.

  • encoding='UTF-8': will be used if our data is UTF-8 encoded.

 Column names

We can then set the names of each column explicitely by setting the df.columns attribute:

In [15]:
df.columns = ['Timestamp', 'Price']
In [16]:
df.head()
Out[16]:
Timestamp Price
0 2017-04-02 00:00:00 1099.169125
1 2017-04-03 00:00:00 1141.813000
2 2017-04-04 00:00:00 1141.600363
3 2017-04-05 00:00:00 1133.079314
4 2017-04-06 00:00:00 1196.307937

The type of the Price column was correctly interpreted as float, but the Timestamp was interpreted as a regular string (object in Pandas notation):

In [17]:
df.dtypes
Out[17]:
Timestamp     object
Price        float64
dtype: object

We can perform a vectorized operation to parse all the Timestamp values as Datetime objects:

In [18]:
pd.to_datetime(df['Timestamp']).head()
Out[18]:
0   2017-04-02
1   2017-04-03
2   2017-04-04
3   2017-04-05
4   2017-04-06
Name: Timestamp, dtype: datetime64[ns]
In [19]:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
In [20]:
df.head()
Out[20]:
Timestamp Price
0 2017-04-02 1099.169125
1 2017-04-03 1141.813000
2 2017-04-04 1141.600363
3 2017-04-05 1133.079314
4 2017-04-06 1196.307937
In [21]:
df.dtypes
Out[21]:
Timestamp    datetime64[ns]
Price               float64
dtype: object

The timestamp looks a lot like the index of this DataFrame: date > price. We can change the autoincremental ID generated by pandas and use the Timestamp DS column as the Index:

In [22]:
df.set_index('Timestamp', inplace=True)
In [23]:
df.head()
Out[23]:
Price
Timestamp
2017-04-02 1099.169125
2017-04-03 1141.813000
2017-04-04 1141.600363
2017-04-05 1133.079314
2017-04-06 1196.307937

green-divider

Putting everything together

And now, we've finally arrived to the final, desired version of the DataFrame parsed from our CSV file. The steps were:

In [24]:
df = pd.read_csv('data/btc-market-price.csv', header=None)
df.columns = ['Timestamp', 'Price']
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df.set_index('Timestamp', inplace=True)
In [25]:
df.head()
Out[25]:
Price
Timestamp
2017-04-02 1099.169125
2017-04-03 1141.813000
2017-04-04 1141.600363
2017-04-05 1133.079314
2017-04-06 1196.307937

There should be a better way. And there is 😎. And there usually is, explicitly with all these repetitive tasks with pandas.

The read_csv function is extremely powerful and you can specify many more parameters at import time. We can achive the same results with only one line by doing:

In [26]:
df = pd.read_csv(
    'data/btc-market-price.csv',
    header=None,
    names=['Timestamp', 'Price'],
    index_col=0,
    parse_dates=True
)
In [27]:
df.head()
Out[27]:
Price
Timestamp
2017-04-02 1099.169125
2017-04-03 1141.813000
2017-04-04 1141.600363
2017-04-05 1133.079314
2017-04-06 1196.307937
In [28]:
?pd.read_csv
Signature:
pd.read_csv(
    filepath_or_buffer,
    sep=',',
    delimiter=None,
    header='infer',
    names=None,
    index_col=None,
    usecols=None,
    squeeze=False,
    prefix=None,
    mangle_dupe_cols=True,
    dtype=None,
    engine=None,
    converters=None,
    true_values=None,
    false_values=None,
    skipinitialspace=False,
    skiprows=None,
    nrows=None,
    na_values=None,
    keep_default_na=True,
    na_filter=True,
    verbose=False,
    skip_blank_lines=True,
    parse_dates=False,
    infer_datetime_format=False,
    keep_date_col=False,
    date_parser=None,
    dayfirst=False,
    iterator=False,
    chunksize=None,
    compression='infer',
    thousands=None,
    decimal=b'.',
    lineterminator=None,
    quotechar='"',
    quoting=0,
    escapechar=None,
    comment=None,
    encoding=None,
    dialect=None,
    tupleize_cols=None,
    error_bad_lines=True,
    warn_bad_lines=True,
    skipfooter=0,
    doublequote=True,
    delim_whitespace=False,
    low_memory=True,
    memory_map=False,
    float_precision=None,
)
Docstring:
Read CSV (comma-separated) file into DataFrame

Also supports optionally iterating or breaking of the file
into chunks.

Additional help can be found in the `online docs for IO Tools
<http://pandas.pydata.org/pandas-docs/stable/io.html>`_.

Parameters
----------
filepath_or_buffer : str, pathlib.Path, py._path.local.LocalPath or any \
object with a read() method (such as a file handle or StringIO)
    The string could be a URL. Valid URL schemes include http, ftp, s3, and
    file. For file URLs, a host is expected. For instance, a local file could
    be file://localhost/path/to/table.csv
sep : str, default ','
    Delimiter to use. If sep is None, the C engine cannot automatically detect
    the separator, but the Python parsing engine can, meaning the latter will
    be used and automatically detect the separator by Python's builtin sniffer
    tool, ``csv.Sniffer``. In addition, separators longer than 1 character and
    different from ``'\s+'`` will be interpreted as regular expressions and
    will also force the use of the Python parsing engine. Note that regex
    delimiters are prone to ignoring quoted data. Regex example: ``'\r\t'``
delimiter : str, default ``None``
    Alternative argument name for sep.
delim_whitespace : boolean, default False
    Specifies whether or not whitespace (e.g. ``' '`` or ``'\t'``) will be
    used as the sep. Equivalent to setting ``sep='\s+'``. If this option
    is set to True, nothing should be passed in for the ``delimiter``
    parameter.

    .. versionadded:: 0.18.1 support for the Python parser.

header : int or list of ints, default 'infer'
    Row number(s) to use as the column names, and the start of the
    data.  Default behavior is to infer the column names: if no names
    are passed the behavior is identical to ``header=0`` and column
    names are inferred from the first line of the file, if column
    names are passed explicitly then the behavior is identical to
    ``header=None``. Explicitly pass ``header=0`` to be able to
    replace existing names. The header can be a list of integers that
    specify row locations for a multi-index on the columns
    e.g. [0,1,3]. Intervening rows that are not specified will be
    skipped (e.g. 2 in this example is skipped). Note that this
    parameter ignores commented lines and empty lines if
    ``skip_blank_lines=True``, so header=0 denotes the first line of
    data rather than the first line of the file.
names : array-like, default None
    List of column names to use. If file contains no header row, then you
    should explicitly pass header=None. Duplicates in this list will cause
    a ``UserWarning`` to be issued.
index_col : int or sequence or False, default None
    Column to use as the row labels of the DataFrame. If a sequence is given, a
    MultiIndex is used. If you have a malformed file with delimiters at the end
    of each line, you might consider index_col=False to force pandas to _not_
    use the first column as the index (row names)
usecols : list-like or callable, default None
    Return a subset of the columns. If list-like, all elements must either
    be positional (i.e. integer indices into the document columns) or strings
    that correspond to column names provided either by the user in `names` or
    inferred from the document header row(s). For example, a valid list-like
    `usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element
    order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.
    To instantiate a DataFrame from ``data`` with element order preserved use
    ``pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]`` for columns
    in ``['foo', 'bar']`` order or
    ``pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]``
    for ``['bar', 'foo']`` order.

    If callable, the callable function will be evaluated against the column
    names, returning names where the callable function evaluates to True. An
    example of a valid callable argument would be ``lambda x: x.upper() in
    ['AAA', 'BBB', 'DDD']``. Using this parameter results in much faster
    parsing time and lower memory usage.
squeeze : boolean, default False
    If the parsed data only contains one column then return a Series
prefix : str, default None
    Prefix to add to column numbers when no header, e.g. 'X' for X0, X1, ...
mangle_dupe_cols : boolean, default True
    Duplicate columns will be specified as 'X', 'X.1', ...'X.N', rather than
    'X'...'X'. Passing in False will cause data to be overwritten if there
    are duplicate names in the columns.
dtype : Type name or dict of column -> type, default None
    Data type for data or columns. E.g. {'a': np.float64, 'b': np.int32}
    Use `str` or `object` together with suitable `na_values` settings
    to preserve and not interpret dtype.
    If converters are specified, they will be applied INSTEAD
    of dtype conversion.
engine : {'c', 'python'}, optional
    Parser engine to use. The C engine is faster while the python engine is
    currently more feature-complete.
converters : dict, default None
    Dict of functions for converting values in certain columns. Keys can either
    be integers or column labels
true_values : list, default None
    Values to consider as True
false_values : list, default None
    Values to consider as False
skipinitialspace : boolean, default False
    Skip spaces after delimiter.
skiprows : list-like or integer or callable, default None
    Line numbers to skip (0-indexed) or number of lines to skip (int)
    at the start of the file.

    If callable, the callable function will be evaluated against the row
    indices, returning True if the row should be skipped and False otherwise.
    An example of a valid callable argument would be ``lambda x: x in [0, 2]``.
skipfooter : int, default 0
    Number of lines at bottom of file to skip (Unsupported with engine='c')
nrows : int, default None
    Number of rows of file to read. Useful for reading pieces of large files
na_values : scalar, str, list-like, or dict, default None
    Additional strings to recognize as NA/NaN. If dict passed, specific
    per-column NA values.  By default the following values are interpreted as
    NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan',
    '1.#IND', '1.#QNAN', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan',
    'null'.
keep_default_na : bool, default True
    Whether or not to include the default NaN values when parsing the data.
    Depending on whether `na_values` is passed in, the behavior is as follows:

    * If `keep_default_na` is True, and `na_values` are specified, `na_values`
      is appended to the default NaN values used for parsing.
    * If `keep_default_na` is True, and `na_values` are not specified, only
      the default NaN values are used for parsing.
    * If `keep_default_na` is False, and `na_values` are specified, only
      the NaN values specified `na_values` are used for parsing.
    * If `keep_default_na` is False, and `na_values` are not specified, no
      strings will be parsed as NaN.

    Note that if `na_filter` is passed in as False, the `keep_default_na` and
    `na_values` parameters will be ignored.
na_filter : boolean, default True
    Detect missing value markers (empty strings and the value of na_values). In
    data without any NAs, passing na_filter=False can improve the performance
    of reading a large file
verbose : boolean, default False
    Indicate number of NA values placed in non-numeric columns
skip_blank_lines : boolean, default True
    If True, skip over blank lines rather than interpreting as NaN values
parse_dates : boolean or list of ints or names or list of lists or dict, default False

    * boolean. If True -> try parsing the index.
    * list of ints or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3
      each as a separate date column.
    * list of lists. e.g.  If [[1, 3]] -> combine columns 1 and 3 and parse as
      a single date column.
    * dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result
      'foo'

    If a column or index contains an unparseable date, the entire column or
    index will be returned unaltered as an object data type. For non-standard
    datetime parsing, use ``pd.to_datetime`` after ``pd.read_csv``

    Note: A fast-path exists for iso8601-formatted dates.
infer_datetime_format : boolean, default False
    If True and `parse_dates` is enabled, pandas will attempt to infer the
    format of the datetime strings in the columns, and if it can be inferred,
    switch to a faster method of parsing them. In some cases this can increase
    the parsing speed by 5-10x.
keep_date_col : boolean, default False
    If True and `parse_dates` specifies combining multiple columns then
    keep the original columns.
date_parser : function, default None
    Function to use for converting a sequence of string columns to an array of
    datetime instances. The default uses ``dateutil.parser.parser`` to do the
    conversion. Pandas will try to call `date_parser` in three different ways,
    advancing to the next if an exception occurs: 1) Pass one or more arrays
    (as defined by `parse_dates`) as arguments; 2) concatenate (row-wise) the
    string values from the columns defined by `parse_dates` into a single array
    and pass that; and 3) call `date_parser` once for each row using one or
    more strings (corresponding to the columns defined by `parse_dates`) as
    arguments.
dayfirst : boolean, default False
    DD/MM format dates, international and European format
iterator : boolean, default False
    Return TextFileReader object for iteration or getting chunks with
    ``get_chunk()``.
chunksize : int, default None
    Return TextFileReader object for iteration.
    See the `IO Tools docs
    <http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking>`_
    for more information on ``iterator`` and ``chunksize``.
compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
    For on-the-fly decompression of on-disk data. If 'infer' and
    `filepath_or_buffer` is path-like, then detect compression from the
    following extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise no
    decompression). If using 'zip', the ZIP file must contain only one data
    file to be read in. Set to None for no decompression.

    .. versionadded:: 0.18.1 support for 'zip' and 'xz' compression.

thousands : str, default None
    Thousands separator
decimal : str, default '.'
    Character to recognize as decimal point (e.g. use ',' for European data).
float_precision : string, default None
    Specifies which converter the C engine should use for floating-point
    values. The options are `None` for the ordinary converter,
    `high` for the high-precision converter, and `round_trip` for the
    round-trip converter.
lineterminator : str (length 1), default None
    Character to break file into lines. Only valid with C parser.
quotechar : str (length 1), optional
    The character used to denote the start and end of a quoted item. Quoted
    items can include the delimiter and it will be ignored.
quoting : int or csv.QUOTE_* instance, default 0
    Control field quoting behavior per ``csv.QUOTE_*`` constants. Use one of
    QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote : boolean, default ``True``
   When quotechar is specified and quoting is not ``QUOTE_NONE``, indicate
   whether or not to interpret two consecutive quotechar elements INSIDE a
   field as a single ``quotechar`` element.
escapechar : str (length 1), default None
    One-character string used to escape delimiter when quoting is QUOTE_NONE.
comment : str, default None
    Indicates remainder of line should not be parsed. If found at the beginning
    of a line, the line will be ignored altogether. This parameter must be a
    single character. Like empty lines (as long as ``skip_blank_lines=True``),
    fully commented lines are ignored by the parameter `header` but not by
    `skiprows`. For example, if ``comment='#'``, parsing
    ``#empty\na,b,c\n1,2,3`` with ``header=0`` will result in 'a,b,c' being
    treated as the header.
encoding : str, default None
    Encoding to use for UTF when reading/writing (ex. 'utf-8'). `List of Python
    standard encodings
    <https://docs.python.org/3/library/codecs.html#standard-encodings>`_
dialect : str or csv.Dialect instance, default None
    If provided, this parameter will override values (default or not) for the
    following parameters: `delimiter`, `doublequote`, `escapechar`,
    `skipinitialspace`, `quotechar`, and `quoting`. If it is necessary to
    override values, a ParserWarning will be issued. See csv.Dialect
    documentation for more details.
tupleize_cols : boolean, default False
    .. deprecated:: 0.21.0
       This argument will be removed and will always convert to MultiIndex

    Leave a list of tuples on columns as is (default is to convert to
    a MultiIndex on the columns)
error_bad_lines : boolean, default True
    Lines with too many fields (e.g. a csv line with too many commas) will by
    default cause an exception to be raised, and no DataFrame will be returned.
    If False, then these "bad lines" will dropped from the DataFrame that is
    returned.
warn_bad_lines : boolean, default True
    If error_bad_lines is False, and warn_bad_lines is True, a warning for each
    "bad line" will be output.
low_memory : boolean, default True
    Internally process the file in chunks, resulting in lower memory use
    while parsing, but possibly mixed type inference.  To ensure no mixed
    types either set False, or specify the type with the `dtype` parameter.
    Note that the entire file is read into a single DataFrame regardless,
    use the `chunksize` or `iterator` parameter to return the data in chunks.
    (Only valid with C parser)
memory_map : boolean, default False
    If a filepath is provided for `filepath_or_buffer`, map the file object
    directly onto memory and access the data directly from there. Using this
    option can improve performance because there is no longer any I/O overhead.

Returns
-------
result : DataFrame or TextParser
File:      /usr/local/lib/python3.6/site-packages/pandas/io/parsers.py
Type:      function
In [29]:
df.plot(figsize=(14, 7))
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f02ec04e358>

green-divider

A more challenging parsing

To demonstrate plotting two columns together, we'll try to add Ether prices to our df DataFrame. The ETH prices data can be found in the data/eth-price.csv file. The problem is that it seems like that CSV file was created by someone who really hated programmers. Take a look at it and see how ugly it looks like. We'll still use pandas to parse it.

In [30]:
eth = pd.read_csv('data/eth-price.csv')

eth.head()
Out[30]:
Date(UTC) UnixTimeStamp Value
0 4/2/2017 1491091200 48.55
1 4/3/2017 1491177600 44.13
2 4/4/2017 1491264000 44.43
3 4/5/2017 1491350400 44.90
4 4/6/2017 1491436800 43.23

As you can see, it has a Value column (which represents the price), a Date(UTC) one that has a string representing dates and also a UnixTimeStamp date represeting the datetime in unix timestamp format. The header is read automatically, let's try to parse dates with the CSV Reader:

In [31]:
eth = pd.read_csv('data/eth-price.csv', parse_dates=True)

print(eth.dtypes)
eth.head()
Date(UTC)         object
UnixTimeStamp      int64
Value            float64
dtype: object
Out[31]:
Date(UTC) UnixTimeStamp Value
0 4/2/2017 1491091200 48.55
1 4/3/2017 1491177600 44.13
2 4/4/2017 1491264000 44.43
3 4/5/2017 1491350400 44.90
4 4/6/2017 1491436800 43.23

Seems like the parse_dates attribute didn't work. We'll need to add a little bit more customization. Let's divide this problem and focus on the problem of "date parsing" first. The simplest option would be to use the UnixTimeStamp column. The pandas module has a to_datetime function that converts Unix timestamps to Datetime objects automatically:

In [32]:
pd.to_datetime(eth['UnixTimeStamp']).head()
Out[32]:
0   1970-01-01 00:00:01.491091200
1   1970-01-01 00:00:01.491177600
2   1970-01-01 00:00:01.491264000
3   1970-01-01 00:00:01.491350400
4   1970-01-01 00:00:01.491436800
Name: UnixTimeStamp, dtype: datetime64[ns]

The problem is the precision of unix timestamps. To match both columns we'll need to use the same index and, our df containing Bitcoin prices, is "per day":

In [33]:
df.head()
Out[33]:
Price
Timestamp
2017-04-02 1099.169125
2017-04-03 1141.813000
2017-04-04 1141.600363
2017-04-05 1133.079314
2017-04-06 1196.307937

We could either, remove the precision of UnixTimeStamp or attempt to parse the Date(UTC). Let's do String parsing of Date(UTC) for fun:

In [34]:
pd.to_datetime(eth['Date(UTC)']).head()
Out[34]:
0   2017-04-02
1   2017-04-03
2   2017-04-04
3   2017-04-05
4   2017-04-06
Name: Date(UTC), dtype: datetime64[ns]

That seems to work fine! Why isn't it then parsing the Date(UTC) column? Simple, the parse_dates=True parameter will instruct pandas to parse the index of the DataFrame. If you want to parse any other column, you must explicitly pass the column position or name:

In [35]:
pd.read_csv('data/eth-price.csv', parse_dates=[0]).head()
Out[35]:
Date(UTC) UnixTimeStamp Value
0 2017-04-02 1491091200 48.55
1 2017-04-03 1491177600 44.13
2 2017-04-04 1491264000 44.43
3 2017-04-05 1491350400 44.90
4 2017-04-06 1491436800 43.23

Putting everything together again:

In [36]:
eth = pd.read_csv('data/eth-price.csv', parse_dates=True, index_col=0)
print(eth.info())

eth.head()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 362 entries, 2017-04-02 to 2018-04-01
Data columns (total 2 columns):
UnixTimeStamp    362 non-null int64
Value            362 non-null float64
dtypes: float64(1), int64(1)
memory usage: 8.5 KB
None
Out[36]:
UnixTimeStamp Value
Date(UTC)
2017-04-02 1491091200 48.55
2017-04-03 1491177600 44.13
2017-04-04 1491264000 44.43
2017-04-05 1491350400 44.90
2017-04-06 1491436800 43.23
In [37]:
eth['Value'].plot(figsize=(14, 7))
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f02e6eb6668>

We can now combine both DataFrames into one. Both have the same index, so aligning both prices will be easy. Let's first create an empty DataFrame and with the index from Bitcoin prices:

In [38]:
prices = pd.DataFrame(index=df.index)
In [39]:
prices.head()
Out[39]:
Timestamp
2017-04-02
2017-04-03
2017-04-04
2017-04-05
2017-04-06

And we can now just set columns from the other DataFrames:

In [40]:
prices['Bitcoin'] = df['Price']
In [41]:
prices['Ether'] = eth['Value']
In [42]:
prices.head()
Out[42]:
Bitcoin Ether
Timestamp
2017-04-02 1099.169125 48.55
2017-04-03 1141.813000 44.13
2017-04-04 1141.600363 44.43
2017-04-05 1133.079314 44.90
2017-04-06 1196.307937 43.23
In [43]:
prices.plot(figsize=(14, 7))
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f02e6e282e8>

green-divider

 Read raw text files

The read_csv method also reads raw text files:

In [44]:
cars = pd.read_csv('data/mtcars.txt')

cars.head()
Out[44]:
mpg cyl disp hp drat wt qsec vs am gear carb
0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

green-divider

 Read JSON files

The read_json method reads JSON files and accepts many parameters.

In [45]:
?pd.read_json
Signature:
pd.read_json(
    path_or_buf=None,
    orient=None,
    typ='frame',
    dtype=True,
    convert_axes=True,
    convert_dates=True,
    keep_default_dates=True,
    numpy=False,
    precise_float=False,
    date_unit=None,
    encoding=None,
    lines=False,
    chunksize=None,
    compression='infer',
)
Docstring:
Convert a JSON string to pandas object

Parameters
----------
path_or_buf : a valid JSON string or file-like, default: None
    The string could be a URL. Valid URL schemes include http, ftp, s3, and
    file. For file URLs, a host is expected. For instance, a local file
    could be ``file://localhost/path/to/table.json``

orient : string,
    Indication of expected JSON string format.
    Compatible JSON strings can be produced by ``to_json()`` with a
    corresponding orient value.
    The set of possible orients is:

    - ``'split'`` : dict like
      ``{index -> [index], columns -> [columns], data -> [values]}``
    - ``'records'`` : list like
      ``[{column -> value}, ... , {column -> value}]``
    - ``'index'`` : dict like ``{index -> {column -> value}}``
    - ``'columns'`` : dict like ``{column -> {index -> value}}``
    - ``'values'`` : just the values array

    The allowed and default values depend on the value
    of the `typ` parameter.

    * when ``typ == 'series'``,

      - allowed orients are ``{'split','records','index'}``
      - default is ``'index'``
      - The Series index must be unique for orient ``'index'``.

    * when ``typ == 'frame'``,

      - allowed orients are ``{'split','records','index',
        'columns','values', 'table'}``
      - default is ``'columns'``
      - The DataFrame index must be unique for orients ``'index'`` and
        ``'columns'``.
      - The DataFrame columns must be unique for orients ``'index'``,
        ``'columns'``, and ``'records'``.

    .. versionadded:: 0.23.0
       'table' as an allowed value for the ``orient`` argument

typ : type of object to recover (series or frame), default 'frame'
dtype : boolean or dict, default True
    If True, infer dtypes, if a dict of column to dtype, then use those,
    if False, then don't infer dtypes at all, applies only to the data.
convert_axes : boolean, default True
    Try to convert the axes to the proper dtypes.
convert_dates : boolean, default True
    List of columns to parse for dates; If True, then try to parse
    datelike columns default is True; a column label is datelike if

    * it ends with ``'_at'``,

    * it ends with ``'_time'``,

    * it begins with ``'timestamp'``,

    * it is ``'modified'``, or

    * it is ``'date'``

keep_default_dates : boolean, default True
    If parsing dates, then parse the default datelike columns
numpy : boolean, default False
    Direct decoding to numpy arrays. Supports numeric data only, but
    non-numeric column and index labels are supported. Note also that the
    JSON ordering MUST be the same for each term if numpy=True.
precise_float : boolean, default False
    Set to enable usage of higher precision (strtod) function when
    decoding string to double values. Default (False) is to use fast but
    less precise builtin functionality
date_unit : string, default None
    The timestamp unit to detect if converting dates. The default behaviour
    is to try and detect the correct precision, but if this is not desired
    then pass one of 's', 'ms', 'us' or 'ns' to force parsing only seconds,
    milliseconds, microseconds or nanoseconds respectively.
lines : boolean, default False
    Read the file as a json object per line.

    .. versionadded:: 0.19.0

encoding : str, default is 'utf-8'
    The encoding to use to decode py3 bytes.

    .. versionadded:: 0.19.0

chunksize: integer, default None
    Return JsonReader object for iteration.
    See the `line-delimted json docs
    <http://pandas.pydata.org/pandas-docs/stable/io.html#io-jsonl>`_
    for more information on ``chunksize``.
    This can only be passed if `lines=True`.
    If this is None, the file will be read into memory all at once.

    .. versionadded:: 0.21.0

compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
    For on-the-fly decompression of on-disk data. If 'infer', then use
    gzip, bz2, zip or xz if path_or_buf is a string ending in
    '.gz', '.bz2', '.zip', or 'xz', respectively, and no decompression
    otherwise. If using 'zip', the ZIP file must contain only one data
    file to be read in. Set to None for no decompression.

    .. versionadded:: 0.21.0

Returns
-------
result : Series or DataFrame, depending on the value of `typ`.

Notes
-----
Specific to ``orient='table'``, if a :class:`DataFrame` with a literal
:class:`Index` name of `index` gets written with :func:`to_json`, the
subsequent read operation will incorrectly set the :class:`Index` name to
``None``. This is because `index` is also used by :func:`DataFrame.to_json`
to denote a missing :class:`Index` name, and the subsequent
:func:`read_json` operation cannot distinguish between the two. The same
limitation is encountered with a :class:`MultiIndex` and any names
beginning with ``'level_'``.

See Also
--------
DataFrame.to_json

Examples
--------

>>> df = pd.DataFrame([['a', 'b'], ['c', 'd']],
...                   index=['row 1', 'row 2'],
...                   columns=['col 1', 'col 2'])

Encoding/decoding a Dataframe using ``'split'`` formatted JSON:

>>> df.to_json(orient='split')
'{"columns":["col 1","col 2"],
  "index":["row 1","row 2"],
  "data":[["a","b"],["c","d"]]}'
>>> pd.read_json(_, orient='split')
      col 1 col 2
row 1     a     b
row 2     c     d

Encoding/decoding a Dataframe using ``'index'`` formatted JSON:

>>> df.to_json(orient='index')
'{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'
>>> pd.read_json(_, orient='index')
      col 1 col 2
row 1     a     b
row 2     c     d

Encoding/decoding a Dataframe using ``'records'`` formatted JSON.
Note that index labels are not preserved with this encoding.

>>> df.to_json(orient='records')
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
>>> pd.read_json(_, orient='records')
  col 1 col 2
0     a     b
1     c     d

Encoding with Table Schema

>>> df.to_json(orient='table')
'{"schema": {"fields": [{"name": "index", "type": "string"},
                        {"name": "col 1", "type": "string"},
                        {"name": "col 2", "type": "string"}],
                "primaryKey": "index",
                "pandas_version": "0.20.0"},
    "data": [{"index": "row 1", "col 1": "a", "col 2": "b"},
            {"index": "row 2", "col 1": "c", "col 2": "d"}]}'
File:      /usr/local/lib/python3.6/site-packages/pandas/io/json/json.py
Type:      function
In [46]:
games = pd.read_json('data/games.json')

games.head()
Out[46]:
content_rating_img genre image price provider release_date title
0 https://cdn-a.sonyentertainmentnetwork.com/grc... Shooter https://store.playstation.com/store/api/chihir... 129,99 € Activision Blizzard Int'l BV Lanzado 21 Dic 2018 Call of Duty®: WWII + Destiny 2 - Lote
1 https://cdn-a.sonyentertainmentnetwork.com/grc... Acción https://store.playstation.com/store/api/chihir... 69,99 € Sony Interactive Entertainment Europe Lanzado 20 Abr 2018 God of War® Digital Deluxe Edition
2 https://cdn-a.sonyentertainmentnetwork.com/grc... Aventura https://store.playstation.com/store/api/chihir... 69,99 € UBISOFT ENTERTAINMENT SA Lanzado 27 Mar 2018 Far Cry 5
3 https://cdn-a.sonyentertainmentnetwork.com/grc... Aventura https://store.playstation.com/store/api/chihir... 79,99 € UBISOFT ENTERTAINMENT SA Lanzado 27 Mar 2018 Far Cry 5 Edición Deluxe
4 https://cdn-a.sonyentertainmentnetwork.com/grc... Aventura https://store.playstation.com/store/api/chihir... 89,99 € UBISOFT ENTERTAINMENT SA Lanzado 27 Mar 2018 Far Cry 5 Edición Oro

green-divider

 An advanced JSON example

JSON documents tend not to be so straightforward to read as tables, specially when they have nested structures.

Dataset source

In [47]:
df = pd.read_json('data/emissions.json')
df.head()
Out[47]:
value
0 {'ID': 0, 'Bronnen': 'T001176 ', 'Perioden': ...
1 {'ID': 1, 'Bronnen': 'T001176 ', 'Perioden': ...
10 {'ID': 10, 'Bronnen': 'T001176 ', 'Perioden':...
100 {'ID': 100, 'Bronnen': '346700 ', 'Perioden'...
101 {'ID': 101, 'Bronnen': '346700 ', 'Perioden'...
In [48]:
import json
In [49]:
with open('data/emissions.json') as fp:
    doc = json.load(fp)
In [50]:
df = pd.DataFrame.from_dict(doc['value'], orient='index')

df.head()
Out[50]:
ID Bronnen Perioden CO2_1 CH4_2 N2O_3
0 0 T001176 1990JJ00 163120 1278.17 59.49
1 1 T001176 1995JJ00 173520 1192.41 59.84
10 10 T001176 2008JJ00 175770 795.75 29.36
100 100 346700 1990JJ00 39810 18.37 0.38
101 101 346700 1995JJ00 47890 18.57 0.41

green-divider

 Read JSON data from an API

We'll read a JSON data stream from SWAPI and put it in a Pandas dataframe:

In [51]:
import json 
import requests

request = requests.get('https://swapi.co/api/people')
df = pd.read_json(request.text)

df
Out[51]:
count next previous results
0 87 https://swapi.co/api/people/?page=2 NaN {'name': 'Luke Skywalker', 'height': '172', 'm...
1 87 https://swapi.co/api/people/?page=2 NaN {'name': 'C-3PO', 'height': '167', 'mass': '75...
2 87 https://swapi.co/api/people/?page=2 NaN {'name': 'R2-D2', 'height': '96', 'mass': '32'...
3 87 https://swapi.co/api/people/?page=2 NaN {'name': 'Darth Vader', 'height': '202', 'mass...
4 87 https://swapi.co/api/people/?page=2 NaN {'name': 'Leia Organa', 'height': '150', 'mass...
5 87 https://swapi.co/api/people/?page=2 NaN {'name': 'Owen Lars', 'height': '178', 'mass':...
6 87 https://swapi.co/api/people/?page=2 NaN {'name': 'Beru Whitesun lars', 'height': '165'...
7 87 https://swapi.co/api/people/?page=2 NaN {'name': 'R5-D4', 'height': '97', 'mass': '32'...
8 87 https://swapi.co/api/people/?page=2 NaN {'name': 'Biggs Darklighter', 'height': '183',...
9 87 https://swapi.co/api/people/?page=2 NaN {'name': 'Obi-Wan Kenobi', 'height': '182', 'm...
In [52]:
pd.DataFrame.from_dict(request.json()['results'])
Out[52]:
birth_year created edited eye_color films gender hair_color height homeworld mass name skin_color species starships url vehicles
0 19BBY 2014-12-09T13:50:51.644000Z 2014-12-20T21:17:56.891000Z blue [https://swapi.co/api/films/2/, https://swapi.... male blond 172 https://swapi.co/api/planets/1/ 77 Luke Skywalker fair [https://swapi.co/api/species/1/] [https://swapi.co/api/starships/12/, https://s... https://swapi.co/api/people/1/ [https://swapi.co/api/vehicles/14/, https://sw...
1 112BBY 2014-12-10T15:10:51.357000Z 2014-12-20T21:17:50.309000Z yellow [https://swapi.co/api/films/2/, https://swapi.... n/a n/a 167 https://swapi.co/api/planets/1/ 75 C-3PO gold [https://swapi.co/api/species/2/] [] https://swapi.co/api/people/2/ []
2 33BBY 2014-12-10T15:11:50.376000Z 2014-12-20T21:17:50.311000Z red [https://swapi.co/api/films/2/, https://swapi.... n/a n/a 96 https://swapi.co/api/planets/8/ 32 R2-D2 white, blue [https://swapi.co/api/species/2/] [] https://swapi.co/api/people/3/ []
3 41.9BBY 2014-12-10T15:18:20.704000Z 2014-12-20T21:17:50.313000Z yellow [https://swapi.co/api/films/2/, https://swapi.... male none 202 https://swapi.co/api/planets/1/ 136 Darth Vader white [https://swapi.co/api/species/1/] [https://swapi.co/api/starships/13/] https://swapi.co/api/people/4/ []
4 19BBY 2014-12-10T15:20:09.791000Z 2014-12-20T21:17:50.315000Z brown [https://swapi.co/api/films/2/, https://swapi.... female brown 150 https://swapi.co/api/planets/2/ 49 Leia Organa light [https://swapi.co/api/species/1/] [] https://swapi.co/api/people/5/ [https://swapi.co/api/vehicles/30/]
5 52BBY 2014-12-10T15:52:14.024000Z 2014-12-20T21:17:50.317000Z blue [https://swapi.co/api/films/5/, https://swapi.... male brown, grey 178 https://swapi.co/api/planets/1/ 120 Owen Lars light [https://swapi.co/api/species/1/] [] https://swapi.co/api/people/6/ []
6 47BBY 2014-12-10T15:53:41.121000Z 2014-12-20T21:17:50.319000Z blue [https://swapi.co/api/films/5/, https://swapi.... female brown 165 https://swapi.co/api/planets/1/ 75 Beru Whitesun lars light [https://swapi.co/api/species/1/] [] https://swapi.co/api/people/7/ []
7 unknown 2014-12-10T15:57:50.959000Z 2014-12-20T21:17:50.321000Z red [https://swapi.co/api/films/1/] n/a n/a 97 https://swapi.co/api/planets/1/ 32 R5-D4 white, red [https://swapi.co/api/species/2/] [] https://swapi.co/api/people/8/ []
8 24BBY 2014-12-10T15:59:50.509000Z 2014-12-20T21:17:50.323000Z brown [https://swapi.co/api/films/1/] male black 183 https://swapi.co/api/planets/1/ 84 Biggs Darklighter light [https://swapi.co/api/species/1/] [https://swapi.co/api/starships/12/] https://swapi.co/api/people/9/ []
9 57BBY 2014-12-10T16:16:29.192000Z 2014-12-20T21:17:50.325000Z blue-gray [https://swapi.co/api/films/2/, https://swapi.... male auburn, white 182 https://swapi.co/api/planets/20/ 77 Obi-Wan Kenobi fair [https://swapi.co/api/species/1/] [https://swapi.co/api/starships/48/, https://s... https://swapi.co/api/people/10/ [https://swapi.co/api/vehicles/38/]

green-divider

 Read data from SQL database

Reading data from SQL stores is fairly simple. Pandas support a variety of methods. We'll start with an example using SQLite, as it's a builtin Python package, and we don't need anything extra installed.

In order to work with a SQLite database from Python, we first have to connect to it. We can do that using the connect function, which returns a Connection object.

We'll use this example database.

In [53]:
import sqlite3

conn = sqlite3.connect('data/chinook.db')

Once we have a Connection object, we can then create a Cursor object. Cursors allow us to execute SQL queries against a database:

In [54]:
cur = conn.cursor()

The Cursor created has a method execute, which will receive SQL parameters to run against the database.

The code below will fetch the first 5 rows from the employees table:

In [55]:
cur.execute('SELECT * FROM employees LIMIT 5;')
Out[55]:
<sqlite3.Cursor at 0x7f02e493aab0>

You may have noticed that we didn't assign the result of the above query to a variable. This is because we need to run another command to actually fetch the results. We can use the fetchall method to fetch all of the results of a query:

In [56]:
results = cur.fetchall()

results
Out[56]:
[(1,
  'Adams',
  'Andrew',
  'General Manager',
  None,
  '1962-02-18 00:00:00',
  '2002-08-14 00:00:00',
  '11120 Jasper Ave NW',
  'Edmonton',
  'AB',
  'Canada',
  'T5K 2N1',
  '+1 (780) 428-9482',
  '+1 (780) 428-3457',
  'andrew@chinookcorp.com'),
 (2,
  'Edwards',
  'Nancy',
  'Sales Manager',
  1,
  '1958-12-08 00:00:00',
  '2002-05-01 00:00:00',
  '825 8 Ave SW',
  'Calgary',
  'AB',
  'Canada',
  'T2P 2T3',
  '+1 (403) 262-3443',
  '+1 (403) 262-3322',
  'nancy@chinookcorp.com'),
 (3,
  'Peacock',
  'Jane',
  'Sales Support Agent',
  2,
  '1973-08-29 00:00:00',
  '2002-04-01 00:00:00',
  '1111 6 Ave SW',
  'Calgary',
  'AB',
  'Canada',
  'T2P 5M5',
  '+1 (403) 262-3443',
  '+1 (403) 262-6712',
  'jane@chinookcorp.com'),
 (4,
  'Park',
  'Margaret',
  'Sales Support Agent',
  2,
  '1947-09-19 00:00:00',
  '2003-05-03 00:00:00',
  '683 10 Street SW',
  'Calgary',
  'AB',
  'Canada',
  'T2P 5G3',
  '+1 (403) 263-4423',
  '+1 (403) 263-4289',
  'margaret@chinookcorp.com'),
 (5,
  'Johnson',
  'Steve',
  'Sales Support Agent',
  2,
  '1965-03-03 00:00:00',
  '2003-10-17 00:00:00',
  '7727B 41 Ave',
  'Calgary',
  'AB',
  'Canada',
  'T3B 1Y7',
  '1 (780) 836-9987',
  '1 (780) 836-9543',
  'steve@chinookcorp.com')]

As you can see, the results are returned as a list of tuples. Each tuple corresponds to a row in the database that we accessed. Dealing with data this way is painful. We'd need to manually add column headers, and manually parse the data. Luckily, the pandas library has an easier way, which we'll look at in the next section.

Before we move on, it's good practice to close Connection objects and Cursor objects that are open. This prevents the SQLite database from being locked. When a SQLite database is locked, you may be unable to update the database, and may get errors. We can close the Cursor and the Connection like this:

In [57]:
cur.close()
conn.close()

Reading results into a pandas DataFrame

We can use the pandas read_sql function to read the results of a SQL query directly into a pandas DataFrame. The code below will execute the same query that we just did, but it will return a DataFrame. It has several advantages over the query we did above:

  • It doesn't require us to create a Cursor object or call fetchall at the end.
  • It automatically reads in the names of the headers from the table.
  • It creates a DataFrame, so we can quickly explore the data.
In [58]:
?pd.read_sql
Signature:
pd.read_sql(
    sql,
    con,
    index_col=None,
    coerce_float=True,
    params=None,
    parse_dates=None,
    columns=None,
    chunksize=None,
)
Docstring:
Read SQL query or database table into a DataFrame.

This function is a convenience wrapper around ``read_sql_table`` and
``read_sql_query`` (for backward compatibility). It will delegate
to the specific function depending on the provided input. A SQL query
will be routed to ``read_sql_query``, while a database table name will
be routed to ``read_sql_table``. Note that the delegated function might
have more specific notes about their functionality not listed here.

Parameters
----------
sql : string or SQLAlchemy Selectable (select or text object)
    SQL query to be executed or a table name.
con : SQLAlchemy connectable (engine/connection) or database string URI
    or DBAPI2 connection (fallback mode)

    Using SQLAlchemy makes it possible to use any DB supported by that
    library. If a DBAPI2 object, only sqlite3 is supported.
index_col : string or list of strings, optional, default: None
    Column(s) to set as index(MultiIndex).
coerce_float : boolean, default True
    Attempts to convert values of non-string, non-numeric objects (like
    decimal.Decimal) to floating point, useful for SQL result sets.
params : list, tuple or dict, optional, default: None
    List of parameters to pass to execute method.  The syntax used
    to pass parameters is database driver dependent. Check your
    database driver documentation for which of the five syntax styles,
    described in PEP 249's paramstyle, is supported.
    Eg. for psycopg2, uses %(name)s so use params={'name' : 'value'}
parse_dates : list or dict, default: None
    - List of column names to parse as dates.
    - Dict of ``{column_name: format string}`` where format string is
      strftime compatible in case of parsing string times, or is one of
      (D, s, ns, ms, us) in case of parsing integer timestamps.
    - Dict of ``{column_name: arg dict}``, where the arg dict corresponds
      to the keyword arguments of :func:`pandas.to_datetime`
      Especially useful with databases without native Datetime support,
      such as SQLite.
columns : list, default: None
    List of column names to select from SQL table (only used when reading
    a table).
chunksize : int, default None
    If specified, return an iterator where `chunksize` is the
    number of rows to include in each chunk.

Returns
-------
DataFrame

See also
--------
read_sql_table : Read SQL database table into a DataFrame.
read_sql_query : Read SQL query into a DataFrame.
File:      /usr/local/lib/python3.6/site-packages/pandas/io/sql.py
Type:      function
In [59]:
conn = sqlite3.connect('data/chinook.db')

df = pd.read_sql('SELECT * FROM employees;', conn, parse_dates=['BirthDate'])

df
Out[59]:
EmployeeId LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
0 1 Adams Andrew General Manager NaN 1962-02-18 2002-08-14 00:00:00 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
1 2 Edwards Nancy Sales Manager 1.0 1958-12-08 2002-05-01 00:00:00 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
2 3 Peacock Jane Sales Support Agent 2.0 1973-08-29 2002-04-01 00:00:00 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
3 4 Park Margaret Sales Support Agent 2.0 1947-09-19 2003-05-03 00:00:00 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
4 5 Johnson Steve Sales Support Agent 2.0 1965-03-03 2003-10-17 00:00:00 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com
5 6 Mitchell Michael IT Manager 1.0 1973-07-01 2003-10-17 00:00:00 5827 Bowness Road NW Calgary AB Canada T3B 0C5 +1 (403) 246-9887 +1 (403) 246-9899 michael@chinookcorp.com
6 7 King Robert IT Staff 6.0 1970-05-29 2004-01-02 00:00:00 590 Columbia Boulevard West Lethbridge AB Canada T1K 5N8 +1 (403) 456-9986 +1 (403) 456-8485 robert@chinookcorp.com
7 8 Callahan Laura IT Staff 6.0 1968-01-09 2004-03-04 00:00:00 923 7 ST NW Lethbridge AB Canada T1H 1Y8 +1 (403) 467-3351 +1 (403) 467-8772 laura@chinookcorp.com
In [60]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 15 columns):
EmployeeId    8 non-null int64
LastName      8 non-null object
FirstName     8 non-null object
Title         8 non-null object
ReportsTo     7 non-null float64
BirthDate     8 non-null datetime64[ns]
HireDate      8 non-null object
Address       8 non-null object
City          8 non-null object
State         8 non-null object
Country       8 non-null object
PostalCode    8 non-null object
Phone         8 non-null object
Fax           8 non-null object
Email         8 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(12)
memory usage: 1.0+ KB
In [61]:
df
Out[61]:
EmployeeId LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
0 1 Adams Andrew General Manager NaN 1962-02-18 2002-08-14 00:00:00 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
1 2 Edwards Nancy Sales Manager 1.0 1958-12-08 2002-05-01 00:00:00 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
2 3 Peacock Jane Sales Support Agent 2.0 1973-08-29 2002-04-01 00:00:00 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
3 4 Park Margaret Sales Support Agent 2.0 1947-09-19 2003-05-03 00:00:00 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
4 5 Johnson Steve Sales Support Agent 2.0 1965-03-03 2003-10-17 00:00:00 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com
5 6 Mitchell Michael IT Manager 1.0 1973-07-01 2003-10-17 00:00:00 5827 Bowness Road NW Calgary AB Canada T3B 0C5 +1 (403) 246-9887 +1 (403) 246-9899 michael@chinookcorp.com
6 7 King Robert IT Staff 6.0 1970-05-29 2004-01-02 00:00:00 590 Columbia Boulevard West Lethbridge AB Canada T1K 5N8 +1 (403) 456-9986 +1 (403) 456-8485 robert@chinookcorp.com
7 8 Callahan Laura IT Staff 6.0 1968-01-09 2004-03-04 00:00:00 923 7 ST NW Lethbridge AB Canada T1H 1Y8 +1 (403) 467-3351 +1 (403) 467-8772 laura@chinookcorp.com
In [62]:
df['ReportsTo'].isna().sum()
Out[62]:
1
In [63]:
df['ReportsTo'].mean()
Out[63]:
2.857142857142857
In [64]:
df['ReportsTo'] > 1.75
Out[64]:
0    False
1    False
2     True
3     True
4     True
5    False
6     True
7     True
Name: ReportsTo, dtype: bool
In [65]:
df
Out[65]:
EmployeeId LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
0 1 Adams Andrew General Manager NaN 1962-02-18 2002-08-14 00:00:00 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
1 2 Edwards Nancy Sales Manager 1.0 1958-12-08 2002-05-01 00:00:00 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
2 3 Peacock Jane Sales Support Agent 2.0 1973-08-29 2002-04-01 00:00:00 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
3 4 Park Margaret Sales Support Agent 2.0 1947-09-19 2003-05-03 00:00:00 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
4 5 Johnson Steve Sales Support Agent 2.0 1965-03-03 2003-10-17 00:00:00 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com
5 6 Mitchell Michael IT Manager 1.0 1973-07-01 2003-10-17 00:00:00 5827 Bowness Road NW Calgary AB Canada T3B 0C5 +1 (403) 246-9887 +1 (403) 246-9899 michael@chinookcorp.com
6 7 King Robert IT Staff 6.0 1970-05-29 2004-01-02 00:00:00 590 Columbia Boulevard West Lethbridge AB Canada T1K 5N8 +1 (403) 456-9986 +1 (403) 456-8485 robert@chinookcorp.com
7 8 Callahan Laura IT Staff 6.0 1968-01-09 2004-03-04 00:00:00 923 7 ST NW Lethbridge AB Canada T1H 1Y8 +1 (403) 467-3351 +1 (403) 467-8772 laura@chinookcorp.com
In [66]:
df.set_index('EmployeeId', inplace=True)
In [67]:
df
Out[67]:
LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
EmployeeId
1 Adams Andrew General Manager NaN 1962-02-18 2002-08-14 00:00:00 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
2 Edwards Nancy Sales Manager 1.0 1958-12-08 2002-05-01 00:00:00 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
3 Peacock Jane Sales Support Agent 2.0 1973-08-29 2002-04-01 00:00:00 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
4 Park Margaret Sales Support Agent 2.0 1947-09-19 2003-05-03 00:00:00 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
5 Johnson Steve Sales Support Agent 2.0 1965-03-03 2003-10-17 00:00:00 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com
6 Mitchell Michael IT Manager 1.0 1973-07-01 2003-10-17 00:00:00 5827 Bowness Road NW Calgary AB Canada T3B 0C5 +1 (403) 246-9887 +1 (403) 246-9899 michael@chinookcorp.com
7 King Robert IT Staff 6.0 1970-05-29 2004-01-02 00:00:00 590 Columbia Boulevard West Lethbridge AB Canada T1K 5N8 +1 (403) 456-9986 +1 (403) 456-8485 robert@chinookcorp.com
8 Callahan Laura IT Staff 6.0 1968-01-09 2004-03-04 00:00:00 923 7 ST NW Lethbridge AB Canada T1H 1Y8 +1 (403) 467-3351 +1 (403) 467-8772 laura@chinookcorp.com
In [68]:
df['City'] = df['City'].astype('category')
In [69]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8 entries, 1 to 8
Data columns (total 14 columns):
LastName      8 non-null object
FirstName     8 non-null object
Title         8 non-null object
ReportsTo     7 non-null float64
BirthDate     8 non-null datetime64[ns]
HireDate      8 non-null object
Address       8 non-null object
City          8 non-null category
State         8 non-null object
Country       8 non-null object
PostalCode    8 non-null object
Phone         8 non-null object
Fax           8 non-null object
Email         8 non-null object
dtypes: category(1), datetime64[ns](1), float64(1), object(11)
memory usage: 1008.0+ bytes

It turns out, that read_sql is just a wrapper around read_sql_query and read_sql_table. Our previous example corresponds to read_sql_query, as you'll observe the same result as before:

In [70]:
conn = sqlite3.connect('data/chinook.db')

df = pd.read_sql_query('SELECT * FROM employees LIMIT 5;', conn)

df
Out[70]:
EmployeeId LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
0 1 Adams Andrew General Manager NaN 1962-02-18 00:00:00 2002-08-14 00:00:00 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
1 2 Edwards Nancy Sales Manager 1.0 1958-12-08 00:00:00 2002-05-01 00:00:00 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
2 3 Peacock Jane Sales Support Agent 2.0 1973-08-29 00:00:00 2002-04-01 00:00:00 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
3 4 Park Margaret Sales Support Agent 2.0 1947-09-19 00:00:00 2003-05-03 00:00:00 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
4 5 Johnson Steve Sales Support Agent 2.0 1965-03-03 00:00:00 2003-10-17 00:00:00 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com

read_sql_table method

read_sql_table is a useful function, but it works only with SQLAlchemy, Python's most popular ORM. This is just a demonstration of its usage:

In [7]:
import pandas as pd
In [8]:
import sqlalchemy as db
In [9]:
import sqlite3
In [22]:
sqlite3.sqlite_version
Out[22]:
'3.16.2'
In [20]:
db.__version__
Out[20]:
'1.3.5'
In [10]:
engine = db.create_engine('sqlite:///data/chinook.db')
connection = engine.connect()
In [11]:
df = pd.read_sql_table('employees', connection)

df.head()
Out[11]:
EmployeeId LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
0 1 Adams Andrew General Manager NaN 1962-02-18 2002-08-14 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
1 2 Edwards Nancy Sales Manager 1.0 1958-12-08 2002-05-01 825 8 Ave SW Calgary AB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
2 3 Peacock Jane Sales Support Agent 2.0 1973-08-29 2002-04-01 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
3 4 Park Margaret Sales Support Agent 2.0 1947-09-19 2003-05-03 683 10 Street SW Calgary AB Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com
4 5 Johnson Steve Sales Support Agent 2.0 1965-03-03 2003-10-17 7727B 41 Ave Calgary AB Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com

purple-divider

Notebooks AI
Notebooks AI Profile20060