Profile picture

Complementary File Types and IO Tools

Last updated: November 25th, 20192019-11-25Project preview

rmotr


Complementary file types and IO tools

The pandas I/O API has a set of top level functions to let us work with a wide variety of file types.

In this lesson we'll show some file types pandas can work with besides the most known CSV, JSON and XLSX types.

purple-divider

Hands on!

In [1]:
import pandas as pd

green-divider

Pickling

When it comes to something like machine learning, after training your models, these models can't be saved to a .txt or .csv file, becouse it's an object with complex binary data.

Luckily, in programming, there are various terms for the process of saving binary data to a file that can be accessed later. In Python, this is called pickling. You may know it as serialization, or maybe even something else.

For our lucky, pandas handles pickles in its IO module, and all pandas objects are equipped with the to_pickle and read_pickle methods.

In [2]:
df = pd.DataFrame([[1,2,3], [4,5,6]],
                  columns=['A','B','C'])

df
Out[2]:
A B C
0 1 2 3
1 4 5 6

The to_pickle method uses Python's cPickle module to save data structures to disk using the pickle format.

In [3]:
df.to_pickle('out.pkl')
In [4]:
!cat out.pkl
����pandas.core.frame��	DataFrame���)��}�(�_data��pandas.core.internals.managers��BlockManager���)��(]�(�pandas.core.indexes.base��
�pandas.core.indexes.range��
h}�(hhhK��h��R�(KK��h!�]�(h%h&h'et�bh)Nu��R�a}��0.14.1�}�(�axes�h
�blocks�]�}�(�values�h8�mgr_locs��builtins��slice���KKK��R�uaust�b�_typ��	dataframe��	_metadata�]�ub.

The read_pickle method can be used to load any pickled pandas object (or any other pickled object) from file:

In [5]:
df = pd.read_pickle('out.pkl')
In [6]:
df
Out[6]:
A B C
0 1 2 3
1 4 5 6

green-divider

Clipboard content

A handy way to grab data is to use the read_clipboard method, which takes the contents of the clipboard buffer and passes them to the read_table method.

For instance, you can copy the following text to the clipboard (CTRL-C on many operating systems):

  A B C
x 1 4 p
y 2 5 q
z 3 6 r

And then import the data directly to a DataFrame by calling read_clipboard method.

In [7]:
df = pd.read_clipboard()
In [8]:
df
Out[8]:
A B C
x 1 4 p
y 2 5 q
z 3 6 r

The to_clipboard method can be used to write the contents of a DataFrame to the clipboard.

Following which you can paste the clipboard contents into other applications (CTRL-V on many operating systems).

In [9]:
df.to_clipboard()
In [10]:
pd.read_clipboard()
Out[10]:
A B C
x 1 4 p
y 2 5 q
z 3 6 r

We can see that we got the same content back, which we had earlier written to the clipboard.

green-divider

SAS files

The top-level function read_sas() can read (but not write) SAS xport (.XPT) and (since v0.18.0) SAS7BDAT (.sas7bdat) format files.

SAS files only contain two value types: ASCII text and floating point values (usually 8 bytes but sometimes truncated). For xport files, there is no automatic type conversion to integers, dates, or categoricals. For SAS7BDAT files, the format codes may allow date variables to be automatically converted to dates. By default the whole file is read and returned as a DataFrame.

We are going to load the airline.sav7bdat file into a pandas DataFrame using the read_sas method.

In [11]:
!cat airline.sas7bdat
 ��0H�D�\t4@44�
4�
4p
4p
p�@@�l�?���?�w��?ף�?���?t�@����?ף�?`���? �$�?T��?x�@���?����?�m4�?�?5�? V�?|�@+�?@�?�K7�?����?�I�?��@`�@@
��?���?���?@5^�?��@��@ज़�?`���?`���? ���?��@@33@�p�?`���?@�l�?���?��@��~@`ff�?�Y�?`��@��v�?��@��@���?�u�?@�z@����?��@��@@�A�?��|�?��@�x�?��@ �@@�$�?`��?`�@�n�?��@�=�@��C�?����?���@��S�?��@�K7@ X9�?@io�?�E�@-��?��@)�@�p�?@���? ��@���?��@��L@�=
�?@6�?�x	@`���?��@ �@�&1�? ���?��
@��S@��@��J@�K7�?����?��K@@���?��@�p="@�I�?���?`��@V@��@�$%@ ���?�H.�?��K@���@��@�=
*@�l��?@io�?@
����<	����(���� ����P$����8$8H\l������Written by SASYEARyearYlevel of outputWwage rateRinterest rateLlabor inputKcapital input���0�������������������������������������0, . ��������kIW--\
In [12]:
df = pd.read_sas('airline.sas7bdat')
In [13]:
df.head()
Out[13]:
YEAR Y W R L K
0 1948.0 1.214 0.243 0.1454 1.415 0.612
1 1949.0 1.354 0.260 0.2181 1.384 0.559
2 1950.0 1.569 0.278 0.3157 1.388 0.573
3 1951.0 1.948 0.297 0.3940 1.550 0.564
4 1952.0 2.265 0.310 0.3559 1.802 0.574

We can also load the same file from a given URL:

In [14]:
sas_url = 'http://www.principlesofeconometrics.com/sas/airline.sas7bdat'
In [15]:
df = pd.read_sas(sas_url)
In [16]:
df.head()
Out[16]:
YEAR Y W R L K
0 1948.0 1.214 0.243 0.1454 1.415 0.612
1 1949.0 1.354 0.260 0.2181 1.384 0.559
2 1950.0 1.569 0.278 0.3157 1.388 0.573
3 1951.0 1.948 0.297 0.3940 1.550 0.564
4 1952.0 2.265 0.310 0.3559 1.802 0.574

Plot a variable:

In [18]:
df.loc[:,'Y'].plot()
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x11319bac8>

green-divider

STATA files

The top-level function read_stata will read a DTA file and return either a DataFrame or a StataReader that can be used to read the file incrementally.

We are going to load the broiler.dta file into a pandas DataFrame using the read_stata method.

In [19]:
!cat broiler.dta
K/=\C4C�B@�D33�A�wF33�B33�B�BRؼBff�B�KÕ^CLC�B`�D33B�{Fff�B33�B�B�
�B33�B	�&KaCRC�B��D��B�zF�̼B���Bff�B���B�̤B��+K۹cCBC�B��D��B$~F�Bff�B���B[Q�B�̵B�5KL�eCoC�B��D��B F���B33�B�BF�B�B-
D�B��Dff0BB�F3�CffC�B[Q�B33C�w�KV}C�&D�B�D:B��Ff�C�LC�B���B�LC�{�K�V�C�DD�B �D��@B�F
C�	C�̹BFfC�C���K��CCD�B@�D33CB6�F�CC33�B�yC33C���KCuD�B`�D��@B|�F33Cf�C�B�{
CffC}j�K�.�C��D�B��D33CB��F��C�C�CX.Cf�C�N�K��C`�D�B��DFB��F�C��C33�B��C� C9O�K�`�C`�D�B��D33GB��F��C�Cff�B�C#Cd��K=��C��D�B��D��SB,�F��!C33Cff�B���B��&C���Ko��C@�D�B�D��TB��Ff�"C�C�̘B33,C��K�>�C��D�B �D��WB�F(C� C���B�1CV.�C �D�B
In [20]:
df = pd.read_stata('broiler.dta')
In [21]:
df.head()
Out[21]:
year q y pchick pbeef pcor pf cpi qproda pop meatex time
0 1950.0 14.3 7863.0 69.500000 31.200001 59.799999 NaN 24.100000 2628500.0 151.684006 NaN 41.0
1 1951.0 15.1 7953.0 72.900002 36.500000 72.099998 NaN 26.000000 2843000.0 154.287003 NaN 42.0
2 1952.0 15.3 8071.0 73.099998 36.200001 71.300003 NaN 26.500000 2851200.0 156.953995 NaN 43.0
3 1953.0 15.2 8319.0 71.300003 28.500000 62.700001 NaN 26.700001 2953900.0 159.565002 NaN 44.0
4 1954.0 15.8 8276.0 64.400002 27.400000 63.400002 NaN 26.900000 3099700.0 162.391006 NaN 45.0

We can also load the same file from a given URL:

In [22]:
stata_url = 'http://www.principlesofeconometrics.com/stata/broiler.dta'
In [23]:
df = pd.read_stata(stata_url)
In [24]:
df.head()
Out[24]:
year q y pchick pbeef pcor pf cpi qproda pop meatex time
0 1950.0 14.3 7863.0 69.500000 31.200001 59.799999 NaN 24.100000 2628500.0 151.684006 NaN 41.0
1 1951.0 15.1 7953.0 72.900002 36.500000 72.099998 NaN 26.000000 2843000.0 154.287003 NaN 42.0
2 1952.0 15.3 8071.0 73.099998 36.200001 71.300003 NaN 26.500000 2851200.0 156.953995 NaN 43.0
3 1953.0 15.2 8319.0 71.300003 28.500000 62.700001 NaN 26.700001 2953900.0 159.565002 NaN 44.0
4 1954.0 15.8 8276.0 64.400002 27.400000 63.400002 NaN 26.900000 3099700.0 162.391006 NaN 45.0

Plot the Consumer Price Index (CPI):

In [25]:
df.loc[:,'cpi'].plot()
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x11320d160>

green-divider

Google BigQuery

Google BigQuery data can be loaded using pandas read_gbq method.

This method requires the pandas-gbq package and a BigQuery project (you can create a new one from here).

In [ ]:
!pip install pandas_gbq
In [ ]:
import pandas_gbq
In [ ]:
sql = """
    SELECT name, SUM(number) as count
    FROM `bigquery-public-data.usa_names.usa_1910_current`
    GROUP BY name
    ORDER BY count DESC
    LIMIT 10
"""

pandas_gbq.read_gbq(sql,
                    project_id='MY_PROJECT_ID')

purple-divider

Notebooks AI
Notebooks AI Profile20060