Profile picture

Co-founder @ RMOTR

The Ugliest CSV I've Had to Parse

Last updated: July 9th, 20192019-07-09Project preview

Today I was playing with data from the USDA Food Composition Database when I found one of the toughest CSVs I've ever had to read. It takes the pandas.read_csv method to the extreme.

A preview of the CSV file

First of all, the file format says "txt", but we call these plain CSVs files. Let's take a look

In [1]:
!head FOOD_DES.txt
~01001~^~0100~^~Butter, salted~^~BUTTER,WITH SALT~^^^~Y~^^0^^6.38^4.27^8.79^3.87
~01002~^~0100~^~Butter, whipped, with salt~^~BUTTER,WHIPPED,W/ SALT~^^^~Y~^^0^^6.38^^^
~01003~^~0100~^~Butter oil, anhydrous~^~BUTTER OIL,ANHYDROUS~^^^~Y~^^0^^6.38^4.27^8.79^3.87
~01004~^~0100~^~Cheese, blue~^~CHEESE,BLUE~^^^~Y~^^0^^6.38^4.27^8.79^3.87
~01005~^~0100~^~Cheese, brick~^~CHEESE,BRICK~^^^~Y~^^0^^6.38^4.27^8.79^3.87
~01006~^~0100~^~Cheese, brie~^~CHEESE,BRIE~^^^~Y~^^0^^6.38^4.27^8.79^3.87
~01007~^~0100~^~Cheese, camembert~^~CHEESE,CAMEMBERT~^^^~Y~^^0^^6.38^4.27^8.79^3.87
~01008~^~0100~^~Cheese, caraway~^~CHEESE,CARAWAY~^^^^^0^^6.38^4.27^8.79^3.87
~01009~^~0100~^~Cheese, cheddar (Includes foods for USDA's Food Distribution Program)~^~CHEESE,CHEDDAR~^^^~Y~^^0^^^^^
~01010~^~0100~^~Cheese, cheshire~^~CHEESE,CHESHIRE~^^^^^0^^6.38^4.27^8.79^3.87

What's all that 😱? This is when the fun begins. If you try to read it using Pandas standard read_csv method, you'll get an error:

In [3]:
import pandas as pd
In [4]:
pd.read_csv('FOOD_DES.txt')
---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
<ipython-input-4-a942efd28237> in <module>
----> 1 pd.read_csv('FOOD_DES.txt')

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)
    676                     skip_blank_lines=skip_blank_lines)
    677 
--> 678         return _read(filepath_or_buffer, kwds)
    679 
    680     parser_f.__name__ = name

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    444 
    445     try:
--> 446         data = parser.read(nrows)
    447     finally:
    448         parser.close()

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1034                 raise ValueError('skipfooter not supported for iteration')
   1035 
-> 1036         ret = self._engine.read(nrows)
   1037 
   1038         # May alter columns / col_dict

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1846     def read(self, nrows=None):
   1847         try:
-> 1848             data = self._reader.read(nrows)
   1849         except StopIteration:
   1850             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 5 fields in line 12, saw 7

Delimiters (or separators)

CSV stands for Comma-separated values. Which means that the values of a file are separated by commas, for example, if we have some user data in a "pure" CSV, we'd see:

ID,Name,Email,Age
1,Jane Doe,jane@example.com,34
2,Joe Doe,joe@example.com,34

In this example, each "field" in the CSV is separated by a comma. In the file from the USDA, that is not so clear. After reading the docs (there's an associated PDF file with the download), I found out that the separator character is the caret ^.

We can use then, the attribute delimiter (or sep, an alias):

In [5]:
pd.read_csv('FOOD_DES.txt', delimiter='^', nrows=1)
Out[5]:
~01001~ ~0100~ ~Butter, salted~ ~BUTTER,WITH SALT~ Unnamed: 4 Unnamed: 5 ~Y~ Unnamed: 7 0 Unnamed: 9 6.38 4.27 8.79 3.87
0 ~01002~ ~0100~ ~Butter, whipped, with salt~ ~BUTTER,WHIPPED,W/ SALT~ NaN NaN ~Y~ NaN 0 NaN 6.38 NaN NaN NaN

Pandas read_csv expects the file to (by default) contain a header explaining what each one of the fields represent. In this case, there's no header, so we have to make that explicit using the parameter header=None.

In [18]:
pd.read_csv('FOOD_DES.txt', delimiter='^', header=None, nrows=5)
Out[18]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 ~01001~ ~0100~ ~Butter, salted~ ~BUTTER,WITH SALT~ NaN NaN ~Y~ NaN 0 NaN 6.38 4.27 8.79 3.87
1 ~01002~ ~0100~ ~Butter, whipped, with salt~ ~BUTTER,WHIPPED,W/ SALT~ NaN NaN ~Y~ NaN 0 NaN 6.38 NaN NaN NaN
2 ~01003~ ~0100~ ~Butter oil, anhydrous~ ~BUTTER OIL,ANHYDROUS~ NaN NaN ~Y~ NaN 0 NaN 6.38 4.27 8.79 3.87
3 ~01004~ ~0100~ ~Cheese, blue~ ~CHEESE,BLUE~ NaN NaN ~Y~ NaN 0 NaN 6.38 4.27 8.79 3.87
4 ~01005~ ~0100~ ~Cheese, brick~ ~CHEESE,BRICK~ NaN NaN ~Y~ NaN 0 NaN 6.38 4.27 8.79 3.87

String quotations

You'll find that each value in the returned DataFrame is surrounded by a tilde ~. That is what the USDA has chosen to be the "string quotation character". We can use the quotechar attribute of read_csv to indicate Pandas what's the "character used to denote the start and end of a quoted item":

In [19]:
pd.read_csv('FOOD_DES.txt', delimiter='^', header=None, quotechar='~', nrows=5)
Out[19]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1001 100 Butter, salted BUTTER,WITH SALT NaN NaN Y NaN 0 NaN 6.38 4.27 8.79 3.87
1 1002 100 Butter, whipped, with salt BUTTER,WHIPPED,W/ SALT NaN NaN Y NaN 0 NaN 6.38 NaN NaN NaN
2 1003 100 Butter oil, anhydrous BUTTER OIL,ANHYDROUS NaN NaN Y NaN 0 NaN 6.38 4.27 8.79 3.87
3 1004 100 Cheese, blue CHEESE,BLUE NaN NaN Y NaN 0 NaN 6.38 4.27 8.79 3.87
4 1005 100 Cheese, brick CHEESE,BRICK NaN NaN Y NaN 0 NaN 6.38 4.27 8.79 3.87

Almost there! Let's try now reading the whole file:

In [8]:
pd.read_csv('FOOD_DES.txt', delimiter='^', header=None, quotechar='~')
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc6 in position 44: invalid continuation byte

During handling of the above exception, another exception occurred:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-8-cbc808c75bc2> in <module>
----> 1 pd.read_csv('FOOD_DES.txt', delimiter='^', header=None, quotechar='~')

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)
    676                     skip_blank_lines=skip_blank_lines)
    677 
--> 678         return _read(filepath_or_buffer, kwds)
    679 
    680     parser_f.__name__ = name

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    444 
    445     try:
--> 446         data = parser.read(nrows)
    447     finally:
    448         parser.close()

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1034                 raise ValueError('skipfooter not supported for iteration')
   1035 
-> 1036         ret = self._engine.read(nrows)
   1037 
   1038         # May alter columns / col_dict

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1846     def read(self, nrows=None):
   1847         try:
-> 1848             data = self._reader.read(nrows)
   1849         except StopIteration:
   1850             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc6 in position 44: invalid continuation byte

Encoding errors

I'm at risk of opening Pandora's box with this, but I have to mention it: files are stored using different "encodings". You've probably heard about ASCII, UTF-8, latin1, etc.

This file is stored using ISO/IEC 8859-1, which is one of the examples of "extended ASCII". To properly read the file, we have to set the encoding to that:

In [10]:
df = pd.read_csv('FOOD_DES.txt', delimiter='^', header=None, quotechar='~', encoding='iso-8859-1')
df.head()
Out[10]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 1001 100 Butter, salted BUTTER,WITH SALT NaN NaN Y NaN 0.0 NaN 6.38 4.27 8.79 3.87
1 1002 100 Butter, whipped, with salt BUTTER,WHIPPED,W/ SALT NaN NaN Y NaN 0.0 NaN 6.38 NaN NaN NaN
2 1003 100 Butter oil, anhydrous BUTTER OIL,ANHYDROUS NaN NaN Y NaN 0.0 NaN 6.38 4.27 8.79 3.87
3 1004 100 Cheese, blue CHEESE,BLUE NaN NaN Y NaN 0.0 NaN 6.38 4.27 8.79 3.87
4 1005 100 Cheese, brick CHEESE,BRICK NaN NaN Y NaN 0.0 NaN 6.38 4.27 8.79 3.87

Now it worked!


We've figured it out, but Remember kids: try to stick to standards.

Notebooks AI
Notebooks AI Profile20060