Today I was playing with data from the USDA Food Composition Database when I found one of the toughest CSVs I've ever had to read. It takes the pandas.read_csv
method to the extreme.
A preview of the CSV file¶
First of all, the file format says "txt", but we call these plain CSVs files. Let's take a look
!head FOOD_DES.txt
What's all that š±? This is when the fun begins. If you try to read it using Pandas standard read_csv
method, you'll get an error:
import pandas as pd
pd.read_csv('FOOD_DES.txt')
Delimiters (or separators)¶
CSV
stands for Comma-separated values. Which means that the values of a file are separated by commas, for example, if we have some user data in a "pure" CSV, we'd see:
ID,Name,Email,Age
1,Jane Doe,jane@example.com,34
2,Joe Doe,joe@example.com,34
In this example, each "field" in the CSV is separated by a comma. In the file from the USDA, that is not so clear. After reading the docs (there's an associated PDF file with the download), I found out that the separator character is the caret ^
.
We can use then, the attribute delimiter
(or sep
, an alias):
pd.read_csv('FOOD_DES.txt', delimiter='^', nrows=1)
Header¶
Pandas read_csv
expects the file to (by default) contain a header explaining what each one of the fields represent. In this case, there's no header, so we have to make that explicit using the parameter header=None
.
pd.read_csv('FOOD_DES.txt', delimiter='^', header=None, nrows=5)
String quotations¶
You'll find that each value in the returned DataFrame is surrounded by a tilde ~
. That is what the USDA has chosen to be the "string quotation character". We can use the quotechar
attribute of read_csv
to indicate Pandas what's the "character used to denote the start and end of a quoted item":
pd.read_csv('FOOD_DES.txt', delimiter='^', header=None, quotechar='~', nrows=5)
Almost there! Let's try now reading the whole file:
pd.read_csv('FOOD_DES.txt', delimiter='^', header=None, quotechar='~')
Encoding errors¶
I'm at risk of opening Pandora's box with this, but I have to mention it: files are stored using different "encodings". You've probably heard about ASCII, UTF-8, latin1, etc.
This file is stored using ISO/IEC 8859-1, which is one of the examples of "extended ASCII". To properly read the file, we have to set the encoding to that:
df = pd.read_csv('FOOD_DES.txt', delimiter='^', header=None, quotechar='~', encoding='iso-8859-1')
df.head()
Now it worked!
We've figured it out, but Remember kids: try to stick to standards.