Profile picture

Column Types on Pandas DataFrames

Last updated: October 31st, 20192019-10-31Project preview

rmotr


Column Types on Pandas DataFrames

Each column in a Pandas DataFrame has a specific type. This is VERY important.

The column has a fixed, pre-defined type, which means that we can't mix types in the same column.

In this lesson we'll analyze them.

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd

green-divider

The first thing we'll do is load in a DataFrame the data of some orders made in an Australian shop we have in the orders.csv file.

In [2]:
df = pd.read_csv('orders.csv',
                 parse_dates=['order_date', 'ship_date'],
                 dtype={'product_category':'category'})
In [3]:
df.head()
Out[3]:
order_date city product_name product_category ship_date cost_price retail_price order_quantity shipping_price
0 2013-08-09 Sydney Cando S750 Color Inkjet Printer Technology 2013-08-16 75.00 120.97 35 26.30
1 2013-08-15 Sydney Steady Liquid Accent Tank-Style Highlighters Office Supplies 2013-08-16 1.31 2.84 13 0.93
2 2014-09-04 Sydney Apex Preferred Stainless Steel Scissors Office Supplies 2014-09-05 2.50 5.68 23 3.60
3 2013-03-23 Sydney Smiths Gold Paper Clips Office Supplies 2013-03-26 1.82 2.98 22 1.58
4 2013-06-20 Sydney OIC Colored Binder Clips, Assorted Sizes Office Supplies 2013-06-21 2.29 3.58 10 1.63

green-divider

 Analyzing column types

We can use the .info() method to see the data type of each column.

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 292 entries, 0 to 291
Data columns (total 9 columns):
order_date          292 non-null datetime64[ns]
city                292 non-null object
product_name        292 non-null object
product_category    292 non-null category
ship_date           292 non-null datetime64[ns]
cost_price          292 non-null float64
retail_price        292 non-null float64
order_quantity      292 non-null int64
shipping_price      292 non-null float64
dtypes: category(1), datetime64[ns](2), float64(3), int64(1), object(2)
memory usage: 18.8+ KB

Also, we can use the dtypes attribute which will return the type of each DataFrame column.

In [5]:
# Type of our DataFrame columns
df.dtypes
Out[5]:
order_date          datetime64[ns]
city                        object
product_name                object
product_category          category
ship_date           datetime64[ns]
cost_price                 float64
retail_price               float64
order_quantity               int64
shipping_price             float64
dtype: object

green-divider

 Column types in Pandas

Data types in Pandas are known as dtypes, and the main ones are:

  • int and float: specific for integer and float numbers
  • datetime and timedelta: specific for dates and offsets
  • object: Just strings, they're used to store text, but pandas refers to them as objects.
  • bool: specific for True/False values
  • categorical:

green-divider

Numbers: int and float

These are numeric columns and support all regular arithmetic operations. In our current orders DataFrame, these are numeric columns:

  • order_quantity: int
  • cost_price: float
  • retail_price: float
  • shipping_price: float
In [6]:
df['order_quantity'].head()
Out[6]:
0    35
1    13
2    23
3    22
4    10
Name: order_quantity, dtype: int64
In [7]:
df['cost_price'].head()
Out[7]:
0    75.00
1     1.31
2     2.50
3     1.82
4     2.29
Name: cost_price, dtype: float64

Remember when we used the .describe() method, only the numeric columns were included.

Numeric operations and methods

Numeric columns support multiple operations (as we'll see in the following section) as well as some very useful methods, for example:

mean: the average value of the column

In [8]:
df['cost_price'].mean()
Out[8]:
25.202705479452057

std: the standard deviation of the column

In [9]:
df['cost_price'].std()
Out[9]:
62.11414058290448

And a few others. But we can even run a .describe in the column itself:

In [10]:
df['cost_price'].describe()
Out[10]:
count    292.000000
mean      25.202705
std       62.114141
min        0.240000
25%        1.870000
50%        3.750000
75%       13.640000
max      377.990000
Name: cost_price, dtype: float64

Arithmetic Operations

Numeric columns allow us to perform regular arithmetic operations with them. For example:

In [11]:
df['cost_price'].head()
Out[11]:
0    75.00
1     1.31
2     2.50
3     1.82
4     2.29
Name: cost_price, dtype: float64

We can increase the values by 10%:

In [12]:
df['cost_price'].head() * 1.10
Out[12]:
0    82.500
1     1.441
2     2.750
3     2.002
4     2.519
Name: cost_price, dtype: float64

Please note that the previous operation didn't update the underlying DataFrame, it just showed a result. Let's look at cost_price again:

In [13]:
df['cost_price'].head()
Out[13]:
0    75.00
1     1.31
2     2.50
3     1.82
4     2.29
Name: cost_price, dtype: float64

If you want to override the contents of a column, you'll have to set it equals to that value:

In [14]:
df['cost_price'] = df['cost_price'] * 1.08

Now the changes are permanent:

In [15]:
df['cost_price'].head()
Out[15]:
0    81.0000
1     1.4148
2     2.7000
3     1.9656
4     2.4732
Name: cost_price, dtype: float64

Other way:

In [16]:
# df['cost_price'] *= 1.08

 Creating new columns

Let's see a more useful example of a new column. Let's say we want to calculate a new column total_price, which is calculated using the following formula:

$$ total\_price = retail\_price * order\_quantity + shipping\_price $$

The way to do it is as simple as:

In [17]:
df['total_price'] = df['retail_price'] * df['order_quantity'] + df['shipping_price']

df.head()
Out[17]:
order_date city product_name product_category ship_date cost_price retail_price order_quantity shipping_price total_price
0 2013-08-09 Sydney Cando S750 Color Inkjet Printer Technology 2013-08-16 81.0000 120.97 35 26.30 4260.25
1 2013-08-15 Sydney Steady Liquid Accent Tank-Style Highlighters Office Supplies 2013-08-16 1.4148 2.84 13 0.93 37.85
2 2014-09-04 Sydney Apex Preferred Stainless Steel Scissors Office Supplies 2014-09-05 2.7000 5.68 23 3.60 134.24
3 2013-03-23 Sydney Smiths Gold Paper Clips Office Supplies 2013-03-26 1.9656 2.98 22 1.58 67.14
4 2013-06-20 Sydney OIC Colored Binder Clips, Assorted Sizes Office Supplies 2013-06-21 2.4732 3.58 10 1.63 37.43

In upcoming lessons we'll cover column creation with more detail.

green-divider

Datetimes (and timedeltas)

The following columns in our orders DataFrame are of type datetime or also called timestamp:

  • order_date
  • ship_date

We'll calculate the shipping_delay by checking how many days we have between order_date and ship_date:

In [18]:
df['ship_date'].head()
Out[18]:
0   2013-08-16
1   2013-08-16
2   2014-09-05
3   2013-03-26
4   2013-06-21
Name: ship_date, dtype: datetime64[ns]

Datetime columns also support arithmetic operations, and the results are timedeltas (or offsets). For example:

In [19]:
df['ship_delay'] = df['ship_date'] - df['order_date']

df[['ship_date', 'order_date', 'ship_delay']].head()
Out[19]:
ship_date order_date ship_delay
0 2013-08-16 2013-08-09 7 days
1 2013-08-16 2013-08-15 1 days
2 2014-09-05 2014-09-04 1 days
3 2013-03-26 2013-03-23 3 days
4 2013-06-21 2013-06-20 1 days

7 days is a "relative time", or a delta in time (a timedelta). We can also use timedeltas in combination with our columns:

In [20]:
df['ship_date'].head()
Out[20]:
0   2013-08-16
1   2013-08-16
2   2014-09-05
3   2013-03-26
4   2013-06-21
Name: ship_date, dtype: datetime64[ns]
In [21]:
df['ship_date'].head() + pd.Timedelta(days=10)
Out[21]:
0   2013-08-26
1   2013-08-26
2   2014-09-15
3   2013-04-05
4   2013-07-01
Name: ship_date, dtype: datetime64[ns]

green-divider

Strings (type object)

Any type of text (long, short, containing special characters or not) will be of type object in a DataFrame. Although in Python we know them as just strings. The following columns are strings:

  • city
  • product_name
In [22]:
df['product_name'].head()
Out[22]:
0                 Cando S750 Color Inkjet Printer
1    Steady Liquid Accent Tank-Style Highlighters
2         Apex Preferred Stainless Steel Scissors
3                         Smiths Gold Paper Clips
4        OIC Colored Binder Clips, Assorted Sizes
Name: product_name, dtype: object

Some operations are supported with strings, for example, a simple concatenation:

In [23]:
df['product_name'].head() + ' !!!'
Out[23]:
0                 Cando S750 Color Inkjet Printer !!!
1    Steady Liquid Accent Tank-Style Highlighters !!!
2         Apex Preferred Stainless Steel Scissors !!!
3                         Smiths Gold Paper Clips !!!
4        OIC Colored Binder Clips, Assorted Sizes !!!
Name: product_name, dtype: object

There are more advanced operations we could perform, using the .str attribute:

In [24]:
df['product_name'].str.replace(' ', '_').head()
Out[24]:
0                 Cando_S750_Color_Inkjet_Printer
1    Steady_Liquid_Accent_Tank-Style_Highlighters
2         Apex_Preferred_Stainless_Steel_Scissors
3                         Smiths_Gold_Paper_Clips
4        OIC_Colored_Binder_Clips,_Assorted_Sizes
Name: product_name, dtype: object
In [25]:
df['product_name'].str.contains('750').head()
Out[25]:
0     True
1    False
2    False
3    False
4    False
Name: product_name, dtype: bool

More on string handling is further developed in Data Cleaning with Pandas course.

green-divider

 Categorical

Categorical Data represents a special data type which has a finite number of possible values.

We use categories to save memory space and speed up computations, as categories are "coded". For example, Sex M, F can be represented as 0, 1 internally.

In our orders DataFrame, the product_category could be a categorical column.

In [26]:
df['product_category'].head()
Out[26]:
0         Technology
1    Office Supplies
2    Office Supplies
3    Office Supplies
4    Office Supplies
Name: product_category, dtype: category
Categories (3, object): [Furniture, Office Supplies, Technology]
In [27]:
df['product_category'].values[0:10]
Out[27]:
[Technology, Office Supplies, Office Supplies, Office Supplies, Office Supplies, Office Supplies, Technology, Office Supplies, Technology, Office Supplies]
Categories (3, object): [Furniture, Office Supplies, Technology]
In [28]:
df['product_category'].values.categories
Out[28]:
Index(['Furniture', 'Office Supplies', 'Technology'], dtype='object')
In [29]:
df['product_category'].values.codes[0:10]
Out[29]:
array([2, 1, 1, 1, 1, 1, 2, 1, 2, 1], dtype=int8)
In [30]:
df['product_category'].value_counts()
Out[30]:
Office Supplies    230
Technology          58
Furniture            4
Name: product_category, dtype: int64
In [31]:
df['product_category'].unique()
Out[31]:
[Technology, Office Supplies, Furniture]
Categories (3, object): [Technology, Office Supplies, Furniture]

Categorical cat assesor

Also categorical columns has a special cat assesor object to access categorical properties.

This assesor let us show, rename, reorder, add or remove categories from our column.

In [32]:
df['product_category'].cat.categories
Out[32]:
Index(['Furniture', 'Office Supplies', 'Technology'], dtype='object')

More on categorical type is further developed in Handling Categorical data with Pandas course.

green-divider

 Changing column type

Finally, we can change a column type by using the astype() method.

We'll need to send as first paramenter the desired column type. We can use numpy.dtype or Python type to cast entire pandas object to the same type.

In [33]:
df['cost_price'].astype('int').head() # or np.int
Out[33]:
0    81
1     1
2     2
3     1
4     2
Name: cost_price, dtype: int64

But if we want to keep the new values we need to update our data explicitly:

In [34]:
df.head()
Out[34]:
order_date city product_name product_category ship_date cost_price retail_price order_quantity shipping_price total_price ship_delay
0 2013-08-09 Sydney Cando S750 Color Inkjet Printer Technology 2013-08-16 81.0000 120.97 35 26.30 4260.25 7 days
1 2013-08-15 Sydney Steady Liquid Accent Tank-Style Highlighters Office Supplies 2013-08-16 1.4148 2.84 13 0.93 37.85 1 days
2 2014-09-04 Sydney Apex Preferred Stainless Steel Scissors Office Supplies 2014-09-05 2.7000 5.68 23 3.60 134.24 1 days
3 2013-03-23 Sydney Smiths Gold Paper Clips Office Supplies 2013-03-26 1.9656 2.98 22 1.58 67.14 3 days
4 2013-06-20 Sydney OIC Colored Binder Clips, Assorted Sizes Office Supplies 2013-06-21 2.4732 3.58 10 1.63 37.43 1 days
In [35]:
df['rounded_cost_price'] = df['cost_price'].astype('int')
In [36]:
df.head()
Out[36]:
order_date city product_name product_category ship_date cost_price retail_price order_quantity shipping_price total_price ship_delay rounded_cost_price
0 2013-08-09 Sydney Cando S750 Color Inkjet Printer Technology 2013-08-16 81.0000 120.97 35 26.30 4260.25 7 days 81
1 2013-08-15 Sydney Steady Liquid Accent Tank-Style Highlighters Office Supplies 2013-08-16 1.4148 2.84 13 0.93 37.85 1 days 1
2 2014-09-04 Sydney Apex Preferred Stainless Steel Scissors Office Supplies 2014-09-05 2.7000 5.68 23 3.60 134.24 1 days 2
3 2013-03-23 Sydney Smiths Gold Paper Clips Office Supplies 2013-03-26 1.9656 2.98 22 1.58 67.14 3 days 1
4 2013-06-20 Sydney OIC Colored Binder Clips, Assorted Sizes Office Supplies 2013-06-21 2.4732 3.58 10 1.63 37.43 1 days 2

Let's try casting the same column to another types:

In [37]:
df['cost_price'].astype('float').head() # or np.float
Out[37]:
0    81.0000
1     1.4148
2     2.7000
3     1.9656
4     2.4732
Name: cost_price, dtype: float64
In [38]:
df['cost_price'].astype('str').head() # or np.str
Out[38]:
0                  81.0
1                1.4148
2                   2.7
3    1.9656000000000002
4    2.4732000000000003
Name: cost_price, dtype: object
In [39]:
df['cost_price'].astype('bool').head() # or np.bool
Out[39]:
0    True
1    True
2    True
3    True
4    True
Name: cost_price, dtype: bool
In [40]:
df['cost_price'].astype('object').head() # or np.object
Out[40]:
0        81
1    1.4148
2       2.7
3    1.9656
4    2.4732
Name: cost_price, dtype: object

Finally, if we need to cast to datetime type, we'll need to use the pd.to_datetime method:

In [41]:
pd.to_datetime(df['cost_price'])
Out[41]:
0     1970-01-01 00:00:00.000000081
1     1970-01-01 00:00:00.000000001
2     1970-01-01 00:00:00.000000002
3     1970-01-01 00:00:00.000000001
4     1970-01-01 00:00:00.000000002
                   ...             
287   1970-01-01 00:00:00.000000019
288   1970-01-01 00:00:00.000000002
289   1970-01-01 00:00:00.000000004
290   1970-01-01 00:00:00.000000002
291   1970-01-01 00:00:00.000000004
Name: cost_price, Length: 292, dtype: datetime64[ns]

Casting to another types are covered in Data Cleaning with Pandas course.

purple-divider

Notebooks AI
Notebooks AI Profile20060