Profile picture

First Steps in Pandas for Excel Users

Last updated: July 25th, 20192019-07-25Project preview

In this lesson we'll have our first interactions with Pandas, the main Python library used for Data management.

If you're using Notebooks.ai or MyBinder, Pandas should already be installed. If you're using your own computer, you'll have to install it doing pip install pandas (assuming you're using pip).

Pandas' main datastructure, the DataFrame, is a 2-dimensional(*) object which resembles our Excel tables. In this lesson we'll analyze the parallelism between the two.

(*) DataFrames actually support multiple dimensions (>2), but that's outside of the scope of this lesson. If you're interested, read about MultiIndices.

In [1]:
import pandas as pd

We'll start by reading the Excel file of this lesson.

In [2]:
excel_file = pd.ExcelFile('1. Orders.xlsx')
In [3]:
excel_file.sheet_names
Out[3]:
['Orders']

Reading the Orders sheet into a DataFrame

In [4]:
df = excel_file.parse('Orders')

The variable df contains a Pandas DataFrame, which includes all the information of the Orders sheet in our excel file.

A DataFrame is similar to an Excel table:

In [6]:
df.head(10)
Out[6]:
order_date city product_name product_category ship_date cost_price retail_price order_quantity shipping_price
0 2013-08-09 Sydney Cando S750 Color Inkjet Printer Technology 2013-08-16 75.00 120.97 35 26.30
1 2013-08-15 Sydney Steady Liquid Accent Tank-Style Highlighters Office Supplies 2013-08-16 1.31 2.84 13 0.93
2 2014-09-04 Sydney Apex Preferred Stainless Steel Scissors Office Supplies 2014-09-05 2.50 5.68 23 3.60
3 2013-03-23 Sydney Smiths Gold Paper Clips Office Supplies 2013-03-26 1.82 2.98 22 1.58
4 2013-06-20 Sydney OIC Colored Binder Clips, Assorted Sizes Office Supplies 2013-06-21 2.29 3.58 10 1.63
5 2013-07-06 Sydney TypeRight Top-Opening Peel & Seel Envelopes,... Office Supplies 2013-07-06 21.56 35.94 28 6.66
6 2013-06-29 Sydney HFX 6S Scientific Calculator Technology 2013-07-01 7.92 12.99 49 9.44
7 2014-03-10 Sydney Artisan 481 Labels Office Supplies 2014-03-11 1.94 3.08 1 0.99
8 2013-07-21 Sydney Cando PC940 Copier Technology 2013-07-23 278.99 449.99 12 49.00
9 2014-08-11 Melbourne Steady Major Accent Highlighters Office Supplies 2014-08-13 3.75 7.08 47 2.35

The main difference is that, when we work with DataFrames, we don't always have a visual reference of the data. We know it's in memory, and we know what each column contains, but we don't have a view of the entire DataFrame. The reason for this is that Pandas is designed to work with millions of rows of data.

We'll feel a lot like this fella:

Matrix-media-monitoring

green-divider

The Big Picture

Working with large volumes of data can be problematic. So, instead of always looking at the table of data directly, we like focus on the big picture: what does my DataFrame contain?. As you've seen before, the methods .head() and .tail() are very useful to have an idea of the data we're handling:

In [7]:
df.tail()
Out[7]:
order_date city product_name product_category ship_date cost_price retail_price order_quantity shipping_price
287 2013-12-27 Sydney Binding Machine Supplies Office Supplies 2013-12-29 18.38 29.17 37 6.27
288 2016-07-05 Sydney Artisan Poly Binder Pockets Office Supplies 2016-07-07 2.26 3.58 36 5.47
289 2015-11-28 Sydney Smiths Pen Style Liquid Stix; Assorted (yellow... Office Supplies 2015-11-30 3.88 6.47 16 1.22
290 2016-08-11 Sydney Artisan 487 Labels Office Supplies 2016-08-13 2.29 3.69 13 0.50
291 2013-11-07 Sydney Steady Major Accent Highlighters Office Supplies 2013-11-08 3.75 7.08 29 2.35

But we can also employ a few other methods or attributes to complement this reference. For example, we can inspect the shape of our DataFrame, to understand how our data is structured:

In [8]:
df.shape
Out[8]:
(292, 9)

In this case, shape tells you that this DataFrame contains 292 rows/records and 11 columns. We can also check the detail of our columns:

In [9]:
df.columns
Out[9]:
Index(['order_date', 'city', 'product_name', 'product_category', 'ship_date',
       'cost_price', 'retail_price', 'order_quantity', 'shipping_price'],
      dtype='object')

Important: Note that shape and columns were invoked without parenthesis: just df.shape. That's because it's an attribute and not a method.

We also use a few methods that aggregate the whole data to give us a bigger picture. One of them is .describe(), we'll also see .info() when discussing column types:

In [10]:
df.describe()
Out[10]:
cost_price retail_price order_quantity shipping_price
count 292.000000 292.000000 292.000000 292.000000
mean 25.202705 47.997568 25.188356 6.381575
std 62.114141 110.931564 14.938205 9.552268
min 0.240000 1.140000 1.000000 0.490000
25% 1.870000 3.462500 11.000000 1.390000
50% 3.750000 7.280000 24.500000 3.140000
75% 13.640000 22.380000 38.000000 7.010000
max 377.990000 599.990000 50.000000 69.300000

As you can see, .describe() provides a summary of all the "numeric" columns including count (how many rows) and some other statistical methods: mean, std, etc.

green-divider

Column Selection

To "zoom in" on your DataFrame, and select a specific column, the syntax we use is:

In [11]:
df.head()
Out[11]:
order_date city product_name product_category ship_date cost_price retail_price order_quantity shipping_price
0 2013-08-09 Sydney Cando S750 Color Inkjet Printer Technology 2013-08-16 75.00 120.97 35 26.30
1 2013-08-15 Sydney Steady Liquid Accent Tank-Style Highlighters Office Supplies 2013-08-16 1.31 2.84 13 0.93
2 2014-09-04 Sydney Apex Preferred Stainless Steel Scissors Office Supplies 2014-09-05 2.50 5.68 23 3.60
3 2013-03-23 Sydney Smiths Gold Paper Clips Office Supplies 2013-03-26 1.82 2.98 22 1.58
4 2013-06-20 Sydney OIC Colored Binder Clips, Assorted Sizes Office Supplies 2013-06-21 2.29 3.58 10 1.63
In [12]:
df['order_date'].head()
Out[12]:
0   2013-08-09
1   2013-08-15
2   2014-09-04
3   2013-03-23
4   2013-06-20
Name: order_date, dtype: datetime64[ns]
In [13]:
df['product_name'].head()
Out[13]:
0                 Cando S750 Color Inkjet Printer
1    Steady Liquid Accent Tank-Style Highlighters
2         Apex Preferred Stainless Steel Scissors
3                         Smiths Gold Paper Clips
4        OIC Colored Binder Clips, Assorted Sizes
Name: product_name, dtype: object
In [14]:
df['cost_price'].head()
Out[14]:
0    75.00
1     1.31
2     2.50
3     1.82
4     2.29
Name: cost_price, dtype: float64

We can also "select" specific columns to work with a smaller representation of our DataFrame:

In [15]:
df[['order_date', 'product_name', 'cost_price']].head()
Out[15]:
order_date product_name cost_price
0 2013-08-09 Cando S750 Color Inkjet Printer 75.00
1 2013-08-15 Steady Liquid Accent Tank-Style Highlighters 1.31
2 2014-09-04 Apex Preferred Stainless Steel Scissors 2.50
3 2013-03-23 Smiths Gold Paper Clips 1.82
4 2013-06-20 OIC Colored Binder Clips, Assorted Sizes 2.29
In [16]:
simpler_df = df[['order_date', 'product_name', 'cost_price']]
In [17]:
simpler_df.head()
Out[17]:
order_date product_name cost_price
0 2013-08-09 Cando S750 Color Inkjet Printer 75.00
1 2013-08-15 Steady Liquid Accent Tank-Style Highlighters 1.31
2 2014-09-04 Apex Preferred Stainless Steel Scissors 2.50
3 2013-03-23 Smiths Gold Paper Clips 1.82
4 2013-06-20 OIC Colored Binder Clips, Assorted Sizes 2.29

green-divider

Column types

Each column in a Pandas DataFrame has a specific type. This is VERY important. The column has a fixed, pre-defined type, which means that we can't mix types in the same column. This is different from Excel, in which we set a "formatting" type to the column, but we can store any value we want.

We can use the .info() method to see the data type of each column:

In [18]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 292 entries, 0 to 291
Data columns (total 9 columns):
order_date          292 non-null datetime64[ns]
city                292 non-null object
product_name        292 non-null object
product_category    292 non-null object
ship_date           292 non-null datetime64[ns]
cost_price          292 non-null float64
retail_price        292 non-null float64
order_quantity      292 non-null int64
shipping_price      292 non-null float64
dtypes: datetime64[ns](2), float64(3), int64(1), object(3)
memory usage: 20.6+ KB

Column types in Pandas

In Excel there are just three data types:

  • Numeric (Number, Currency, Accounting, Date)
  • Text
  • Formula

Data types in Pandas are known as dtypes, and the main ones are:

  • int and float: specific for integer and float numbers
  • datetime and timedelta: specific for dates and offsets
  • object: Just strings, they're used to store text, but pandas refers to them as objects.
  • bool: specific for True/False values
In [19]:
df.head()
Out[19]:
order_date city product_name product_category ship_date cost_price retail_price order_quantity shipping_price
0 2013-08-09 Sydney Cando S750 Color Inkjet Printer Technology 2013-08-16 75.00 120.97 35 26.30
1 2013-08-15 Sydney Steady Liquid Accent Tank-Style Highlighters Office Supplies 2013-08-16 1.31 2.84 13 0.93
2 2014-09-04 Sydney Apex Preferred Stainless Steel Scissors Office Supplies 2014-09-05 2.50 5.68 23 3.60
3 2013-03-23 Sydney Smiths Gold Paper Clips Office Supplies 2013-03-26 1.82 2.98 22 1.58
4 2013-06-20 Sydney OIC Colored Binder Clips, Assorted Sizes Office Supplies 2013-06-21 2.29 3.58 10 1.63

Numbers: int and float

These are numeric columns and support all regular arithmetic operations. In our current orders DataFrame, these are numeric columns:

  • order_quantity: int
  • cost_price: float
  • retail_price: float
  • shipping_price: float
In [20]:
df['order_quantity'].head()
Out[20]:
0    35
1    13
2    23
3    22
4    10
Name: order_quantity, dtype: int64
In [21]:
df['cost_price'].head()
Out[21]:
0    75.00
1     1.31
2     2.50
3     1.82
4     2.29
Name: cost_price, dtype: float64

Remember when we used the .describe() method, only the numeric columns were included.

Numeric operations and methods

Numeric columns support multiple operations (as we'll see in the following section) as well as some very useful methods, for example:

mean: the average value of the column

In [22]:
df['cost_price'].mean()
Out[22]:
25.202705479452057

std: the standard deviation of the column

In [23]:
df['cost_price'].std()
Out[23]:
62.11414058290448

And a few others. But we can even run a .describe in the column itself:

In [24]:
df['cost_price'].describe()
Out[24]:
count    292.000000
mean      25.202705
std       62.114141
min        0.240000
25%        1.870000
50%        3.750000
75%       13.640000
max      377.990000
Name: cost_price, dtype: float64

Arithmetic Operations

Numeric columns allow us to perform regular arithmetic operations with them. For example:

In [25]:
df['cost_price'].head()
Out[25]:
0    75.00
1     1.31
2     2.50
3     1.82
4     2.29
Name: cost_price, dtype: float64

We can increase the values by 10%:

In [26]:
df['cost_price'].head() * 1.10
Out[26]:
0    82.500
1     1.441
2     2.750
3     2.002
4     2.519
Name: cost_price, dtype: float64

Please note that the previous operation didn't update the underlying DataFrame, it just showed a result. Let's look at cost_price again:

In [27]:
df['cost_price'].head()
Out[27]:
0    75.00
1     1.31
2     2.50
3     1.82
4     2.29
Name: cost_price, dtype: float64

If you want to override the contents of a column, you'll have to set it equals to that value:

In [28]:
df['cost_price'] = df['cost_price'] * 1.08

Now the changes are permanent:

In [ ]:
df['cost_price'].head()

We could have also written:

In [ ]:
# df['cost_price'] *= 1.08

We could also create new columns in our DataFrame. We do that by just setting the value of the column equals to a certain value. For example, I'm going to create the column Dummy Column with only values -1.

In [33]:
df['New Column Video'] = df['cost_price'] * 2
In [29]:
df['Dummy Column'] = -1
In [34]:
df.head()
Out[34]:
order_date city product_name product_category ship_date cost_price retail_price order_quantity shipping_price Dummy Column New Column Video
0 2013-08-09 Sydney Cando S750 Color Inkjet Printer Technology 2013-08-16 81.0000 120.97 35 26.30 -1 162.0000
1 2013-08-15 Sydney Steady Liquid Accent Tank-Style Highlighters Office Supplies 2013-08-16 1.4148 2.84 13 0.93 -1 2.8296
2 2014-09-04 Sydney Apex Preferred Stainless Steel Scissors Office Supplies 2014-09-05 2.7000 5.68 23 3.60 -1 5.4000
3 2013-03-23 Sydney Smiths Gold Paper Clips Office Supplies 2013-03-26 1.9656 2.98 22 1.58 -1 3.9312
4 2013-06-20 Sydney OIC Colored Binder Clips, Assorted Sizes Office Supplies 2013-06-21 2.4732 3.58 10 1.63 -1 4.9464

I can delete the column by using the del keyword:

In [35]:
del df['Dummy Column']
In [36]:
del df['New Column Video']
In [37]:
df.head()
Out[37]:
order_date city product_name product_category ship_date cost_price retail_price order_quantity shipping_price
0 2013-08-09 Sydney Cando S750 Color Inkjet Printer Technology 2013-08-16 81.0000 120.97 35 26.30
1 2013-08-15 Sydney Steady Liquid Accent Tank-Style Highlighters Office Supplies 2013-08-16 1.4148 2.84 13 0.93
2 2014-09-04 Sydney Apex Preferred Stainless Steel Scissors Office Supplies 2014-09-05 2.7000 5.68 23 3.60
3 2013-03-23 Sydney Smiths Gold Paper Clips Office Supplies 2013-03-26 1.9656 2.98 22 1.58
4 2013-06-20 Sydney OIC Colored Binder Clips, Assorted Sizes Office Supplies 2013-06-21 2.4732 3.58 10 1.63

Let's see a more useful example of a new column. Let's say we want to calculate a new column total_price, which is calculated using the following formula:

$$ total\_price = retail\_price * order\_quantity + shipping\_price $$

The way to do it is as simple as:

In [38]:
df['total_price'] = df['retail_price'] * df['order_quantity'] + df['shipping_price']

df.head()
Out[38]:
order_date city product_name product_category ship_date cost_price retail_price order_quantity shipping_price total_price
0 2013-08-09 Sydney Cando S750 Color Inkjet Printer Technology 2013-08-16 81.0000 120.97 35 26.30 4260.25
1 2013-08-15 Sydney Steady Liquid Accent Tank-Style Highlighters Office Supplies 2013-08-16 1.4148 2.84 13 0.93 37.85
2 2014-09-04 Sydney Apex Preferred Stainless Steel Scissors Office Supplies 2014-09-05 2.7000 5.68 23 3.60 134.24
3 2013-03-23 Sydney Smiths Gold Paper Clips Office Supplies 2013-03-26 1.9656 2.98 22 1.58 67.14
4 2013-06-20 Sydney OIC Colored Binder Clips, Assorted Sizes Office Supplies 2013-06-21 2.4732 3.58 10 1.63 37.43

Datetimes (and timedeltas)

The following columns in our orders DataFrame are of type datetime or also called timestamp:

  • order_date
  • ship_date

We'll calculate the shipping_delay by checking how many days we have between order_date and ship_date:

In [39]:
df['ship_date'].head()
Out[39]:
0   2013-08-16
1   2013-08-16
2   2014-09-05
3   2013-03-26
4   2013-06-21
Name: ship_date, dtype: datetime64[ns]

Datetime columns also support arithmetic operations, and the results are timedeltas (or offsets). For example:

In [40]:
df['ship_delay'] = df['ship_date'] - df['order_date']

df[['ship_date', 'order_date', 'ship_delay']].head()
Out[40]:
ship_date order_date ship_delay
0 2013-08-16 2013-08-09 7 days
1 2013-08-16 2013-08-15 1 days
2 2014-09-05 2014-09-04 1 days
3 2013-03-26 2013-03-23 3 days
4 2013-06-21 2013-06-20 1 days

7 days is a "relative time", or a delta in time (a timedelta). We can also use timedeltas in combination with our columns:

In [41]:
df['ship_date'].head()
Out[41]:
0   2013-08-16
1   2013-08-16
2   2014-09-05
3   2013-03-26
4   2013-06-21
Name: ship_date, dtype: datetime64[ns]
In [42]:
df['ship_date'].head() + pd.Timedelta(days=10)
Out[42]:
0   2013-08-26
1   2013-08-26
2   2014-09-15
3   2013-04-05
4   2013-07-01
Name: ship_date, dtype: datetime64[ns]
In [43]:
df[['ship_date', 'order_date', 'ship_delay']].head()
Out[43]:
ship_date order_date ship_delay
0 2013-08-16 2013-08-09 7 days
1 2013-08-16 2013-08-15 1 days
2 2014-09-05 2014-09-04 1 days
3 2013-03-26 2013-03-23 3 days
4 2013-06-21 2013-06-20 1 days

Strings (type object)

Any type of text (long, short, containing special characters or not) will be of type object in a DataFrame. Although in Python we know them as just strings. The following columns are strings:

  • city
  • product_name
  • product_category
In [44]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 292 entries, 0 to 291
Data columns (total 11 columns):
order_date          292 non-null datetime64[ns]
city                292 non-null object
product_name        292 non-null object
product_category    292 non-null object
ship_date           292 non-null datetime64[ns]
cost_price          292 non-null float64
retail_price        292 non-null float64
order_quantity      292 non-null int64
shipping_price      292 non-null float64
total_price         292 non-null float64
ship_delay          292 non-null timedelta64[ns]
dtypes: datetime64[ns](2), float64(4), int64(1), object(3), timedelta64[ns](1)
memory usage: 25.2+ KB
In [45]:
df['product_name'].head()
Out[45]:
0                 Cando S750 Color Inkjet Printer
1    Steady Liquid Accent Tank-Style Highlighters
2         Apex Preferred Stainless Steel Scissors
3                         Smiths Gold Paper Clips
4        OIC Colored Binder Clips, Assorted Sizes
Name: product_name, dtype: object

Some operations are supported with strings, for example, a simple concatenation:

In [48]:
df['product_name'] = df['product_name'] + ' !!!'
In [49]:
df.head()
Out[49]:
order_date city product_name product_category ship_date cost_price retail_price order_quantity shipping_price total_price ship_delay
0 2013-08-09 Sydney Cando S750 Color Inkjet Printer !!! Technology 2013-08-16 81.0000 120.97 35 26.30 4260.25 7 days
1 2013-08-15 Sydney Steady Liquid Accent Tank-Style Highlighters !!! Office Supplies 2013-08-16 1.4148 2.84 13 0.93 37.85 1 days
2 2014-09-04 Sydney Apex Preferred Stainless Steel Scissors !!! Office Supplies 2014-09-05 2.7000 5.68 23 3.60 134.24 1 days
3 2013-03-23 Sydney Smiths Gold Paper Clips !!! Office Supplies 2013-03-26 1.9656 2.98 22 1.58 67.14 3 days
4 2013-06-20 Sydney OIC Colored Binder Clips, Assorted Sizes !!! Office Supplies 2013-06-21 2.4732 3.58 10 1.63 37.43 1 days

There are more advanced operations we could perform, using the .str attribute:

In [50]:
df['product_name'].str.replace(' ', '_').head()
Out[50]:
0                 Cando_S750_Color_Inkjet_Printer_!!!
1    Steady_Liquid_Accent_Tank-Style_Highlighters_!!!
2         Apex_Preferred_Stainless_Steel_Scissors_!!!
3                         Smiths_Gold_Paper_Clips_!!!
4        OIC_Colored_Binder_Clips,_Assorted_Sizes_!!!
Name: product_name, dtype: object
In [53]:
df['product_name'] = df['product_name'].str.replace(' ', '-')
In [54]:
df.head()
Out[54]:
order_date city product_name product_category ship_date cost_price retail_price order_quantity shipping_price total_price ship_delay
0 2013-08-09 Sydney Cando-S750-Color-Inkjet-Printer-!!! Technology 2013-08-16 81.0000 120.97 35 26.30 4260.25 7 days
1 2013-08-15 Sydney Steady-Liquid-Accent-Tank-Style-Highlighters-!!! Office Supplies 2013-08-16 1.4148 2.84 13 0.93 37.85 1 days
2 2014-09-04 Sydney Apex-Preferred-Stainless-Steel-Scissors-!!! Office Supplies 2014-09-05 2.7000 5.68 23 3.60 134.24 1 days
3 2013-03-23 Sydney Smiths-Gold-Paper-Clips-!!! Office Supplies 2013-03-26 1.9656 2.98 22 1.58 67.14 3 days
4 2013-06-20 Sydney OIC-Colored-Binder-Clips,-Assorted-Sizes-!!! Office Supplies 2013-06-21 2.4732 3.58 10 1.63 37.43 1 days
In [51]:
df['product_name'].str.contains('750').head()
Out[51]:
0     True
1    False
2    False
3    False
4    False
Name: product_name, dtype: bool

But this is enough for now, we'll have a lesson related to string handling later.

green-divider

Next steps

We're just scratching the surface in terms of data handling with Pandas. There are multiple things we have to keep exploring, like for example Pandas Indices, selection and filtering, cleaning data, grouping and aggregations, etc.

For now, it's important that you keep practicing the parallelism between data in an excel spreadsheet, where you can constantly see the data, and the way Pandas works "in memory".

purple-divider

Notebooks AI
Notebooks AI Profile20060