Profile picture

Co-founder @ RMOTR

Pandas Intro

Last updated: November 14th, 20182018-11-14Project preview

Open In Colab

Let's start by importing Pandas. Click on a cell to activate it. Press Shift + Enter to execute. Cells should be executed in order.

In [1]:
import pandas as pd

Load / Create Data Frame

We can read CSV or excel files to import data. We can also create dataframes programatically like this:

In [2]:
df = pd.DataFrame({'Artist':['Billie Holiday','Jimi Hendrix', 'Miles Davis', 'SIA'],
              'Genre': ['Jazz', 'Rock', 'Jazz', 'Pop'],
              'Listeners': [1300000, 2700000, 1500000, 2000000],
              'Plays': [27000000, 70000000, 48000000, 74000000]})

The variable df now contains a dataframe. Executing this cell shows the contents of the dataframe

In [3]:
df
Out[3]:
Artist Genre Listeners Plays
0 Billie Holiday Jazz 1300000 27000000
1 Jimi Hendrix Rock 2700000 70000000
2 Miles Davis Jazz 1500000 48000000
3 SIA Pop 2000000 74000000

Selection

We can select any column using its label:

In [4]:
df['Artist']
Out[4]:
0    Billie Holiday
1      Jimi Hendrix
2       Miles Davis
3               SIA
Name: Artist, dtype: object

We can select one or multiple rows using their numbers (inclusive of both bounding row numbers):

In [5]:
df.loc[1:3]
Out[5]:
Artist Genre Listeners Plays
1 Jimi Hendrix Rock 2700000 70000000
2 Miles Davis Jazz 1500000 48000000
3 SIA Pop 2000000 74000000
In [6]:
# Exercise: Select only row #0

We can select any slice of the table using a both column label and row numbers using loc:

In [7]:
df.loc[1:3,['Artist']]
Out[7]:
Artist
1 Jimi Hendrix
2 Miles Davis
3 SIA
In [8]:
# Exercise: Select 'Artist' and 'Plays' for rows #1 and #2

Filtering

Now it gets more interesting. We can easily filter rows using the values of a specific row. For example, here are our jazz musicians:

In [9]:
df[df['Genre'] == "Jazz" ]
Out[9]:
Artist Genre Listeners Plays
0 Billie Holiday Jazz 1300000 27000000
2 Miles Davis Jazz 1500000 48000000
In [10]:
# Exercise: Select the row where the Genre is 'Rock'

Here are the artists who have more than 1,800,000 listeners:

In [11]:
df[df['Listeners'] > 1800000 ]
Out[11]:
Artist Genre Listeners Plays
1 Jimi Hendrix Rock 2700000 70000000
3 SIA Pop 2000000 74000000
In [12]:
# Exercise: Select the rows where 'Plays' is less than 50,000,000

Grouping

In [13]:
df.groupby('Genre').sum()
Out[13]:
Listeners Plays
Genre
Jazz 2800000 75000000
Pop 2000000 74000000
Rock 2700000 70000000
In [14]:
# Exercise: Group by Genre, use mean() as the aggregation function 
In [15]:
# Exercise: Group by Genre, use max() as the aggregation function 
Notebooks AI
Notebooks AI Profile20060