Let's start by importing Pandas. Click on a cell to activate it. Press Shift + Enter to execute. Cells should be executed in order.
import pandas as pd
Load / Create Data Frame¶
We can read CSV or excel files to import data. We can also create dataframes programatically like this:
df = pd.DataFrame({'Artist':['Billie Holiday','Jimi Hendrix', 'Miles Davis', 'SIA'],
'Genre': ['Jazz', 'Rock', 'Jazz', 'Pop'],
'Listeners': [1300000, 2700000, 1500000, 2000000],
'Plays': [27000000, 70000000, 48000000, 74000000]})
The variable df
now contains a dataframe. Executing this cell shows the contents of the dataframe
df
Selection¶
We can select any column using its label:
df['Artist']
We can select one or multiple rows using their numbers (inclusive of both bounding row numbers):
df.loc[1:3]
# Exercise: Select only row #0
We can select any slice of the table using a both column label and row numbers using loc:
df.loc[1:3,['Artist']]
# Exercise: Select 'Artist' and 'Plays' for rows #1 and #2
Filtering¶
Now it gets more interesting. We can easily filter rows using the values of a specific row. For example, here are our jazz musicians:
df[df['Genre'] == "Jazz" ]
# Exercise: Select the row where the Genre is 'Rock'
Here are the artists who have more than 1,800,000 listeners:
df[df['Listeners'] > 1800000 ]
# Exercise: Select the rows where 'Plays' is less than 50,000,000
Grouping¶
df.groupby('Genre').sum()
# Exercise: Group by Genre, use mean() as the aggregation function
# Exercise: Group by Genre, use max() as the aggregation function