Profile picture

Intro to Data Analysis With Pandas

Last updated: January 13th, 20202020-01-13Project preview

rmotr


Intro to Data Analysis with Pandas - Workshop

purple-divider

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

green-divider

Reading data

We'll use the following Pokemon Data.

In [2]:
df = pd.read_csv('datasets/Pokemon.csv')
In [3]:
df.head()
Out[3]:
# Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
3 4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
4 5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1 False

DataFrames have an index. We can set it using set_index() method:

In [4]:
df.set_index('#', inplace=True)
In [5]:
df.index.name = 'Id'
In [6]:
df.index
Out[6]:
Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            712, 713, 714, 715, 716, 717, 718, 719, 720, 721],
           dtype='int64', name='Id', length=721)

Internally, pandas uses a very important library in the Data Science space which is numpy. We can see the internal representation with .values:

In [7]:
df.values
Out[7]:
array([['Bulbasaur', 'Grass', 'Poison', ..., 45, 1, False],
       ['Ivysaur', 'Grass', 'Poison', ..., 60, 1, False],
       ['Venusaur', 'Grass', 'Poison', ..., 80, 1, False],
       ...,
       ['Diancie', 'Rock', 'Fairy', ..., 50, 6, True],
       ['HoopaHoopa Confined', 'Psychic', 'Ghost', ..., 70, 6, True],
       ['Volcanion', 'Fire', 'Water', ..., 70, 6, True]], dtype=object)
In [8]:
df.head()
Out[8]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
Id
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1 False

green-divider

Composition

In [9]:
# dataframe elements
df.size
Out[9]:
8652
In [10]:
# dataframe dimensions
df.ndim
Out[10]:
2
In [11]:
# dataframe shape
df.shape
Out[11]:
(721, 12)
In [12]:
# traspose dataframe
df.T
Out[12]:
Id 1 2 3 4 5 6 7 8 9 10 ... 712 713 714 715 716 717 718 719 720 721
Name Bulbasaur Ivysaur Venusaur Charmander Charmeleon Charizard Squirtle Wartortle Blastoise Caterpie ... Bergmite Avalugg Noibat Noivern Xerneas Yveltal Zygarde50% Forme Diancie HoopaHoopa Confined Volcanion
Type 1 Grass Grass Grass Fire Fire Fire Water Water Water Bug ... Ice Ice Flying Flying Fairy Dark Dragon Rock Psychic Fire
Type 2 Poison Poison Poison NaN NaN Flying NaN NaN NaN NaN ... NaN NaN Dragon Dragon NaN Flying Ground Fairy Ghost Water
Total 318 405 525 309 405 534 314 405 530 195 ... 304 514 245 535 680 680 600 600 600 600
HP 45 60 80 39 58 78 44 59 79 45 ... 55 95 40 85 126 126 108 50 80 80
Attack 49 62 82 52 64 84 48 63 83 30 ... 69 117 30 70 131 131 100 100 110 110
Defense 49 63 83 43 58 78 65 80 100 35 ... 85 184 35 80 95 95 121 150 60 120
Sp. Atk 65 80 100 60 80 109 50 65 85 20 ... 32 44 45 97 131 131 81 100 150 130
Sp. Def 65 80 100 50 65 85 64 80 105 20 ... 35 46 40 80 98 98 95 150 130 90
Speed 45 60 80 65 80 100 43 58 78 45 ... 28 28 55 123 99 99 95 50 70 70
Generation 1 1 1 1 1 1 1 1 1 1 ... 6 6 6 6 6 6 6 6 6 6
Legendary False False False False False False False False False False ... False False False False True True True True True True

12 rows × 721 columns

green-divider

 Pandas Data Types

A data type is essentially an internal construct that a programming language uses to understand how to store and manipulate data.

Dataframes have columns, which each one has types:

  • object
  • int64
  • float64
  • datetime64
  • bool
  • category

We can see the column names of our DataFrame:

In [13]:
df.columns
Out[13]:
Index(['Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense',
       'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')

We can set these the types of the columns if we need it to:

In [14]:
df['Type 1'] = df['Type 1'].astype('category')
df['Type 2'] = df['Type 2'].astype('category')
In [15]:
df.dtypes
Out[15]:
Name            object
Type 1        category
Type 2        category
Total            int64
HP               int64
Attack           int64
Defense          int64
Sp. Atk          int64
Sp. Def          int64
Speed            int64
Generation       int64
Legendary         bool
dtype: object
In [16]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 721 entries, 1 to 721
Data columns (total 12 columns):
Name          721 non-null object
Type 1        721 non-null category
Type 2        359 non-null category
Total         721 non-null int64
HP            721 non-null int64
Attack        721 non-null int64
Defense       721 non-null int64
Sp. Atk       721 non-null int64
Sp. Def       721 non-null int64
Speed         721 non-null int64
Generation    721 non-null int64
Legendary     721 non-null bool
dtypes: bool(1), category(2), int64(8), object(1)
memory usage: 80.0+ KB

Special types will have benefits based on that:

In [17]:
df['Type 1'].value_counts()
Out[17]:
Water       105
Normal       93
Grass        66
Bug          63
Psychic      47
Fire         47
Rock         41
Electric     36
Ground       30
Poison       28
Dark         28
Fighting     25
Dragon       24
Ice          23
Ghost        23
Steel        22
Fairy        17
Flying        3
Name: Type 1, dtype: int64

Also, we can do all of this at read time:

In [18]:
df = pd.read_csv('datasets/Pokemon.csv',
                 index_col='#',
                 dtype={
                     'Type 1': 'category',
                     'Type 2': 'category'
                 })
In [19]:
df.head()
Out[19]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1 False
In [20]:
df.dtypes
Out[20]:
Name            object
Type 1        category
Type 2        category
Total            int64
HP               int64
Attack           int64
Defense          int64
Sp. Atk          int64
Sp. Def          int64
Speed            int64
Generation       int64
Legendary         bool
dtype: object
In [21]:
df.tail()
Out[21]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
717 Yveltal Dark Flying 680 126 131 95 131 98 99 6 True
718 Zygarde50% Forme Dragon Ground 600 108 100 121 81 95 95 6 True
719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True
720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True
721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True

green-divider

 Indexing and selection

In [22]:
df['Attack'].head()
Out[22]:
#
1    49
2    62
3    82
4    52
5    64
Name: Attack, dtype: int64
In [23]:
df[['Attack', 'Defense']].head()
Out[23]:
Attack Defense
#
1 49 49
2 62 63
3 82 83
4 52 43
5 64 58
In [24]:
df.loc[3]
Out[24]:
Name          Venusaur
Type 1           Grass
Type 2          Poison
Total              525
HP                  80
Attack              82
Defense             83
Sp. Atk            100
Sp. Def            100
Speed               80
Generation           1
Legendary        False
Name: 3, dtype: object
In [25]:
df.loc[3: 10]
Out[25]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1 False
6 Charizard Fire Flying 534 78 84 78 109 85 100 1 False
7 Squirtle Water NaN 314 44 48 65 50 64 43 1 False
8 Wartortle Water NaN 405 59 63 80 65 80 58 1 False
9 Blastoise Water NaN 530 79 83 100 85 105 78 1 False
10 Caterpie Bug NaN 195 45 30 35 20 20 45 1 False
In [26]:
df.loc[3, ['Name', 'Attack', 'Defense']]
Out[26]:
Name       Venusaur
Attack           82
Defense          83
Name: 3, dtype: object
In [27]:
df.loc[[2, 3], ['Name', 'Attack', 'Defense']]
Out[27]:
Name Attack Defense
#
2 Ivysaur 62 63
3 Venusaur 82 83
In [28]:
df.loc[2:6, ['Name', 'Attack', 'Defense']]
Out[28]:
Name Attack Defense
#
2 Ivysaur 62 63
3 Venusaur 82 83
4 Charmander 52 43
5 Charmeleon 64 58
6 Charizard 84 78

We can also use sequential position:

In [29]:
df.iloc[0]
Out[29]:
Name          Bulbasaur
Type 1            Grass
Type 2           Poison
Total               318
HP                   45
Attack               49
Defense              49
Sp. Atk              65
Sp. Def              65
Speed                45
Generation            1
Legendary         False
Name: 1, dtype: object
In [30]:
df.iloc[-1]
Out[30]:
Name          Volcanion
Type 1             Fire
Type 2            Water
Total               600
HP                   80
Attack              110
Defense             120
Sp. Atk             130
Sp. Def              90
Speed                70
Generation            6
Legendary          True
Name: 721, dtype: object

green-divider

 Summary statistics

A large number of methods collectively compute descriptive statistics and other related operations on DataFrame.

In [31]:
df.describe()
Out[31]:
Total HP Attack Defense Sp. Atk Sp. Def Speed Generation
count 721.000000 721.000000 721.000000 721.000000 721.000000 721.000000 721.000000 721.000000
mean 417.945908 68.380028 75.124827 70.697642 68.848821 69.180305 65.714286 3.323162
std 109.663671 25.848272 29.070335 29.194941 28.898590 26.899364 27.277920 1.669873
min 180.000000 1.000000 5.000000 5.000000 10.000000 20.000000 5.000000 1.000000
25% 320.000000 50.000000 54.000000 50.000000 45.000000 50.000000 45.000000 2.000000
50% 424.000000 65.000000 75.000000 65.000000 65.000000 65.000000 65.000000 3.000000
75% 499.000000 80.000000 95.000000 85.000000 90.000000 85.000000 85.000000 5.000000
max 720.000000 255.000000 165.000000 230.000000 154.000000 230.000000 160.000000 6.000000
In [32]:
df['Total'].mean()
Out[32]:
417.94590846047157
In [33]:
df['Total'].sum()
Out[33]:
301339
In [34]:
df['Total'].max()
Out[34]:
720
In [35]:
df['Total'].min()
Out[35]:
180

 Column distribution analysis

In [36]:
sns.distplot(df['Attack'], color='red')
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e8708890>
In [37]:
sns.distplot(df['Defense'], color='blue')
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e65d0bd0>
In [38]:
sns.distplot(df['Speed'], color='green')
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e6508750>

Correlation between base stats

In [39]:
correlations = df[['Total' ,'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']].corr()

sns.heatmap(correlations, annot=True)
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e64474d0>
In [40]:
sns.lmplot(x='Attack', y='Defense', data=df, height=7)
Out[40]:
<seaborn.axisgrid.FacetGrid at 0x7fc9e63b1850>
In [41]:
sns.lmplot(x='Attack', y='Total', data=df, height=7)
Out[41]:
<seaborn.axisgrid.FacetGrid at 0x7fc9e630db10>

green-divider

Filtering / Conditional Selection / Boolean arrays

Will allow you to select different elements from a DataFrame or Series.

In [42]:
df_small = df.loc[0:10]

df_small
Out[42]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1 False
6 Charizard Fire Flying 534 78 84 78 109 85 100 1 False
7 Squirtle Water NaN 314 44 48 65 50 64 43 1 False
8 Wartortle Water NaN 405 59 63 80 65 80 58 1 False
9 Blastoise Water NaN 530 79 83 100 85 105 78 1 False
10 Caterpie Bug NaN 195 45 30 35 20 20 45 1 False
In [43]:
df_small[[True, True, True, True, True, True, True, True, True, True]]
Out[43]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1 False
6 Charizard Fire Flying 534 78 84 78 109 85 100 1 False
7 Squirtle Water NaN 314 44 48 65 50 64 43 1 False
8 Wartortle Water NaN 405 59 63 80 65 80 58 1 False
9 Blastoise Water NaN 530 79 83 100 85 105 78 1 False
10 Caterpie Bug NaN 195 45 30 35 20 20 45 1 False
In [44]:
df_small[[True, False, False, True, False, False, True, False, False, True]]
Out[44]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
7 Squirtle Water NaN 314 44 48 65 50 64 43 1 False
10 Caterpie Bug NaN 195 45 30 35 20 20 45 1 False

But broadcasting operations will allow us to create those Boolean arrays dynamically.

For example, let's filter the Attack values higher than 80:

In [45]:
df_small['Attack'] > 80
Out[45]:
#
1     False
2     False
3      True
4     False
5     False
6      True
7     False
8     False
9      True
10    False
Name: Attack, dtype: bool

So we can just use that expression to select just the Pokemons with Attack values higher than 50:

In [46]:
df_small.loc[df_small['Attack'] > 80]
Out[46]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
6 Charizard Fire Flying 534 78 84 78 109 85 100 1 False
9 Blastoise Water NaN 530 79 83 100 85 105 78 1 False
In [47]:
df_small.loc[df_small['Attack'] > 80, 'Name']
Out[47]:
#
3     Venusaur
6    Charizard
9    Blastoise
Name: Name, dtype: object
In [48]:
df_small.loc[df_small['Attack'] > 80, ['Name', 'Type 1']]
Out[48]:
Name Type 1
#
3 Venusaur Grass
6 Charizard Fire
9 Blastoise Water
In [49]:
df_small.loc[df_small['Type 1'] == 'Grass']
Out[49]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False

 Combined operations

"Filters" can be combined with and or or operations:

In [50]:
# AND
(df_small['Attack'] > 80) & (df_small['Speed'] > 80)
Out[50]:
#
1     False
2     False
3     False
4     False
5     False
6      True
7     False
8     False
9     False
10    False
dtype: bool
In [51]:
df_small.loc[(df_small['Attack'] > 80) & (df_small['Speed'] > 80)]
Out[51]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
6 Charizard Fire Flying 534 78 84 78 109 85 100 1 False
In [52]:
# OR
(df_small['Attack'] > 80) | (df_small['Speed'] > 80)
Out[52]:
#
1     False
2     False
3      True
4     False
5     False
6      True
7     False
8     False
9      True
10    False
dtype: bool
In [53]:
df_small.loc[(df_small['Attack'] > 80) | (df_small['Speed'] > 80)]
Out[53]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
6 Charizard Fire Flying 534 78 84 78 109 85 100 1 False
9 Blastoise Water NaN 530 79 83 100 85 105 78 1 False
In [54]:
# isin()
df_small.loc[df_small['Type 1'].isin(['Grass', 'Water'])]
Out[54]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
7 Squirtle Water NaN 314 44 48 65 50 64 43 1 False
8 Wartortle Water NaN 405 59 63 80 65 80 58 1 False
9 Blastoise Water NaN 530 79 83 100 85 105 78 1 False

 Summary statistics after selecting

In [55]:
df_small.loc[df_small['Type 1'] == 'Fire', 'Total'].mean()
Out[55]:
416.0
In [56]:
df_small.loc[df_small['Type 1'] == 'Water', 'Total'].mean()
Out[56]:
416.3333333333333
In [57]:
df_small.loc[df_small['Type 1'] == 'Dragon']
Out[57]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
In [58]:
df_small.loc[df_small['Type 1'] == 'Dragon', 'Total'].mean()
Out[58]:
nan

Also we can use the whole data:

In [59]:
df.loc[df['Legendary']]
Out[59]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
144 Articuno Ice Flying 580 90 85 100 95 125 85 1 True
145 Zapdos Electric Flying 580 90 90 85 125 90 100 1 True
146 Moltres Fire Flying 580 90 100 90 125 85 90 1 True
150 Mewtwo Psychic Fighting 680 106 110 90 154 90 130 1 True
243 Raikou Electric NaN 580 90 85 75 115 100 115 2 True
244 Entei Fire NaN 580 115 115 85 90 75 100 2 True
245 Suicune Water NaN 580 100 75 115 90 115 85 2 True
249 Lugia Psychic Flying 680 106 90 130 90 154 110 2 True
250 Ho-oh Fire Flying 680 106 130 90 110 154 90 2 True
377 Regirock Rock NaN 580 80 100 200 50 100 50 3 True
378 Regice Ice NaN 580 80 50 100 100 200 50 3 True
379 Registeel Steel NaN 580 80 75 150 75 150 50 3 True
380 Latias Dragon Psychic 600 80 80 90 110 130 110 3 True
381 Latios Dragon Psychic 600 80 90 80 130 110 110 3 True
382 Kyogre Water NaN 670 100 100 90 150 140 90 3 True
383 Groudon Ground Fire 670 100 150 140 100 90 90 3 True
384 Rayquaza Dragon Flying 680 105 150 90 150 90 95 3 True
385 Jirachi Steel Psychic 600 100 100 100 100 100 100 3 True
386 DeoxysNormal Forme Psychic NaN 600 50 150 50 150 50 150 3 True
480 Uxie Psychic NaN 580 75 75 130 75 130 95 4 True
481 Mesprit Psychic NaN 580 80 105 105 105 105 80 4 True
482 Azelf Psychic NaN 580 75 125 70 125 70 115 4 True
483 Dialga Steel Dragon 680 100 120 120 150 100 90 4 True
484 Palkia Water Dragon 680 90 120 100 150 120 100 4 True
485 Heatran Fire Steel 600 91 90 106 130 106 77 4 True
486 Regigigas Normal NaN 670 110 160 110 80 110 100 4 True
487 GiratinaAltered Forme Ghost Dragon 680 150 100 120 100 120 90 4 True
491 Darkrai Dark NaN 600 70 90 90 135 90 125 4 True
492 ShayminLand Forme Grass Flying 600 100 100 100 100 100 100 4 True
493 Arceus Normal NaN 720 120 120 120 120 120 120 4 True
494 Victini Psychic Fire 600 100 100 100 100 100 100 5 True
638 Cobalion Steel Fighting 580 91 90 129 90 72 108 5 True
639 Terrakion Rock Fighting 580 91 129 90 72 90 108 5 True
640 Virizion Grass Fighting 580 91 90 72 90 129 108 5 True
641 TornadusIncarnate Forme Flying NaN 580 79 115 70 125 80 111 5 True
642 ThundurusIncarnate Forme Electric Flying 580 79 115 70 125 80 111 5 True
643 Reshiram Dragon Fire 680 100 120 100 150 120 90 5 True
644 Zekrom Dragon Electric 680 100 150 120 120 100 90 5 True
645 LandorusIncarnate Forme Ground Flying 600 89 125 90 115 80 101 5 True
646 Kyurem Dragon Ice 660 125 130 90 130 90 95 5 True
716 Xerneas Fairy NaN 680 126 131 95 131 98 99 6 True
717 Yveltal Dark Flying 680 126 131 95 131 98 99 6 True
718 Zygarde50% Forme Dragon Ground 600 108 100 121 81 95 95 6 True
719 Diancie Rock Fairy 600 50 100 150 100 150 50 6 True
720 HoopaHoopa Confined Psychic Ghost 600 80 110 60 150 130 70 6 True
721 Volcanion Fire Water 600 80 110 120 130 90 70 6 True
In [60]:
total_count = df.shape[0]
legendary_count = df.loc[df['Legendary']].shape[0]
no_legendary_count = total_count - legendary_count
In [61]:
plt.pie([no_legendary_count, legendary_count],
        labels=['Not Legendary', 'Legendary'],
        autopct='%1.1f%%',
        shadow=True,
        startangle=90,
        explode=(0, 0.1))

plt.title('Legendary Pokemons', fontsize=12)

plt.show()
In [62]:
df.loc[df['Generation'] == 1]
Out[62]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
4 Charmander Fire NaN 309 39 52 43 60 50 65 1 False
5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1 False
... ... ... ... ... ... ... ... ... ... ... ... ...
147 Dratini Dragon NaN 300 41 64 45 50 50 50 1 False
148 Dragonair Dragon NaN 420 61 84 65 70 70 70 1 False
149 Dragonite Dragon Flying 600 91 134 95 100 100 80 1 False
150 Mewtwo Psychic Fighting 680 106 110 90 154 90 130 1 True
151 Mew Psychic NaN 600 100 100 100 100 100 100 1 False

151 rows × 12 columns

In [63]:
df.loc[df['Name'].str.contains('saur')]
Out[63]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1 False
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1 False
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1 False
In [64]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))

ax1.tick_params(axis='x', rotation=45)
ax2.tick_params(axis='x', rotation=45)

sns.countplot(x='Type 1', data=df, ax=ax1)

sns.countplot(x='Type 2', data=df, ax=ax2)
Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e61cfb10>

Data Cleaning with Pandas

green-divider

 Adding new elements

Let's add some new Pokemons to our data:

In [65]:
df.loc[1003] = ['Venusaur Mega', 'Grass', 'Poisson', 905, 150, 182, 143, 150, 140, 140, 1, False]
df.loc[1006] = ['Charizard Mega', 'Fire', 'Flying', 974, 178, 184, 138, 149, 165, 160, 1, False]
df.loc[1009] = ['Wartortle Mega', 'Water', np.nan, 950, 169, 183, 150, 165, 145, 138, 1, False]
df.loc[1015] = ['Beedrill Mega', 'Bug', 'Poisson', 495, 65, 150, 40, 15, 80, 145, np.nan, False]
df.loc[1018] = ['Pidgeot Mega', 'Normal', 'Flying', 579, 83, 80, 80, 135, 80, 121, np.nan, False]
df.loc[1099] = [np.nan, np.nan, np.nan, 579, 83, 80, 80, 135, 80, 121, 1, False]

df.tail()
Out[65]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
#
1006 Charizard Mega Fire Flying 974 178 184 138 149 165 160 1.0 False
1009 Wartortle Mega Water NaN 950 169 183 150 165 145 138 1.0 False
1015 Beedrill Mega Bug Poisson 495 65 150 40 15 80 145 NaN False
1018 Pidgeot Mega Normal Flying 579 83 80 80 135 80 121 NaN False
1099 NaN NaN NaN 579 83 80 80 135 80 121 1.0 False

green-divider

 Vectorized operations

We'll calculate a new Total Calculated column using the following formula:

$$ Total\ Calculated = HP + Attack + Defense + Sp. Atk + Sp. Def + Speed $$
In [66]:
df['Total Calculated'] = df[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']].sum(axis=1)

df.head()
Out[66]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary Total Calculated
#
1 Bulbasaur Grass Poison 318 45 49 49 65 65 45 1.0 False 318
2 Ivysaur Grass Poison 405 60 62 63 80 80 60 1.0 False 405
3 Venusaur Grass Poison 525 80 82 83 100 100 80 1.0 False 525
4 Charmander Fire NaN 309 39 52 43 60 50 65 1.0 False 309
5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1.0 False 405

Has our new Total Calculated the same values that the original Total column has?

In [67]:
(df['Total Calculated'] - df['Total']).sum() == 0
Out[67]:
True
In [68]:
df.shape[0] == (df['Total Calculated'] == df['Total']).sum()
Out[68]:
True
In [69]:
df['Total'].plot.hist()
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e5eeb110>
In [70]:
sns.distplot(df['Total'])
Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e6083ad0>

 Another calculated columns

We'll build two new Attack(Sp. Atk) and Defense(Sp. Def) columns containing both base and special stats at the same time:

In [71]:
df['Attack(Sp. Atk)'] = df.apply(lambda p: "{}({})".format(p['Attack'], p['Sp. Atk']), axis=1)
df['Defense(Sp. Def)'] = df.apply(lambda p: "{}({})".format(p['Defense'], p['Sp. Def']), axis=1)

df[['Name', 'Attack(Sp. Atk)', 'Defense(Sp. Def)']].head()
Out[71]:
Name Attack(Sp. Atk) Defense(Sp. Def)
#
1 Bulbasaur 49(65) 49(65)
2 Ivysaur 62(80) 63(80)
3 Venusaur 82(100) 83(100)
4 Charmander 52(60) 43(50)
5 Charmeleon 64(80) 58(65)

 Outliers detection

In [72]:
#df['Total'].plot.box()
sns.boxplot(y='Total', data=df)
Out[72]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e6046f90>
In [73]:
sns.boxplot(x='Generation', y='Total', data=df)
Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e6036650>

There are some outlier values... Let's get rid of them.

green-divider

Cleaning and dealing with missing data

The first thing we're gonna do is to remove the outlier values detected.

boxplot

In [74]:
q1 = df['Total'].quantile(0.25)
q3 = df['Total'].quantile(0.75)
iqr = q3 - q1 #Interquartile range

fence_low = q1 - 1.5*iqr
fence_high = q3 + 1.5*iqr
In [75]:
# "Outside" boxplot Pokemons
df.loc[(df['Total'] < fence_low) | (df['Total'] > fence_high)]
Out[75]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary Total Calculated Attack(Sp. Atk) Defense(Sp. Def)
#
1003 Venusaur Mega Grass Poisson 905 150 182 143 150 140 140 1.0 False 905 182(150) 143(140)
1006 Charizard Mega Fire Flying 974 178 184 138 149 165 160 1.0 False 974 184(149) 138(165)
1009 Wartortle Mega Water NaN 950 169 183 150 165 145 138 1.0 False 950 183(165) 150(145)
In [76]:
# Keep just the "inside" boxplot Pokemons
df = df.loc[(df['Total'] > fence_low) & (df['Total'] < fence_high)]
In [77]:
df['Total'].plot.box()
Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e62b56d0>

 Cleaning missing values

Now we're gonna check missing values:

In [78]:
df.isna().sum()
Out[78]:
Name                  1
Type 1                1
Type 2              363
Total                 0
HP                    0
Attack                0
Defense               0
Sp. Atk               0
Sp. Def               0
Speed                 0
Generation            2
Legendary             0
Total Calculated      0
Attack(Sp. Atk)       0
Defense(Sp. Def)      0
dtype: int64

We have missing values on the following columns:

  • Name column: we'll drop that rows with missing values.
  • Type 2 column: we'll just leave that as it is. It is possible for a Pokemon not to have Type 2 assigned.
  • Generation column: we'll fill that missing values with method='ffill' (propagate last valid observation forward to next valid).
In [79]:
# Drop rows with missing value at 'Name' column
df.dropna(subset=['Name'], inplace=True)
In [80]:
# Fill rows with missing value at 'Generation' column with a '1' value
df['Generation'].fillna(method='ffill', inplace=True)

Another fill possibility is to fill using a fixed value:

In [81]:
df['Generation'].fillna(1, inplace=True)

green-divider

Removing rows and columns

In [82]:
# remove row
df.drop([1, 2, 3]).head()
Out[82]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary Total Calculated Attack(Sp. Atk) Defense(Sp. Def)
#
4 Charmander Fire NaN 309 39 52 43 60 50 65 1.0 False 309 52(60) 43(50)
5 Charmeleon Fire NaN 405 58 64 58 80 65 80 1.0 False 405 64(80) 58(65)
6 Charizard Fire Flying 534 78 84 78 109 85 100 1.0 False 534 84(109) 78(85)
7 Squirtle Water NaN 314 44 48 65 50 64 43 1.0 False 314 48(50) 65(64)
8 Wartortle Water NaN 405 59 63 80 65 80 58 1.0 False 405 63(65) 80(80)
In [83]:
# remove column
df.drop(['Attack', 'Generation'], axis=1).head()
Out[83]:
Name Type 1 Type 2 Total HP Defense Sp. Atk Sp. Def Speed Legendary Total Calculated Attack(Sp. Atk) Defense(Sp. Def)
#
1 Bulbasaur Grass Poison 318 45 49 65 65 45 False 318 49(65) 49(65)
2 Ivysaur Grass Poison 405 60 63 80 80 60 False 405 62(80) 63(80)
3 Venusaur Grass Poison 525 80 83 100 100 80 False 525 82(100) 83(100)
4 Charmander Fire NaN 309 39 43 60 50 65 False 309 52(60) 43(50)
5 Charmeleon Fire NaN 405 58 58 80 65 80 False 405 64(80) 58(65)

green-divider

Sorting values

Weaker pokemons:

In [84]:
df.sort_values(by='Defense').head()
Out[84]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary Total Calculated Attack(Sp. Atk) Defense(Sp. Def)
#
440 Happiny Normal NaN 220 100 5 5 15 65 30 4.0 False 220 5(15) 5(65)
113 Chansey Normal NaN 450 250 5 5 35 105 50 1.0 False 450 5(35) 5(105)
242 Blissey Normal NaN 540 255 10 10 75 135 55 2.0 False 540 10(75) 10(135)
238 Smoochum Ice Psychic 305 45 30 15 85 65 65 2.0 False 305 30(85) 15(65)
172 Pichu Electric NaN 205 20 40 15 35 35 60 2.0 False 205 40(35) 15(35)

Strongest pokemons:

In [85]:
df.sort_values(by='Attack', ascending=False).head()
Out[85]:
Name Type 1 Type 2 Total HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary Total Calculated Attack(Sp. Atk) Defense(Sp. Def)
#
409 Rampardos Rock NaN 495 97 165 60 65 50 58 4.0 False 495 165(65) 60(50)
486 Regigigas Normal NaN 670 110 160 110 80 110 100 4.0 True 670 160(80) 110(110)
289 Slaking Normal NaN 670 150 160 100 95 65 100 3.0 False 670 160(95) 100(65)
384 Rayquaza Dragon Flying 680 105 150 90 150 90 95 3.0 True 680 150(150) 90(90)
1015 Beedrill Mega Bug Poisson 495 65 150 40 15 80 145 6.0 False 495 150(15) 40(80)

green-divider

Grouping and Aggregations

Let's group our Pokemons per Generation to get some aggregated information:

In [86]:
generations = df.groupby('Generation')

generations_attack = generations['Attack'].mean().sort_values(ascending=False)

generations_attack
Out[86]:
Generation
5.0    80.839744
4.0    80.214953
6.0    75.000000
3.0    73.111111
1.0    72.549669
2.0    68.260000
Name: Attack, dtype: float64
In [87]:
sns.barplot(x=generations_attack.index,
            y=generations_attack,
            order=generations_attack.index)
Out[87]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e5c71e10>
In [88]:
generations = df.groupby('Generation')

generations_legendary = generations['Legendary'].sum().sort_values(ascending=False)

generations_legendary
Out[88]:
Generation
4.0    11.0
5.0    10.0
3.0    10.0
6.0     6.0
2.0     5.0
1.0     4.0
Name: Legendary, dtype: float64
In [89]:
sns.barplot(x=generations_legendary.index,
            y=generations_legendary,
            order=generations_legendary.index)
Out[89]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc9e5bfd2d0>

Also we can group using two columns at the same type.

In this case, we'll group by Generation and Type 1 then get the mean Attack per group:

In [90]:
generations_type = df.groupby(['Generation', 'Type 1'])

generations_type_attack = generations_type['Attack'].mean()

generations_type_attack.head(20)
Out[90]:
Generation  Type 1  
1.0         Bug          63.750000
            Dragon       94.000000
            Electric     62.000000
            Fairy        57.500000
            Fighting    102.857143
            Fire         83.916667
            Ghost        50.000000
            Grass        70.666667
            Ground       81.875000
            Ice          67.500000
            Normal       67.727273
            Poison       74.428571
            Psychic      60.125000
            Rock         82.222222
            Water        70.250000
2.0         Bug          69.000000
            Dark         79.000000
            Electric     59.666667
            Fairy        57.000000
            Fighting     65.000000
Name: Attack, dtype: float64
In [91]:
formatted_df = generations_type_attack.unstack()

formatted_df.head(10)
Out[91]:
Type 1 Bug Dark Dragon Electric Fairy Fighting Fire Flying Ghost Grass Ground Ice Normal Poison Psychic Rock Steel Water
Generation
1.0 63.750000 NaN 94.000000 62.000000 57.500000 102.857143 83.916667 NaN 50.000000 70.666667 81.875000 67.500000 67.727273 74.428571 60.125000 82.222222 NaN 70.250000
2.0 69.000000 79.000000 NaN 59.666667 57.000000 65.000000 76.250000 NaN 60.000000 57.000000 85.000000 58.750000 63.133333 90.000000 69.285714 95.500000 82.500000 68.111111
3.0 55.833333 87.500000 99.285714 52.500000 NaN 70.000000 85.000000 NaN 75.000000 75.250000 88.333333 60.000000 69.888889 72.000000 52.250000 79.625000 88.333333 72.416667
4.0 59.750000 111.666667 96.666667 79.714286 50.000000 90.000000 85.000000 NaN 80.333333 78.846154 104.750000 90.000000 79.941176 77.166667 79.285714 86.500000 77.666667 72.461538
5.0 77.611111 81.384615 124.428571 86.428571 NaN 108.571429 94.625000 115.0 41.000000 65.333333 97.777778 73.333333 82.058824 72.500000 52.500000 116.000000 81.250000 73.352941
6.0 64.750000 92.333333 81.250000 50.333333 66.222222 99.333333 69.375000 50.0 84.000000 82.200000 NaN 93.000000 60.400000 67.500000 68.666667 81.625000 105.000000 68.000000
In [92]:
fig, ax = plt.subplots(figsize=(16, 10))

sns.lineplot(data=formatted_df,
             dashes=False,
             ax=ax)

plt.legend(loc='center right', bbox_to_anchor=(1.1, 0.5))
Out[92]:
<matplotlib.legend.Legend at 0x7fc9e5b2afd0>

 Dynamic plots

In [93]:
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook
from bokeh.palettes import Category10

output_notebook()
Loading BokehJS ...
In [94]:
mypalette = Category10[10]

p = figure(plot_width=800, plot_height=500, x_range=(1, 7))

p.title.text = 'Click on legend entries to mute the corresponding lines'

for index, name in enumerate(formatted_df.columns):
    current_color = mypalette[index % len(mypalette)]
    
    line = p.line(x=formatted_df.index,
                  y=formatted_df.loc[:, name],
                  line_width=3,
                  color=current_color,
                  muted_color=current_color,
                  muted_alpha=0.05,
                  legend=name)

    line.muted = True if index > 3 else False

p.legend.location = 'top_right'
p.legend.click_policy = 'mute'

show(p)

purple-divider

Notebooks AI
Notebooks AI Profile20060