Profile picture

Add and Modify Elements on Pandas DataFrames

Last updated: May 27th, 20192019-05-27Project preview

rmotr


Add and Modify elements on Pandas DataFrames

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd

green-divider

The first thing we'll do is create again the DataFrame from our previous lecture:

In [2]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [1785387, 2833687, 3874437, 2167744, 4602367, 2950039, 17348075],
    'Surface Area': [9984670, 640679, 357114, 301336, 377930, 242495, 9525067],
    'HDI': [0.913, 0.888, 0.916, 0.873, 0.891, 0.907, 0.915],
    'Continent': ['America', 'Europe', 'Europe', 'Europe',
                  'Asia', 'Europe', 'America']
})

df.columns = ['Population', 'GDP', 'Surface Area', 'HDI', 'Continent']

df.index = ['Canada', 'France', 'Germany', 'Italy',
            'Japan', 'United Kingdom', 'United States']

green-divider

Adding new elements to a DataFrame

In [3]:
df
Out[3]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387 9984670 0.913 America
France 63.951 2833687 640679 0.888 Europe
Germany 80.940 3874437 357114 0.916 Europe
Italy 60.665 2167744 301336 0.873 Europe
Japan 127.061 4602367 377930 0.891 Asia
United Kingdom 64.511 2950039 242495 0.907 Europe
United States 318.523 17348075 9525067 0.915 America

 Adding new rows

Empty values will be filled with NaN values.

In [4]:
new_row = pd.Series({
    'Population': 50233,
    'GDP': 1485387,
    'Surface Area': 8923670
}, name='Brazil')
new_row
Out[4]:
Population        50233
GDP             1485387
Surface Area    8923670
Name: Brazil, dtype: int64
In [5]:
df = df.append(new_row)

df
Out[5]:
Population GDP Surface Area HDI Continent
Canada 35.467 1785387.0 9984670.0 0.913 America
France 63.951 2833687.0 640679.0 0.888 Europe
Germany 80.940 3874437.0 357114.0 0.916 Europe
Italy 60.665 2167744.0 301336.0 0.873 Europe
Japan 127.061 4602367.0 377930.0 0.891 Asia
United Kingdom 64.511 2950039.0 242495.0 0.907 Europe
United States 318.523 17348075.0 9525067.0 0.915 America
Brazil 50233.000 1485387.0 8923670.0 NaN NaN

You can directly set the new index and values to the DataFrame:

In [6]:
df.loc['China'] = pd.Series({'Population': 1_400_000_000, 'Continent': 'Asia'})

df
Out[6]:
Population GDP Surface Area HDI Continent
Canada 3.546700e+01 1785387.0 9984670.0 0.913 America
France 6.395100e+01 2833687.0 640679.0 0.888 Europe
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe
United States 3.185230e+02 17348075.0 9525067.0 0.915 America
Brazil 5.023300e+04 1485387.0 8923670.0 NaN NaN
China 1.400000e+09 NaN NaN NaN Asia

 Adding new column

In [ ]:
df
In [7]:
df['Currency'] = ['Canadian Dolar', 'Euro', 'Euro', 'Euro', 'Yen', 'Pound sterling', 'American Dolar', 'Real', 'Yuan']

df
Out[7]:
Population GDP Surface Area HDI Continent Currency
Canada 3.546700e+01 1785387.0 9984670.0 0.913 America Canadian Dolar
France 6.395100e+01 2833687.0 640679.0 0.888 Europe Euro
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe Euro
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe Euro
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia Yen
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe Pound sterling
United States 3.185230e+02 17348075.0 9525067.0 0.915 America American Dolar
Brazil 5.023300e+04 1485387.0 8923670.0 NaN NaN Real
China 1.400000e+09 NaN NaN NaN Asia Yuan

green-divider

Modifying DataFrames

It's simple and intuitive, you can add columns, or replace values for columns without issues:

 Replacing values per column

In [8]:
df['Language'] = 'English'
In [9]:
df
Out[9]:
Population GDP Surface Area HDI Continent Currency Language
Canada 3.546700e+01 1785387.0 9984670.0 0.913 America Canadian Dolar English
France 6.395100e+01 2833687.0 640679.0 0.888 Europe Euro English
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe Euro English
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe Euro English
Japan 1.270610e+02 4602367.0 377930.0 0.891 Asia Yen English
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe Pound sterling English
United States 3.185230e+02 17348075.0 9525067.0 0.915 America American Dolar English
Brazil 5.023300e+04 1485387.0 8923670.0 NaN NaN Real English
China 1.400000e+09 NaN NaN NaN Asia Yuan English

 Replacing specific row value

In [10]:
df.loc['Japan', 'Population'] = 130.061
In [11]:
df
Out[11]:
Population GDP Surface Area HDI Continent Currency Language
Canada 3.546700e+01 1785387.0 9984670.0 0.913 America Canadian Dolar English
France 6.395100e+01 2833687.0 640679.0 0.888 Europe Euro English
Germany 8.094000e+01 3874437.0 357114.0 0.916 Europe Euro English
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe Euro English
Japan 1.300610e+02 4602367.0 377930.0 0.891 Asia Yen English
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe Pound sterling English
United States 3.185230e+02 17348075.0 9525067.0 0.915 America American Dolar English
Brazil 5.023300e+04 1485387.0 8923670.0 NaN NaN Real English
China 1.400000e+09 NaN NaN NaN Asia Yuan English
In [12]:
df.loc['Canada':'Germany', 'HDI'] = 0.9
In [13]:
df
Out[13]:
Population GDP Surface Area HDI Continent Currency Language
Canada 3.546700e+01 1785387.0 9984670.0 0.900 America Canadian Dolar English
France 6.395100e+01 2833687.0 640679.0 0.900 Europe Euro English
Germany 8.094000e+01 3874437.0 357114.0 0.900 Europe Euro English
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe Euro English
Japan 1.300610e+02 4602367.0 377930.0 0.891 Asia Yen English
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe Pound sterling English
United States 3.185230e+02 17348075.0 9525067.0 0.915 America American Dolar English
Brazil 5.023300e+04 1485387.0 8923670.0 NaN NaN Real English
China 1.400000e+09 NaN NaN NaN Asia Yuan English

Transpose DataFrames

We can transpose rows by columns. This will change column indexes to rows indexes.

In [14]:
df
Out[14]:
Population GDP Surface Area HDI Continent Currency Language
Canada 3.546700e+01 1785387.0 9984670.0 0.900 America Canadian Dolar English
France 6.395100e+01 2833687.0 640679.0 0.900 Europe Euro English
Germany 8.094000e+01 3874437.0 357114.0 0.900 Europe Euro English
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe Euro English
Japan 1.300610e+02 4602367.0 377930.0 0.891 Asia Yen English
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe Pound sterling English
United States 3.185230e+02 17348075.0 9525067.0 0.915 America American Dolar English
Brazil 5.023300e+04 1485387.0 8923670.0 NaN NaN Real English
China 1.400000e+09 NaN NaN NaN Asia Yuan English
In [15]:
df.T
Out[15]:
Canada France Germany Italy Japan United Kingdom United States Brazil China
Population 35.467 63.951 80.94 60.665 130.061 64.511 318.523 50233 1.4e+09
GDP 1.78539e+06 2.83369e+06 3.87444e+06 2.16774e+06 4.60237e+06 2.95004e+06 1.73481e+07 1.48539e+06 NaN
Surface Area 9.98467e+06 640679 357114 301336 377930 242495 9.52507e+06 8.92367e+06 NaN
HDI 0.9 0.9 0.9 0.873 0.891 0.907 0.915 NaN NaN
Continent America Europe Europe Europe Asia Europe America NaN Asia
Currency Canadian Dolar Euro Euro Euro Yen Pound sterling American Dolar Real Yuan
Language English English English English English English English English English

green-divider

Removing elements from a DataFrame

Opposed to the concept of selection, we have "dropping". Instead of pointing out which values you'd like to select you could point which ones you'd like to drop:

In [16]:
df
Out[16]:
Population GDP Surface Area HDI Continent Currency Language
Canada 3.546700e+01 1785387.0 9984670.0 0.900 America Canadian Dolar English
France 6.395100e+01 2833687.0 640679.0 0.900 Europe Euro English
Germany 8.094000e+01 3874437.0 357114.0 0.900 Europe Euro English
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe Euro English
Japan 1.300610e+02 4602367.0 377930.0 0.891 Asia Yen English
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe Pound sterling English
United States 3.185230e+02 17348075.0 9525067.0 0.915 America American Dolar English
Brazil 5.023300e+04 1485387.0 8923670.0 NaN NaN Real English
China 1.400000e+09 NaN NaN NaN Asia Yuan English

 Removing rows from a DataFrame

In [17]:
df.drop('Brazil')
Out[17]:
Population GDP Surface Area HDI Continent Currency Language
Canada 3.546700e+01 1785387.0 9984670.0 0.900 America Canadian Dolar English
France 6.395100e+01 2833687.0 640679.0 0.900 Europe Euro English
Germany 8.094000e+01 3874437.0 357114.0 0.900 Europe Euro English
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe Euro English
Japan 1.300610e+02 4602367.0 377930.0 0.891 Asia Yen English
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe Pound sterling English
United States 3.185230e+02 17348075.0 9525067.0 0.915 America American Dolar English
China 1.400000e+09 NaN NaN NaN Asia Yuan English
In [19]:
df.drop('Brazil', inplace=True)

df
Out[19]:
Population GDP Surface Area HDI Continent Currency Language
Canada 3.546700e+01 1785387.0 9984670.0 0.900 America Canadian Dolar English
France 6.395100e+01 2833687.0 640679.0 0.900 Europe Euro English
Germany 8.094000e+01 3874437.0 357114.0 0.900 Europe Euro English
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe Euro English
Japan 1.300610e+02 4602367.0 377930.0 0.891 Asia Yen English
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe Pound sterling English
United States 3.185230e+02 17348075.0 9525067.0 0.915 America American Dolar English
China 1.400000e+09 NaN NaN NaN Asia Yuan English
In [20]:
# will return a new dataframe
df.drop(['Canada', 'Japan'])
Out[20]:
Population GDP Surface Area HDI Continent Currency Language
France 6.395100e+01 2833687.0 640679.0 0.900 Europe Euro English
Germany 8.094000e+01 3874437.0 357114.0 0.900 Europe Euro English
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe Euro English
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe Pound sterling English
United States 3.185230e+02 17348075.0 9525067.0 0.915 America American Dolar English
China 1.400000e+09 NaN NaN NaN Asia Yuan English
In [21]:
df
Out[21]:
Population GDP Surface Area HDI Continent Currency Language
Canada 3.546700e+01 1785387.0 9984670.0 0.900 America Canadian Dolar English
France 6.395100e+01 2833687.0 640679.0 0.900 Europe Euro English
Germany 8.094000e+01 3874437.0 357114.0 0.900 Europe Euro English
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe Euro English
Japan 1.300610e+02 4602367.0 377930.0 0.891 Asia Yen English
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe Pound sterling English
United States 3.185230e+02 17348075.0 9525067.0 0.915 America American Dolar English
China 1.400000e+09 NaN NaN NaN Asia Yuan English
In [25]:
# will return a new dataframe
df.drop(['Italy', 'Canada'])
Out[25]:
Population GDP Surface Area HDI Continent Currency
France 6.395100e+01 2833687.0 640679.0 0.900 Europe Euro
Germany 8.094000e+01 3874437.0 357114.0 0.900 Europe Euro
Japan 1.300610e+02 4602367.0 377930.0 0.891 Asia Yen
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe Pound sterling
United States 3.185230e+02 17348075.0 9525067.0 0.915 America American Dolar
China 1.400000e+09 NaN NaN NaN Asia Yuan

 Removing columns from a DataFrame

In [22]:
# will return a new dataframe
df.drop(columns=['Language'], inplace=True)
df
Out[22]:
Population GDP Surface Area HDI Continent Currency
Canada 3.546700e+01 1785387.0 9984670.0 0.900 America Canadian Dolar
France 6.395100e+01 2833687.0 640679.0 0.900 Europe Euro
Germany 8.094000e+01 3874437.0 357114.0 0.900 Europe Euro
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe Euro
Japan 1.300610e+02 4602367.0 377930.0 0.891 Asia Yen
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe Pound sterling
United States 3.185230e+02 17348075.0 9525067.0 0.915 America American Dolar
China 1.400000e+09 NaN NaN NaN Asia Yuan
In [26]:
#del df['Currency']
df.drop('Currency', axis=1, inplace=True)

df
Out[26]:
Population GDP Surface Area HDI Continent
Canada 3.546700e+01 1785387.0 9984670.0 0.900 America
France 6.395100e+01 2833687.0 640679.0 0.900 Europe
Germany 8.094000e+01 3874437.0 357114.0 0.900 Europe
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe
Japan 1.300610e+02 4602367.0 377930.0 0.891 Asia
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe
United States 3.185230e+02 17348075.0 9525067.0 0.915 America
China 1.400000e+09 NaN NaN NaN Asia
In [ ]:
# will return a new dataframe
df.drop(['Population', 'HDI'], axis=1)
In [27]:
# will return a new dataframe
df.drop(['Population', 'HDI'], axis='columns')
Out[27]:
GDP Surface Area Continent
Canada 1785387.0 9984670.0 America
France 2833687.0 640679.0 Europe
Germany 3874437.0 357114.0 Europe
Italy 2167744.0 301336.0 Europe
Japan 4602367.0 377930.0 Asia
United Kingdom 2950039.0 242495.0 Europe
United States 17348075.0 9525067.0 America
China NaN NaN Asia
In [28]:
df.drop(['Canada', 'Germany'], axis='rows')
Out[28]:
Population GDP Surface Area HDI Continent
France 6.395100e+01 2833687.0 640679.0 0.900 Europe
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe
Japan 1.300610e+02 4602367.0 377930.0 0.891 Asia
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe
United States 3.185230e+02 17348075.0 9525067.0 0.915 America
China 1.400000e+09 NaN NaN NaN Asia

By default, the drop method returns a new DataFrame. If you'd like to modify it "in place", you can use the inplace attribute (there's an example below).

green-divider

Creating columns from other columns

Altering a DataFrame often involves combining different columns into another. For example, in our Countries analysis, we could try to calculate the "GDP per capita", which is just, GDP / Population.

In [29]:
df[['Population', 'GDP']]
Out[29]:
Population GDP
Canada 3.546700e+01 1785387.0
France 6.395100e+01 2833687.0
Germany 8.094000e+01 3874437.0
Italy 6.066500e+01 2167744.0
Japan 1.300610e+02 4602367.0
United Kingdom 6.451100e+01 2950039.0
United States 3.185230e+02 17348075.0
China 1.400000e+09 NaN

The regular pandas way of expressing that, is just dividing each series:

In [30]:
df['GDP']
Out[30]:
Canada             1785387.0
France             2833687.0
Germany            3874437.0
Italy              2167744.0
Japan              4602367.0
United Kingdom     2950039.0
United States     17348075.0
China                    NaN
Name: GDP, dtype: float64
In [31]:
df['Population']
Out[31]:
Canada            3.546700e+01
France            6.395100e+01
Germany           8.094000e+01
Italy             6.066500e+01
Japan             1.300610e+02
United Kingdom    6.451100e+01
United States     3.185230e+02
China             1.400000e+09
Name: Population, dtype: float64
In [32]:
df['GDP'] / df['Population']
Out[32]:
Canada            50339.385908
France            44310.284437
Germany           47868.013343
Italy             35733.025633
Japan             35386.218774
United Kingdom    45729.239975
United States     54464.120330
China                      NaN
dtype: float64

The result of that operation is just another series that you can add to the original DataFrame:

In [33]:
df['GDP Per Capita'] = df['GDP'] / df['Population']
In [34]:
df
Out[34]:
Population GDP Surface Area HDI Continent GDP Per Capita
Canada 3.546700e+01 1785387.0 9984670.0 0.900 America 50339.385908
France 6.395100e+01 2833687.0 640679.0 0.900 Europe 44310.284437
Germany 8.094000e+01 3874437.0 357114.0 0.900 Europe 47868.013343
Italy 6.066500e+01 2167744.0 301336.0 0.873 Europe 35733.025633
Japan 1.300610e+02 4602367.0 377930.0 0.891 Asia 35386.218774
United Kingdom 6.451100e+01 2950039.0 242495.0 0.907 Europe 45729.239975
United States 3.185230e+02 17348075.0 9525067.0 0.915 America 54464.120330
China 1.400000e+09 NaN NaN NaN Asia NaN

green-divider

 Checking existance of a key (membership)

In [ ]:
df

Column existance

In [ ]:
'Population' in df
In [ ]:
'Currency' in df

Row existance

In [ ]:
'Canada' in df.index
In [ ]:
'Brasil' in df.index

purple-divider

Notebooks AI
Notebooks AI Profile20060