Profile picture

Vectorized Operations on Pandas DataFrames

Last updated: October 31st, 20192019-10-31Project preview

rmotr


Vectorized Operations on Pandas DataFrames

As we saw on previous Series lectures, DataFrame's also support vectorized operations and aggregation functions as Numpy, on this lecture we'll see most common ones.

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd
In [2]:
pd.options.display.float_format = '{:,.2f}'.format

green-divider

The first thing we'll do is create again the DataFrame from our previous lecture:

In [3]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [1785387.0, 2833687, 3874437, 2167744, 4602367, 2950039, 17348075],
    'Surface Area': [9984670, 640679, 357114, 301336, 377930, 242495, 9525067],
    'HDI': [0.913, 0.888, 0.916, 0.873, 0.891, 0.907, 0.915],
    'Continent': ['America', 'Europe', 'Europe', 'Europe',
                  'Asia', 'Europe', 'America']
})

df.columns = ['Population', 'GDP', 'Surface Area', 'HDI', 'Continent']

df.index = ['Canada', 'France', 'Germany', 'Italy',
            'Japan', 'United Kingdom', 'United States']
In [4]:
df
Out[4]:
Population GDP Surface Area HDI Continent
Canada 35.47 1,785,387.00 9984670 0.91 America
France 63.95 2,833,687.00 640679 0.89 Europe
Germany 80.94 3,874,437.00 357114 0.92 Europe
Italy 60.66 2,167,744.00 301336 0.87 Europe
Japan 127.06 4,602,367.00 377930 0.89 Asia
United Kingdom 64.51 2,950,039.00 242495 0.91 Europe
United States 318.52 17,348,075.00 9525067 0.92 America

green-divider

Head and tail

To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is five, but you may pass a custom number.

In [5]:
df.head(3)
Out[5]:
Population GDP Surface Area HDI Continent
Canada 35.47 1,785,387.00 9984670 0.91 America
France 63.95 2,833,687.00 640679 0.89 Europe
Germany 80.94 3,874,437.00 357114 0.92 Europe
In [6]:
df.tail(3)
Out[6]:
Population GDP Surface Area HDI Continent
Japan 127.06 4,602,367.00 377930 0.89 Asia
United Kingdom 64.51 2,950,039.00 242495 0.91 Europe
United States 318.52 17,348,075.00 9525067 0.92 America

green-divider

Counting Things

There are two handy methods to get summaries of columns in DataFrames. Please note that these will also work on Series (after all, a DataFrame column is just a Series). The first one is unique():

In [7]:
df['Continent'].unique()
Out[7]:
array(['America', 'Europe', 'Asia'], dtype=object)

If you want to get a summary of the count of unique elements, use value_counts():

In [8]:
df['Continent'].value_counts()
Out[8]:
Europe     4
America    2
Asia       1
Name: Continent, dtype: int64

green-divider

DataFrames vectorized operations

DataFrames also support vectorized operations as Numpy.

In [9]:
df['Population'] + 10
Out[9]:
Canada            45.47
France            73.95
Germany           90.94
Italy             70.66
Japan            137.06
United Kingdom    74.51
United States    328.52
Name: Population, dtype: float64
In [10]:
df['Population'] * 1_000_000
Out[10]:
Canada            35,467,000.00
France            63,951,000.00
Germany           80,940,000.00
Italy             60,665,000.00
Japan            127,061,000.00
United Kingdom    64,511,000.00
United States    318,523,000.00
Name: Population, dtype: float64

This operations were executed element-wise for every row.

We can also apply that operations for a subset of our data:

In [11]:
df[['Population', 'GDP']]
Out[11]:
Population GDP
Canada 35.47 1,785,387.00
France 63.95 2,833,687.00
Germany 80.94 3,874,437.00
Italy 60.66 2,167,744.00
Japan 127.06 4,602,367.00
United Kingdom 64.51 2,950,039.00
United States 318.52 17,348,075.00
In [12]:
df[['Population', 'GDP']] * 1_000_000
Out[12]:
Population GDP
Canada 35,467,000.00 1,785,387,000,000.00
France 63,951,000.00 2,833,687,000,000.00
Germany 80,940,000.00 3,874,437,000,000.00
Italy 60,665,000.00 2,167,744,000,000.00
Japan 127,061,000.00 4,602,367,000,000.00
United Kingdom 64,511,000.00 2,950,039,000,000.00
United States 318,523,000.00 17,348,075,000,000.00

 Operations between DataFrame and DataFrame

In [13]:
df + df
Out[13]:
Population GDP Surface Area HDI Continent
Canada 70.93 3,570,774.00 19969340 1.83 AmericaAmerica
France 127.90 5,667,374.00 1281358 1.78 EuropeEurope
Germany 161.88 7,748,874.00 714228 1.83 EuropeEurope
Italy 121.33 4,335,488.00 602672 1.75 EuropeEurope
Japan 254.12 9,204,734.00 755860 1.78 AsiaAsia
United Kingdom 129.02 5,900,078.00 484990 1.81 EuropeEurope
United States 637.05 34,696,150.00 19050134 1.83 AmericaAmerica
In [14]:
df[['Population', 'GDP']] + df[['Population', 'GDP']]
Out[14]:
Population GDP
Canada 70.93 3,570,774.00
France 127.90 5,667,374.00
Germany 161.88 7,748,874.00
Italy 121.33 4,335,488.00
Japan 254.12 9,204,734.00
United Kingdom 129.02 5,900,078.00
United States 637.05 34,696,150.00
In [15]:
df[['Population', 'GDP']] / df[['Population', 'GDP']]
Out[15]:
Population GDP
Canada 1.00 1.00
France 1.00 1.00
Germany 1.00 1.00
Italy 1.00 1.00
Japan 1.00 1.00
United Kingdom 1.00 1.00
United States 1.00 1.00

Calculating "GDP per capita", as a vectorized operation between 2 columns:

In [16]:
df['GDP'] / df['Population']
Out[16]:
Canada           50,339.39
France           44,310.28
Germany          47,868.01
Italy            35,733.03
Japan            36,221.71
United Kingdom   45,729.24
United States    54,464.12
dtype: float64

green-divider

 Comparisson functions

We can use pandas' comparisson methods (eq, ne, lt, gt, le, and ge) between DataFrames.

These operations produce a pandas object of the same type as the left-hand-side input that is of dtype bool.

In [17]:
# get equal to of dataframe and other, element-wise
df.eq(df)
Out[17]:
Population GDP Surface Area HDI Continent
Canada True True True True True
France True True True True True
Germany True True True True True
Italy True True True True True
Japan True True True True True
United Kingdom True True True True True
United States True True True True True
In [18]:
# get greater than of dataframe and other, element-wise
df.gt(df)
Out[18]:
Population GDP Surface Area HDI Continent
Canada False False False False False
France False False False False False
Germany False False False False False
Italy False False False False False
Japan False False False False False
United Kingdom False False False False False
United States False False False False False

 Boolean reductions

Also, you can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result.

In [19]:
# return whether any element is True, potentially over an axis.
df.any()
Out[19]:
Population      True
GDP             True
Surface Area    True
HDI             True
Continent       True
dtype: bool
In [20]:
df.any(axis=1)
Out[20]:
Canada            True
France            True
Germany           True
Italy             True
Japan             True
United Kingdom    True
United States     True
dtype: bool

green-divider

 Broadcasting and Operations with Series

We can make operations between DataFrames and Series, but we need to know that operations with Series work at a column level, broadcasting down the rows, which can be counter intuitive.

Operations between a DataFrame and a Series are similar to operations between a two-dimensional and one-dimensional NumPy array.

In [21]:
df[['GDP', 'HDI']]
Out[21]:
GDP HDI
Canada 1,785,387.00 0.91
France 2,833,687.00 0.89
Germany 3,874,437.00 0.92
Italy 2,167,744.00 0.87
Japan 4,602,367.00 0.89
United Kingdom 2,950,039.00 0.91
United States 17,348,075.00 0.92
In [22]:
crisis = pd.Series([-1_000_000, -0.3], index=['GDP', 'HDI'])

crisis
Out[22]:
GDP   -1,000,000.00
HDI           -0.30
dtype: float64
In [23]:
df[['GDP', 'HDI']] + crisis
Out[23]:
GDP HDI
Canada 785,387.00 0.61
France 1,833,687.00 0.59
Germany 2,874,437.00 0.62
Italy 1,167,744.00 0.57
Japan 3,602,367.00 0.59
United Kingdom 1,950,039.00 0.61
United States 16,348,075.00 0.61
In [24]:
df[['GDP', 'HDI']] + (crisis * 0.8)
Out[24]:
GDP HDI
Canada 985,387.00 0.67
France 2,033,687.00 0.65
Germany 3,074,437.00 0.68
Italy 1,367,744.00 0.63
Japan 3,802,367.00 0.65
United Kingdom 2,150,039.00 0.67
United States 16,548,075.00 0.68

purple-divider

Notebooks AI
Notebooks AI Profile20060