 Pandas DataFrame - Vectorized Operations and Sorting

Last updated: May 25th, 2019  Pandas DataFrame - Vectorized operations and sorting¶

As we saw on previous Series lectures, DataFrame's also support vectorized operations and aggregation functions as Numpy, on this lecture we'll see most common ones. Hands on!¶

In [ ]:
import numpy as np
import pandas as pd

In [ ]:
pd.options.display.float_format = '{:,.2f}'.format The first thing we'll do is create again the DataFrame from our previous lecture:

In [ ]:
df = pd.DataFrame({
'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
'GDP': [1785387.0, 2833687, 3874437, 2167744, 4602367, 2950039, 17348075],
'Surface Area': [9984670, 640679, 357114, 301336, 377930, 242495, 9525067],
'HDI': [0.913, 0.888, 0.916, 0.873, 0.891, 0.907, 0.915],
'Continent': ['America', 'Europe', 'Europe', 'Europe',
'Asia', 'Europe', 'America']
})

df.columns = ['Population', 'GDP', 'Surface Area', 'HDI', 'Continent']

df.index = ['Canada', 'France', 'Germany', 'Italy',
'Japan', 'United Kingdom', 'United States']

In [ ]:
df Counting Things¶

There are two handy methods to get summaries of columns in DataFrames. Please note that these will also work on Series (after all, a DataFrame column is just a Series). The first one is unique():

In [ ]:
df['Continent'].unique()


If you want to get a summary of the count of unique elements, use value_counts():

In [ ]:
df['Continent'].value_counts()

In [ ]:
df.head(3)

In [ ]:
df.tail(3) DataFrames methods and operations¶

DataFrames also support vectorized operations and aggregation functions as Numpy:

In [ ]:
df['Population'] * 1_000_000

In [ ]:
df[['Population', 'GDP']]

In [ ]:
df[['Population', 'GDP']] * 1_000_000


Calculating "GDP per capita", as a vectorized operation between 2 columns:

In [ ]:
df['GDP'] / df['Population']


Operations with Series work at a column level, broadcasting down the rows (which can be counter intuitive).

In [ ]:
df[['GDP', 'HDI']]

In [ ]:
crisis = pd.Series([-1_000_000, -0.3], index=['GDP', 'HDI'])

crisis

In [ ]:
df[['GDP', 'HDI']] + crisis Using Universal Functions (Ufuncs) to obtain statistical info¶

We can apply any Universal Function to a DataFrame column.

You've already seen the describe method, which gives you a good "summary" of the whole DataFrame or any specific column. Let's explore other methods in more detail:

In [ ]:
df.describe()


Every other ufunc that we've used for Series, also works for entire DataFrames:

In [ ]:
df[['GDP', 'Population']].max()

In [ ]:
df[['GDP', 'Population']].min()

In [ ]:
df[['GDP', 'Population']].sum()

In [ ]:
df[['GDP', 'Population']].quantile(.25) Sorting DataFrame values¶

In many cases DataFrame values need to be sorted.

Sorting in Pandas is extremely easy. There are two important methods to be used for Series and DataFrames that will take care of the job: sort_values and sort_index.

In [ ]:
df

In [ ]:
df.sort_values(['Population'])


Remember that these operations are immutable; the original DataFrame hasn't been modified:

In [ ]:
df


As you can see, sorting is as simple as invoking the sort_values method. By default, values are sorted in ascending order, which you can customize with the ascending parameter.

In [ ]:
df.sort_values(['Population'], ascending=False)

In [ ]:
df


Note that we have to add the inplace parameter if we want to keep changes on our DataFrame. On next lecture we'll see this parameter on detail.

In [ ]:
df.sort_values(['Population'], ascending=False, inplace=True)

In [ ]:
df


Sorting index¶

sort_index works exactly in the same way:

In [ ]:
df['GDP'].sort_index()


Reindexing¶

In [ ]:
df.index

In [ ]:
# Reorder current DataFrame indexes
df.reindex(['France',
'Germany',
'Italy',
'Japan',
'United Kingdom',
'United States'])

In [ ]:
# Adding a new index value to a DataFrame
df.reindex(['France',
'Germany',
'Italy',
'Japan',
'United Kingdom',
'United States',
'Brazil'])

In [ ]:
# Adding a new index value to a DataFrame, with default fill value
df.reindex(['France',
'Germany',
'Italy', 