Profile picture

Pandas DataFrame - Vectorized Operations and Sorting

Last updated: May 25th, 20192019-05-25Project preview

rmotr


Pandas DataFrame - Vectorized operations and sorting

As we saw on previous Series lectures, DataFrame's also support vectorized operations and aggregation functions as Numpy, on this lecture we'll see most common ones.

purple-divider

Hands on!

In [ ]:
import numpy as np
import pandas as pd
In [ ]:
pd.options.display.float_format = '{:,.2f}'.format

green-divider

The first thing we'll do is create again the DataFrame from our previous lecture:

In [ ]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [1785387.0, 2833687, 3874437, 2167744, 4602367, 2950039, 17348075],
    'Surface Area': [9984670, 640679, 357114, 301336, 377930, 242495, 9525067],
    'HDI': [0.913, 0.888, 0.916, 0.873, 0.891, 0.907, 0.915],
    'Continent': ['America', 'Europe', 'Europe', 'Europe',
                  'Asia', 'Europe', 'America']
})

df.columns = ['Population', 'GDP', 'Surface Area', 'HDI', 'Continent']

df.index = ['Canada', 'France', 'Germany', 'Italy',
            'Japan', 'United Kingdom', 'United States']
In [ ]:
df

green-divider

Counting Things

There are two handy methods to get summaries of columns in DataFrames. Please note that these will also work on Series (after all, a DataFrame column is just a Series). The first one is unique():

In [ ]:
df['Continent'].unique()

If you want to get a summary of the count of unique elements, use value_counts():

In [ ]:
df['Continent'].value_counts()
In [ ]:
df.head(3)
In [ ]:
df.tail(3)

green-divider

DataFrames methods and operations

DataFrames also support vectorized operations and aggregation functions as Numpy:

In [ ]:
df['Population'] * 1_000_000
In [ ]:
df[['Population', 'GDP']]
In [ ]:
df[['Population', 'GDP']] * 1_000_000

Calculating "GDP per capita", as a vectorized operation between 2 columns:

In [ ]:
df['GDP'] / df['Population']

 Broadcasting

Operations with Series work at a column level, broadcasting down the rows (which can be counter intuitive).

In [ ]:
df[['GDP', 'HDI']]
In [ ]:
crisis = pd.Series([-1_000_000, -0.3], index=['GDP', 'HDI'])

crisis
In [ ]:
df[['GDP', 'HDI']] + crisis

green-divider

Using Universal Functions (Ufuncs) to obtain statistical info

We can apply any Universal Function to a DataFrame column.

You've already seen the describe method, which gives you a good "summary" of the whole DataFrame or any specific column. Let's explore other methods in more detail:

In [ ]:
df.describe()

Every other ufunc that we've used for Series, also works for entire DataFrames:

In [ ]:
df[['GDP', 'Population']].max()
In [ ]:
df[['GDP', 'Population']].min()
In [ ]:
df[['GDP', 'Population']].sum()
In [ ]:
df[['GDP', 'Population']].quantile(.25)

green-divider

Sorting DataFrame values

In many cases DataFrame values need to be sorted.

Sorting in Pandas is extremely easy. There are two important methods to be used for Series and DataFrames that will take care of the job: sort_values and sort_index.

In [ ]:
df
In [ ]:
df.sort_values(['Population'])

Remember that these operations are immutable; the original DataFrame hasn't been modified:

In [ ]:
df

As you can see, sorting is as simple as invoking the sort_values method. By default, values are sorted in ascending order, which you can customize with the ascending parameter.

In [ ]:
df.sort_values(['Population'], ascending=False)
In [ ]:
df

Note that we have to add the inplace parameter if we want to keep changes on our DataFrame. On next lecture we'll see this parameter on detail.

In [ ]:
df.sort_values(['Population'], ascending=False, inplace=True)
In [ ]:
df

 Sorting index

sort_index works exactly in the same way:

In [ ]:
df['GDP'].sort_index()

Reindexing

In [ ]:
df.index
In [ ]:
# Reorder current DataFrame indexes
df.reindex(['France',
            'Germany',
            'Italy',
            'Canada',
            'Japan',
            'United Kingdom',
            'United States'])
In [ ]:
# Adding a new index value to a DataFrame
df.reindex(['France',
            'Germany',
            'Italy',
            'Canada',
            'Japan',
            'United Kingdom',
            'United States',
            'Brazil'])
In [ ]:
# Adding a new index value to a DataFrame, with default fill value
df.reindex(['France',
            'Germany',
            'Italy',
            'Canada',
            'Japan',
            'United Kingdom',
            'United States',
            'Brasil'], fill_value=0)

purple-divider

Notebooks AI
Notebooks AI Profile20060