Profile picture

Pandas DataFrames - Sorting

Last updated: October 31st, 20192019-10-31Project preview

rmotr


Pandas DataFrames - Sorting

In many cases DataFrame values or indices need to be sorted.

Sorting in Pandas is extremely easy. There are two important methods to be used for Series and DataFrames that will take care of the job: sort_values and sort_index.

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd
In [2]:
pd.options.display.float_format = '{:,.2f}'.format

green-divider

The first thing we'll do is create again the DataFrame from our previous lecture:

In [3]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [1785387.0, 2833687, 3874437, 2167744, 4602367, 2950039, 17348075],
    'Surface Area': [9984670, 640679, 357114, 301336, 377930, 242495, 9525067],
    'HDI': [0.913, 0.888, 0.916, 0.873, 0.891, 0.907, 0.915],
    'Continent': ['America', 'Europe', 'Europe', 'Europe',
                  'Asia', 'Europe', 'America']
})

df.columns = ['Population', 'GDP', 'Surface Area', 'HDI', 'Continent']

df.index = ['Canada', 'France', 'Germany', 'Italy',
            'Japan', 'United Kingdom', 'United States']
In [4]:
df
Out[4]:
Population GDP Surface Area HDI Continent
Canada 35.47 1,785,387.00 9984670 0.91 America
France 63.95 2,833,687.00 640679 0.89 Europe
Germany 80.94 3,874,437.00 357114 0.92 Europe
Italy 60.66 2,167,744.00 301336 0.87 Europe
Japan 127.06 4,602,367.00 377930 0.89 Asia
United Kingdom 64.51 2,950,039.00 242495 0.91 Europe
United States 318.52 17,348,075.00 9525067 0.92 America

green-divider

Sorting DataFrame values

In [5]:
df
Out[5]:
Population GDP Surface Area HDI Continent
Canada 35.47 1,785,387.00 9984670 0.91 America
France 63.95 2,833,687.00 640679 0.89 Europe
Germany 80.94 3,874,437.00 357114 0.92 Europe
Italy 60.66 2,167,744.00 301336 0.87 Europe
Japan 127.06 4,602,367.00 377930 0.89 Asia
United Kingdom 64.51 2,950,039.00 242495 0.91 Europe
United States 318.52 17,348,075.00 9525067 0.92 America
In [6]:
df.sort_values(['Population'])
Out[6]:
Population GDP Surface Area HDI Continent
Canada 35.47 1,785,387.00 9984670 0.91 America
Italy 60.66 2,167,744.00 301336 0.87 Europe
France 63.95 2,833,687.00 640679 0.89 Europe
United Kingdom 64.51 2,950,039.00 242495 0.91 Europe
Germany 80.94 3,874,437.00 357114 0.92 Europe
Japan 127.06 4,602,367.00 377930 0.89 Asia
United States 318.52 17,348,075.00 9525067 0.92 America

Remember that these operations are immutable; the original DataFrame hasn't been modified:

In [7]:
df
Out[7]:
Population GDP Surface Area HDI Continent
Canada 35.47 1,785,387.00 9984670 0.91 America
France 63.95 2,833,687.00 640679 0.89 Europe
Germany 80.94 3,874,437.00 357114 0.92 Europe
Italy 60.66 2,167,744.00 301336 0.87 Europe
Japan 127.06 4,602,367.00 377930 0.89 Asia
United Kingdom 64.51 2,950,039.00 242495 0.91 Europe
United States 318.52 17,348,075.00 9525067 0.92 America

As you can see, sorting is as simple as invoking the sort_values method. By default, values are sorted in ascending order, which you can customize with the ascending parameter.

In [8]:
df.sort_values(['Population'], ascending=False)
Out[8]:
Population GDP Surface Area HDI Continent
United States 318.52 17,348,075.00 9525067 0.92 America
Japan 127.06 4,602,367.00 377930 0.89 Asia
Germany 80.94 3,874,437.00 357114 0.92 Europe
United Kingdom 64.51 2,950,039.00 242495 0.91 Europe
France 63.95 2,833,687.00 640679 0.89 Europe
Italy 60.66 2,167,744.00 301336 0.87 Europe
Canada 35.47 1,785,387.00 9984670 0.91 America
In [9]:
df
Out[9]:
Population GDP Surface Area HDI Continent
Canada 35.47 1,785,387.00 9984670 0.91 America
France 63.95 2,833,687.00 640679 0.89 Europe
Germany 80.94 3,874,437.00 357114 0.92 Europe
Italy 60.66 2,167,744.00 301336 0.87 Europe
Japan 127.06 4,602,367.00 377930 0.89 Asia
United Kingdom 64.51 2,950,039.00 242495 0.91 Europe
United States 318.52 17,348,075.00 9525067 0.92 America

Note that we have to add the inplace parameter if we want to keep changes on our DataFrame. On next lecture we'll see this parameter on detail.

In [10]:
df.sort_values(['Population'], ascending=False, inplace=True)
In [11]:
df
Out[11]:
Population GDP Surface Area HDI Continent
United States 318.52 17,348,075.00 9525067 0.92 America
Japan 127.06 4,602,367.00 377930 0.89 Asia
Germany 80.94 3,874,437.00 357114 0.92 Europe
United Kingdom 64.51 2,950,039.00 242495 0.91 Europe
France 63.95 2,833,687.00 640679 0.89 Europe
Italy 60.66 2,167,744.00 301336 0.87 Europe
Canada 35.47 1,785,387.00 9984670 0.91 America

green-divider

 Sorting index

sort_index works exactly in the same way:

In [12]:
df['GDP'].sort_index()
Out[12]:
Canada            1,785,387.00
France            2,833,687.00
Germany           3,874,437.00
Italy             2,167,744.00
Japan             4,602,367.00
United Kingdom    2,950,039.00
United States    17,348,075.00
Name: GDP, dtype: float64

Reindexing

In [13]:
df.index
Out[13]:
Index(['United States', 'Japan', 'Germany', 'United Kingdom', 'France',
       'Italy', 'Canada'],
      dtype='object')
In [14]:
# Reorder current DataFrame indices
df.reindex(['France',
            'Germany',
            'Italy',
            'Canada',
            'Japan',
            'United Kingdom',
            'United States'])
Out[14]:
Population GDP Surface Area HDI Continent
France 63.95 2,833,687.00 640679 0.89 Europe
Germany 80.94 3,874,437.00 357114 0.92 Europe
Italy 60.66 2,167,744.00 301336 0.87 Europe
Canada 35.47 1,785,387.00 9984670 0.91 America
Japan 127.06 4,602,367.00 377930 0.89 Asia
United Kingdom 64.51 2,950,039.00 242495 0.91 Europe
United States 318.52 17,348,075.00 9525067 0.92 America
In [15]:
# Adding a new index value to a DataFrame
df.reindex(['France',
            'Germany',
            'Italy',
            'Canada',
            'Japan',
            'United Kingdom',
            'United States',
            'Brazil'])
Out[15]:
Population GDP Surface Area HDI Continent
France 63.95 2,833,687.00 640,679.00 0.89 Europe
Germany 80.94 3,874,437.00 357,114.00 0.92 Europe
Italy 60.66 2,167,744.00 301,336.00 0.87 Europe
Canada 35.47 1,785,387.00 9,984,670.00 0.91 America
Japan 127.06 4,602,367.00 377,930.00 0.89 Asia
United Kingdom 64.51 2,950,039.00 242,495.00 0.91 Europe
United States 318.52 17,348,075.00 9,525,067.00 0.92 America
Brazil nan nan nan nan NaN
In [16]:
# Adding a new index value to a DataFrame, with default fill value
df.reindex(['France',
            'Germany',
            'Italy',
            'Canada',
            'Japan',
            'United Kingdom',
            'United States',
            'Brasil'], fill_value=0)
Out[16]:
Population GDP Surface Area HDI Continent
France 63.95 2,833,687.00 640679 0.89 Europe
Germany 80.94 3,874,437.00 357114 0.92 Europe
Italy 60.66 2,167,744.00 301336 0.87 Europe
Canada 35.47 1,785,387.00 9984670 0.91 America
Japan 127.06 4,602,367.00 377930 0.89 Asia
United Kingdom 64.51 2,950,039.00 242495 0.91 Europe
United States 318.52 17,348,075.00 9525067 0.92 America
Brasil 0.00 0.00 0 0.00 0

purple-divider

Notebooks AI
Notebooks AI Profile20060