Profile picture

Pandas Series Selection and Indexing

Last updated: May 14th, 20192019-05-14Project preview

rmotr


Pandas Series - Selection and Indexing

Pandas Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary. If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these data structures.

purple-divider

Hands on!

In [1]:
import pandas as pd
import numpy as np

green-divider

The first thing we'll do is create again the Series from our previous lecture:

In [45]:
data_dic = {
    'Canada': 35.467,
    'France': 63.951,
    'Germany': 80.94,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}

g7_pop = pd.Series(data_dic,
                   name='G7 Population in millions')
In [3]:
g7_pop
Out[3]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

green-divider

Indexing

Indexing works similarly to lists and dictionaries.

Indexing by index

you use the index of the element you're looking for:

In [4]:
g7_pop['Canada']
Out[4]:
35.467
In [5]:
g7_pop['Japan']
Out[5]:
127.061
In [6]:
g7_pop['United Kingdom']
Out[6]:
64.511
In [7]:
g7_pop.Japan
Out[7]:
127.061

 Slicing and multi-selection

Slicing also works, but important, in Pandas, the upper limit is also included:

In [8]:
g7_pop['Germany': 'Japan']
Out[8]:
Germany     80.940
Italy       60.665
Japan      127.061
Name: G7 Population in millions, dtype: float64

Multi indexing also works (similarly to numpy):

In [9]:
g7_pop[['Italy', 'France', 'United States']]
Out[9]:
Italy             60.665
France            63.951
United States    318.523
Name: G7 Population in millions, dtype: float64

 Indexing by sequential position

Indexing elements by their sequential position also works. In this case pandas evaluates the object received; if it doesn't exist as an index, it'll try by sequential position.

With sequential position the upper limit is not included.

In [10]:
g7_pop
Out[10]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64
In [11]:
g7_pop[2]
Out[11]:
80.94
In [12]:
g7_pop[4]
Out[12]:
127.061
In [13]:
g7_pop[2:4]
Out[13]:
Germany    80.940
Italy      60.665
Name: G7 Population in millions, dtype: float64
In [14]:
g7_pop[[3, 1, 6]]
Out[14]:
Italy             60.665
France            63.951
United States    318.523
Name: G7 Population in millions, dtype: float64

green-divider

Adding new elements to a Series

In many cases we'll want to add new values to our Series, to do that we can just simply index our Series using the new index and then assigning a value to that index. Let's add two new records:

In [15]:
g7_pop['Brasil'] = 20.124
g7_pop['India'] = 32.235
In [16]:
g7_pop
Out[16]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Brasil             20.124
India              32.235
Name: G7 Population in millions, dtype: float64

green-divider

Modifying Series elements

In [17]:
g7_pop['Canada'] = 40.5

g7_pop
Out[17]:
Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Brasil             20.124
India              32.235
Name: G7 Population in millions, dtype: float64
In [18]:
g7_pop['France'] = np.nan

g7_pop
Out[18]:
Canada             40.500
France                NaN
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Brasil             20.124
India              32.235
Name: G7 Population in millions, dtype: float64

green-divider

Removing elements from a Series

In [19]:
del g7_pop['Brasil']

g7_pop
Out[19]:
Canada             40.500
France                NaN
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
India              32.235
Name: G7 Population in millions, dtype: float64
In [20]:
del g7_pop['India']

g7_pop
Out[20]:
Canada             40.500
France                NaN
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

green-divider

 Checking existance of a key (membership)

In [21]:
'France' in g7_pop
Out[21]:
True
In [22]:
'Brasil' in g7_pop
Out[22]:
False

green-divider

Introducing loc & iloc

What's the problem with the indexing we've seen? It's not explicit. Pandas receives an element to index and it tries figuring out if we meant to select an element by its key, or its sequential position. Check out the following example:

In [23]:
s = pd.Series(
    ['a', 'b', 'c'],
    index=[1, 2, 3])
s
Out[23]:
1    a
2    b
3    c
dtype: object
In [24]:
s
Out[24]:
1    a
2    b
3    c
dtype: object

What happens if we try indexing s[1], what should it return? a or b?

In [25]:
s[1]
Out[25]:
'a'

In this case, the returned object is worked out by the index, not by the sequential position. But again, it's not intuitive or explicit.

Enter loc and iloc:

  • loc is the preferred way to select elements in Series (and Dataframes) by their index
  • iloc is the preferred way to select by sequential position
In [26]:
s.loc[1]
Out[26]:
'a'
In [27]:
s.iloc[1]
Out[27]:
'b'
In [28]:
g7_pop
Out[28]:
Canada             40.500
France                NaN
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64
In [29]:
g7_pop.iloc[-1]
Out[29]:
318.523
In [30]:
g7_pop.iloc[[0, 1]]
Out[30]:
Canada    40.5
France     NaN
Name: G7 Population in millions, dtype: float64

Using our previous series:

In [31]:
g7_pop
Out[31]:
Canada             40.500
France                NaN
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64
In [32]:
g7_pop.loc['Japan']
Out[32]:
127.061
In [33]:
g7_pop.iloc[-1]
Out[33]:
318.523
In [34]:
g7_pop
Out[34]:
Canada             40.500
France                NaN
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64
In [35]:
g7_pop.loc['Canada']
Out[35]:
40.5
In [36]:
g7_pop.iloc[0]
Out[36]:
40.5
In [37]:
g7_pop.iloc[-1]
Out[37]:
318.523
In [38]:
g7_pop.loc[['Japan', 'Canada']]
Out[38]:
Japan     127.061
Canada     40.500
Name: G7 Population in millions, dtype: float64
In [39]:
g7_pop.iloc[[0, -1]]
Out[39]:
Canada            40.500
United States    318.523
Name: G7 Population in millions, dtype: float64

loc & iloc to modify Series

In [40]:
g7_pop.loc['United States'] = 1000

g7_pop
Out[40]:
Canada              40.500
France                 NaN
Germany             80.940
Italy               60.665
Japan              127.061
United Kingdom      64.511
United States     1000.000
Name: G7 Population in millions, dtype: float64
In [41]:
g7_pop.iloc[-1] = 500

g7_pop
Out[41]:
Canada             40.500
France                NaN
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     500.000
Name: G7 Population in millions, dtype: float64

green-divider

Introducing to Conditional selection

Another way to select certain values within a Series is using Conditional selection, also known as Boolean selection.

We can index our Series using a list of boolean values:

In [42]:
g7_pop[[False, False,  True, False,  True, False,  True]]
Out[42]:
Germany           80.940
Japan            127.061
United States    500.000
Name: G7 Population in millions, dtype: float64

Or we can index our Series using another Series with boolean values:

In [43]:
condition = pd.Series([
    False, False,  True, False,  True, False,  True
], index=[
    'Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom', 'United States'
])

condition
Out[43]:
Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
dtype: bool
In [44]:
g7_pop[condition]
Out[44]:
Germany           80.940
Japan            127.061
United States    500.000
Name: G7 Population in millions, dtype: float64

On next lecture we'll see how to use more complex conditional selections.

purple-divider

Notebooks AI
Notebooks AI Profile20060