Profile picture

Pandas Series

Last updated: May 7th, 20192019-05-07Project preview

rmotr


Intro to Pandas

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed to work with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical arrays.

Data structures

To get started with pandas, you will need to get comfortable with its two main data structures: Series and DataFrames.

purple-divider

Hands on!

Importing pandas with the pd alias is a convention, similar to np for numpy:

In [1]:
import pandas as pd
import numpy as np

Pandas Series

A Series is a one-dimensional array-like object containing a typed sequence of values and an associated array of data labels, called its index.

Series creation

pd.Series' constructor accepts the following parameters:

  • data: (required) has all the data we want to store on the Series and could be an scalar value, a Python sequence or an unidimensional NumPy ndarray.
  • index: (optional), has all the labels that we want to assign to our data values and could be a Python sequence or an unidimensional NumPy ndarray. Default value: np.arange(0, len(data)).
  • dtype: (optional) any NumPy data type.
In [10]:
series = pd.Series([1, 2, 3, 4, 5])
series
Out[10]:
0    1
1    2
2    3
3    4
4    5
dtype: int64

Series have an associated type:

In [11]:
series.dtype
Out[11]:
dtype('int64')
In [13]:
series = pd.Series([1, 2, 3, 4, 5], dtype=np.float)
series
Out[13]:
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64
In [14]:
series.dtype
Out[14]:
dtype('float64')
In [15]:
series = pd.Series(['a', 'b', 'c', 'd', 'e'])
series
Out[15]:
0    a
1    b
2    c
3    d
4    e
dtype: object
In [6]:
# Using a ndarraynp.array([2, 4, 6, 8, 10
array = np.array([2, 4, 6, 8, 10])
series = pd.Series(array)
series
Out[6]:
0     2
1     4
2     6
3     8
4    10
dtype: int64
In [7]:
# With predefined index
series = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
series
Out[7]:
a    1
b    2
c    3
d    4
e    5
dtype: int64
In [8]:
# Using a dictionary (index will be defined using keys)
series = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}, dtype=np.float64)
series
Out[8]:
a    1.0
b    2.0
c    3.0
d    4.0
e    5.0
dtype: float64

green-divider

Series attributes

These are the most common attributes to get information about a Series:

In [ ]:
series = pd.Series(data=[1, 2, 3, 4, 5],
                   index=['a', 'b', 'c', 'd', 'e'],
                   dtype=np.float64)
series
In [ ]:
# Type of our Series
series.dtype
In [ ]:
# Values of a series
series.values
In [ ]:
type(series.values)
In [ ]:
# Index of a series
series.index
In [ ]:
# Dimension of the Series
series.ndim
In [ ]:
# Shape of the Series
series.shape
In [ ]:
# Number of Series elements
series.size

green-divider

The Group of Seven

We'll start analyzing "The Group of Seven". Which is a political formed by Canada, France, Germany, Italy, Japan, the United Kingdom and the United States. We'll start by analyzing population, and for that, we'll use a pandas.Series object.

In [16]:
# In millions
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])

g7_pop
Out[16]:
0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

Someone might not know we're representing population in millions of inhabitants. Series can have a name, to better document the purpose of the Series:

In [17]:
g7_pop.name = 'G7 Population in millions'

g7_pop
Out[17]:
0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

Series are pretty similar to numpy arrays:

In [18]:
g7_pop.dtype
Out[18]:
dtype('float64')
In [19]:
type(series.values)
Out[19]:
numpy.ndarray
In [20]:
g7_pop.ndim
Out[20]:
1
In [21]:
g7_pop.shape
Out[21]:
(7,)
In [22]:
g7_pop.size
Out[22]:
7

And they look like simple Python lists or Numpy Arrays. But they're actually more similar to Python dicts.

In [23]:
g7_pop
Out[23]:
0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64
In [24]:
g7_pop.index
Out[24]:
RangeIndex(start=0, stop=7, step=1)

But, in contrast to lists, we can explicitly define the index:

In [25]:
g7_pop.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]
In [26]:
g7_pop
Out[26]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

Compare it with the following table:

image

We can say that Series look like "ordered dictionaries". We can actually create Series out of dictionaries:

In [27]:
pd.Series({
    'Canada': 35.467,
    'France': 63.951,
    'Germany': 80.94,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}, name='G7 Population in millions')
Out[27]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64
In [28]:
pd.Series(
    [35.467, 63.951, 80.94, 60.665, 127.061, 64.511, 318.523],
    index=['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom',
       'United States'],
    name='G7 Population in millions')
Out[28]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

You can also create Series out of other series, specifying indexes:

In [29]:
pd.Series(g7_pop, index=['France', 'Germany', 'Italy', 'Spain'])
Out[29]:
France     63.951
Germany    80.940
Italy      60.665
Spain         NaN
Name: G7 Population in millions, dtype: float64

--- FINISH SECTION 1 ---

green-divider

Adding new elements to a Series

In [ ]:
g7_pop
In [ ]:
g7_pop['Brasil'] = 20.124

g7_pop
In [ ]:
g7_pop['India'] = 32.235

g7_pop

green-divider

Removing elements from a Serie

In [ ]:
del g7_pop['Brasil']

g7_pop
In [ ]:
del g7_pop['India']

g7_pop

green-divider

Indexing

Indexing works similarly to lists and dictionaries.

Indexing by index

you use the index of the element you're looking for:

In [ ]:
g7_pop['Canada']
In [ ]:
g7_pop['Japan']
In [ ]:
g7_pop['United Kingdom']
In [ ]:
g7_pop.Japan

Slicing also works, but important, in Pandas, the upper limit is also included:

In [ ]:
g7_pop['Canada': 'Italy']

Multi indexing also works (similarly to numpy):

In [ ]:
g7_pop[['Italy', 'France']]

 Indexing by sequential position

Indexing elements by their sequential position also works. In this case pandas evaluates the object received; if it doesn't exist as an index, it'll try by sequential position.

In [ ]:
g7_pop
In [ ]:
g7_pop[2]
In [ ]:
g7_pop[4]
In [ ]:
g7_pop[0:2]

Using loc & iloc

What's the problem with the indexing we've seen? It's not explicit. Pandas receives an element to index and it tries figuring out if we meant to select an element by its key, or its sequential position. Check out the following example:

In [ ]:
s = pd.Series(
    ['a', 'b', 'c'],
    index=[1, 2, 3])
s
In [ ]:
s

What happens if we try indexing s[1], what should it return? a or b?

In [ ]:
s[1]

In this case, the returned object is worked out by the index, not by the sequential position. But again, it's not intuitive or explicit.

Enter loc and iloc:

  • loc is the preferred way to select elements in Series (and Dataframes) by their index
  • iloc is the preferred way to select by sequential position
In [ ]:
s.loc[1]
In [ ]:
s.iloc[1]
In [ ]:
g7_pop
In [ ]:
g7_pop.iloc[-1]
In [ ]:
g7_pop.iloc[[0, 1]]

Using our previous series:

In [ ]:
g7_pop
In [ ]:
g7_pop.loc['Japan']
In [ ]:
g7_pop.iloc[-1]
In [ ]:
 
In [35]:
g7_pop
Out[35]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64
In [36]:
g7_pop.loc['Canada']
Out[36]:
35.467
In [37]:
g7_pop.iloc[0]
Out[37]:
35.467
In [38]:
g7_pop.iloc[-1]
Out[38]:
318.523
In [39]:
g7_pop.loc[['Japan', 'Canada']]
Out[39]:
Japan     127.061
Canada     35.467
Name: G7 Population in millions, dtype: float64
In [40]:
g7_pop.iloc[[0, -1]]
Out[40]:
Canada            35.467
United States    318.523
Name: G7 Population in millions, dtype: float64
In [52]:
g7_pop[[False, False,  True, False,  True, False,  True]]
Out[52]:
Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64
In [53]:
g7_pop > 70
Out[53]:
Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: G7 Population in millions, dtype: bool
In [56]:
condition = pd.Series([
    False, False,  True, False,  True, False,  True
], index=[
    'Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom', 'United States'
])
In [57]:
g7_pop[condition]
Out[57]:
Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64
In [54]:
g7_pop.loc[g7_pop > 70]
Out[54]:
Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

 Checking existance of a key:

In [ ]:
'France' in g7_pop
In [ ]:
'Brasil' in g7_pop

green-divider

Reindexing

Indexes are immutable, so we can't change individual values independently. However, we can change a complete index with a new index.

In [ ]:
g7_pop
In [ ]:
g7_pop.index
In [ ]:
# Reorder current Series indexes
g7_pop.reindex(['France',
                'Germany',
                'Italy',
                'Canada',
                'Japan',
                'United Kingdom',
                'United States'])
In [ ]:
# Adding a new index value to a Series
g7_pop.reindex(['France',
                'Germany',
                'Italy',
                'Canada',
                'Japan',
                'United Kingdom',
                'United States',
                'Brasil'])
In [ ]:
# Adding a new index value to a Series, with default fill value
g7_pop.reindex(['France',
                'Germany',
                'Italy',
                'Canada',
                'Japan',
                'United Kingdom',
                'United States',
                'Brasil'], fill_value=0)

green-divider

Conditional selection (boolean arrays)

The same boolean array techniques we saw applied to numpy arrays can be used for Pandas Series:

In [30]:
g7_pop
Out[30]:
Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64
In [ ]:
g7_pop
In [ ]:
g7_pop > 70
In [ ]:
g7_pop[g7_pop > 70]
In [ ]:
g7_pop.loc[g7_pop > 70]
In [ ]:
g7_pop.mean()
In [ ]:
g7_pop[g7_pop > g7_pop.mean()]
In [ ]:
g7_pop[(g7_pop > 70) | (g7_pop < 40)]
In [ ]:
g7_pop[(g7_pop > 80) & (g7_pop < 200)]
In [ ]:
g7_pop[g7_pop > g7_pop.mean()]
In [ ]:
g7_pop.std()
In [ ]:
g7_pop[(g7_pop > g7_pop.mean() - g7_pop.std() / 2) | (g7_pop > g7_pop.mean() + g7_pop.std() / 2)]

green-divider

Operations and methods

Series also support vectorized operations and aggregation functions as Numpy:

In [ ]:
g7_pop.head(3)
In [ ]:
g7_pop * 1_000_000

We can apply any Universal Function to a Series:

In [ ]:
g7_pop.max()
In [ ]:
g7_pop.min()
In [ ]:
g7_pop.mean()
In [ ]:
g7_pop.std()
In [ ]:
g7_pop.quantile(.2)
In [ ]:
g7_pop.quantile(.8)
In [ ]:
np.log(g7_pop)

green-divider

Sorting Values

In [ ]:
g7_pop
In [ ]:
g7_pop.sort_values()
In [ ]:
g7_pop.sort_values(ascending=False)
In [ ]:
g7_pop
In [ ]:
g7_pop.sort_values(ascending=False, inplace=True)
In [ ]:
g7_pop

green-divider

Modifying series

In [ ]:
g7_pop['Canada'] = 40.5

g7_pop
In [ ]:
g7_pop.iloc[-1] = 500

g7_pop
In [ ]:
g7_pop[g7_pop < 70] = 99.99

g7_pop

purple-divider

Notebooks AI
Notebooks AI Profile20060