Profile picture

Intro to Vectorized Operations Using NumPy

Last updated: October 31st, 20192019-10-31Project preview

rmotr


Intro to Vectorized Operations using NumPy

"Vectorization" is an important concept in numeric computing. It refers to the process of operating in entire arrays, in a concise and efficient way. Let's take a look with a couple of examples.

Numpy also refers to vectorized operations as "array broadcasting". We'll use both terms interchangeably.

purple-divider

Hands on!

We've seen already some "linear algebra" examples in our previous lessons. These are examples of vectorized operations. For example, given these two arrays:

In [1]:
import numpy as np
In [2]:
a = np.arange(0, 5)
a
Out[2]:
array([0, 1, 2, 3, 4])
In [3]:
b = np.arange(5, 10)
b
Out[3]:
array([5, 6, 7, 8, 9])

A vectorized operation is any algebraic binary operation you perform on them:

In [4]:
a + b
Out[4]:
array([ 5,  7,  9, 11, 13])

In this case, the operation was performed "element-wise", ie: "vectorized". Let's compare it with a regular for-loop operation:

In [5]:
res = []
for i in range(len(a)):
    res.append(a[i] + b[i])
np.array(res)
Out[5]:
array([ 5,  7,  9, 11, 13])

It's much easier to express the operation as a + b, instead of explicitly defining the for loop to perform the element-wise sum. This is the difference between declarative and imperative programming.

Finally, compare this to the usage of regular Python lists:

In [6]:
l1 = list(range(5))
l1
Out[6]:
[0, 1, 2, 3, 4]
In [7]:
l2 = list(range(5, 10))
l2
Out[7]:
[5, 6, 7, 8, 9]
In [8]:
l1 + l2
Out[8]:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Python lists don't operate "algebraically", the + operation is just concatenation.

Generally speaking, array vectorization will always work with arrays of the same size and shape, but it can also work for arrays of different shapes. For example:

In [9]:
A = np.array([
    [0, 0, 0],
    [1, 1, 1],
    [2, 2, 2],
    [3, 3, 3]
])
A
Out[9]:
array([[0, 0, 0],
       [1, 1, 1],
       [2, 2, 2],
       [3, 3, 3]])
In [10]:
A.shape
Out[10]:
(4, 3)
In [11]:
b = np.array([[1, 2, 3]])
b
Out[11]:
array([[1, 2, 3]])
In [12]:
b.shape
Out[12]:
(1, 3)

Even though the shapes are different, we can still perform a vectorized operation between A and b:

In [13]:
A + b
Out[13]:
array([[1, 2, 3],
       [2, 3, 4],
       [3, 4, 5],
       [4, 5, 6]])

This is possible thanks to the following rule:

The Broadcasting Rule: Two arrays are compatible for broadcasting if for each trailing dimension (i.e., starting from the end) the axis lengths match or if either of the lengths is 1. Broadcasting is then performed over the missing or length 1 dimensions.

Graphically speaking, this is the operation we've performed:

broadcasting

Any operation between arrays not following The Broadcasting Rule, will fail with a "broadcast" related error. For example:

In [14]:
A = np.arange(6).reshape(2, 3)
A
Out[14]:
array([[0, 1, 2],
       [3, 4, 5]])
In [15]:
b = np.array([5, 5])
b
Out[15]:
array([5, 5])
In [16]:
A + b
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-48207f55069c> in <module>
----> 1 A + b

ValueError: operands could not be broadcast together with shapes (2,3) (2,) 

green-divider

Broadcasting with Scalars

Vectorized operations between arrays are usually reserved for more "advanced" or "scientific" usages. In your day to day work as a data analyst, you'll be using more often broadcasting with scalars.

Broadcasting with scalars is probably the most intuitive of the vectorized operations:

In [17]:
a = np.arange(0, 6)
a
Out[17]:
array([0, 1, 2, 3, 4, 5])
In [18]:
a + 10
Out[18]:
array([10, 11, 12, 13, 14, 15])

As you can see, the + 10 operation was "broadcasted" (or "distributed") among all the elements of the array. This is what we usually do with a list comprehension in pure Python:

In [19]:
[x + 10 for x in a]
Out[19]:
[10, 11, 12, 13, 14, 15]

But this "broadcasting" behavior is default for Numpy. Here are a few other examples:

In [20]:
a - 5
Out[20]:
array([-5, -4, -3, -2, -1,  0])
In [21]:
a * 3
Out[21]:
array([ 0,  3,  6,  9, 12, 15])

For example, there's a pretty common technique in Machine Learning that's called Standardization, that follows this formula:

$$ \large x'={\frac {x-{\bar {x}}}{\sigma }} $$

The objective is to make the data have mean 0, and standard deviation 1. Standarizing an array using Numpy broadcasting is as simple as:

In [22]:
(a - a.mean()) / a.std()
Out[22]:
array([-1.46385011, -0.87831007, -0.29277002,  0.29277002,  0.87831007,
        1.46385011])

green-divider

Vectorized boolean operations

As you might know already, aside from regular algebraic operators (+, -), we also have boolean operators. Boolean operations are also broadcasted with arrays. Example:

In [23]:
a
Out[23]:
array([0, 1, 2, 3, 4, 5])
In [24]:
a > 2
Out[24]:
array([False, False, False,  True,  True,  True])
In [25]:
a == 0
Out[25]:
array([ True, False, False, False, False, False])

As you can see, the result of a boolean broadcasted operation, is a boolean array; an array containing the result of applying the boolean operation to each individual element in the array.

A pretty common pattern is to use the function np.sum to count how many elements in the array satisfy a given condition. For example, from our array a, how many elements are greater than 2 (> 2)? The answer is 3 (elements 3, 4 and 5).

In [26]:
a > 2
Out[26]:
array([False, False, False,  True,  True,  True])
In [27]:
np.sum(a > 2)
Out[27]:
3

Boolean arrays will be very important for masks and filters later, so keep an eye on them.

Finally, it's important to note that vectorized operations (algebraic and boolean) with scalars will also work for multidimensional arrays:

In [28]:
A = np.arange(1, 10).reshape(3, 3)
A
Out[28]:
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
In [29]:
A + 100
Out[29]:
array([[101, 102, 103],
       [104, 105, 106],
       [107, 108, 109]])
In [30]:
A % 2
Out[30]:
array([[1, 0, 1],
       [0, 1, 0],
       [1, 0, 1]])
In [31]:
A % 2 == 0
Out[31]:
array([[False,  True, False],
       [ True, False,  True],
       [False,  True, False]])

purple-divider

Notebooks AI
Notebooks AI Profile20060