Profile picture

Type of Plots

Last updated: July 2nd, 20192019-07-02Project preview

rmotr


Type of plots

Previously, we saw an overview of how pandas plot method worked and how to use the basic API of matplotlib. We'll provide more details in this lesson.

In this lecture we'll see all common plot types.

purple-divider

Hands on!

Matplotlib's default pyplot API has a global, MATLAB-style interface, as we've already seen:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

We'll use mtcars and diamonds datasets to explain each plot type:

In [2]:
mtcars = pd.read_table('data/mtcars.txt', sep=",")

mtcars.head()
Out[2]:
mpg cyl disp hp drat wt qsec vs am gear carb
0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
In [3]:
diamonds = pd.read_table('data/diamonds.txt', sep=",")

diamonds.head()
Out[3]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326.0 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326.0 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327.0 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334.0 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335.0 4.34 4.35 2.75

green-divider

OOP Interface

In [4]:
fig, axes = plt.subplots(figsize=(12, 6))
In [5]:
x = np.arange(-10, 11)
In [6]:
axes.plot(
    x, (x ** 2), color='red', linewidth=3,
    marker='o', markersize=8, label='X^2')

axes.plot(x, -1 * (x ** 2), 'b--', label='-X^2')

axes.set_xlabel('X')
axes.set_ylabel('X Squared')

axes.set_title("My Nice Plot")

axes.legend()

fig
Out[6]:

green-divider

 Lines

This is the most common plot type. You can create it using plot function which according to documentation has two parameters:

  • args: arbitrary set of parameter groups with the form x, y, z:
    • x: set of values to use in X axis.
    • y: set of values to use in Y axis.
    • z: set of values to use in Z axis.
  • kwargs: arbitrary set of parameters used to establish globally style of the lines included in the graphic.

Full value specification can be found in this link.

In [7]:
plt.plot(mtcars['mpg'])
Out[7]:
[<matplotlib.lines.Line2D at 0x7f4a427b67f0>]
In [8]:
plt.plot(mtcars['mpg'])
plt.plot(mtcars['hp'])
Out[8]:
[<matplotlib.lines.Line2D at 0x7f4a42716a58>]
In [9]:
plt.plot(mtcars['mpg'], 'o--r',
         mtcars['hp'], ':^m')
Out[9]:
[<matplotlib.lines.Line2D at 0x7f4a426f2f98>,
 <matplotlib.lines.Line2D at 0x7f4a426fb4e0>]
In [10]:
plt.plot(mtcars['mpg'],
         color="red",
         linestyle="dashed",
         marker="o",
         linewidth=5,
         markersize=10)

plt.plot(mtcars['hp'],
         color = "magenta",
         linestyle="--",
         marker="^",
         linewidth=0.5,
         markersize=10)
Out[10]:
[<matplotlib.lines.Line2D at 0x7f4a42665390>]

green-divider

 Area

Another used plot is the area type. It's just a line plot with the area below (or above if the values are negative) of the line until the X axis will be filled.

The function offered by matplotlib for this type of graph is called stackplot and its parameters are identical to those of plot.

Full value specification can be found in this link.

In [11]:
mpg_cyl = mtcars['mpg']

plt.stackplot(mpg_cyl.index, mpg_cyl.values)
Out[11]:
[<matplotlib.collections.PolyCollection at 0x7f4a42646160>]
In [12]:
mpg_cyl = mtcars['mpg']
hp_cyl = mtcars['hp']

plt.stackplot(mpg_cyl.index, mpg_cyl.values, hp_cyl.values)
Out[12]:
[<matplotlib.collections.PolyCollection at 0x7f4a425ad358>,
 <matplotlib.collections.PolyCollection at 0x7f4a425ad5f8>]
In [13]:
mpg_cyl = mtcars['mpg']
hp_cyl = mtcars['hp']

plt.stackplot(mpg_cyl.index, mpg_cyl.values, hp_cyl.values,
              colors = ["red", "magenta"], alpha = 0.5)
Out[13]:
[<matplotlib.collections.PolyCollection at 0x7f4a42567f28>,
 <matplotlib.collections.PolyCollection at 0x7f4a4250c7b8>]

green-divider

 Scatter

One of the most used graphics is the so-called dot plot or scatter plot. In this case, matplotlib offers us thescatter function. This function will receive, as a minimum, a set of values for the X axis and a set of values for the Y axis.

Additionally, it provides a set of parameters that will allow us to control different visual characteristics of the represented points: size, alpha, color, point type, etc. In other words, the function includes specific parameters for each "aesthetic" of the graphic.

Full value specification can be found in this link.

In [14]:
plt.scatter(diamonds['carat'], diamonds['price'])
Out[14]:
<matplotlib.collections.PathCollection at 0x7f4a424f5208>
In [15]:
plt.scatter(diamonds['carat'], diamonds['price'],
            color="black",
            alpha=0.1)
Out[15]:
<matplotlib.collections.PathCollection at 0x7f4a4244f748>
In [16]:
colors = ['red' if x == 'Ideal' else 'blue' for x in diamonds['cut']]

plt.scatter(diamonds['carat'], diamonds['price'],
            alpha=0.5,
            color=colors)
Out[16]:
<matplotlib.collections.PathCollection at 0x7f4a42427f98>
In [17]:
# cut vs clarity, with its proportion used as point size
x_values = []
y_values = []
sizes = []

for i, element_i in enumerate(diamonds['cut'].unique()):
    for j, element_j in enumerate(diamonds['clarity'].unique()):
        x_values.append(i)
        y_values.append(j)
        sizes.append(diamonds[(diamonds['cut'] == element_i) & (diamonds['clarity'] == element_j)].size / 100)
        
plt.scatter(x_values, y_values, s=sizes)
Out[17]:
<matplotlib.collections.PathCollection at 0x7f4a40b95780>
In [18]:
# cut vs price
cuts = list(diamonds['cut'].unique())

x_values = [cuts.index(element) for element in diamonds['cut']]
y_values = diamonds['price']

plt.scatter(x_values, y_values,
            color="darkblue", alpha=0.1)
Out[18]:
<matplotlib.collections.PathCollection at 0x7f4a40b7ea20>
In [19]:
# manual Jitter
cuts = list(diamonds['cut'].unique())

x_values = [cuts.index(element) + np.random.uniform(-0.3, 0.3) for element in diamonds['cut']]
y_values = diamonds['price']

plt.scatter(x_values, y_values,
            color="darkblue", alpha=0.1)
Out[19]:
<matplotlib.collections.PathCollection at 0x7f4a40a633c8>

Another example:

In [20]:
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = np.pi * (20 * np.random.rand(N))**2  # 0 to 15 point radii
In [21]:
plt.figure(figsize=(14, 6))

plt.scatter(x, y, s=area, c=colors, alpha=0.5, cmap='Spectral')
plt.colorbar()

plt.show()

We can also split in two scatters:

In [22]:
fig = plt.figure(figsize=(14, 6))

ax1 = fig.add_subplot(1,2,1)
plt.scatter(x, y, s=area, c=colors, alpha=0.5, cmap='Pastel1')
plt.colorbar()

ax2 = fig.add_subplot(1,2,2)
plt.scatter(x, y, s=area, c=colors, alpha=0.5, cmap='Pastel2')
plt.colorbar()

plt.show()

Here is the full cmap options available: https://matplotlib.org/users/colormaps.html

In [23]:
diamonds.head(2)
Out[23]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326.0 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326.0 3.89 3.84 2.31

Also, we can build a matrix of scatter plots using the builtin Pandas scatter_matrix() function:

In [24]:
pd.plotting.scatter_matrix(diamonds[['carat', 'depth', 'cut']],
                           figsize=(10, 8))
Out[24]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f4a408b1a90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4a409f12e8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f4a40a436a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f4a409e0d30>]],
      dtype=object)

green-divider

 Bars

For the creation of bar graphs, matplotlib puts at our disposal two functions: bar and barh (depending on the orientation we want to give to the graph). For its creation we will have to pass, again, the set of values of the X axis and the set of values of the Y axis.

Additionally, as with previous graphics, we will have specific parameters to control the different visual characteristics: bars centered on their value, width of the bars, ticks to be used in each of the bars, etc.

Full value specification can be found in this link.

In [25]:
Y = np.random.rand(1, 5)[0]
Y2 = np.random.rand(1, 5)[0]
In [26]:
plt.bar(np.arange(len(Y)), Y,
        width=0.5,
        color='#00b894')
Out[26]:
<BarContainer object of 5 artists>
In [27]:
plt.bar(np.arange(len(Y)) - 0.15, Y,
        width=0.3,
        color='#00b894',
        label='Label Y')

plt.bar(np.arange(len(Y2)) + 0.15, Y2,
        width=0.3,
        color='#e17055',
        label='Label Y2')
Out[27]:
<BarContainer object of 5 artists>

Also can be stacked bars, and add a legend to the plot:

In [28]:
plt.bar(np.arange(len(Y)), Y,
        width=0.5,
        color='#00b894',
        label='Label Y')

plt.bar(np.arange(len(Y2)), Y2,
        width=0.5,
        color='#e17055',
        bottom=Y,
        label='Label Y2')

plt.legend()
Out[28]:
<matplotlib.legend.Legend at 0x7f4a406a0828>

 Stacked bars example

In [29]:
df = pd.DataFrame([[3,2,3,4,5], [3,2,3,4,5], [4,3,2,1,2], [4,3,2,1,2]])
df.columns = ['DevType1', 'DevType2', 'DevType3', 'DevType4', 'DevType5']
df.index = pd.Series(['0-2 years', '2-4 years', '4-6 years', '6-8 years'], name='YearsCoding')

df
Out[29]:
DevType1 DevType2 DevType3 DevType4 DevType5
YearsCoding
0-2 years 3 2 3 4 5
2-4 years 3 2 3 4 5
4-6 years 4 3 2 1 2
6-8 years 4 3 2 1 2
In [30]:
df.sum(axis=1)
Out[30]:
YearsCoding
0-2 years    17
2-4 years    17
4-6 years    12
6-8 years    12
dtype: int64

 Using matplotlib

In [31]:
index = np.arange(df.index.size)
bar_size = 0.4
plt.figure(figsize=(16,8))

for i, x in enumerate(df.columns):
    if i==0:
        plt.bar(x=index, height=df.iloc[:,i], width=bar_size)
    else:
        # we should define 'bottom' to make our stack
        plt.bar(x=index, height=df.iloc[:,i], width=bar_size, bottom=df.iloc[:,:i].sum(axis=1))

plt.title('DevType per YearsCoding')
plt.ylabel('Count')
plt.ylim(0, df.sum(axis=1).max() + 1)
plt.xticks(index, df.index, rotation=0)
plt.legend(df.columns)
Out[31]:
<matplotlib.legend.Legend at 0x7f4a4063ba20>
In [32]:
index = np.arange(df.index.size)
bar_size = 0.4
plt.figure(figsize=(16,8))

for i, x in enumerate(df.columns):
    if i==0:
        plt.barh(y=index, width=df.iloc[:,i], height=bar_size)
    else:
        # we should define 'left' to make our stack
        plt.barh(y=index, width=df.iloc[:,i], height=bar_size, left=df.iloc[:,:i].sum(axis=1))

plt.title('DevType per YearsCoding')
plt.xlabel('Count')
plt.xlim(0, df.sum(axis=1).max() + 1)
plt.yticks(index, df.index, rotation=0)
plt.legend(df.columns)
Out[32]:
<matplotlib.legend.Legend at 0x7f4a40314dd8>

 Using pandas

In [33]:
fig, ax = plt.subplots(figsize=(16,8))

df.plot(kind='bar', stacked=True, ax=ax)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4a4055eba8>
In [34]:
fig, ax = plt.subplots(figsize=(16,8))

df.plot(kind='barh', stacked=True, ax=ax)
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4a404f90f0>

green-divider

 Pie

Matplotlib offer a direct way to carry out the creation of pie charts. In particular, we have the pie function.

For the construction of a pie chart, it will be necessary to provide the function with a unique set of values so that the module will present the proportion of each of the supplied values over the total.

As up to now, the function puts at our disposal a set of visual parameters with which to customize the resulting graphic: colors, representation of percentages within segments, highlighting segments, etc.

Full value specification can be found in this link.

In [35]:
cut_count = diamonds['cut'].value_counts()

plt.pie(cut_count)
plt.show()
In [36]:
plt.figure(figsize=(6, 6))

cut_count = diamonds['cut'].value_counts()

plt.pie(cut_count, autopct="%.2f%%")
plt.show()
In [37]:
plt.figure(figsize=(6, 6))

cut_count = diamonds['cut'].value_counts()

plt.pie(cut_count, autopct="%.2f%%", labels=cut_count.index)
plt.show()
In [38]:
plt.figure(figsize=(6, 6))

cut_count = diamonds['cut'].value_counts()

explode = pd.Series(np.zeros_like(cut_count), index=cut_count.index, dtype=np.float32)
explode.index = cut_count.index
explode[cut_count.idxmax()] = 0.25

plt.pie(cut_count, autopct="%.2f", labels=cut_count.index, explode=explode)
plt.show()

green-divider

 Histograms

Using the hist function, matplotlib allows us to carry out the creation of histograms for the representation of the distribution of a numeric variable. This set of numerical values will be the only mandatory parameters necessary for the creation of the graph.

Additionally, we will have parameters to control: number of bins in the histogram, normalization of the histogram (so that we obtain densities), indication of whether a cumulative histogram should be performed, etc.

Full value specification can be found in this link.

In [39]:
plt.hist(diamonds['carat'],
         color='#3498db')
Out[39]:
(array([2.3350e+03, 2.4980e+03, 1.0068e+04, 5.4230e+03, 2.2380e+03,
        3.2000e+02, 1.5000e+02, 2.4000e+01, 7.0000e+00, 9.0000e+00]),
 array([0.2  , 0.491, 0.782, 1.073, 1.364, 1.655, 1.946, 2.237, 2.528,
        2.819, 3.11 ]),
 <a list of 10 Patch objects>)
In [40]:
plt.hist(diamonds['carat'],
         bins=20,
         color='#3498db')
Out[40]:
(array([1.979e+03, 3.560e+02, 1.940e+02, 2.304e+03, 3.024e+03, 7.044e+03,
        3.481e+03, 1.942e+03, 7.980e+02, 1.440e+03, 2.890e+02, 3.100e+01,
        1.260e+02, 2.400e+01, 1.700e+01, 7.000e+00, 2.000e+00, 5.000e+00,
        0.000e+00, 9.000e+00]),
 array([0.2   , 0.3455, 0.491 , 0.6365, 0.782 , 0.9275, 1.073 , 1.2185,
        1.364 , 1.5095, 1.655 , 1.8005, 1.946 , 2.0915, 2.237 , 2.3825,
        2.528 , 2.6735, 2.819 , 2.9645, 3.11  ]),
 <a list of 20 Patch objects>)
In [41]:
plt.hist(diamonds['carat'],
         bins=20,
         color='#3498db',
         histtype='step')
Out[41]:
(array([1.979e+03, 3.560e+02, 1.940e+02, 2.304e+03, 3.024e+03, 7.044e+03,
        3.481e+03, 1.942e+03, 7.980e+02, 1.440e+03, 2.890e+02, 3.100e+01,
        1.260e+02, 2.400e+01, 1.700e+01, 7.000e+00, 2.000e+00, 5.000e+00,
        0.000e+00, 9.000e+00]),
 array([0.2   , 0.3455, 0.491 , 0.6365, 0.782 , 0.9275, 1.073 , 1.2185,
        1.364 , 1.5095, 1.655 , 1.8005, 1.946 , 2.0915, 2.237 , 2.3825,
        2.528 , 2.6735, 2.819 , 2.9645, 3.11  ]),
 <a list of 1 Patch objects>)
In [42]:
plt.hist(diamonds['carat'],
         bins=20,
         color='#3498db',
         orientation='horizontal')
Out[42]:
(array([1.979e+03, 3.560e+02, 1.940e+02, 2.304e+03, 3.024e+03, 7.044e+03,
        3.481e+03, 1.942e+03, 7.980e+02, 1.440e+03, 2.890e+02, 3.100e+01,
        1.260e+02, 2.400e+01, 1.700e+01, 7.000e+00, 2.000e+00, 5.000e+00,
        0.000e+00, 9.000e+00]),
 array([0.2   , 0.3455, 0.491 , 0.6365, 0.782 , 0.9275, 1.073 , 1.2185,
        1.364 , 1.5095, 1.655 , 1.8005, 1.946 , 2.0915, 2.237 , 2.3825,
        2.528 , 2.6735, 2.819 , 2.9645, 3.11  ]),
 <a list of 20 Patch objects>)
In [43]:
plt.hist(diamonds['carat'],
         bins=20,
         color='#3498db',
         cumulative=True)
Out[43]:
(array([ 1979.,  2335.,  2529.,  4833.,  7857., 14901., 18382., 20324.,
        21122., 22562., 22851., 22882., 23008., 23032., 23049., 23056.,
        23058., 23063., 23063., 23072.]),
 array([0.2   , 0.3455, 0.491 , 0.6365, 0.782 , 0.9275, 1.073 , 1.2185,
        1.364 , 1.5095, 1.655 , 1.8005, 1.946 , 2.0915, 2.237 , 2.3825,
        2.528 , 2.6735, 2.819 , 2.9645, 3.11  ]),
 <a list of 20 Patch objects>)

Also, histograms can be grouped by a certain column:

In [44]:
for i in diamonds['cut'].unique():
    plt.hist(diamonds[diamonds['cut'] == i].loc[:, 'carat'].values,
             stacked=True)
In [45]:
diamonds.hist(column='carat', by='cut',
              sharex=True,
              sharey=True,
              layout=(1, 5),
              color='#3498db',
              figsize=(12, 4))
Out[45]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f4a3be319e8>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f4a3bde1be0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f4a3bd8afd0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f4a3bdbd400>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f4a3bd667f0>],
      dtype=object)

green-divider

KDE (kernel density estimation)

In [46]:
from scipy import stats

values = np.random.randn(1000)

density = stats.kde.gaussian_kde(values)

density
Out[46]:
<scipy.stats.kde.gaussian_kde at 0x7f4a3bcb19e8>
In [47]:
plt.subplots(figsize=(12, 6))

values2 = np.linspace(min(values)-10, max(values)+10, 100)

plt.plot(values2, density(values2), color='#FF7F00')
plt.fill_between(values2, 0, density(values2), alpha=0.5, color='#FF7F00')
plt.xlim(xmin=-5, xmax=5)

plt.show()

green-divider

 Combine plots

In [48]:
plt.subplots(figsize=(12, 6))

plt.hist(values, bins=100, alpha=0.8, density=1,
          histtype='bar', color='steelblue',
          edgecolor='green')

plt.plot(values2, density(values2), color='#FF7F00', linewidth=3.0)
plt.xlim(xmin=-5, xmax=5)

plt.show()

green-divider

 Boxplots

Another plot type is the well-known boxplot. For this, we have the boxplot function that will receive a set of values on which to calculate the ranges, medians, whiskers and outliers.

We will have, as before, a wide set of optional parameters that will allow us to control: type of point to represent outliers, control of whether or not boxes and mustaches are shown, etc.

Full value specification can be found in this link.

 But what is exactly a Boxplot?

The boxplot show quartiles (and outliers) for one or more numerical variables using five-number summary:

  • min = minimum value
  • 25% = first quartile (Q1) = median of the lower half of the data
  • 50% = second quartile (Q2) = median of the data
  • 75% = third quartile (Q3) = median of the upper half of the data
  • max = maximum value

What is more useful than mean and standard deviation for describing skewed distributions.

Then it calculates the IQR (box):

$$ Interquartile Range (IQR) = Q3 - Q1 $$

And outliers (outside points):

$$ below Q1 - 1.5 * IQR $$$$ above Q3 + 1.5 * IQR $$
In [49]:
plt.boxplot(diamonds['carat'])

plt.show()

Also, boxplots can be grouped by a certain column:

In [50]:
values_boxplot = []

for i in diamonds['cut'].unique():
    values_boxplot.append(list(diamonds[diamonds['cut'] == i].loc[:, 'carat'].values))

plt.boxplot(values_boxplot)

plt.show()