Profile picture

Type of Plots

Last updated: March 27th, 20192019-03-27Project preview

rmotr


Type of plots

Previously, we saw an overview of how pandas plot method worked and how to use the basic API of matplotlib. We'll provide more details in this lesson.

In this lecture we'll see all common plot types.

purple-divider

Hands on!

Matplotlib's default pyplot API has a global, MATLAB-style interface, as we've already seen:

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

We'll use mtcars and diamonds datasets to explain each plot type:

In [ ]:
mtcars = pd.read_table('data/mtcars.txt', sep=",")

mtcars.head()
In [ ]:
diamonds = pd.read_table('data/diamonds.txt', sep=",")

diamonds.head()

green-divider

OOP Interface

In [ ]:
fig, axes = plt.subplots(figsize=(12, 6))
In [ ]:
x = np.arange(-10, 11)
In [ ]:
axes.plot(
    x, (x ** 2), color='red', linewidth=3,
    marker='o', markersize=8, label='X^2')

axes.plot(x, -1 * (x ** 2), 'b--', label='-X^2')

axes.set_xlabel('X')
axes.set_ylabel('X Squared')

axes.set_title("My Nice Plot")

axes.legend()

fig

green-divider

 Lines

This is the most common plot type. You can create it using plot function which according to documentation has two parameters:

  • args: arbitrary set of parameter groups with the form x, y, z:
    • x: set of values to use in X axis.
    • y: set of values to use in Y axis.
    • z: set of values to use in Z axis.
  • kwargs: arbitrary set of parameters used to establish globally style of the lines included in the graphic.

Full value specification can be found in this link.

In [ ]:
plt.plot(mtcars['mpg'])
In [ ]:
plt.plot(mtcars['mpg'])
plt.plot(mtcars['hp'])
In [ ]:
plt.plot(mtcars['mpg'], 'o--r',
         mtcars['hp'], ':^m')
In [ ]:
plt.plot(mtcars['mpg'],
         color="red",
         linestyle="dashed",
         marker="o",
         linewidth=5,
         markersize=10)

plt.plot(mtcars['hp'],
         color = "magenta",
         linestyle="--",
         marker="^",
         linewidth=0.5,
         markersize=10)

green-divider

 Area

Another used plot is the area type. It's just a line plot with the area below (or above if the values are negative) of the line until the X axis will be filled.

The function offered by matplotlib for this type of graph is called stackplot and its parameters are identical to those of plot.

Full value specification can be found in this link.

In [ ]:
mpg_cyl = mtcars['mpg']

plt.stackplot(mpg_cyl.index, mpg_cyl.values)
In [ ]:
mpg_cyl = mtcars['mpg']
hp_cyl = mtcars['hp']

plt.stackplot(mpg_cyl.index, mpg_cyl.values, hp_cyl.values)
In [ ]:
mpg_cyl = mtcars['mpg']
hp_cyl = mtcars['hp']

plt.stackplot(mpg_cyl.index, mpg_cyl.values, hp_cyl.values,
              colors = ["red", "magenta"], alpha = 0.5)

green-divider

 Scatter

One of the most used graphics is the so-called dot plot or scatter plot. In this case, matplotlib offers us thescatter function. This function will receive, as a minimum, a set of values for the X axis and a set of values for the Y axis.

Additionally, it provides a set of parameters that will allow us to control different visual characteristics of the represented points: size, alpha, color, point type, etc. In other words, the function includes specific parameters for each "aesthetic" of the graphic.

Full value specification can be found in this link.

In [ ]:
plt.scatter(diamonds['carat'], diamonds['price'])
In [ ]:
plt.scatter(diamonds['carat'], diamonds['price'],
            color="black",
            alpha=0.1)
In [ ]:
colors = ['red' if x == 'Ideal' else 'blue' for x in diamonds['cut']]

plt.scatter(diamonds['carat'], diamonds['price'],
            alpha=0.5,
            color=colors)
In [ ]:
# cut vs clarity, with its proportion used as point size
x_values = []
y_values = []
sizes = []

for i, element_i in enumerate(diamonds['cut'].unique()):
    for j, element_j in enumerate(diamonds['clarity'].unique()):
        x_values.append(i)
        y_values.append(j)
        sizes.append(diamonds[(diamonds['cut'] == element_i) & (diamonds['clarity'] == element_j)].size / 100)
        
plt.scatter(x_values, y_values, s=sizes)
In [ ]:
# cut vs price
cuts = list(diamonds['cut'].unique())

x_values = [cuts.index(element) for element in diamonds['cut']]
y_values = diamonds['price']

plt.scatter(x_values, y_values,
            color="darkblue", alpha=0.1)
In [ ]:
# manual Jitter
cuts = list(diamonds['cut'].unique())

x_values = [cuts.index(element) + np.random.uniform(-0.3, 0.3) for element in diamonds['cut']]
y_values = diamonds['price']

plt.scatter(x_values, y_values,
            color="darkblue", alpha=0.1)

Another example:

In [ ]:
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = np.pi * (20 * np.random.rand(N))**2  # 0 to 15 point radii
In [ ]:
plt.figure(figsize=(14, 6))

plt.scatter(x, y, s=area, c=colors, alpha=0.5, cmap='Spectral')
plt.colorbar()

plt.show()

We can also split in two scatters:

In [ ]:
fig = plt.figure(figsize=(14, 6))

ax1 = fig.add_subplot(1,2,1)
plt.scatter(x, y, s=area, c=colors, alpha=0.5, cmap='Pastel1')
plt.colorbar()

ax2 = fig.add_subplot(1,2,2)
plt.scatter(x, y, s=area, c=colors, alpha=0.5, cmap='Pastel2')
plt.colorbar()

plt.show()

Here is the full cmap options available: https://matplotlib.org/users/colormaps.html

In [ ]:
diamonds.head(2)

Also, we can build a matrix of scatter plots using the builtin Pandas scatter_matrix() function:

In [ ]:
pd.plotting.scatter_matrix(diamonds[['carat', 'depth', 'cut']],
                           figsize=(10, 8))

green-divider

 Bars

For the creation of bar graphs, matplotlib puts at our disposal two functions: bar and barh (depending on the orientation we want to give to the graph). For its creation we will have to pass, again, the set of values of the X axis and the set of values of the Y axis.

Additionally, as with previous graphics, we will have specific parameters to control the different visual characteristics: bars centered on their value, width of the bars, ticks to be used in each of the bars, etc.

Full value specification can be found in this link.

In [ ]:
Y = np.random.rand(1, 5)[0]
Y2 = np.random.rand(1, 5)[0]
In [ ]:
plt.bar(np.arange(len(Y)), Y,
        width=0.5,
        color='#00b894')
In [ ]:
plt.bar(np.arange(len(Y)) - 0.15, Y,
        width=0.3,
        color='#00b894',
        label='Label Y')

plt.bar(np.arange(len(Y2)) + 0.15, Y2,
        width=0.3,
        color='#e17055',
        label='Label Y2')

Also can be stacked bars, and add a legend to the plot:

In [ ]:
plt.bar(np.arange(len(Y)), Y,
        width=0.5,
        color='#00b894',
        label='Label Y')

plt.bar(np.arange(len(Y2)), Y2,
        width=0.5,
        color='#e17055',
        bottom=Y,
        label='Label Y2')

plt.legend()

green-divider

 Pie

Matplotlib offer a direct way to carry out the creation of pie charts. In particular, we have the pie function.

For the construction of a pie chart, it will be necessary to provide the function with a unique set of values so that the module will present the proportion of each of the supplied values over the total.

As up to now, the function puts at our disposal a set of visual parameters with which to customize the resulting graphic: colors, representation of percentages within segments, highlighting segments, etc.

Full value specification can be found in this link.

In [ ]:
cut_count = diamonds['cut'].value_counts()

plt.pie(cut_count)

plt.show()
In [ ]:
cut_count = diamonds['cut'].value_counts()

plt.pie(cut_count, autopct="%.2f%%")

plt.show()
In [ ]:
cut_count = diamonds['cut'].value_counts()

plt.pie(cut_count, autopct="%.2f%%", labels=cut_count.index)

plt.show()
In [ ]:
cut_count = diamonds['cut'].value_counts()

explode = pd.Series(np.zeros_like(cut_count), index=cut_count.index, dtype=np.float32)
explode.index = cut_count.index
explode[cut_count.idxmax()] = 0.25

plt.pie(cut_count, autopct="%.2f", labels=cut_count.index, explode=explode)

plt.show()

green-divider

 Histograms

Using the hist function, matplotlib allows us to carry out the creation of histograms for the representation of the distribution of a numeric variable. This set of numerical values will be the only mandatory parameters necessary for the creation of the graph.

Additionally, we will have parameters to control: number of bins in the histogram, normalization of the histogram (so that we obtain densities), indication of whether a cumulative histogram should be performed, etc.

Full value specification can be found in this link.

In [ ]:
plt.hist(diamonds['carat'],
         color='#3498db')
In [ ]:
plt.hist(diamonds['carat'],
         bins=20,
         color='#3498db')
In [ ]:
plt.hist(diamonds['carat'],
         bins=20,
         color='#3498db',
         histtype='step')
In [ ]:
plt.hist(diamonds['carat'],
         bins=20,
         color='#3498db',
         orientation='horizontal')
In [ ]:
plt.hist(diamonds['carat'],
         bins=20,
         color='#3498db',
         cumulative=True)

Also, histograms can be grouped by a certain column:

In [ ]:
for i in diamonds['cut'].unique():
    plt.hist(diamonds[diamonds['cut'] == i].loc[:, 'carat'].values,
             stacked=True)
In [ ]:
diamonds.hist(column='carat', by='cut',
              sharex=True,
              sharey=True,
              layout=(1, 5),
              color='#3498db',
              figsize=(12, 4))

green-divider

KDE (kernel density estimation)

In [ ]:
from scipy import stats

values = np.random.randn(1000)

density = stats.kde.gaussian_kde(values)

density
In [ ]:
plt.subplots(figsize=(12, 6))

values2 = np.linspace(min(values)-10, max(values)+10, 100)

plt.plot(values2, density(values2), color='#FF7F00')
plt.fill_between(values2, 0, density(values2), alpha=0.5, color='#FF7F00')
plt.xlim(xmin=-5, xmax=5)

plt.show()

green-divider

 Combine plots

In [ ]:
plt.subplots(figsize=(12, 6))

plt.hist(values, bins=100, alpha=0.8, density=1,
          histtype='bar', color='steelblue',
          edgecolor='green')

plt.plot(values2, density(values2), color='#FF7F00', linewidth=3.0)
plt.xlim(xmin=-5, xmax=5)

plt.show()

green-divider

 Boxplots

Another plot type is the well-known boxplot. For this, we have the boxplot function that will receive a set of values on which to calculate the ranges, medians, whiskers and outliers.

We will have, as before, a wide set of optional parameters that will allow us to control: type of point to represent outliers, control of whether or not boxes and mustaches are shown, etc.

Full value specification can be found in this link.

 But what is exactly a Boxplot?

The boxplot show quartiles (and outliers) for one or more numerical variables using five-number summary:

  • min = minimum value
  • 25% = first quartile (Q1) = median of the lower half of the data
  • 50% = second quartile (Q2) = median of the data
  • 75% = third quartile (Q3) = median of the upper half of the data
  • max = maximum value

What is more useful than mean and standard deviation for describing skewed distributions.

Then it calculates the IQR (box):

$$ Interquartile Range (IQR) = Q3 - Q1 $$

And outliers (outside points):

$$ below Q1 - 1.5 * IQR $$$$ above Q3 + 1.5 * IQR $$
In [ ]:
plt.boxplot(diamonds['carat'])

plt.show()

Also, boxplots can be grouped by a certain column:

In [ ]:
values_boxplot = []

for i in diamonds['cut'].unique():
    values_boxplot.append(list(diamonds[diamonds['cut'] == i].loc[:, 'carat'].values))

plt.boxplot(values_boxplot)

plt.show()

An easier way to get the same:

In [ ]:
diamonds.boxplot(column='carat', by='cut')

green-divider

Boxplots and outlier detection

In [ ]:
values = np.concatenate([np.random.randn(10), np.array([10, 15, -10, -15])])
In [ ]:
plt.figure(figsize=(12, 4))

plt.hist(values)
In [ ]:
plt.figure(figsize=(12, 4))

plt.boxplot(values)

green-divider

Violin

Although, as we have seen, matplotlib does not offer us a specific function for the creation of density curves, it does offer us the possibility of representing violin graphics (hybrid between boxplot and density curves). For the construction of this type of graphics, we will have the violinplot function, which must always be supplied with the set of values on which the distribution is to be calculated.

As always, we have a set of additional parameters with which to control different characteristics of the chart: orientation, number of points used to calculate the distribution, calculation method, etc.

Full value specification can be found in this link.

In [ ]:
plt.violinplot(diamonds['carat'])
In [ ]:
values_violin = []

for i in diamonds['cut'].unique():
    values_violin.append(list(diamonds[diamonds['cut'] == i].loc[:, 'carat'].values))

plt.violinplot(values_violin)

plt.show()
In [ ]:
values_violin = []

for i in diamonds['cut'].unique():
    values_violin.append(list(diamonds[diamonds['cut'] == i].loc[:, 'carat'].values))

plt.violinplot(values_violin, vert=False)

plt.show()

green-divider

Line range

Finally we will see the line range graphs (two values of Y, minimum and maximum, for each value of X) to be able to carry out the sample visualization that is being used during these sessions. To carry out the construction of this type of graphics, matplotlib puts at our disposal the vlines function, which will therefore have to be supplied with 3 data series (one for each coordinate of the line).

Full value specification can be found in this link.

In [ ]:
min_max_x = diamonds.groupby('clarity')['x'].agg(['min', 'max'])
min_max_x = min_max_x.reset_index()

plt.vlines(min_max_x.index, min_max_x['min'], min_max_x['max'])
In [ ]:
plt.vlines(min_max_x.index, min_max_x['min'], min_max_x['max'],
           linewidth=10, color="red")

green-divider

 More type of plots

Many others type of plots are available within Matplotlib. Full list and other examples are available here and here.

purple-divider

Notebooks AI
Notebooks AI Profile20060