Profile picture

Intro to Pandas Plotting

Last updated: October 31st, 20192019-10-31Project preview

rmotr


Plotting with Pandas

The pandas library has become popular not just for enabling powerful data analysis, but also for its handy pre-canned plotting methods. Interestingly though, pandas plotting methods are really just convenient wrappers around existing matplotlib calls.

That is, the plot() method on pandas’ Series and DataFrame is a wrapper around plt.plot() we'll see in upcoming lectures.

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

Pandas can easily read data stored in different file formats like CSV, JSON, XML or even Excel as we saw on the previous lecture.

Let's read some data and plot some basic figures.

In [2]:
# load data
df = pd.read_csv('nba_data.csv')

# show first rows
df.head()
Out[2]:
Unnamed: 0 Rk Player Position Age Mp Fg Fga Fg% 3P ... Team Gp Mpg Orpm Drpm Rpm Wins_Rpm Pie Pace W
0 0 1 Russell Westbrook PG 28 34.6 10.2 24.0 0.425 2.5 ... OKC 81 34.6 6.74 -0.47 6.27 17.34 23.0 102.31 46
1 1 2 James Harden PG 27 36.4 8.3 18.9 0.440 3.2 ... HOU 81 36.4 6.38 -1.57 4.81 15.54 19.0 102.98 54
2 2 3 Isaiah Thomas PG 27 33.8 9.0 19.4 0.463 3.2 ... BOS 76 33.8 5.72 -3.89 1.83 8.19 16.1 99.84 51
3 3 4 Anthony Davis C 23 36.1 10.3 20.3 0.505 0.5 ... NO 75 36.1 0.45 3.90 4.35 12.81 19.2 100.19 31
4 4 5 DeMar DeRozan SG 27 35.4 9.7 20.9 0.467 0.4 ... TOR 74 35.4 2.21 -2.04 0.17 5.46 15.5 97.69 47

5 rows × 38 columns

green-divider

Plotting basics

Let's see some common used type of plots:

In [3]:
df.plot?

green-divider

Scatter plot

We'll use Fg and Fga to draw a scatter plot and see the relation between both columns.

In [4]:
df.plot(kind='scatter',
        x='Fg',
        y='Fga',
        color='orange',
        figsize=(12, 6))
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x117d7f4a8>
In [5]:
df.plot.scatter(x='Fg',
                y='Fga',
                color='orange',
               figsize=(12, 6))
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x119e639e8>

Now let's see relation between Fg and Age:

In [6]:
df.plot(kind='scatter',
        x='Fg',
        y='Age',
        color='purple',
        figsize=(12, 6))
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a041860>

green-divider

Bar plot

We can also make some nice bar plots. Let's see the age of the first 10 players:

In [7]:
df.head(10).plot(kind='bar',
                 x='Player',
                 y='Age')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a1b7a20>
In [8]:
df.head(10).plot.bar(x='Player',
                     y='Age')
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a34b550>
In [9]:
df.head(10)['Age'].plot(kind='bar')
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a3e9e10>

Also is possible with columns without numeric values:

In [10]:
df['Position'].value_counts()
Out[10]:
PF      94
PG      93
SG      90
C       90
SF      78
PF-C     1
Name: Position, dtype: int64
In [11]:
df['Position'].value_counts().plot(kind='bar')
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a47a6a0>

green-divider

Line plot

Another useful type of plot is the line plot.

In [12]:
df.head(10).plot(kind='line',
                 x='Player',
                 y='Age')
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a4ea7f0>

green-divider

Pie plot

In order to see the fractional area of different values within a column, we can make a pie plot.

In [13]:
df.head(5)['Fg'].plot(kind='pie')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a5cd208>
In [14]:
df.head(5)['Fg'].plot(kind='pie',
                      title='FG per Player',
                      labels=df['Player'],
                      figsize=(6,6))
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a6ab8d0>
In [15]:
df.head(5)['Fg'].plot(kind='pie',
                      labels=df['Player'],
                      title='FG per Player',
                      legend=True,
                      figsize=(6,6))
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a70f0f0>

Also is possible with columns without numeric values:

In [16]:
df['Position'].value_counts()
Out[16]:
PF      94
PG      93
SG      90
C       90
SF      78
PF-C     1
Name: Position, dtype: int64
In [17]:
df['Position'].value_counts().plot(kind='pie', figsize=(6,6))
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a806fd0>

green-divider

Histogram

Histograms can also be drawn.

In [18]:
df['Age'].plot(kind='hist',
               rwidth=0.85)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a530da0>
In [19]:
df['Age'].plot(kind='hist',
               rwidth=0.85,
               orientation='horizontal')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a017048>

green-divider

KDE plot

We can also generate Kernel Density Estimate plot (KDE) using Gaussian kernels.

In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function (PDF) of a random variable. This function uses Gaussian kernels and includes automatic bandwidth determination.

In [20]:
df['Age'].plot(kind='kde')
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ab01828>
In [21]:
df[['Fg', 'Fga']].plot(kind='kde')
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1226f00b8>

green-divider

Box plot

Box plot can also be drawn.

In [22]:
df.head(10)[['Fg', 'Fga']].plot(kind='box')
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x12483c630>
In [23]:
df.head(10)[['Fg', 'Fga']].plot(kind='box', subplots=True, layout=(1,2))
Out[23]:
Fg        AxesSubplot(0.125,0.125;0.352273x0.755)
Fga    AxesSubplot(0.547727,0.125;0.352273x0.755)
dtype: object
In [24]:
df.head(10)[['Fg', 'Fga']].plot(kind='box', subplots=True, layout=(2,1))
Out[24]:
Fg     AxesSubplot(0.125,0.536818;0.775x0.343182)
Fga       AxesSubplot(0.125,0.125;0.775x0.343182)
dtype: object
In [25]:
df.head(10)[['Age']].plot(kind='box')
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x124b5ed30>
In [26]:
df.head(10)[['Age']].plot.box()
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x124c2ef60>
In [27]:
df.head(10)[['Age', 'Position']].boxplot(by='Position')
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x124ce8390>

green-divider

Dive into a real world example

We'll introduce another useful library called matplotlib.

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

To let matplotlib render graphics inside a Notebook, we should execute %matplotlib inline.

Pandas can easily read data stored in different file formats like CSV, JSON, XML or even Excel as we saw on the previous lecture. Let's read some data and plot some basic figures.

In [28]:
# load data
df = pd.read_csv('btc-market-price.csv',
                 header=None,
                 names=['Timestamp', 'Price'],
                 index_col=0,
                 parse_dates=True)

# show first rows
df.head()
Out[28]:
Price
Timestamp
2017-04-02 1099.169125
2017-04-03 1141.813000
2017-04-04 1141.600363
2017-04-05 1133.079314
2017-04-06 1196.307937
In [29]:
df.plot(figsize=(14, 7), title='Bitcoin Price 2017-2018')
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x124dfd9e8>

plt.plot() accepts many parameters, but the first two ones are the most important ones: the values for the X and Y axes.

Let's make a plot showing $f(x) = x^2$:

In [30]:
x_values = np.arange(10)
y_values = [value ** 2 for value in x_values]

plt.plot(x_values, y_values)
Out[30]:
[<matplotlib.lines.Line2D at 0x125364d68>]
In [31]:
x_values = np.arange(-10, 11)
y_values = [value ** 2 for value in x_values]

plt.plot(x_values, y_values)
Out[31]:
[<matplotlib.lines.Line2D at 0x1253c8c18>]

We're using matplotlib's global API, which is horrible but it's the most popular one. We'll learn later how to use the OOP API which will make our work much easier.

In [32]:
plt.plot(x_values, x_values ** 2)
plt.plot(x_values, -1 * (x_values ** 2))
Out[32]:
[<matplotlib.lines.Line2D at 0x125468da0>]

Each plt function alters the global state. If you want to set settings of your plot you can use the plt.figure function. Others like plt.title keep altering the global plot:

In [33]:
plt.figure(figsize=(12, 8))
plt.plot(x_values, x_values ** 2)
plt.plot(x_values, -1 * (x_values ** 2))

plt.title('My Nice Plot')
Out[33]:
Text(0.5, 1.0, 'My Nice Plot')

Go a step further and:

  • Plot sine and cosine series.
  • Add axis names.
  • Add a legend.
In [34]:
plt.figure(figsize=(12, 8))

x_values = np.linspace(0, 2 * np.pi, 100)
sin_line, = plt.plot(x_values, np.sin(x_values))
cos_line, = plt.plot(x_values, np.cos(x_values))

plt.xlabel('x values')
plt.ylabel('y values')
plt.legend([sin_line, cos_line], ['y = sin(x)', 'y = cos(x)'])
plt.show()

Replicating Pandas' plot with matplotlib

At its core, pandas is using matplotlib to draw the plots on screen. We can simulate the steps followed with matplotib to understand a little bit more how it works. This was our original pandas based plot:

In [35]:
df.plot(figsize=(14, 7), title='Bitcoin Price 2017-2018')
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x1257e9e48>

Using matplotlib, we need to take a few more steps:

In [36]:
plt.figure(figsize=(14, 7))
plt.title("Bitcoin Price 2017-2018")
plt.plot(df.index, df['Price'], label="Bitcoin Price", color='orange')
plt.legend()
Out[36]:
<matplotlib.legend.Legend at 0x125a85208>

Saving the plot

Plots can be saved into images of multiple formats with the plt.savefig function:

In [37]:
df.plot(figsize=(14, 7), title='Bitcoin Price 2017-2018')
plt.savefig('plot1.png')
In [38]:
plt.figure(figsize=(14, 7))
plt.title("Bitcoin Price 2017-2018")
plt.plot(df.index, df['Price'], label="Bitcoin Price", color='orange')
plt.legend()
plt.savefig('plot2.png')

purple-divider

Notebooks AI
Notebooks AI Profile20060