Profile picture

What Is Pandas?

Last updated: May 14th, 20192019-05-14Project preview

rmotr


What is Pandas?¶

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

The pandas package is probably the most important tool for Data Scientists and Analysts working with Python today. The powerful machine learning and glamorous visualization tools may get all the attention, but pandas is the backbone of most data-related projects.

Fun fact ūüéĀ: pandas is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. ‚ÄĒ Wikipedia

pandas popularity has grown exponentially in the last years. Here's an image of The Atlas showing popularity of data science tools on Stack Overflow where we see pandas has become the dominating tools used by Python data scientists.

purple-divider

What is pandas used for?¶

If you're thinking about data science as a career, then it is imperative that one of the first things you do is learn pandas.

This tool will help you get, clean, transform and analyze your data.

For example, say you want to explore a dataset stored in a CSV on your computer. The first step is to use pandas to extract the data from that CSV into a DataFrame (a table-like data structure, we'll see more about it later). The we proceed with the routine data analysis tasks:

  • Quick Exploratory Data Analysis (EDA);
  • Calculate statistics such as average, median, max, or min of each column;
  • Creating visualizations. Plot bars, lines, histograms, bubbles, and more;
  • Cleaning the data by doing things like removing missing values and filtering rows or columns by some criteria;
  • Building machine learning models to create predictions or classifications
  • Store the cleaned, transformed data back into a CSV, other file or a database;

purple-divider

 Why no just using Excel?¶

Excel is one of the most popular and widely-used data tools; it's hard to find an organization that doesn't work with it in some way. From analysts, to sales VPs, to CEOs, professionals use Excel for both quick stats and accounting and serious data crunching.

Using pandas with Microsoft Excel can give you the best of both worlds and optimize your workflow.

Pandas works with data stored in Python to manipulate and analyze data. As opposed to Excel, Python is completely free to download and use.

Pandas operates right on the back of Python. As a result, is extremely fast and efficient by using useful methods that allow automating data processing tasks better than what Excel does, including processing Excel files.

In Excel, once you exceed 50K rows, it starts to slow down considerably. Pandas, on the other hand, has no real limit and handles millions of data points seamlessly. In terms of pure space, Excel caps a single spreadsheet at 1.048.576 rows exactly. At that point, your calculations would take forever to compute. More likely, Excel would just crash. A million rows may seem like a lot of data, but for data scientists, this is but a drop in the bucket.

Pandas, however, has no limitation to the number of data points you can have in a DataFrame (their version of a data set). It’s limited only by the amount of memory (RAM) of the computer it is running on.

It is also easier to create and use complex equations and calculations on your data. You can apply hundreds of computations to millions of data points instantly with pandas. Since Python is open source, there are already hundreds of libraries created that could streamline the length of time it takes to calculate.

green-divider

 Hands on!¶

We'll just import pandas and other useful libraries such as numpy, matplotlib and seaborn to work with.

Note that to import pandas and numpy we use the aliases pd and np. This is just a convention, which means it's not strictly necessary, but it is recommended.

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from utils import apply_flatui_theme

%matplotlib inline
apply_theme()

green-divider

NumPy and pandas¶

Pandas is built on top of the NumPy package, which means that all the efficient structures and functions we saw about numpy in previous lessons, will also apply to pandas.

While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed to work with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical (possibly multidimensional) arrays.

Overview Data Structures - Series and Dataframe¶

To get started with pandas, you will need to get comfortable with its two main data structures: Series and DataFrames.

A Series is essentially used for column-data, and a DataFrame is a multi-dimensional table made up of a collection of Series. Pandas relies on NumPy arrays to store this data, which means it also uses its data types.

DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

Let's define some data within Python lists:

In [ ]:
names = ['Avery Bradley', 'John Holland', 'Jonas Jerebko',
         'Jordan Mickey', 'Terry Rozier', 'Jared Sullinger', 'Evan Turner']

teams = ['Boston Celtics', 'Boston Celtics', 'Boston Celtics',
        'Boston Celtics', 'Boston Celtics', 'Boston Celtics', 'Boston Celtics']

numbers = [0, 30, 8, np.nan, 12, 7, 11]

 Series creation¶

In [42]:
my_series = pd.Series(names, name='Name')

my_series.to_frame()
Out[42]:
Name
0 Avery Bradley
1 John Holland
2 Jonas Jerebko
3 Jordan Mickey
4 Terry Rozier
5 Jared Sullinger
6 Evan Turner

Each value can be accessed using just its key/index position on Series:

In [45]:
my_series[3]
Out[45]:
'Jordan Mickey'
In [46]:
my_series.loc[3]
Out[46]:
'Jordan Mickey'

 DataFrame creation¶

There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict.

In [33]:
data = {
    'Name': names,
    'Team': teams,
    'Number': numbers
}
In [66]:
my_df = pd.DataFrame(data)

my_df
Out[66]:
Name Team Number
0 Avery Bradley Boston Celtics 0.000
1 John Holland Boston Celtics 30.000
2 Jonas Jerebko Boston Celtics 8.000
3 Jordan Mickey Boston Celtics nan
4 Terry Rozier Boston Celtics 12.000
5 Jared Sullinger Boston Celtics 7.000
6 Evan Turner Boston Celtics 11.000

Each value can be accessed using its key/index position and value position on DataFrames:

In [67]:
my_df['Name']
Out[67]:
0      Avery Bradley
1       John Holland
2      Jonas Jerebko
3      Jordan Mickey
4       Terry Rozier
5    Jared Sullinger
6        Evan Turner
Name: Name, dtype: object
In [57]:
my_df['Name'][3]
Out[57]:
'Jordan Mickey'
In [58]:
my_df.loc[3, 'Name']
Out[58]:
'Jordan Mickey'

In future lectures we'll see more on locating and extracting data from the DataFrame, don't worry if you don't get it right not.

Let's move on to some quick methods for creating DataFrames from various other sources.

green-divider

Reading external data¶

pandas allow us to read different types of external data files such as CSV, TXT and XLS.

With CSV files all you need is a single line to load in the data:

In [59]:
df = pd.read_csv('bitcoin_data.csv')

df.head()
Out[59]:
Timestamp Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
0 1/1/17 0:00 966.340 1,005.000 960.530 997.750 6,850.590 6,764,742.060 987.470
1 1/2/17 0:00 997.750 1,032.000 990.010 1,012.540 8,167.380 8,273,576.990 1,013.000
2 1/3/17 0:00 1,011.440 1,039.000 999.990 1,035.240 9,089.660 9,276,500.310 1,020.560
3 1/4/17 0:00 1,035.510 1,139.890 1,028.560 1,114.920 21,562.460 23,469,644.960 1,088.450
4 1/5/17 0:00 1,114.380 1,136.720 885.410 1,004.740 36,018.860 36,211,399.530 1,005.350

Also, there are many options when loading data, for example CSVs don't have indexes like our DataFrames, so we'll designate the index_col when reading:

In [60]:
df = pd.read_csv(
    'bitcoin_data.csv',
    index_col=0,
    parse_dates=True
).loc[:, 'Open':'Close']

df.head()
Out[60]:
Open High Low Close
Timestamp
2017-01-01 966.340 1,005.000 960.530 997.750
2017-01-02 997.750 1,032.000 990.010 1,012.540
2017-01-03 1,011.440 1,039.000 999.990 1,035.240
2017-01-04 1,035.510 1,139.890 1,028.560 1,114.920
2017-01-05 1,114.380 1,136.720 885.410 1,004.740
In [5]:
fig, ax = plt.subplots(figsize=(16, 6))

df.plot(ax=ax)

plt.title("Bitcoin price (USD)", fontsize=16, fontweight='bold', color='white')
Out[5]:
Text(0.5,1,'Bitcoin price (USD)')

green-divider

Plotting example: Bollinger bands¶

As a sneak peek of what we'll see in upcoming lectures, lets make some basic plots using pandas.

Bollinger Bands are a technical trading tool created by John Bollinger in the early 1980s. They arose from the need for adaptive trading bands and the observation that volatility was dynamic, not static as was widely believed at the time.

 Calculate Bollinger bands¶

To demostrate the strategy we will use a 30 periods rolling mean window, and 1.5 standard deviations for each of the bands. This might not be the optimal configuration for this dataset, but we will talk more about optimizing these two arguments later.

In [6]:
# set number of days and standard deviations to use for rolling 
# lookback period for Bollinger band calculation
window = 30
no_of_std = 1.5

# calculate rolling mean and standard deviation
rolling_mean = df['Close'].rolling(window).mean()
rolling_std = df['Close'].rolling(window).std()

# create two new DataFrame columns to hold values of upper and lower Bollinger bands
df['Rolling Mean'] = rolling_mean
df['Bollinger High'] = rolling_mean + (rolling_std * no_of_std)
df['Bollinger Low'] = rolling_mean - (rolling_std * no_of_std)
In [7]:
df.tail()
Out[7]:
Open High Low Close Rolling Mean Bollinger High Bollinger Low
Timestamp
2018-03-24 8,917.990 9,020.000 8,505.000 8,547.000 9,533.030 11,150.406 7,915.654
2018-03-25 8,541.960 8,680.000 8,368.630 8,453.900 9,475.957 11,109.229 7,842.684
2018-03-26 8,451.120 8,500.000 7,831.150 8,149.660 9,424.612 11,096.248 7,752.976
2018-03-27 8,152.260 8,211.620 7,742.110 7,791.700 9,364.668 11,094.048 7,635.287
2018-03-28 7,791.690 8,104.980 7,723.030 8,039.860 9,288.506 11,032.616 7,544.396
In [8]:
fig, ax = plt.subplots(figsize=(16, 6))

df[['Close','Bollinger High','Bollinger Low']].plot(ax=ax)

plt.title("Bitcoin - Bollinger bands (USD)", fontsize=16, fontweight='bold', color='white')
Out[8]:
Text(0.5,1,'Bitcoin - Bollinger bands (USD)')

You can get our full article explaining these Bollinger bands strategy here!

purple-divider

Notebooks AI
Notebooks AI Profile20060