# What is Pandas?¶

`pandas`

is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

**The pandas package is probably the most important tool for Data Scientists and Analysts working with Python today**. The powerful machine learning and glamorous visualization tools may get all the attention, but pandas is the backbone of most data-related projects.

Fun fact 🎁:`pandas`

is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. — Wikipedia

pandas popularity has **grown exponentially** in the last years. Here's an image of The Atlas showing popularity of data science tools on Stack Overflow where we see pandas has become the dominating tools used by Python data scientists.

## What is pandas used for?¶

If you're thinking about data science as a career, then it is imperative that one of the first things you do is learn pandas.

This tool will help you get, clean, transform and analyze your data.

For example, say you want to explore a dataset stored in a CSV on your computer. The first step is to use pandas to extract the data from that CSV into a DataFrame (a table-like data structure, we'll see more about it later). The we proceed with the routine data analysis tasks:

- Quick Exploratory Data Analysis (EDA);
- Calculate statistics such as average, median, max, or min of each column;
- Creating visualizations. Plot bars, lines, histograms, bubbles, and more;
- Cleaning the data by doing things like removing missing values and filtering rows or columns by some criteria;
- Building machine learning models to create predictions or classifications
- Store the cleaned, transformed data back into a CSV, other file or a database;

## Why not just using Excel?¶

Excel is one of the most popular and widely-used data tools; it's hard to find an organization that doesn't work with it in some way. From analysts, to sales VPs, to CEOs, professionals use Excel for both quick stats and accounting and serious data crunching.

Using pandas with Microsoft Excel can give you the best of both worlds and optimize your workflow.

Pandas works with data stored in Python to manipulate and analyze data. As opposed to Excel, Python is completely **free to download and use**.

Pandas operates right on the back of Python. As a result, is **extremely fast and efficient** by using useful methods that **allow automating data processing tasks better than what Excel does**, including processing Excel files.

In Excel, once you exceed 50K rows, it starts to slow down considerably. Pandas, on the other hand, **has no real limit and handles millions of data points seamlessly**. In terms of pure space, Excel caps a single spreadsheet at 1.048.576 rows exactly. At that point, your calculations would take forever to compute. More likely, Excel would just crash. A million rows may seem like a lot of data, but for data scientists, this is but a drop in the bucket.

Pandas, however, has no limitation to the number of data points you can have in a `DataFrame`

(their version of a data set). It’s limited only by the amount of memory (RAM) of the computer it is running on.

It is also **easier to create and use complex equations and calculations on your data**. You can apply hundreds of computations to millions of data points instantly with pandas. Since Python is open source, there are already hundreds of libraries created that could streamline the length of time it takes to calculate.

## Hands on!¶

We'll just import pandas and other useful libraries such as numpy, matplotlib and seaborn to work with.

Note that to import pandas and numpy we use the aliases `pd`

and `np`

. This is just a convention, which means it's not strictly necessary, but it is recommended.

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from utils import apply_theme
%matplotlib inline
apply_theme()
```

## NumPy and pandas¶

**Pandas is built on top of the NumPy package**, which means that all the efficient structures and functions we saw about numpy in previous lessons, will also apply to pandas.

While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed to work with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical (possibly multidimensional) arrays.

## Overview Data Structures - Series and Dataframe¶

To get started with pandas, you will need to get comfortable with its two main data structures: `Series`

and `DataFrame`

s.

A `Series`

is essentially used for column-data, and a `DataFrame`

is a multi-dimensional table made up of a collection of `Series`

. Pandas relies on NumPy arrays to store this data, which means it also uses its data types.

`DataFrame`

s and `Series`

are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

Let's define some data within Python lists:

```
names = ['Avery Bradley', 'John Holland', 'Jonas Jerebko',
'Jordan Mickey', 'Terry Rozier', 'Jared Sullinger', 'Evan Turner']
teams = ['Boston Celtics', 'Boston Celtics', 'Boston Celtics',
'Boston Celtics', 'Boston Celtics', 'Boston Celtics', 'Boston Celtics']
numbers = [0, 30, 8, np.nan, 12, 7, 11]
```

```
names
```

```
teams
```

```
numbers
```

### Series creation¶

```
my_series = pd.Series(names, name='Name')
my_series.to_frame()
```

Each value can be accessed using just its key/index position on Series:

```
my_series[3]
```

```
my_series.loc[3]
```

### DataFrame creation¶

There are many ways to create a `DataFrame`

from scratch, but a great option is to just use a simple `dict`

.

```
data = {
'Name': names,
'Team': teams,
'Number': numbers
}
```

```
my_df = pd.DataFrame(data)
my_df
```

Each value can be accessed using its key/index position and value position on DataFrames:

```
my_df['Name']
```

```
type(my_df['Name'])
```

```
my_df['Name'][3]
```

```
my_df.loc[3, 'Name']
```

In future lectures we'll see more on locating and extracting data from the DataFrame, don't worry if you don't get it right away.

Let's move on to some quick methods for creating DataFrames from various other sources.

## Reading external data¶

pandas allow us to read different types of external data files such as CSV, TXT and XLS.

With CSV files all you need is a single line to load in the data:

```
df = pd.read_csv('bitcoin_data.csv')
df.head()
```

Also, there are many options when loading data, for example CSVs don't have indexes like our DataFrames, so we'll designate the `index_col`

when reading:

```
df = pd.read_csv(
'bitcoin_data.csv',
index_col=0,
parse_dates=True
).loc[:, 'Open':'Close']
df.head()
```

```
fig, ax = plt.subplots(figsize=(16, 6))
df.plot(ax=ax)
plt.title("Bitcoin price (USD)", fontsize=16, fontweight='bold', color='white')
```

## Plotting example: Bollinger bands¶

As a sneak peek of what we'll see in upcoming lectures, lets make some basic plots using *pandas*.

Bollinger Bands are a technical trading tool created by John Bollinger in the early 1980s. They arose from the need for adaptive trading bands and the observation that volatility was dynamic, not static as was widely believed at the time.

### Calculate Bollinger bands¶

To demostrate the strategy we will use a 30 periods rolling mean window, and 1.5 standard deviations for each of the bands. This might not be the optimal configuration for this dataset, but we will talk more about optimizing these two arguments later.

```
# set number of days and standard deviations to use for rolling
# lookback period for Bollinger band calculation
window = 30
no_of_std = 1.5
# calculate rolling mean and standard deviation
rolling_mean = df['Close'].rolling(window).mean()
rolling_std = df['Close'].rolling(window).std()
# create two new DataFrame columns to hold values of upper and lower Bollinger bands
df['Rolling Mean'] = rolling_mean
df['Bollinger High'] = rolling_mean + (rolling_std * no_of_std)
df['Bollinger Low'] = rolling_mean - (rolling_std * no_of_std)
```

```
df.tail()
```

```
fig, ax = plt.subplots(figsize=(16, 6))
df[['Close','Bollinger High','Bollinger Low']].plot(ax=ax)
plt.title("Bitcoin - Bollinger bands (USD)", fontsize=16, fontweight='bold', color='white')
```

Check out the blog post we wrote about Bollinger bands here!