Profile picture

Intro to Pandas DataFrames

Last updated: May 14th, 20192019-05-14Project preview

rmotr


Intro to Pandas DataFrame's

Probably the most important data structure of pandas is the DataFrame. It's a tabular structure tightly integrated with Series.

A DataFrame is a tabular structure with the following properties:

  • It's composed by a ordered series of rows and a ordered series of columns.
  • It also uses an index to reference individual rows.
  • Each column could have a different NumPy-related type.
  • It could be seen as a collection of multiple of Series, all sharing the same index.
  • Can be "sliced" horizontally (per row) or vertically (per column).

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd

green-divider

DataFrames creation

The DataFrame constructor accepts the following parameters:

  • data: (required) has all the data we want to store on the DataFrame and could be a Series dictionary, a sequences dictionary, a bidimensional ndarray, a Series or another DataFrame.
  • index: (optional), has all the labels we want to assign to the rows of our DatFrame and could be a Python sequence or an unidimensional ndarray. Default value: np.arange(0, len(rows)).
  • columns: (optional), has all the labels we want to assign to the columns of our DatFrame and could be a Python sequence or an unidimensional ndarray. Default value: np.arange(0, len(columns)).
  • dtype: (optional) any NumPy data type to be assigned on columns.
In [2]:
# Using a dictionary of sequences
dataframe = pd.DataFrame({'var1': [1, 2, 3],
                          'var2': ['one', 'two', 'three'],
                          'var3': [1.0, 2.0, 3.0]})

dataframe
Out[2]:
var1 var2 var3
0 1 one 1.0
1 2 two 2.0
2 3 three 3.0
In [3]:
# Using a dictionary of Series
dataframe = pd.DataFrame({'var1': pd.Series([1, 2, 3], dtype=np.float64),
                          'var2': pd.Series(['a', 'b'])})

dataframe
Out[3]:
var1 var2
0 1.0 a
1 2.0 b
2 3.0 NaN
In [4]:
# Using a ndarray with indexes to rows and columns
dataframe = pd.DataFrame(np.arange(16).reshape(4, 4),
                         index=['r1', 'r2', 'r3', 'r4'],
                         columns=['c1', 'c2', 'c3', 'c4'])
dataframe
Out[4]:
c1 c2 c3 c4
r1 0 1 2 3
r2 4 5 6 7
r3 8 9 10 11
r4 12 13 14 15
In [5]:
# Using a ndarray with indexes to rows and columns, with fixed type
dataframe = pd.DataFrame(np.arange(16).reshape(4,4), dtype=np.int32)

dataframe
Out[5]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
In [6]:
dataframe.dtypes
Out[6]:
0    int32
1    int32
2    int32
3    int32
dtype: object

green-divider

DataFrame elements

DataFrames expose some useful attributes:

In [7]:
dataframe = pd.DataFrame({'var1': pd.Series([1, 2, 3], dtype=np.float64),
                          'var2': pd.Series(['a', 'b'])})

dataframe
Out[7]:
var1 var2
0 1.0 a
1 2.0 b
2 3.0 NaN
In [8]:
# Show first rows of our DataFrame
dataframe.head()
Out[8]:
var1 var2
0 1.0 a
1 2.0 b
2 3.0 NaN
In [9]:
# Type of our DataFrame columns
dataframe.dtypes
Out[9]:
var1    float64
var2     object
dtype: object
In [10]:
dataframe
Out[10]:
var1 var2
0 1.0 a
1 2.0 b
2 3.0 NaN
In [11]:
# Values of a DataFrame
dataframe.values
Out[11]:
array([[1.0, 'a'],
       [2.0, 'b'],
       [3.0, nan]], dtype=object)
In [12]:
dataframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
var1    3 non-null float64
var2    2 non-null object
dtypes: float64(1), object(1)
memory usage: 128.0+ bytes
In [13]:
type(dataframe.values)
Out[13]:
numpy.ndarray
In [14]:
# Using a ndarray with indexes to rows and columns
dataframe = pd.DataFrame(np.arange(16).reshape(4, 4),
                         index=['r1', 'r2', 'r3', 'r4'],
                         columns=['c1', 'c2', 'c3', 'c4'])
dataframe
Out[14]:
c1 c2 c3 c4
r1 0 1 2 3
r2 4 5 6 7
r3 8 9 10 11
r4 12 13 14 15
In [15]:
# Index of a DataFrame
dataframe.index
Out[15]:
Index(['r1', 'r2', 'r3', 'r4'], dtype='object')
In [16]:
# Columns of a DataFrame
dataframe.columns
Out[16]:
Index(['c1', 'c2', 'c3', 'c4'], dtype='object')
In [17]:
# Dimension of the DataFrame
dataframe.ndim
Out[17]:
2
In [18]:
# Shape of the DataFrame
dataframe.shape
Out[18]:
(4, 4)
In [19]:
# Number of DataFrame elements
dataframe.size
Out[19]:
16

Indexes are immutable, so we can't change individual values independently. However, we can change a complete index with a new index.

In [20]:
dataframe = pd.DataFrame({'var1': pd.Series([1, 2, 3], dtype=np.float64),
                          'var2': pd.Series(['a', 'b'])})

dataframe
Out[20]:
var1 var2
0 1.0 a
1 2.0 b
2 3.0 NaN
In [21]:
# Modifying a row index will give us an error
dataframe.index[0] = 4
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-96113c2111c7> in <module>
      1 # Modifying a row index will give us an error
----> 2 dataframe.index[0] = 4

/usr/local/lib/python3.6/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   2048 
   2049     def __setitem__(self, key, value):
-> 2050         raise TypeError("Index does not support mutable operations")
   2051 
   2052     def __getitem__(self, key):

TypeError: Index does not support mutable operations
In [ ]:
# Modifying a column index will give us an error
dataframe.columns[0] = 4
In [ ]:
# This will work
dataframe.index = ['r1', 'r2', 'r3']
dataframe
In [ ]:
# This will work
dataframe.columns = ['c1', 'c2']
dataframe

green-divider

The Group of Seven

We'll keep our analysis of "G7 countries" and looking now at DataFrames. As said, a DataFrame looks a lot like a table (as the one you can appreciate here):

image

Creating DataFrames manually can be tedious. 99% of the time you'll be pulling the data from a Database, a csv file or the web. But still, you can create a DataFrame by specifying the columns and values:

In [ ]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

(The columns attribute is optional. I'm using it to keep the same order as in the picture above)

In [ ]:
df
In [ ]:
df.dtypes
In [ ]:
type(df.values)
In [ ]:
df.ndim
In [ ]:
df.shape
In [ ]:
df.size
In [ ]:
df.info()
In [ ]:
df.describe()
In [ ]:
df.get_dtype_counts()
In [ ]:
df.columns
In [ ]:
df.index

green-divider

 Changing column type

In [ ]:
df['Population'].astype(np.int)
In [ ]:
df
In [ ]:
df['Rounded Population'] = df['Population'].astype(np.int)
In [ ]:
df

green-divider

 Changing DataFrames column index

In [ ]:
df.columns = ['P', 'GDP', 'SA', 'HDI', 'C', 'RP']

df

(we'll keep the original column index).

In [ ]:
df.columns = ['Population', 'GDP', 'Surface Area', 'HDI', 'Continent', 'Rounded Population']

green-divider

 Changing DataFrames row index

DataFrames also have indexes. As you can see in the "table" above, pandas has assigned a numeric, autoincremental index automatically to each "row" in our DataFrame. In our case, we know that each row represents a country, so we'll just reassign the index:

In [ ]:
df
In [ ]:
df.index
In [ ]:
df.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]
In [ ]:
df
In [ ]:
df.index

green-divider

 Removing indexes

We can also discard current indexes from our DataFrame at any time including it as a new column of our data. To do that we use the reset_index() method. New index will be a numerical sequence.

Note 1: that reset_index() will return a new DataFrame, so if we want to keep it we need to assign it to a variable.

Note 2: also, if we don't want to keep the old index as a column we can drop it using the drop=True parameter.

In [ ]:
df = df.reset_index()

df

Also, we can restore a set of columns as DataFrame index:

In [ ]:
df = df.set_index(['index'])

df

green-divider

 Changing DataFrames row and column index at once

In [ ]:
df.rename(
    columns={
        'HDI': 'Human Development Index',
        'Anual Popcorn Consumption': 'APC'
    },
    index={
        'United States': 'USA',
        'United Kingdom': 'UK',
        'Argentina': 'AR'
    }
)
In [ ]:
df.rename(index=str.upper)
In [ ]:
df.rename(index=lambda x: x.lower())

purple-divider

Notebooks AI
Notebooks AI Profile20060