Profile picture

Intro to Pandas DataFrames

Last updated: November 28th, 20192019-11-28Project preview

rmotr


Intro to Pandas DataFrames

Probably the most important data structure of pandas is the DataFrame. It's a tabular structure tightly integrated with Series.

A DataFrame is a tabular structure with the following properties:

  • It's composed by a ordered series of rows and a ordered series of columns.
  • It also uses an index to reference individual rows.
  • Each column could have a different NumPy-related type.
  • It could be seen as a collection of multiple of Series, all sharing the same index.
  • Can be "sliced" horizontally (per row) or vertically (per column).

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd

green-divider

DataFrames creation

The DataFrame constructor accepts the following parameters:

  • data: (required) has all the data we want to store on the DataFrame and could be a Series dictionary, a sequences dictionary, a bidimensional ndarray, a Series or another DataFrame.
  • index: (optional), has all the labels we want to assign to the rows of our DatFrame and could be a Python sequence or an unidimensional ndarray. Default value: np.arange(0, len(rows)).
  • columns: (optional), has all the labels we want to assign to the columns of our DatFrame and could be a Python sequence or an unidimensional ndarray. Default value: np.arange(0, len(columns)).
  • dtype: (optional) any NumPy data type to be assigned on columns.
In [2]:
# Using a dictionary of sequences
dataframe = pd.DataFrame({'var1': [1, 2, 3],
                          'var2': ['one', 'two', 'three'],
                          'var3': [1.0, 2.0, 3.0]})

dataframe
Out[2]:
var1 var2 var3
0 1 one 1.0
1 2 two 2.0
2 3 three 3.0
In [3]:
# Using a dictionary of Series
dataframe = pd.DataFrame({'var1': pd.Series([1, 2, 3], dtype=np.float64),
                          'var2': pd.Series(['a', 'b'])})

dataframe
Out[3]:
var1 var2
0 1.0 a
1 2.0 b
2 3.0 NaN
In [4]:
# Using a ndarray with indexes to rows and columns
dataframe = pd.DataFrame(np.arange(16).reshape(4, 4),
                         index=['r1', 'r2', 'r3', 'r4'],
                         columns=['c1', 'c2', 'c3', 'c4'])
dataframe
Out[4]:
c1 c2 c3 c4
r1 0 1 2 3
r2 4 5 6 7
r3 8 9 10 11
r4 12 13 14 15
In [5]:
# Using a ndarray with indexes to rows and columns, with fixed type
dataframe = pd.DataFrame(np.arange(16).reshape(4,4), dtype=np.int32)

dataframe
Out[5]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
In [6]:
dataframe.dtypes
Out[6]:
0    int32
1    int32
2    int32
3    int32
dtype: object

green-divider

DataFrame elements

DataFrames expose some useful attributes:

In [7]:
dataframe = pd.DataFrame({'var1': pd.Series([1, 2, 3], dtype=np.float64),
                          'var2': pd.Series(['a', 'b'])})

dataframe
Out[7]:
var1 var2
0 1.0 a
1 2.0 b
2 3.0 NaN
In [8]:
# Show first rows of our DataFrame
dataframe.head()
Out[8]:
var1 var2
0 1.0 a
1 2.0 b
2 3.0 NaN
In [9]:
# Type of our DataFrame columns
dataframe.dtypes
Out[9]:
var1    float64
var2     object
dtype: object
In [10]:
dataframe
Out[10]:
var1 var2
0 1.0 a
1 2.0 b
2 3.0 NaN
In [11]:
# Values of a DataFrame
dataframe.values
Out[11]:
array([[1.0, 'a'],
       [2.0, 'b'],
       [3.0, nan]], dtype=object)
In [12]:
dataframe.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
var1    3 non-null float64
var2    2 non-null object
dtypes: float64(1), object(1)
memory usage: 176.0+ bytes
In [13]:
type(dataframe.values)
Out[13]:
numpy.ndarray
In [14]:
# Using a ndarray with indexes to rows and columns
dataframe = pd.DataFrame(np.arange(16).reshape(4, 4),
                         index=['r1', 'r2', 'r3', 'r4'],
                         columns=['c1', 'c2', 'c3', 'c4'])
dataframe
Out[14]:
c1 c2 c3 c4
r1 0 1 2 3
r2 4 5 6 7
r3 8 9 10 11
r4 12 13 14 15
In [15]:
# Index of a DataFrame
dataframe.index
Out[15]:
Index(['r1', 'r2', 'r3', 'r4'], dtype='object')
In [16]:
# Columns of a DataFrame
dataframe.columns
Out[16]:
Index(['c1', 'c2', 'c3', 'c4'], dtype='object')
In [17]:
# Dimension of the DataFrame
dataframe.ndim
Out[17]:
2
In [18]:
# Shape of the DataFrame
dataframe.shape
Out[18]:
(4, 4)
In [19]:
# Number of DataFrame elements
dataframe.size
Out[19]:
16

Indexes are immutable, so we can't change individual values independently. However, we can change a complete index with a new index.

In [20]:
dataframe = pd.DataFrame({'var1': pd.Series([1, 2, 3], dtype=np.float64),
                          'var2': pd.Series(['a', 'b'])})

dataframe
Out[20]:
var1 var2
0 1.0 a
1 2.0 b
2 3.0 NaN
In [21]:
# Modifying a row index will give us an error
dataframe.index[0] = 4
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-96113c2111c7> in <module>
      1 # Modifying a row index will give us an error
----> 2 dataframe.index[0] = 4

/usr/local/lib/python3.7/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   4258 
   4259     def __setitem__(self, key, value):
-> 4260         raise TypeError("Index does not support mutable operations")
   4261 
   4262     def __getitem__(self, key):

TypeError: Index does not support mutable operations
In [22]:
# Modifying a column index will give us an error
dataframe.columns[0] = 4
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-22-e244ac47fdd7> in <module>
      1 # Modifying a column index will give us an error
----> 2 dataframe.columns[0] = 4

/usr/local/lib/python3.7/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value)
   4258 
   4259     def __setitem__(self, key, value):
-> 4260         raise TypeError("Index does not support mutable operations")
   4261 
   4262     def __getitem__(self, key):

TypeError: Index does not support mutable operations
In [23]:
# This will work
dataframe.index = ['r1', 'r2', 'r3']
dataframe
Out[23]:
var1 var2
r1 1.0 a
r2 2.0 b
r3 3.0 NaN
In [24]:
# This will work
dataframe.columns = ['c1', 'c2']
dataframe
Out[24]:
c1 c2
r1 1.0 a
r2 2.0 b
r3 3.0 NaN

green-divider

The Group of Seven

We'll keep our analysis of "G7 countries" and looking now at DataFrames. As said, a DataFrame looks a lot like a table (as the one you can appreciate here):

image

Creating DataFrames manually can be tedious. 99% of the time you'll be pulling the data from a Database, a csv file or the web. But still, you can create a DataFrame by specifying the columns and values:

In [25]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])

(The columns attribute is optional. I'm using it to keep the same order as in the picture above)

In [37]:
df.head()
Out[37]:
Population GDP Surface Area HDI Continent
0 35.467 1785387 9984670 0.913 America
1 63.951 2833687 640679 0.888 Europe
2 80.940 3874437 357114 0.916 Europe
3 60.665 2167744 301336 0.873 Europe
4 127.061 4602367 377930 0.891 Asia
In [27]:
df.dtypes
Out[27]:
Population      float64
GDP               int64
Surface Area      int64
HDI             float64
Continent        object
dtype: object
In [28]:
type(df.values)
Out[28]:
numpy.ndarray
In [29]:
df.ndim
Out[29]:
2
In [30]:
df.shape
Out[30]:
(7, 5)
In [31]:
df.size
Out[31]:
35
In [32]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
Population      7 non-null float64
GDP             7 non-null int64
Surface Area    7 non-null int64
HDI             7 non-null float64
Continent       7 non-null object
dtypes: float64(2), int64(2), object(1)
memory usage: 408.0+ bytes
In [33]:
df.describe()
Out[33]:
Population GDP Surface Area HDI
count 7.000000 7.000000e+00 7.000000e+00 7.000000
mean 107.302571 5.080248e+06 3.061327e+06 0.900429
std 97.249970 5.494020e+06 4.576187e+06 0.016592
min 35.467000 1.785387e+06 2.424950e+05 0.873000
25% 62.308000 2.500716e+06 3.292250e+05 0.889500
50% 64.511000 2.950039e+06 3.779300e+05 0.907000
75% 104.000500 4.238402e+06 5.082873e+06 0.914000
max 318.523000 1.734808e+07 9.984670e+06 0.916000
In [34]:
df.get_dtype_counts()
/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: `get_dtype_counts` has been deprecated and will be removed in a future version. For DataFrames use `.dtypes.value_counts()
  """Entry point for launching an IPython kernel.
Out[34]:
float64    2
int64      2
object     1
dtype: int64
In [35]:
df.columns
Out[35]:
Index(['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'], dtype='object')
In [36]:
df.index
Out[36]:
RangeIndex(start=0, stop=7, step=1)

purple-divider

Notebooks AI
Notebooks AI Profile20060