Profile picture

Useful Statistical Methods for DataFrames

Last updated: October 31st, 20192019-10-31Project preview

rmotr


Useful statistical methods for DataFrames

There exists a large number of methods for computing descriptive statistics and other related operations on DataFrames.

Most of these are aggregations (hence producing a lower-dimensional result) like sum(), mean(), and quantile(), but some of them, like cumsum() and cumprod(), produce an object of the same size.

purple-divider

Hands on!

In [1]:
import numpy as np
import pandas as pd
In [2]:
pd.options.display.float_format = '{:,.2f}'.format

green-divider

The first thing we'll do is create again the DataFrame from our previous lecture:

In [3]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [1785387.0, 2833687, 3874437, 2167744, 4602367, 2950039, 17348075],
    'Surface Area': [9984670, 640679, 357114, 301336, 377930, 242495, 9525067],
    'HDI': [0.913, 0.888, 0.916, 0.873, 0.891, 0.907, 0.915],
    'Continent': ['America', 'Europe', 'Europe', 'Europe',
                  'Asia', 'Europe', 'America']
})

df.columns = ['Population', 'GDP', 'Surface Area', 'HDI', 'Continent']

df.index = ['Canada', 'France', 'Germany', 'Italy',
            'Japan', 'United Kingdom', 'United States']
In [4]:
df
Out[4]:
Population GDP Surface Area HDI Continent
Canada 35.47 1,785,387.00 9984670 0.91 America
France 63.95 2,833,687.00 640679 0.89 Europe
Germany 80.94 3,874,437.00 357114 0.92 Europe
Italy 60.66 2,167,744.00 301336 0.87 Europe
Japan 127.06 4,602,367.00 377930 0.89 Asia
United Kingdom 64.51 2,950,039.00 242495 0.91 Europe
United States 318.52 17,348,075.00 9525067 0.92 America

green-divider

Using Universal Functions (Ufuncs) to obtain statistical info

We can apply any Universal Function to a DataFrame column.

You've already seen the describe method, which gives you a good "summary" of the whole DataFrame or any specific column. Let's explore other methods in more detail:

In [5]:
df.describe()
Out[5]:
Population GDP Surface Area HDI
count 7.00 7.00 7.00 7.00
mean 107.30 5,080,248.00 3,061,327.29 0.90
std 97.25 5,494,020.16 4,576,186.57 0.02
min 35.47 1,785,387.00 242,495.00 0.87
25% 62.31 2,500,715.50 329,225.00 0.89
50% 64.51 2,950,039.00 377,930.00 0.91
75% 104.00 4,238,402.00 5,082,873.00 0.91
max 318.52 17,348,075.00 9,984,670.00 0.92

Every other ufunc that we've used for Series, also works for entire DataFrames.

Here is a quick reference summary table of common functions:

Function Description
count Number of non-NA observations
sum Sum of values
mean Mean of values
mad Mean absolute deviation
median Arithmetic median of values
min Minimum
max Maximum
mode Mode
abs Absolute Value
prod Product of values
std Bessel-corrected sample standard deviation
var Unbiased variance
sem Standard error of the mean
skew Sample skewness (3rd moment)
kurt Sample kurtosis (4th moment)
quantile Sample quantile (value at %)
cumsum Cumulative sum
cumprod Cumulative product
cummax Cumulative maximum
cummin Cumulative minumum

Note that by chance some NumPy methods, like mean, std, and sum, will exclude NAs values by default

In [6]:
df[['GDP', 'Population']].max()
Out[6]:
GDP          17,348,075.00
Population          318.52
dtype: float64
In [7]:
df[['GDP', 'Population']].min()
Out[7]:
GDP          1,785,387.00
Population          35.47
dtype: float64
In [8]:
df[['GDP', 'Population']].sum()
Out[8]:
GDP          35,561,736.00
Population          751.12
dtype: float64
In [12]:
df[['GDP', 'Population']].mean()
Out[12]:
GDP          5,080,248.00
Population         107.30
dtype: float64
In [13]:
df[['GDP', 'Population']].std()
Out[13]:
GDP          5,494,020.16
Population          97.25
dtype: float64
In [14]:
df[['GDP', 'Population']].quantile(.25)
Out[14]:
GDP          2,500,715.50
Population          62.31
Name: 0.25, dtype: float64
In [18]:
df[['GDP', 'Population']].quantile(.9)
Out[18]:
GDP          9,700,650.20
Population         203.65
Name: 0.9, dtype: float64

Note that methods like cumsum() and cumprod() preserve the location of NaN values.

In [20]:
df[['GDP', 'Population']].cumsum()
Out[20]:
GDP Population
Canada 1,785,387.00 35.47
France 4,619,074.00 99.42
Germany 8,493,511.00 180.36
Italy 10,661,255.00 241.02
Japan 15,263,622.00 368.08
United Kingdom 18,213,661.00 432.60
United States 35,561,736.00 751.12

 Standarization

Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, like standardization (rendering data zero mean and standard deviation 1), very concisely:

In [21]:
(df['GDP'] - df['GDP'].mean()) / df['GDP'].std()
Out[21]:
Canada           -0.60
France           -0.41
Germany          -0.22
Italy            -0.53
Japan            -0.09
United Kingdom   -0.39
United States     2.23
Name: GDP, dtype: float64

Summary Statistics for Categorical Columns

Also, there are some summary statistics we can use to help us better understand our categorical columns.

In [24]:
df['Continent'].unique()
Out[24]:
array(['America', 'Europe', 'Asia'], dtype=object)
In [25]:
df['Continent'].value_counts()
Out[25]:
Europe     4
America    2
Asia       1
Name: Continent, dtype: int64

purple-divider

Notebooks AI
Notebooks AI Profile20060