Profile picture

Probability Distributions and Random Variables

Last updated: June 14th, 20192019-06-14Project preview

rmotr


Probability Distributions and Random variables

In probability theory and statistics, a probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment.

👉 External resource: To learn more about Comparing means of distributions, check out this video from Khan Academy: https://youtu.be/pPnxPrhf6Ww

purple-divider

Hands on!

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

green-divider

Random variables

A random variable is a variable whose possible values are numerical outcomes of a random phenomenon. There are two types of random variables: discrete and continuous.

  • Probability distribution of a discrete random variable is applicable to the scenarios where the set of possible outcomes is discrete (such as a coin toss or a roll of dice) and can be encoded by a discrete list of the probabilities of the outcomes, known as a probability mass.

Some examples of discrete probability distributions are Bernoulli distribution, Binomial distribution and Poisson distribution.

  • Probability distribution of a continuous random variable, known as probability distribution functions, is applicable to the scenarios where the set of possible outcomes can take on values in a continuous range (e.g. real numbers), such as the temperature on a given day) is typically described by probability density functions (with the probability of any individual outcome actually being 0).

Some examples of continuous probability distributions are Uniform distribution, Normal distribution, Exponential distribution and Beta distribution.

green-divider

Normal distribution

Normal Distribution, also known as Gaussian distribution, is probably the most common distribution. You will encounter it at many places especially in topics of statistical inference. It is one of the assumptions of many data science algorithms too.

A normal distribution has a bell-shaped density curve described by its mean $\mu$ and standard deviation $\sigma$. The density curve is symmetrical, centered about its mean, with its spread determined by its standard deviation showing that data near the mean are more frequent in occurrence than data far from the mean.

The probability distribution function of a normal density curve with mean $\mu$ and standard deviation $\sigma$ at a given point $x$ is given by:

$$ f(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$
In [ ]:
normal = pd.DataFrame()
normal['x'] = np.random.normal(0, 1, 10000) # loc, scale, size
normal['y'] = np.random.normal(5, 1, 10000)
normal['z'] = normal['x'] + normal['y']

plt.figure(figsize=(12,6))

sns.distplot(normal['z'])

# Mean line
plt.axvline(normal['z'].mean(), color='#e74c3c', linestyle='dashed', linewidth=2)

# Standard deviation lines
plt.axvline(normal['z'].mean() + normal['z'].std(), color="#2c3e50", linestyle='dotted', linewidth=2)
plt.axvline(normal['z'].mean() - normal['z'].std(), color="#2c3e50", linestyle='dotted', linewidth=2)

green-divider

Uniform distribution

The probability distribution function of the continuous uniform distribution is:

$$ \left\{\begin{matrix} \frac{1}{b-a} & for\ a \leq x \leq b, \\ 0 & for\ x < a\ or\ x > b \end{matrix}\right. $$

Since any interval of numbers of equal width has an equal probability of being observed, the curve describing the distribution is a rectangle, with constant height across the interval and 0 height elsewhere.

In [ ]:
uniform = np.random.uniform(1, 50, 10000) # low, high, size

plt.figure(figsize=(12,6))

sns.distplot(uniform)

# Mean line
plt.axvline(uniform.mean(), color='#e74c3c', linestyle='dashed', linewidth=2)

# Standard deviation lines
plt.axvline(uniform.mean() + uniform.std(), color="#2c3e50", linestyle='dotted', linewidth=2)
plt.axvline(uniform.mean() - uniform.std(), color="#2c3e50", linestyle='dotted', linewidth=2)

plt.show()

green-divider

Gamma distribution

The gamma distribution is a two-parameter family of continuous probability distributions.

While it is used rarely in its raw form but other popularly used distributions like exponential, chi-squared, erlang distributions are special cases of the gamma distribution.

In [ ]:
gamma = np.random.gamma(2, 200, 10000) # shape, scale, size

plt.figure(figsize=(12,6))

sns.distplot(gamma)

# Mean line
plt.axvline(gamma.mean(), color='#e74c3c', linestyle='dashed', linewidth=2)

# Standard deviation lines
plt.axvline(gamma.mean() + gamma.std(), color="#2c3e50", linestyle='dotted', linewidth=2)
plt.axvline(gamma.mean() - gamma.std(), color="#2c3e50", linestyle='dotted', linewidth=2)

plt.show()

green-divider

Beta distribution

The beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by $\alpha$ and $\beta$, that appear as exponents of the random variable and control the shape of the distribution. It is a special case of the Dirichlet distribution.

In [ ]:
beta = np.random.beta(10, 200, 10000) # a, b, size

plt.figure(figsize=(12,6))

sns.distplot(beta)

# Mean line
plt.axvline(beta.mean(), color='#e74c3c', linestyle='dashed', linewidth=2)

# Standard deviation lines
plt.axvline(beta.mean() + beta.std(), color="#2c3e50", linestyle='dotted', linewidth=2)
plt.axvline(beta.mean() - beta.std(), color="#2c3e50", linestyle='dotted', linewidth=2)

plt.show()

green-divider

F-distribution

The F-distribution, also known as Snedecor's F distribution or the Fisher–Snedecor distribution.

In [ ]:
f = np.random.f(2, 200, 10000) # a, b, size

plt.figure(figsize=(12,6))

sns.distplot(f)

# Mean line
plt.axvline(f.mean(), color='#e74c3c', linestyle='dashed', linewidth=2)

# Standard deviation lines
plt.axvline(f.mean() + f.std(), color="#2c3e50", linestyle='dotted', linewidth=2)
plt.axvline(f.mean() - f.std(), color="#2c3e50", linestyle='dotted', linewidth=2)

plt.show()

green-divider

Exponential distribution

The exponential distribution describes the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. It has a parameter $\lambda$ called rate parameter, and its equation is described as:

$$ f(x, \lambda) = \left\{\begin{matrix} \lambda e^{-\lambda x} & for\ x \geq 0, \\ 0 & for\ x < 0 \end{matrix}\right. $$
In [ ]:
exponential = np.random.exponential(0.5, 10000) # a, b, size

plt.figure(figsize=(12,6))

sns.distplot(exponential)

# Mean line
plt.axvline(exponential.mean(), color='#e74c3c', linestyle='dashed', linewidth=2)

# Standard deviation lines
plt.axvline(exponential.mean() + exponential.std(), color="#2c3e50", linestyle='dotted', linewidth=2)
plt.axvline(exponential.mean() - exponential.std(), color="#2c3e50", linestyle='dotted', linewidth=2)

plt.show()

green-divider

Binomial distribution

A distribution where only two outcomes are possible, such as success or failure, gain or loss, win or lose and where the probability of success and failure is same for all the trials is called a Binomial Distribution.

The parameters of a binomial distribution are $n$ and $p$ where $n$ is the total number of trials, and $p$ is the probability of success in each trial. Its probability distribution function is given by:

$$ f(k,n,p) = Pr(k,n,p) = Pr(X=k) = \begin{pmatrix} n\\ k \end{pmatrix} p^k (1-p)^{n-k} $$

where:

$$ \begin{pmatrix} n\\ k \end{pmatrix} = \frac{n!}{k!(n-k)!} $$
In [ ]:
binomial = np.random.binomial(1000, 0.5, 10000) # n, p, size

plt.figure(figsize=(12,6))

sns.distplot(binomial)

# Mean line
plt.axvline(binomial.mean(), color='#e74c3c', linestyle='dashed', linewidth=2)

# Standard deviation lines
plt.axvline(binomial.mean() + binomial.std(), color="#2c3e50", linestyle='dotted', linewidth=2)
plt.axvline(binomial.mean() - binomial.std(), color="#2c3e50", linestyle='dotted', linewidth=2)

plt.show()

green-divider

Poisson distribution

Poisson random variable is typically used to model the number of times an event happened in a time interval.

Poisson distribution is described in terms of the rate ($\mu$) at which the events happen. An event can occur 0, 1, 2, … times in an interval. The average number of events in an interval is designated $\lambda$ (lambda). Lambda is the event rate, also called the rate parameter. The probability of observing $k$ events in an interval is given by the equation:

$$ P(k\ events\ in\ interval)=e^{-\lambda} \frac{\lambda^k}{k!} $$
In [ ]:
poisson = np.random.poisson(100, 10000) # lam, size

plt.figure(figsize=(12,6))

sns.distplot(poisson)

# Mean line
plt.axvline(poisson.mean(), color='#e74c3c', linestyle='dashed', linewidth=2)

# Standard deviation lines
plt.axvline(poisson.mean() + poisson.std(), color="#2c3e50", linestyle='dotted', linewidth=2)
plt.axvline(poisson.mean() - poisson.std(), color="#2c3e50", linestyle='dotted', linewidth=2)

plt.show()

purple-divider

Notebooks AI
Notebooks AI Profile20060