Random Processes

Last updated: July 30th, 20202020-07-30Project preview

rmotr


Random processes

We simulated four variables where different random processes were used. The first variable is a categorical one with three clases. If there are MAR variables, dependency can only be with the categorical column without missing values (Variable 1). Help: there are one MCAR, one MAR and one MNAR variables

green-divider

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [5]:
data_nan=pd.read_csv('dataframe_r.csv')
In [7]:
sns.pairplot(data_nan, hue = 'VAR1')
plt.show()

green-divider

MAR with respect to VAR1

In [10]:
data_nan.VAR2.isnull().groupby([data_nan['VAR1']]).sum().astype(int).reset_index(name='count')
Out[10]:
VAR1 count
0 A 53
1 B 67
2 C 62
In [11]:
data_nan.VAR3.isnull().groupby([data_nan['VAR1']]).sum().astype(int).reset_index(name='count')
Out[11]:
VAR1 count
0 A 18
1 B 67
2 C 0
In [12]:
data_nan.VAR4.isnull().groupby([data_nan['VAR1']]).sum().astype(int).reset_index(name='count')
Out[12]:
VAR1 count
0 A 63
1 B 64
2 C 71

green-divider

NMAR

In [16]:
data_nan['VAR4-intervalos'] = pd.cut(data_nan.VAR4, bins  = 15)
data_nan
Out[16]:
VAR1 VAR2 VAR3 VAR4 V4-intervalos VAR4-intervalos
0 A 2298.522712 1044.101309 191.036303 (188.985, 193.015] (188.985, 193.015]
1 A 2149.507106 1010.003930 180.372680 (176.893, 180.923] (176.893, 180.923]
2 A 2589.640405 1024.468450 215.848205 (213.168, 217.199] (213.168, 217.199]
3 A 2510.375014 1056.022330 206.479678 (205.107, 209.138] (205.107, 209.138]
4 A 2279.743088 1046.688950 188.609918 (184.954, 188.985] (184.954, 188.985]
... ... ... ... ... ... ...
1195 C 2452.422078 1025.939642 205.746089 (205.107, 209.138] (205.107, 209.138]
1196 C 2358.216721 1000.469795 195.101718 (193.015, 197.046] (193.015, 197.046]
1197 C NaN 985.155564 NaN NaN NaN
1198 C 2336.303029 949.702992 194.030829 (193.015, 197.046] (193.015, 197.046]
1199 C 2420.844020 1014.742590 197.770408 (197.046, 201.077] (197.046, 201.077]

1200 rows × 6 columns

In [18]:
vf_V4V5 = data_nan.VAR2.isnull().groupby([data_nan['VAR4-intervalos']]).sum().astype(int).reset_index(name='count')
vf_V4V5
Out[18]:
VAR4-intervalos count
0 (168.771, 172.862] 3
1 (172.862, 176.893] 2
2 (176.893, 180.923] 2
3 (180.923, 184.954] 7
4 (184.954, 188.985] 8
5 (188.985, 193.015] 21
6 (193.015, 197.046] 19
7 (197.046, 201.077] 27
8 (201.077, 205.107] 19
9 (205.107, 209.138] 17
10 (209.138, 213.168] 12
11 (213.168, 217.199] 3
12 (217.199, 221.23] 0
13 (221.23, 225.26] 0
14 (225.26, 229.291] 0
In [22]:
data_nan['VAR2-bins'] = pd.cut(data_nan.VAR2, bins  = 15)
In [23]:
aux = data_nan.VAR4.isnull().groupby([data_nan['VAR2-bins']]).sum().astype(int).reset_index(name='count')
In [25]:
aux['freq'] = aux['count']/data_nan.groupby([data_nan['VAR2-bins']]).count()['VAR2'].values
aux
Out[25]:
VAR2-bins count freq
0 (1995.575, 2046.031] 0 0.000000
1 (2046.031, 2095.741] 0 0.000000
2 (2095.741, 2145.452] 0 0.000000
3 (2145.452, 2195.162] 0 0.000000
4 (2195.162, 2244.873] 1 0.017241
5 (2244.873, 2294.583] 2 0.019802
6 (2294.583, 2344.294] 8 0.054795
7 (2344.294, 2394.004] 11 0.068750
8 (2394.004, 2443.715] 13 0.090909
9 (2443.715, 2493.425] 28 0.200000
10 (2493.425, 2543.136] 34 0.373626
11 (2543.136, 2592.846] 30 0.400000
12 (2592.846, 2642.557] 17 0.586207
13 (2642.557, 2692.267] 8 0.500000
14 (2692.267, 2741.977] 4 0.800000
In [26]:
plt.scatter(np.arange(len(aux['freq'])), aux['freq'])
Out[26]:
<matplotlib.collections.PathCollection at 0x7ff0a5771220>

purple-divider

Notebooks AI
Notebooks AI Profile20060