Profile picture

Co-founder @ RMOTR

Planes Telecentro

Last updated: March 19th, 20202020-03-19Project preview

rmotr


Analyzing the epidemiological outbreak of COVID‐19

A visual exploratory data analysis approach.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import plotly.express as px
import theme

%matplotlib inline

green-divider

Step 1: Reading Data

We will load COVID-19 data from the GitHub data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).

This data is daily-updated, so we can keep our project up-to-date just by loading this data again.

Let's load the data and quickly analyze it's columns and values:

In [4]:
COVID_CONFIRMED_URL = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv'

covid_confirmed = pd.read_csv(COVID_CONFIRMED_URL)

print(covid_confirmed.shape)

covid_confirmed.head()
(463, 59)
Out[4]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 2/1/20 2/2/20 2/3/20 2/4/20 2/5/20 2/6/20 2/7/20 2/8/20 2/9/20 2/10/20 2/11/20 2/12/20 2/13/20 2/14/20 2/15/20 2/16/20 2/17/20 2/18/20 2/19/20 2/20/20 2/21/20 2/22/20 2/23/20 2/24/20 2/25/20 2/26/20 2/27/20 2/28/20 2/29/20 3/1/20 3/2/20 3/3/20 3/4/20 3/5/20 3/6/20 3/7/20 3/8/20 3/9/20 3/10/20 3/11/20 3/12/20 3/13/20 3/14/20 3/15/20 3/16/20
0 NaN Thailand 15.000 101.000 2 3 5 7 8 8 14 14 14 19 19 19 19 25 25 25 25 32 32 32 33 33 33 33 33 34 35 35 35 35 35 35 35 35 37 40 40 41 42 42 43 43 43 47 48 50 50 50 53 59 70 75 82 114 147
1 NaN Japan 36.000 138.000 2 1 2 2 4 4 7 7 11 15 20 20 20 22 22 45 25 25 26 26 26 28 28 29 43 59 66 74 84 94 105 122 147 159 170 189 214 228 241 256 274 293 331 360 420 461 502 511 581 639 639 701 773 839 825
2 NaN Singapore 1.283 103.833 0 1 3 3 4 5 7 7 10 13 16 18 18 24 28 28 30 33 40 45 47 50 58 67 72 75 77 81 84 84 85 85 89 89 91 93 93 93 102 106 108 110 110 117 130 138 150 150 160 178 178 200 212 226 243
3 NaN Nepal 28.167 84.250 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
4 NaN Malaysia 2.500 112.500 0 0 0 3 4 4 4 7 8 8 8 8 8 10 12 12 12 16 16 18 18 18 19 19 22 22 22 22 22 22 22 22 22 22 22 22 23 23 25 29 29 36 50 50 83 93 99 117 129 149 149 197 238 428 566
In [5]:
COVID_DEATHS_URL = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv'

covid_deaths = pd.read_csv(COVID_DEATHS_URL)

print(covid_confirmed.shape)

covid_deaths.head()
(463, 59)
Out[5]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 2/1/20 2/2/20 2/3/20 2/4/20 2/5/20 2/6/20 2/7/20 2/8/20 2/9/20 2/10/20 2/11/20 2/12/20 2/13/20 2/14/20 2/15/20 2/16/20 2/17/20 2/18/20 2/19/20 2/20/20 2/21/20 2/22/20 2/23/20 2/24/20 2/25/20 2/26/20 2/27/20 2/28/20 2/29/20 3/1/20 3/2/20 3/3/20 3/4/20 3/5/20 3/6/20 3/7/20 3/8/20 3/9/20 3/10/20 3/11/20 3/12/20 3/13/20 3/14/20 3/15/20 3/16/20
0 NaN Thailand 15.000 101.000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 NaN Japan 36.000 138.000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 2 4 4 5 6 6 6 6 6 6 6 6 10 10 15 16 19 22 22 27
2 NaN Singapore 1.283 103.833 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 NaN Nepal 28.167 84.250 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 NaN Malaysia 2.500 112.500 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [6]:
COVID_RECOVERED_URL = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv'

covid_recovered = pd.read_csv(COVID_RECOVERED_URL)

print(covid_recovered.shape)

covid_recovered.head()
(463, 59)
Out[6]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 1/28/20 1/29/20 1/30/20 1/31/20 2/1/20 2/2/20 2/3/20 2/4/20 2/5/20 2/6/20 2/7/20 2/8/20 2/9/20 2/10/20 2/11/20 2/12/20 2/13/20 2/14/20 2/15/20 2/16/20 2/17/20 2/18/20 2/19/20 2/20/20 2/21/20 2/22/20 2/23/20 2/24/20 2/25/20 2/26/20 2/27/20 2/28/20 2/29/20 3/1/20 3/2/20 3/3/20 3/4/20 3/5/20 3/6/20 3/7/20 3/8/20 3/9/20 3/10/20 3/11/20 3/12/20 3/13/20 3/14/20 3/15/20 3/16/20
0 NaN Thailand 15.000 101.000 0 0 0 0 2 2 5 5 5 5 5 5 5 5 5 5 5 10 10 10 10 10 12 12 12 14 15 15 15 15 17 17 21 21 22 22 22 28 28 28 31 31 31 31 31 31 31 31 33 34 34 35 35 35 35
1 NaN Japan 36.000 138.000 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 9 9 9 9 12 12 12 13 18 18 22 22 22 22 22 22 22 22 32 32 32 43 43 43 46 76 76 76 101 118 118 118 118 118 144
2 NaN Singapore 1.283 103.833 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 9 15 15 17 18 18 24 29 34 34 37 37 51 51 53 62 62 62 72 72 78 78 78 78 78 78 78 78 78 96 96 97 105 105 109
3 NaN Nepal 28.167 84.250 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
4 NaN Malaysia 2.500 112.500 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 3 3 3 3 7 7 7 13 15 15 15 15 15 18 18 18 18 18 18 18 18 22 22 22 22 23 24 24 24 26 26 26 35 42 42

You can learn how to read other type of files using Pandas on our Reading Data with Pandas and Python course!

We are using DataFrames to store our data. A pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).

So far we have all our datasets loaded, let's analyze them!

green-divider

Step 2: Cleaning our data

Another important step before diving into data analysis is cleaning the data.

As the data is already really clean, we'll just replace Mainland china with just China, and fill some missing values.

You can learn more on data cleaning on our Data Cleaning with pandas course!

In [7]:
covid_confirmed['Country/Region'].replace('Mainland China', 'China', inplace=True)
covid_deaths['Country/Region'].replace('Mainland China', 'China', inplace=True)
covid_recovered['Country/Region'].replace('Mainland China', 'China', inplace=True)
In [8]:
covid_confirmed[['Province/State']] = covid_confirmed[['Province/State']].fillna('')
covid_confirmed.fillna(0, inplace=True)

covid_deaths[['Province/State']] = covid_deaths[['Province/State']].fillna('')
covid_deaths.fillna(0, inplace=True)

covid_recovered[['Province/State']] = covid_recovered[['Province/State']].fillna('')
covid_recovered.fillna(0, inplace=True)

Final checks:

In [9]:
covid_confirmed.isna().sum().sum()
Out[9]:
0
In [10]:
covid_deaths.isna().sum().sum()
Out[10]:
0
In [11]:
covid_recovered.isna().sum().sum()
Out[11]:
0

green-divider

Step 3 & 4: Analysis (worldwide impact) and Data Wrangling

With the data loaded, we will start by aggregating all the cases so we can quickly see what's going on in the world.

To do that we'll use the pandas Python library.

pandas is the most popular Python library for Data Science. You can learn data analysis fundamentals using pandas on our Intro to Pandas for Data Analysis course!

In [12]:
covid_confirmed_count = covid_confirmed.iloc[:, 4:].sum().max()

covid_confirmed_count
Out[12]:
181546
In [13]:
covid_deaths_count = covid_deaths.iloc[:, 4:].sum().max()

covid_deaths_count
Out[13]:
7126
In [14]:
covid_recovered_count = covid_recovered.iloc[:, 4:].sum().max()

covid_recovered_count
Out[14]:
78088

Store that values on a DataFrame, and calculate a new active cases value with the following formula:

$$ Active = Confirmed - Deaths - Recovered $$
In [15]:
world_df = pd.DataFrame({
    'confirmed': [covid_confirmed_count],
    'deaths': [covid_deaths_count],
    'recovered': [covid_recovered_count],
    'active': [covid_confirmed_count - covid_deaths_count - covid_recovered_count]
})

world_df
Out[15]:
confirmed deaths recovered active
0 181546 7126 78088 96332
In [16]:
world_long_df = world_df.melt(value_vars=['active', 'deaths', 'recovered'],
                              var_name="status",
                              value_name="count")

world_long_df['upper'] = 'confirmed'

world_long_df
Out[16]:
status count upper
0 active 96332 confirmed
1 deaths 7126 confirmed
2 recovered 78088 confirmed
In [17]:
fig = px.treemap(world_long_df, path=["upper", "status"], values="count",
                 color_discrete_sequence=['#3498db', '#2ecc71', '#e74c3c'],
                 template='plotly_dark')

fig.show()