Profile picture

Dealing With Missing Data in Pandas

Last updated: June 11th, 20192019-06-11Project preview

rmotr


Exercises

Dealing with missing data in Pandas

In [ ]:
import numpy as np
import pandas as pd
import missingno as msno

purple-divider

We are going to use a dataset of 5,000 movies scraped from IMDB. It contains information on the actors, directors, budget, and gross, as well as the IMDB rating and release year.

green-divider

Exercise 1

Read the movies dataset from data/movie.csv into the df variable.

In [ ]:
# your code goes here
In [ ]:
df = pd.read_csv('data/movies.csv')

df.head(15)

green-divider

Exercise 2

Calculate the percentage of missing values per column.

In [ ]:
# your code goes here
In [ ]:
df.isna().mean() * 100
In [ ]:
msno.bar(df)

We now need to start dealing with those missing values we found.

green-divider

Exercise 3

The df DataFrame has 28 columns by now.

Drop all the rows that don't have at least 26 non-null values.

In [ ]:
df.shape
In [ ]:
# your code goes here

Remember that dropna() function has a thresh parameter.

In [ ]:
df.dropna(thresh=26, inplace=True)

df.shape

green-divider

Exercise 4

Drop all the columns in which all their values are NaN.

In [ ]:
# your code goes here

Remember that dropna() function has a how parameter.

In [ ]:
df.dropna(axis='columns', how='all', inplace=True)

df.shape

green-divider

Exercise 5

Drop all rows that contains missing values in the language column.

In [ ]:
# your code goes here
In [ ]:
#df = df.loc[~df['num_critic_for_reviews'].isna()]
df.dropna(subset=['language'], inplace=True)

print(df['language'].isna().sum())

print(df.shape)

green-divider

Exercise 6

Drop the color, gross, plot_keywords and aspect_ratio columns from df DataFrame.

In [ ]:
# your code goes here
In [ ]:
cols_to_drop = ['color', 'gross', 'plot_keywords', 'aspect_ratio']

df.drop(columns=cols_to_drop, inplace=True)

df.shape

green-divider

Exercise 7

Replace (fill) all the missing values of the column director_name with the string value 'Anonymous'.

In [ ]:
# your code goes here
In [ ]:
df['director_name'] = df['director_name'].fillna('Anonymous')

df['director_name'].isnull().sum()

green-divider

Exercise 8

Fill missing values of the column country with an empty string value ('').

In [ ]:
# your code goes here
In [ ]:
df['country'] = df['country'].fillna('')

df['country'].isnull().sum()

green-divider

Exercise 9

Fill missing values of the duration with the mean of all the durations.

In [ ]:
# your code goes here
In [ ]:
df['duration'] = df['duration'].fillna(df['duration'].mean())

df['duration'].isnull().sum()

green-divider

Exercise 10

Check again how many missing values we have per column.

In [ ]:
# your code goes here
In [ ]:
df.isna().mean() * 100
In [ ]:
msno.bar(df)

purple-divider

Notebooks AI
Notebooks AI Profile20060