 Dealing With Missing Data in Pandas

Last updated: June 11th, 2019 Exercises¶

Dealing with missing data in Pandas¶

In [ ]:
import numpy as np
import pandas as pd
import missingno as msno We are going to use a dataset of 5,000 movies scraped from IMDB. It contains information on the actors, directors, budget, and gross, as well as the IMDB rating and release year. Exercise 1¶

Read the movies dataset from data/movie.csv into the df variable.

In [ ]:
# your code goes here

In [ ]:
df = pd.read_csv('data/movies.csv') Exercise 2¶

Calculate the percentage of missing values per column.

In [ ]:
# your code goes here

In [ ]:
df.isna().mean() * 100

In [ ]:
msno.bar(df)


We now need to start dealing with those missing values we found. Exercise 3¶

The df DataFrame has 28 columns by now.

Drop all the rows that don't have at least 26 non-null values.

In [ ]:
df.shape

In [ ]:
# your code goes here


Remember that dropna() function has a thresh parameter.

In [ ]:
df.dropna(thresh=26, inplace=True)

df.shape Exercise 4¶

Drop all the columns in which all their values are NaN.

In [ ]:
# your code goes here


Remember that dropna() function has a how parameter.

In [ ]:
df.dropna(axis='columns', how='all', inplace=True)

df.shape Exercise 5¶

Drop all rows that contains missing values in the language column.

In [ ]:
# your code goes here

In [ ]:
#df = df.loc[~df['num_critic_for_reviews'].isna()]
df.dropna(subset=['language'], inplace=True)

print(df['language'].isna().sum())

print(df.shape) Exercise 6¶

Drop the color, gross, plot_keywords and aspect_ratio columns from df DataFrame.

In [ ]:
# your code goes here

In [ ]:
cols_to_drop = ['color', 'gross', 'plot_keywords', 'aspect_ratio']

df.drop(columns=cols_to_drop, inplace=True)

df.shape Exercise 7¶

Replace (fill) all the missing values of the column director_name with the string value 'Anonymous'.

In [ ]:
# your code goes here

In [ ]:
df['director_name'] = df['director_name'].fillna('Anonymous')

df['director_name'].isnull().sum() Exercise 8¶

Fill missing values of the column country with an empty string value ('').

In [ ]:
# your code goes here

In [ ]:
df['country'] = df['country'].fillna('')

df['country'].isnull().sum() Exercise 9¶

Fill missing values of the duration with the mean of all the durations.

In [ ]:
# your code goes here

In [ ]:
df['duration'] = df['duration'].fillna(df['duration'].mean())

df['duration'].isnull().sum() Exercise 10¶

Check again how many missing values we have per column.

In [ ]:
# your code goes here

In [ ]:
df.isna().mean() * 100

In [ ]:
msno.bar(df) 