Profile picture

2.2 - Handling Missing Data With Pandas

Last updated: February 16th, 20192019-02-16Project preview

rmotr


Handling Missing Data with Pandas - Exercises

In [ ]:
import numpy as np
import pandas as pd

green-divider

We are going to use a dataset of 5,000 movies scraped from IMDB. It contains information on the actors, directors, budget, and gross, as well as the IMDB rating and release year.

Exercise 1

Read the movies dataset from data/movie_metadata.csv into the df variable.

In [ ]:
# your code goes here
In [ ]:
df = pd.read_csv('data/movie_metadata.csv')

df.head(15)

green-divider

Exercise 2

Check how many missing values each column has.

In [ ]:
# your code goes here

First get boolean values of each element whether it has a missing value or not, then sum that values.

In [ ]:
df.isnull().sum().to_frame()

green-divider

We now need to deal with those missing values.

Exercise 3

Replace (fill) all the missing values of the column director_name with the string value 'Anonymous'.

In [ ]:
# your code goes here

Use fillna() function.

In [ ]:
df['director_name'] = df['director_name'].fillna('Anonymous')

df['director_name'].isnull().sum()

green-divider

Exercise 4

Fill missing values of the column country with an empty string value ('').

In [ ]:
# your code goes here
In [ ]:
df['country'] = df['country'].fillna('')

df['country'].isnull().sum()

green-divider

Exercise 5

Fill missing values of the duration with the mean of all the durations.

In [ ]:
# your code goes here
In [ ]:
df['duration'] = df['duration'].fillna(df['duration'].mean())

df['duration'].isnull().sum()

green-divider

Exercise 6

Get the percentage/proportion of missing values per column.

In [ ]:
# your code goes here
In [ ]:
(df.isnull().sum() / df.shape[0]).to_frame()

green-divider

Exercise 7

Drop the columns color, gross, plot_keywords and aspect_ratio from the Dataframe df.

In [ ]:
# your code goes here
In [ ]:
df.drop(columns=['color', 'gross', 'plot_keywords', 'aspect_ratio'], inplace=True)

df.head()

green-divider

Exercise 8

Drop all the columns in which all their values are nan.

In [ ]:
# your code goes here

Remember that dropna() function has a how parameter.

In [ ]:
df.dropna(axis=1, how='all', inplace=True)

green-divider

Exercise 9

Your df Dataframe should have 24 columns by now. Drop all the rows that don't have at least 22 non-null values.

In [ ]:
df.columns.size
In [ ]:
# your code goes here

Remember that dropna() function has a thresh parameter.

In [ ]:
df.dropna(thresh=22, inplace=True)

green-divider

Exercise 10

Check again how many missing values we have per column.

In [ ]:
df.isnull().sum().to_frame()
In [ ]:
df.head(15)

purple-divider

Notebooks AI
Notebooks AI Profile20060