# Dealing With Missing Data in Pandas

Last updated: June 11th, 2019

# Exercises¶

## Dealing with missing data in Pandas¶

In [ ]:
import numpy as np
import pandas as pd
import missingno as msno


We are going to use a dataset of 5,000 movies scraped from IMDB. It contains information on the actors, directors, budget, and gross, as well as the IMDB rating and release year.

### Exercise 1¶

Read the movies dataset from data/movie.csv into the df variable.

In [ ]:
# your code goes here

In [ ]:
df = pd.read_csv('data/movies.csv')



### Exercise 2¶

Calculate the percentage of missing values per column.

In [ ]:
# your code goes here

In [ ]:
df.isna().mean() * 100

In [ ]:
msno.bar(df)


We now need to start dealing with those missing values we found.

### Exercise 3¶

The df DataFrame has 28 columns by now.

Drop all the rows that don't have at least 26 non-null values.

In [ ]:
df.shape

In [ ]:
# your code goes here


Remember that dropna() function has a thresh parameter.

In [ ]:
df.dropna(thresh=26, inplace=True)

df.shape


### Exercise 4¶

Drop all the columns in which all their values are NaN.

In [ ]:
# your code goes here


Remember that dropna() function has a how parameter.

In [ ]:
df.dropna(axis='columns', how='all', inplace=True)

df.shape


### Exercise 5¶

Drop all rows that contains missing values in the language column.

In [ ]:
# your code goes here

In [ ]:
#df = df.loc[~df['num_critic_for_reviews'].isna()]
df.dropna(subset=['language'], inplace=True)

print(df['language'].isna().sum())

print(df.shape)


### Exercise 6¶

Drop the color, gross, plot_keywords and aspect_ratio columns from df DataFrame.

In [ ]:
# your code goes here

In [ ]:
cols_to_drop = ['color', 'gross', 'plot_keywords', 'aspect_ratio']

df.drop(columns=cols_to_drop, inplace=True)

df.shape


### Exercise 7¶

Replace (fill) all the missing values of the column director_name with the string value 'Anonymous'.

In [ ]:
# your code goes here

In [ ]:
df['director_name'] = df['director_name'].fillna('Anonymous')

df['director_name'].isnull().sum()


### Exercise 8¶

Fill missing values of the column country with an empty string value ('').

In [ ]:
# your code goes here

In [ ]:
df['country'] = df['country'].fillna('')

df['country'].isnull().sum()


### Exercise 9¶

Fill missing values of the duration with the mean of all the durations.

In [ ]:
# your code goes here

In [ ]:
df['duration'] = df['duration'].fillna(df['duration'].mean())

df['duration'].isnull().sum()


### Exercise 10¶

Check again how many missing values we have per column.

In [ ]:
# your code goes here

In [ ]:
df.isna().mean() * 100

In [ ]:
msno.bar(df)