Profile picture

Dealing With Duplicated Values in Pandas

Last updated: June 11th, 20192019-06-11Project preview

rmotr


Exercises

Dealing with duplicated values in Pandas

In [ ]:
import numpy as np
import pandas as pd

purple-divider

We are going to use a dataset of 5,000 movies scraped from IMDB. It contains information on the actors, directors, budget, and gross, as well as the IMDB rating and release year.

green-divider

Exercise 1

Read the movies dataset from data/movie.csv into the df variable.

In [ ]:
# your code goes here
In [ ]:
df = pd.read_csv('data/movies.csv')

df.head(15)

green-divider

Exercise 2

Can you find any duplicated values within the data? How many duplicated rows you found?

In [ ]:
# your code goes here
In [ ]:
df.loc[df.duplicated()]
In [ ]:
df.duplicated().sum()

green-divider

Exercise 3

Remove all the duplicated rows and keep the first appearance of them.

In [ ]:
# your code goes here
In [ ]:
df.drop_duplicates(keep='first', inplace=True)

df.duplicated().sum()

green-divider

Exercise 4

Now suppose you want to keep just the last movie of each main actor (actor_1_name) and drop all its previous appearances. Also check how many rows you keep before and after the deletion.

In [ ]:
# your code goes here
In [ ]:
print('before:', df.shape)

df.drop_duplicates(subset=['actor_1_name'], keep='last', inplace=True)

print('after:', df.shape)

purple-divider

Notebooks AI
Notebooks AI Profile20060