Profile picture

2.3 - Cleaning Not Null Values

Last updated: February 16th, 20192019-02-16Project preview

rmotr


Cleaning not-null values - Exercises

In [ ]:
import numpy as np
import pandas as pd

green-divider

We are going to use a real estate dataset with data of many properties.

  • STREET_NUM: Street number (numeric)
  • STREET_NAME: Street name (string)
  • OWN_OCCUPIED: Is the residence owner occupied (boolean)
  • BEDROOMS: Number of bedrooms (numeric)
  • BATHROOMS: Number of bathrooms (numeric)
  • SQUARE_FEET: Total area (numeric)
  • LAST_UPDATE_YEAR: Year of last update of the property (int)

Exercise 1

Read the real estate dataset from data/property-data.csv and save it as a DataFrame in df variable.

In [ ]:
# your code goes here
In [ ]:
df = pd.read_csv('data/property-data.csv')

df

green-divider

Exercise 2

Check how many missing values have each column.

In [ ]:
# your code goes here

You can try using any() Numpy method on the matrix.

In [ ]:
df.isnull().sum().to_frame()

There are no missing values, but there are a lot of invalid ones.

green-divider

Exercise 3

Find unique values on OWN_OCCUPIED column and replace wrong values.

Valid values should be just True and False.

In [ ]:
# your code goes here
In [ ]:
df['OWN_OCCUPIED'].unique()
In [ ]:
df['OWN_OCCUPIED'].replace({'Y': True,
                            'N': False,
                            '12': True,
                            'na': False,
                            'Z': True}, inplace=True)

df

green-divider

Exercise 4

Find wrong values on BEDROOMS and BATHROOMS columns and replace them with correct values.

Valid values should be positive numbers.

In [ ]:
# your code goes here
In [ ]:
df['BEDROOMS'].unique()
In [ ]:
df['BATHROOMS'].unique()
In [ ]:
df.replace({
    'BEDROOMS': {
        'na': '0',
        'no': '0'
    },
    'BATHROOMS': {
        'no': '0',
        'HURLEY': '0'
    }
}, inplace=True)

df

green-divider

Exercise 5

Find wrong values on SQUARE_FEET column and:

  • 1) Mark as NaN non-numeric values.
  • 2) Make positive all the negative values.
  • 3) Replace NaN values with the mean of the whole column values.
In [ ]:
# your code goes here
# 1-
# 2-
# 3-

Use pd.to_numeric() function with error='coerce' parameter to keep numeric values and mark as NaN the wrong ones. Then use abs() and fillna() using the mean.

In [ ]:
# 1-
df['SQUARE_FEET'] = pd.to_numeric(df['SQUARE_FEET'], errors='coerce')

df['SQUARE_FEET']
In [ ]:
# 2-
df['SQUARE_FEET'] = abs(df['SQUARE_FEET'])

df['SQUARE_FEET']
In [ ]:
# 3-
df['SQUARE_FEET'].fillna(df['SQUARE_FEET'].mean(), inplace=True)

df

green-divider

Exercise 6

Drop duplicated rows and keep just the first appearance.

In [ ]:
# your code goes here

Remember using drop_duplicates() function with keep parameter.

In [ ]:
df.drop_duplicates(keep='first', inplace=True)

df

green-divider

Exercise 7

Split LAST_UPDATE_YEAR and keep just the year.

In [ ]:
# your code goes here
In [ ]:
df['LAST_UPDATE_YEAR'] = df['LAST_UPDATE_YEAR'].str.split('-', expand=True)[1]

df

green-divider

Exercise 8

Cast LAST_UPDATE_YEAR column type to int and add 2000 to make the value full year.

In [ ]:
# your code goes here

Use astype() function.

In [ ]:
df['LAST_UPDATE_YEAR'] = df['LAST_UPDATE_YEAR'].astype('int') + 2000

df

purple-divider

Notebooks AI
Notebooks AI Profile20060