Profile picture

EDA Google Playstore

Last updated: March 14th, 20192019-03-14Project preview

Google Playstore App EDA

Dataset

Data Cleaning Process

Step by step process:
1) Data Types
2) Null objects


Strategies:
1) Drop
2) Fix based on heuristics
3) Fill with values (scalars, computed (mean), ffill/bfill)

DEVELOPMENT IDEAS:

  • Genres: To dummies
  • New column combining Last Updated with Android Ver for a "freshness score".
In [1]:
#!unzip google-play-store-apps.zip
In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
In [3]:
!head -n 1 googleplaystore.csv
App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
In [4]:
pd.read_csv('googleplaystore.csv', nrows=2)
Out[4]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
In [5]:
df = pd.read_csv('googleplaystore.csv', dtype={
    'Category': 'category',
    'Type': 'category',
    'Content Rating': 'category',
    'Genres': 'category',
}, parse_dates=['Last Updated'])
df.head()
Out[5]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
App               10841 non-null object
Category          10841 non-null category
Rating            9367 non-null float64
Reviews           10841 non-null object
Size              10841 non-null object
Installs          10841 non-null object
Type              10840 non-null category
Price             10841 non-null object
Content Rating    10840 non-null category
Genres            10841 non-null category
Last Updated      10841 non-null object
Current Ver       10833 non-null object
Android Ver       10838 non-null object
dtypes: category(4), float64(1), object(8)
memory usage: 812.4+ KB

Data Cleaning Process

In [7]:
df['Android Ver'].value_counts()
Out[7]:
4.1 and up            2451
4.0.3 and up          1501
4.0 and up            1375
Varies with device    1362
4.4 and up             980
2.3 and up             652
5.0 and up             601
4.2 and up             394
2.3.3 and up           281
2.2 and up             244
4.3 and up             243
3.0 and up             241
2.1 and up             134
1.6 and up             116
6.0 and up              60
7.0 and up              42
3.2 and up              36
2.0 and up              32
5.1 and up              24
1.5 and up              20
4.4W and up             12
3.1 and up              10
2.0.1 and up             7
8.0 and up               6
7.1 and up               3
1.0 and up               2
5.0 - 8.0                2
4.0.3 - 7.1.1            2
4.1 - 7.1.1              1
7.0 - 7.1.1              1
5.0 - 7.1.1              1
2.2 - 7.1.1              1
5.0 - 6.0                1
Name: Android Ver, dtype: int64

1. Data Types

Things to fix:

  • Why aren't Reviews and Price read as numeric types?
  • What's the real datatype of Installs and Android Ver
Fixing Reviews
In [8]:
df['Reviews'].str.isnumeric().head()
Out[8]:
0    True
1    True
2    True
3    True
4    True
Name: Reviews, dtype: bool
In [9]:
df.loc[~df['Reviews'].str.isnumeric()]
Out[9]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
10472 Life Made WI-Fi Touchscreen Photo Frame 1.9 19.0 3.0M 1,000+ Free 0 Everyone NaN February 11, 2018 1.0.19 4.0 and up NaN

(Link to app in Google Play)

But this app doesn't seem to have 3 MILLION reviews:

image

And what's worst, is that all the numbers seem to be incorrect 🤦:

  • Rating: 19
  • Size: 1,000+
  • Installs: Free
  • Price: Everyone

image

It's clearly the result of bad scraping. What we'll do is remove it altogether:

In [10]:
df.shape
Out[10]:
(10841, 13)
In [11]:
df.loc[~df['Reviews'].str.isnumeric()]
Out[11]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
10472 Life Made WI-Fi Touchscreen Photo Frame 1.9 19.0 3.0M 1,000+ Free 0 Everyone NaN February 11, 2018 1.0.19 4.0 and up NaN
In [12]:
df.loc[~df['Reviews'].str.isnumeric()].index
Out[12]:
Int64Index([10472], dtype='int64')
In [13]:
df.drop(df.loc[~df['Reviews'].str.isnumeric()].index, inplace=True)
In [14]:
df.shape
Out[14]:
(10840, 13)

We can finally turn Reviews in its corresponding Data Type:

In [15]:
df.head()
Out[15]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
In [16]:
df['Reviews'] = pd.to_numeric(df['Reviews'])
In [17]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
App               10840 non-null object
Category          10840 non-null category
Rating            9366 non-null float64
Reviews           10840 non-null int64
Size              10840 non-null object
Installs          10840 non-null object
Type              10839 non-null category
Price             10840 non-null object
Content Rating    10840 non-null category
Genres            10840 non-null category
Last Updated      10840 non-null object
Current Ver       10832 non-null object
Android Ver       10838 non-null object
dtypes: category(4), float64(1), int64(1), object(7)
memory usage: 897.0+ KB
Fixing Price

Let's explore really quickly the Price column to see what's going on:

In [18]:
df.loc[~df['Price'].str.isnumeric()]
Out[18]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
234 TurboScan: scan documents and receipts in PDF BUSINESS 4.7 11442 6.8M 100,000+ Paid $4.99 Everyone Business March 25, 2018 1.5.2 4.0 and up
235 Tiny Scanner Pro: PDF Doc Scan BUSINESS 4.8 10295 39M 100,000+ Paid $4.99 Everyone Business April 11, 2017 3.4.6 3.0 and up
290 TurboScan: scan documents and receipts in PDF BUSINESS 4.7 11442 6.8M 100,000+ Paid $4.99 Everyone Business March 25, 2018 1.5.2 4.0 and up
291 Tiny Scanner Pro: PDF Doc Scan BUSINESS 4.8 10295 39M 100,000+ Paid $4.99 Everyone Business April 11, 2017 3.4.6 3.0 and up
427 Puffin Browser Pro COMMUNICATION 4.0 18247 Varies with device 100,000+ Paid $3.99 Everyone Communication July 5, 2018 7.5.3.20547 4.1 and up
476 Moco+ - Chat, Meet People DATING 4.2 1545 Varies with device 10,000+ Paid $3.99 Mature 17+ Dating June 19, 2018 2.6.139 4.1 and up
477 Calculator DATING 2.6 57 6.2M 1,000+ Paid $6.99 Everyone Dating October 25, 2017 1.1.6 4.0 and up
478 Truth or Dare Pro DATING NaN 0 20M 50+ Paid $1.49 Teen Dating September 1, 2017 1.0 4.0 and up
479 Private Dating, Hide App- Blue for PrivacyHider DATING NaN 0 18k 100+ Paid $2.99 Everyone Dating July 25, 2017 1.0.1 4.0 and up
480 Ad Blocker for SayHi DATING NaN 4 1.2M 100+ Paid $3.99 Teen Dating August 2, 2018 1.2 4.0.3 and up
481 AMBW Dating App: Asian Men Black Women Interra... DATING 3.5 2 17M 100+ Paid $7.99 Mature 17+ Dating January 21, 2017 1.0.1 4.0 and up
571 Moco+ - Chat, Meet People DATING 4.2 1546 Varies with device 10,000+ Paid $3.99 Mature 17+ Dating June 19, 2018 2.6.139 4.1 and up
851 Sago Mini Hat Maker EDUCATION 4.9 11 63M 1,000+ Paid $3.99 Everyone Education;Pretend Play July 24, 2017 1.0 4.0.3 and up
852 Fuzzy Numbers: Pre-K Number Foundation EDUCATION 4.7 21 44M 1,000+ Paid $5.99 Everyone Education;Education July 21, 2017 1.3 4.1 and up
853 Toca Life: City EDUCATION 4.7 31085 24M 500,000+ Paid $3.99 Everyone Education;Pretend Play July 6, 2018 1.5-play 4.4 and up
854 Toca Life: Hospital EDUCATION 4.7 3528 24M 100,000+ Paid $3.99 Everyone Education;Pretend Play June 12, 2018 1.1.1-play 4.4 and up
995 My Talking Pet ENTERTAINMENT 4.6 6238 Varies with device 100,000+ Paid $4.99 Everyone Entertainment June 30, 2018 Varies with device Varies with device
1001 Meme Generator ENTERTAINMENT 4.6 3771 53M 100,000+ Paid $2.99 Mature 17+ Entertainment August 3, 2018 4.426 4.1 and up
1227 My CookBook Pro (Ad Free) FOOD_AND_DRINK 4.6 2129 Varies with device 10,000+ Paid $3.49 Everyone Food & Drink June 28, 2018 Varies with device Varies with device
1228 Paprika Recipe Manager FOOD_AND_DRINK 4.1 1268 2.3M 50,000+ Paid $4.99 Everyone Food & Drink June 3, 2018 1.4.4 4.0 and up
1327 Pocket Yoga HEALTH_AND_FITNESS 4.4 2107 Varies with device 100,000+ Paid $2.99 Everyone Health & Fitness December 22, 2015 Varies with device Varies with device
1335 Meditation Studio HEALTH_AND_FITNESS 4.6 1026 29M 10,000+ Paid $3.99 Everyone Health & Fitness May 15, 2018 1.0.6 4.3 and up
1341 Relax Melodies P: Sleep Sounds HEALTH_AND_FITNESS 4.8 19543 Varies with device 100,000+ Paid $2.99 Everyone Health & Fitness January 19, 2018 Varies with device Varies with device
1347 Pocket Yoga HEALTH_AND_FITNESS 4.4 2107 Varies with device 100,000+ Paid $2.99 Everyone Health & Fitness December 22, 2015 Varies with device Varies with device
1831 The Game of Life GAME 4.4 18621 63M 100,000+ Paid $2.99 Everyone Board July 4, 2018 2.1.2 4.4 and up
1832 Clue GAME 4.6 19922 35M 100,000+ Paid $1.99 Everyone 10+ Board July 30, 2018 2.2.5 5.0 and up
1833 The Room: Old Sins GAME 4.9 21119 48M 100,000+ Paid $4.99 Everyone Puzzle April 18, 2018 1.0.1 4.4 and up
1834 The Escapists GAME 4.4 7412 84M 100,000+ Paid $4.99 Teen Strategy April 26, 2018 1.1.0 2.3 and up
1835 Farming Simulator 18 GAME 4.5 18125 15M 100,000+ Paid $4.99 Everyone Simulation;Education July 9, 2018 Varies with device 4.4 and up
1836 RollerCoaster Tycoon® Classic GAME 4.6 10795 69M 100,000+ Paid $5.99 Everyone Simulation December 21, 2017 1.2.1.1712080 4.0.3 and up
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10453 Talkie Pro - Wi-Fi Calling, Chats, File Sharing COMMUNICATION 4.5 201 Varies with device 1,000+ Paid $2.99 Everyone Communication January 6, 2018 Varies with device Varies with device
10457 WiFi Monitor Pro - analyzer of Wi-Fi networks TOOLS 4.6 85 2.4M 1,000+ Paid $2.99 Everyone Tools July 5, 2018 1.9 4.0 and up
10459 SCI-FI UI FAMILY 4.7 15 3.9M 100+ Paid $1.99 Everyone Entertainment April 16, 2018 0.0.53 1.6 and up
10460 Wi-Fi Rabbit Unlock Key TOOLS 4.5 142 26k 5,000+ Paid $1.00 Everyone Tools June 26, 2011 1.0.0 2.1 and up
10517 FJ Toolkit TOOLS NaN 1 2.5M 100+ Paid $1.49 Everyone Tools December 21, 2015 14 4.0 and up
10531 Kernel Manager for Franco Kernel ✨ TOOLS 4.8 12700 10M 100,000+ Paid $3.49 Everyone Tools August 3, 2018 3.2.5 5.0 and up
10540 Ray Financial Calculator Pro FINANCE 4.0 67 2.4M 10,000+ Paid $2.99 Everyone Finance July 3, 2017 4 3.2 and up
10570 FL SW Fishing Regulations SPORTS 4.6 60 24M 1,000+ Paid $1.99 Everyone Sports March 7, 2014 1.03 2.2 and up
10583 Florida Tides & Weather WEATHER 3.8 30 2.0M 1,000+ Paid $6.99 Everyone Weather May 6, 2015 2.0.0 2.3 and up
10586 FL Racing Manager 2015 Pro SPORTS 4.4 656 22M 5,000+ Paid $0.99 Everyone Sports March 12, 2016 0.858 3.0 and up
10594 FL Racing Manager 2018 Pro SPORTS 4.3 340 15M 5,000+ Paid $1.99 Everyone Sports March 17, 2018 1.18 3.0 and up
10645 Football Manager Mobile 2018 SPORTS 3.9 11460 Varies with device 100,000+ Paid $8.99 Everyone Sports June 27, 2018 Varies with device 4.1 and up
10650 FN pistol Model 1906 explained BOOKS_AND_REFERENCE NaN 1 5.3M 10+ Paid $5.49 Everyone Books & Reference March 9, 2017 Android 3.0 - 2017 1.6 and up
10651 FN pistol model 1903 explained BOOKS_AND_REFERENCE NaN 1 19M 10+ Paid $6.49 Everyone Books & Reference September 5, 2015 Android 3.0 - 2015 1.6 and up
10661 The FN "Baby" pistol explained BOOKS_AND_REFERENCE NaN 1 8.8M 10+ Paid $5.99 Everyone Books & Reference September 6, 2015 Android 3.0 - 2015 1.6 and up
10662 FN FAL rifle explained BOOKS_AND_REFERENCE NaN 1 7.3M 10+ Paid $6.49 Everyone Books & Reference September 6, 2015 Android 3.0 - 2015 1.6 and up
10664 The FN HP pistol explained BOOKS_AND_REFERENCE NaN 1 8.5M 10+ Paid $6.49 Everyone Books & Reference September 6, 2015 Android 3.0 - 2015 1.6 and up
10668 FN model 1900 pistol explained BOOKS_AND_REFERENCE NaN 0 8.2M 10+ Paid $6.49 Everyone Books & Reference September 5, 2015 Android 3.0 - 2015 1.6 and up
10669 Pistolet FN GP35 expliqué BOOKS_AND_REFERENCE NaN 2 7.9M 5+ Paid $5.99 Everyone Books & Reference August 19, 2014 Android 2.0 - 2014 1.6 and up
10674 Pistolet FN 1906 expliqué BOOKS_AND_REFERENCE NaN 0 5.2M 10+ Paid $5.49 Everyone Books & Reference August 17, 2014 Android 2.0 - 2014 1.6 and up
10675 Circle Colors Pack-FN Theme PERSONALIZATION 4.2 6 89k 50+ Paid $0.99 Everyone Personalization August 9, 2013 1.0 2.2 and up
10679 Solitaire+ GAME 4.6 11235 Varies with device 100,000+ Paid $2.99 Everyone Card July 30, 2018 Varies with device Varies with device
10682 Fruit Ninja Classic GAME 4.3 85468 36M 1,000,000+ Paid $0.99 Everyone Arcade June 8, 2018 2.4.1.485300 4.0.3 and up
10690 FO Bixby PERSONALIZATION 5.0 5 861k 100+ Paid $0.99 Everyone Personalization April 25, 2018 0.2 7.0 and up
10697 Mu.F.O. GAME 5.0 2 16M 1+ Paid $0.99 Everyone Arcade March 3, 2017 1.0 2.3 and up
10735 FP VoiceBot FAMILY NaN 17 157k 100+ Paid $0.99 Mature 17+ Entertainment November 25, 2015 1.2 2.1 and up
10760 Fast Tract Diet HEALTH_AND_FITNESS 4.4 35 2.4M 1,000+ Paid $7.99 Everyone Health & Fitness August 8, 2018 1.9.3 4.2 and up
10782 Trine 2: Complete Story GAME 3.8 252 11M 10,000+ Paid $16.99 Teen Action February 27, 2015 2.22 5.0 and up
10785 sugar, sugar FAMILY 4.2 1405 9.5M 10,000+ Paid $1.20 Everyone Puzzle June 5, 2018 2.7 2.3 and up
10798 Word Search Tab 1 FR FAMILY NaN 0 1020k 50+ Paid $1.04 Everyone Puzzle February 6, 2012 1.1 3.0 and up

800 rows × 13 columns

The $ character is messing up with our data. I'll just get rid of it with the str attribute:

In [19]:
df['Price'] = df['Price'].str.replace('$', '')

But wait, is it fixed? Let's check again:

In [20]:
df.loc[~df['Price'].str.isnumeric()]
Out[20]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
234 TurboScan: scan documents and receipts in PDF BUSINESS 4.7 11442 6.8M 100,000+ Paid 4.99 Everyone Business March 25, 2018 1.5.2 4.0 and up
235 Tiny Scanner Pro: PDF Doc Scan BUSINESS 4.8 10295 39M 100,000+ Paid 4.99 Everyone Business April 11, 2017 3.4.6 3.0 and up
290 TurboScan: scan documents and receipts in PDF BUSINESS 4.7 11442 6.8M 100,000+ Paid 4.99 Everyone Business March 25, 2018 1.5.2 4.0 and up
291 Tiny Scanner Pro: PDF Doc Scan BUSINESS 4.8 10295 39M 100,000+ Paid 4.99 Everyone Business April 11, 2017 3.4.6 3.0 and up
427 Puffin Browser Pro COMMUNICATION 4.0 18247 Varies with device 100,000+ Paid 3.99 Everyone Communication July 5, 2018 7.5.3.20547 4.1 and up
476 Moco+ - Chat, Meet People DATING 4.2 1545 Varies with device 10,000+ Paid 3.99 Mature 17+ Dating June 19, 2018 2.6.139 4.1 and up
477 Calculator DATING 2.6 57 6.2M 1,000+ Paid 6.99 Everyone Dating October 25, 2017 1.1.6 4.0 and up
478 Truth or Dare Pro DATING NaN 0 20M 50+ Paid 1.49 Teen Dating September 1, 2017 1.0 4.0 and up
479 Private Dating, Hide App- Blue for PrivacyHider DATING NaN 0 18k 100+ Paid 2.99 Everyone Dating July 25, 2017 1.0.1 4.0 and up
480 Ad Blocker for SayHi DATING NaN 4 1.2M 100+ Paid 3.99 Teen Dating August 2, 2018 1.2 4.0.3 and up
481 AMBW Dating App: Asian Men Black Women Interra... DATING 3.5 2 17M 100+ Paid 7.99 Mature 17+ Dating January 21, 2017 1.0.1 4.0 and up
571 Moco+ - Chat, Meet People DATING 4.2 1546 Varies with device 10,000+ Paid 3.99 Mature 17+ Dating June 19, 2018 2.6.139 4.1 and up
851 Sago Mini Hat Maker EDUCATION 4.9 11 63M 1,000+ Paid 3.99 Everyone Education;Pretend Play July 24, 2017 1.0 4.0.3 and up
852 Fuzzy Numbers: Pre-K Number Foundation EDUCATION 4.7 21 44M 1,000+ Paid 5.99 Everyone Education;Education July 21, 2017 1.3 4.1 and up
853 Toca Life: City EDUCATION 4.7 31085 24M 500,000+ Paid 3.99 Everyone Education;Pretend Play July 6, 2018 1.5-play 4.4 and up
854 Toca Life: Hospital EDUCATION 4.7 3528 24M 100,000+ Paid 3.99 Everyone Education;Pretend Play June 12, 2018 1.1.1-play 4.4 and up
995 My Talking Pet ENTERTAINMENT 4.6 6238 Varies with device 100,000+ Paid 4.99 Everyone Entertainment June 30, 2018 Varies with device Varies with device
1001 Meme Generator ENTERTAINMENT 4.6 3771 53M 100,000+ Paid 2.99 Mature 17+ Entertainment August 3, 2018 4.426 4.1 and up
1227 My CookBook Pro (Ad Free) FOOD_AND_DRINK 4.6 2129 Varies with device 10,000+ Paid 3.49 Everyone Food & Drink June 28, 2018 Varies with device Varies with device
1228 Paprika Recipe Manager FOOD_AND_DRINK 4.1 1268 2.3M 50,000+ Paid 4.99 Everyone Food & Drink June 3, 2018 1.4.4 4.0 and up
1327 Pocket Yoga HEALTH_AND_FITNESS 4.4 2107 Varies with device 100,000+ Paid 2.99 Everyone Health & Fitness December 22, 2015 Varies with device Varies with device
1335 Meditation Studio HEALTH_AND_FITNESS 4.6 1026 29M 10,000+ Paid 3.99 Everyone Health & Fitness May 15, 2018 1.0.6 4.3 and up
1341 Relax Melodies P: Sleep Sounds HEALTH_AND_FITNESS 4.8 19543 Varies with device 100,000+ Paid 2.99 Everyone Health & Fitness January 19, 2018 Varies with device Varies with device
1347 Pocket Yoga HEALTH_AND_FITNESS 4.4 2107 Varies with device 100,000+ Paid 2.99 Everyone Health & Fitness December 22, 2015 Varies with device Varies with device
1831 The Game of Life GAME 4.4 18621 63M 100,000+ Paid 2.99 Everyone Board July 4, 2018 2.1.2 4.4 and up
1832 Clue GAME 4.6 19922 35M 100,000+ Paid 1.99 Everyone 10+ Board July 30, 2018 2.2.5 5.0 and up
1833 The Room: Old Sins GAME 4.9 21119 48M 100,000+ Paid 4.99 Everyone Puzzle April 18, 2018 1.0.1 4.4 and up
1834 The Escapists GAME 4.4 7412 84M 100,000+ Paid 4.99 Teen Strategy April 26, 2018 1.1.0 2.3 and up
1835 Farming Simulator 18 GAME 4.5 18125 15M 100,000+ Paid 4.99 Everyone Simulation;Education July 9, 2018 Varies with device 4.4 and up
1836 RollerCoaster Tycoon® Classic GAME 4.6 10795 69M 100,000+ Paid 5.99 Everyone Simulation December 21, 2017 1.2.1.1712080 4.0.3 and up
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10453 Talkie Pro - Wi-Fi Calling, Chats, File Sharing COMMUNICATION 4.5 201 Varies with device 1,000+ Paid 2.99 Everyone Communication January 6, 2018 Varies with device Varies with device
10457 WiFi Monitor Pro - analyzer of Wi-Fi networks TOOLS 4.6 85 2.4M 1,000+ Paid 2.99 Everyone Tools July 5, 2018 1.9 4.0 and up
10459 SCI-FI UI FAMILY 4.7 15 3.9M 100+ Paid 1.99 Everyone Entertainment April 16, 2018 0.0.53 1.6 and up
10460 Wi-Fi Rabbit Unlock Key TOOLS 4.5 142 26k 5,000+ Paid 1.00 Everyone Tools June 26, 2011 1.0.0 2.1 and up
10517 FJ Toolkit TOOLS NaN 1 2.5M 100+ Paid 1.49 Everyone Tools December 21, 2015 14 4.0 and up
10531 Kernel Manager for Franco Kernel ✨ TOOLS 4.8 12700 10M 100,000+ Paid 3.49 Everyone Tools August 3, 2018 3.2.5 5.0 and up
10540 Ray Financial Calculator Pro FINANCE 4.0 67 2.4M 10,000+ Paid 2.99 Everyone Finance July 3, 2017 4 3.2 and up
10570 FL SW Fishing Regulations SPORTS 4.6 60 24M 1,000+ Paid 1.99 Everyone Sports March 7, 2014 1.03 2.2 and up
10583 Florida Tides & Weather WEATHER 3.8 30 2.0M 1,000+ Paid 6.99 Everyone Weather May 6, 2015 2.0.0 2.3 and up
10586 FL Racing Manager 2015 Pro SPORTS 4.4 656 22M 5,000+ Paid 0.99 Everyone Sports March 12, 2016 0.858 3.0 and up
10594 FL Racing Manager 2018 Pro SPORTS 4.3 340 15M 5,000+ Paid 1.99 Everyone Sports March 17, 2018 1.18 3.0 and up
10645 Football Manager Mobile 2018 SPORTS 3.9 11460 Varies with device 100,000+ Paid 8.99 Everyone Sports June 27, 2018 Varies with device 4.1 and up
10650 FN pistol Model 1906 explained BOOKS_AND_REFERENCE NaN 1 5.3M 10+ Paid 5.49 Everyone Books & Reference March 9, 2017 Android 3.0 - 2017 1.6 and up
10651 FN pistol model 1903 explained BOOKS_AND_REFERENCE NaN 1 19M 10+ Paid 6.49 Everyone Books & Reference September 5, 2015 Android 3.0 - 2015 1.6 and up
10661 The FN "Baby" pistol explained BOOKS_AND_REFERENCE NaN 1 8.8M 10+ Paid 5.99 Everyone Books & Reference September 6, 2015 Android 3.0 - 2015 1.6 and up
10662 FN FAL rifle explained BOOKS_AND_REFERENCE NaN 1 7.3M 10+ Paid 6.49 Everyone Books & Reference September 6, 2015 Android 3.0 - 2015 1.6 and up
10664 The FN HP pistol explained BOOKS_AND_REFERENCE NaN 1 8.5M 10+ Paid 6.49 Everyone Books & Reference September 6, 2015 Android 3.0 - 2015 1.6 and up
10668 FN model 1900 pistol explained BOOKS_AND_REFERENCE NaN 0 8.2M 10+ Paid 6.49 Everyone Books & Reference September 5, 2015 Android 3.0 - 2015 1.6 and up
10669 Pistolet FN GP35 expliqué BOOKS_AND_REFERENCE NaN 2 7.9M 5+ Paid 5.99 Everyone Books & Reference August 19, 2014 Android 2.0 - 2014 1.6 and up
10674 Pistolet FN 1906 expliqué BOOKS_AND_REFERENCE NaN 0 5.2M 10+ Paid 5.49 Everyone Books & Reference August 17, 2014 Android 2.0 - 2014 1.6 and up
10675 Circle Colors Pack-FN Theme PERSONALIZATION 4.2 6 89k 50+ Paid 0.99 Everyone Personalization August 9, 2013 1.0 2.2 and up
10679 Solitaire+ GAME 4.6 11235 Varies with device 100,000+ Paid 2.99 Everyone Card July 30, 2018 Varies with device Varies with device
10682 Fruit Ninja Classic GAME 4.3 85468 36M 1,000,000+ Paid 0.99 Everyone Arcade June 8, 2018 2.4.1.485300 4.0.3 and up
10690 FO Bixby PERSONALIZATION 5.0 5 861k 100+ Paid 0.99 Everyone Personalization April 25, 2018 0.2 7.0 and up
10697 Mu.F.O. GAME 5.0 2 16M 1+ Paid 0.99 Everyone Arcade March 3, 2017 1.0 2.3 and up
10735 FP VoiceBot FAMILY NaN 17 157k 100+ Paid 0.99 Mature 17+ Entertainment November 25, 2015 1.2 2.1 and up
10760 Fast Tract Diet HEALTH_AND_FITNESS 4.4 35 2.4M 1,000+ Paid 7.99 Everyone Health & Fitness August 8, 2018 1.9.3 4.2 and up
10782 Trine 2: Complete Story GAME 3.8 252 11M 10,000+ Paid 16.99 Teen Action February 27, 2015 2.22 5.0 and up
10785 sugar, sugar FAMILY 4.2 1405 9.5M 10,000+ Paid 1.20 Everyone Puzzle June 5, 2018 2.7 2.3 and up
10798 Word Search Tab 1 FR FAMILY NaN 0 1020k 50+ Paid 1.04 Everyone Puzzle February 6, 2012 1.1 3.0 and up

800 rows × 13 columns

Seems like str.isnumeric() isn't working correctly. Actually, it's a pretty "limited" method, as it'll just check if "Return true if all characters in the string are numeric characters", similar to Python's string method isnumeric()

In [21]:
pd.Series(['one', '4.9', '321', '1', '']).str.isnumeric()
Out[21]:
0    False
1    False
2     True
3     True
4    False
dtype: bool

The most robust check is trying to coerce your values into numbers:

In [22]:
pd.to_numeric(pd.Series(['one', '4.9', '1', '']), errors='coerce')
Out[22]:
0    NaN
1    4.9
2    1.0
3    NaN
dtype: float64

And checking which ones are NaNs. Putting all together:

In [23]:
pd.to_numeric(df['Price'], errors='coerce').isna().sum()
Out[23]:
0

Seems like all the values in Price are now numeric, we can replace the column altogether:

In [24]:
df['Price'] = pd.to_numeric(df['Price'])
In [25]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
App               10840 non-null object
Category          10840 non-null category
Rating            9366 non-null float64
Reviews           10840 non-null int64
Size              10840 non-null object
Installs          10840 non-null object
Type              10839 non-null category
Price             10840 non-null float64
Content Rating    10840 non-null category
Genres            10840 non-null category
Last Updated      10840 non-null object
Current Ver       10832 non-null object
Android Ver       10838 non-null object
dtypes: category(4), float64(2), int64(1), object(6)
memory usage: 897.0+ KB
In [26]:
df.head()
Out[26]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0.0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0.0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0.0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0.0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0.0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
The curious case of Installs

Installs seems to be a numeric number, at least at the beginning. But exploring Google Play carefully, seems like the numbers are not so detailed, and they're a little bit more "categorical". It means, if your app has 20,381 installs, it won't say "20,000+" installs, it'll still be in the 10,000+ category. We can quickly verify this with our unique/value_counts methods:

In [27]:
df['Installs'].unique()
Out[27]:
array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+',
       '10+', '1+', '5+', '0+', '0'], dtype=object)
In [28]:
df['Installs'].value_counts()
Out[28]:
1,000,000+        1579
10,000,000+       1252
100,000+          1169
10,000+           1054
1,000+             907
5,000,000+         752
100+               719
500,000+           539
50,000+            479
5,000+             477
100,000,000+       409
10+                386
500+               330
50,000,000+        289
50+                205
5+                  82
500,000,000+        72
1+                  67
1,000,000,000+      58
0+                  14
0                    1
Name: Installs, dtype: int64

This looks like a good candidate for "ordered" categories:

In [29]:
print("\n".join([val for val in df['Installs'].unique()]))
10,000+
500,000+
5,000,000+
50,000,000+
100,000+
50,000+
1,000,000+
10,000,000+
5,000+
100,000,000+
1,000,000,000+
1,000+
500,000,000+
50+
100+
500+
10+
1+
5+
0+
0
In [30]:
ordered_categories = """
1,000,000,000+
500,000,000+
100,000,000+
50,000,000+
10,000,000+
5,000,000+
1,000,000+
500,000+
100,000+
50,000+
10,000+
5,000+
1,000+
500+
100+
50+
10+
5+
1+
0+
0
"""
[cat for cat in ordered_categories.split()][::-1]
Out[30]:
['0',
 '0+',
 '1+',
 '5+',
 '10+',
 '50+',
 '100+',
 '500+',
 '1,000+',
 '5,000+',
 '10,000+',
 '50,000+',
 '100,000+',
 '500,000+',
 '1,000,000+',
 '5,000,000+',
 '10,000,000+',
 '50,000,000+',
 '100,000,000+',
 '500,000,000+',
 '1,000,000,000+']
In [31]:
from pandas.api.types import CategoricalDtype
In [32]:
installs_cat = CategoricalDtype([cat for cat in ordered_categories.split()][::-1], ordered=True)

We can finally replace our Installs column:

In [33]:
df['Installs'] = df['Installs'].astype(installs_cat)
In [34]:
fig, ax = plt.subplots(figsize=(17, 7))
sns.countplot('Installs', data=df, ax=ax)
ax.tick_params(axis='x', labelrotation=-45)
In [35]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
App               10840 non-null object
Category          10840 non-null category
Rating            9366 non-null float64
Reviews           10840 non-null int64
Size              10840 non-null object
Installs          10840 non-null category
Type              10839 non-null category
Price             10840 non-null float64
Content Rating    10840 non-null category
Genres            10840 non-null category
Last Updated      10840 non-null object
Current Ver       10832 non-null object
Android Ver       10838 non-null object
dtypes: category(5), float64(2), int64(1), object(5)
memory usage: 1.1+ MB
In [36]:
df.head()
Out[36]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0.0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0.0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0.0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0.0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0.0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
Fixing Last Updated

Last Updated is now fixed, since we've removed that problematic row:

In [37]:
df['Last Updated'] = pd.to_datetime(df['Last Updated'])
Making Android Ver a categorical value and squashing them:
In [38]:
df['Android Ver'].value_counts()
Out[38]:
4.1 and up            2451
4.0.3 and up          1501
4.0 and up            1375
Varies with device    1362
4.4 and up             980
2.3 and up             652
5.0 and up             601
4.2 and up             394
2.3.3 and up           281
2.2 and up             244
4.3 and up             243
3.0 and up             241
2.1 and up             134
1.6 and up             116
6.0 and up              60
7.0 and up              42
3.2 and up              36
2.0 and up              32
5.1 and up              24
1.5 and up              20
4.4W and up             12
3.1 and up              10
2.0.1 and up             7
8.0 and up               6
7.1 and up               3
1.0 and up               2
5.0 - 8.0                2
4.0.3 - 7.1.1            2
4.1 - 7.1.1              1
7.0 - 7.1.1              1
5.0 - 7.1.1              1
2.2 - 7.1.1              1
5.0 - 6.0                1
Name: Android Ver, dtype: int64
In [ ]:
df.loc[df['Android Ver'].isin(['5.0 - 7.1.1', '4.1 - 7.1.1', '7.0 - 7.1.1'])]
In [ ]:
df['Size'].value_counts()
In [ ]:
sns.jointplot(x='Installs', y='Rating', data=df)
In [ ]:
df.plot.scatter('Installs', 'Rating', figsize=(14, 7))
In [ ]:
df.loc[df['Android Ver'].isin(['5.0 - 7.1.1', '4.1 - 7.1.1', '7.0 - 7.1.1']), 'Android Ver']
In [ ]:
df.loc[df['Android Ver'].isin(['5.0 - 7.1.1', '4.1 - 7.1.1', '7.0 - 7.1.1']), 'Android Ver'] = "Sonia's Category"
In [ ]:
df.loc[df['Android Ver'].isin(["Sonia's Category"])]
In [ ]:
 
In [ ]:
df['Android Ver'].value_counts()
In [ ]:
pd.qcut(versions.drop_duplicates(), 8)
In [ ]:
df.head()
In [ ]:
 
In [41]:
df.dtypes
Out[41]:
App                       object
Category                category
Rating                   float64
Reviews                    int64
Size                      object
Installs                category
Type                    category
Price                    float64
Content Rating          category
Genres                  category
Last Updated      datetime64[ns]
Current Ver               object
Android Ver               object
dtype: object
In [40]:
df.head()
Out[40]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0.0 Everyone Art & Design 2018-01-07 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0.0 Everyone Art & Design;Pretend Play 2018-01-15 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0.0 Everyone Art & Design 2018-08-01 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0.0 Teen Art & Design 2018-06-08 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0.0 Everyone Art & Design;Creativity 2018-06-20 1.1 4.4 and up
In [ ]:
 
Notebooks AI
Notebooks AI Profile20060