MoSayed Proj 2

Last updated: September 19th, 20202020-09-19Project preview

This project is about getting insights from a database of google play apps stats (ratings, reviews...etc)

In [173]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [174]:
google_data = pd.read_csv('googleplaystore.csv')
google_data.head()
Out[174]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up

Quick look stats about the db

In [175]:
google_data.shape
Out[175]:
(10841, 13)
In [176]:
google_data.describe()
Out[176]:
Rating
count 9367.000000
mean 4.193338
std 0.537431
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 19.000000

But, where are the other stats?

In [177]:
google_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB

So, the reason other columns aren't shown in "describe" is that they are of "object" type

Also we have some null values in some of the db columns


Data Cleaning

In [178]:
google_data.isnull().sum()
Out[178]:
App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

So, we have null values in columns: Rating, Type, Content Rating, Current Ver & Android Ver

lets check for outliers, too (rating shouldn't be > 5):

In [179]:
google_data[google_data['Rating']>5]
Out[179]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
10472 Life Made WI-Fi Touchscreen Photo Frame 1.9 19.0 3.0M 1,000+ Free 0 Everyone NaN February 11, 2018 1.0.19 4.0 and up NaN
In [180]:
google_data.drop([10472], inplace=True)
In [181]:
google_data[google_data['Rating']>5]
Out[181]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver

Ratings outliers are cleared

Now, let's fill the missing cells with values that has the least effect on the stats (like avg value or most common)

In [182]:
google_data['Rating'] = google_data['Rating'].fillna(google_data['Rating'].median())
In [183]:
google_data.isnull().sum()
Out[183]:
App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              1
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       8
Android Ver       2
dtype: int64

Type, Current Ver & Android Ver are strings, so median trick won't work here, we ll use most common value

In [184]:
google_data['Type'].mode().values[0]
Out[184]:
'Free'
In [185]:
google_data['Current Ver'].mode().values[0]
Out[185]:
'Varies with device'
In [186]:
google_data['Android Ver'].mode().values[0]
Out[186]:
'4.1 and up'
In [187]:
google_data['Type'].fillna(str(google_data['Type'].mode().values[0]), inplace=True)
google_data['Current Ver'].fillna(str(google_data['Current Ver'].mode().values[0]), inplace=True)
google_data['Android Ver'].fillna(str(google_data['Android Ver'].mode().values[0]), inplace=True)
In [188]:
google_data.isnull().sum()
Out[188]:
App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       0
Android Ver       0
dtype: int64

Now, missing values problem is resolved

Reviews, Price & installs are of object type, while they should be intigers

In [189]:
google_data['Price'] = google_data['Price'].apply(lambda x: str(x).replace('$', '') if '$' in str(x) else str(x))
google_data['Price'] = google_data['Price'].apply(lambda x: float(x))
google_data['Installs'] = google_data['Installs'].apply(lambda x: str(x).replace('+', '') if '+' in str(x) else str(x))
google_data['Installs'] = google_data['Installs'].apply(lambda x: str(x).replace(',', '') if ',' in str(x) else str(x))
google_data['Installs'] = google_data['Installs'].apply(lambda x: float(x))
google_data['Reviews'] = pd.to_numeric(google_data['Reviews'], errors='coerce')
In [190]:
google_data.describe()
Out[190]:
Rating Reviews Installs Price
count 10840.000000 1.084000e+04 1.084000e+04 10840.000000
mean 4.206476 4.441529e+05 1.546434e+07 1.027368
std 0.480342 2.927761e+06 8.502936e+07 15.949703
min 1.000000 0.000000e+00 0.000000e+00 0.000000
25% 4.100000 3.800000e+01 1.000000e+03 0.000000
50% 4.300000 2.094000e+03 1.000000e+05 0.000000
75% 4.500000 5.477550e+04 5.000000e+06 0.000000
max 5.000000 7.815831e+07 1.000000e+09 400.000000

Data cleaning is done successfully


Data Visualization

In [191]:
google_data['Category'].value_counts().plot(kind='pie', figsize=(16,16))
Out[191]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa5d9395100>

Visually, Family, Game, Tools, Medical & Business are the top no of apps on google play store

Market Segments Exploration

In [193]:
plt.figure(figsize=(12,6))
plt.title('Mrkt Segments Exploration')
sns.scatterplot(google_data.Installs, google_data.Rating, hue=google_data['Content Rating'] , s=90)
Out[193]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa5d8cf1bb0>

Visually, we have a profitable segment of high rating high installs (and so high revenues), that segment has the majority of the orange color which is categorized under the 'Teen' content rating

Now, let's view some important top stats:

In [194]:
google_data['Revenues'] = google_data['Installs'] * google_data['Price']
In [195]:
grp = google_data.groupby('Category')
x = grp['Price'].sum().sort_values(ascending=False).head(10)
x
Out[195]:
Category
FINANCE            2900.83
FAMILY             2434.78
LIFESTYLE          2360.87
MEDICAL            1439.96
GAME                287.30
TOOLS               267.25
PRODUCTIVITY        250.93
BUSINESS            185.27
PERSONALIZATION     153.96
PHOTOGRAPHY         134.21
Name: Price, dtype: float64

Top Prices Bar chart

In [196]:
plt.figure(figsize=(12,6))
plt.title('Top Price Categories')
x.plot(kind='bar', rot=90, color='green')
plt.ylabel('Price USD')
Out[196]:
Text(0, 0.5, 'Price USD')
In [197]:
y = grp['Rating'].mean().sort_values(ascending=False).head(10)
y
Out[197]:
Category
EVENTS                 4.395313
EDUCATION              4.388462
ART_AND_DESIGN         4.355385
BOOKS_AND_REFERENCE    4.335498
PERSONALIZATION        4.328827
PARENTING              4.300000
GAME                   4.286888
BEAUTY                 4.283019
HEALTH_AND_FITNESS     4.280059
SHOPPING               4.263077
Name: Rating, dtype: float64

Top Ratings Bar chart

In [198]:
plt.figure(figsize=(12,6))
plt.title('Top Rating Categories')
y.plot(kind='bar', rot=90, color='black', ylim = (4,5))
plt.ylabel('Rating out of 5')
Out[198]:
Text(0, 0.5, 'Rating out of 5')
In [199]:
z = grp['Reviews'].sum().sort_values(ascending=False).head(10)
z
Out[199]:
Category
GAME               1585422349
COMMUNICATION       815462260
SOCIAL              621241422
FAMILY              410226330
TOOLS               273185044
PHOTOGRAPHY         213516650
SHOPPING            115041222
PRODUCTIVITY        114116975
VIDEO_PLAYERS       110380188
PERSONALIZATION      89346140
Name: Reviews, dtype: int64

Top Reviews Bar chart

In [200]:
plt.figure(figsize=(12,6))
plt.title('Top Reviews Categories')
z.plot(kind='bar', rot=90, color='blue')
plt.ylabel('No. of Reviews (billions)')
Out[200]:
Text(0, 0.5, 'No. of Reviews (billions)')
In [201]:
r = grp['Revenues'].sum().sort_values(ascending=False).head(10)
r
Out[201]:
Category
FAMILY             1.860526e+08
LIFESTYLE          5.758394e+07
GAME               4.098764e+07
FINANCE            2.572668e+07
PHOTOGRAPHY        2.092277e+07
MEDICAL            1.357484e+07
PERSONALIZATION    1.027695e+07
TOOLS              5.464821e+06
SPORTS             4.706212e+06
PRODUCTIVITY       4.313375e+06
Name: Revenues, dtype: float64

Top Revenues Bar chart

In [202]:
plt.figure(figsize=(12,6))
plt.title('Top Revenues Categories')
r.plot(kind='bar', rot=90, color='red')
plt.ylabel('Revenues (100 Million $)')
Out[202]:
Text(0, 0.5, 'Revenues (100 Million $)')

Now let's use the Correlation & Heat Map to explore some trends:

In [203]:
corr = google_data.corr()
fig = plt.figure(figsize=(8,8))
plt.matshow(corr, cmap='Purples', fignum=fig.number)
plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical');
plt.yticks(range(len(corr.columns)), corr.columns);

Visually, there is a relationship between price & revenues, and a strong relationship between installs & reviews, which seems to be logic

In [204]:
google_NewIndex = google_data.set_index('Category')
google_Top5 = google_NewIndex.loc[['FAMILY', 'GAME', 'TOOLS', 'MEDICAL', 'BUSINESS']]
In [205]:
google_Top5.loc[['FAMILY', 'GAME', 'TOOLS', 'MEDICAL', 'BUSINESS']] 
Out[205]:
App Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver Revenues
Category
FAMILY YouTube Kids 4.5 470694 Varies with device 50000000.0 Free 0.0 Everyone Entertainment;Music & Video August 3, 2018 3.43.3 4.1 and up 0.0
FAMILY Candy Bomb 4.4 42145 20M 10000000.0 Free 0.0 Everyone Casual;Brain Games July 4, 2018 2.9.3181 4.0.3 and up 0.0
FAMILY ROBLOX 4.5 4449910 67M 100000000.0 Free 0.0 Everyone 10+ Adventure;Action & Adventure July 31, 2018 2.347.225742 4.1 and up 0.0
FAMILY Jewels Crush- Match 3 Puzzle 4.4 14774 19M 1000000.0 Free 0.0 Everyone Casual;Brain Games July 23, 2018 1.9.3901 4.0.3 and up 0.0
FAMILY Coloring & Learn 4.4 12753 51M 5000000.0 Free 0.0 Everyone Educational;Creativity July 17, 2018 1.49 4.0.3 and up 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
BUSINESS FQ Load Board for Transporters 4.3 0 3.9M 100.0 Free 0.0 Everyone Business February 16, 2018 1.1.3 5.0 and up 0.0
BUSINESS Fr Lupupa Sermons 4.8 19 21M 100.0 Free 0.0 Everyone Business June 12, 2018 1 4.4 and up 0.0
BUSINESS DICT.fr Mobile 4.3 20 2.7M 10000.0 Free 0.0 Everyone Business July 17, 2018 2.1.10 4.1 and up 0.0
BUSINESS FieldBi FR Offline 4.3 2 6.8M 100.0 Free 0.0 Everyone Business August 6, 2018 2.1.8 4.1 and up 0.0
BUSINESS FR Forms 4.3 0 9.6M 10.0 Free 0.0 Everyone Business September 29, 2016 1.1.5 4.0 and up 0.0

4882 rows × 13 columns

In [206]:
# google_Top5['Category'] = google_Top5.index
# google_Top5.drop(columns='Category', inplace=True)
Top5_table = google_Top5.pivot_table(index = "Category", columns = "Content Rating", values = "Rating")
Top5_table
Out[206]:
Content Rating Everyone Everyone 10+ Mature 17+ Teen Unrated
Category
BUSINESS 4.178427 4.300000 4.300000 4.300000 NaN
FAMILY 4.207587 4.241221 4.206000 4.167816 4.3
GAME 4.284539 4.346565 4.262162 4.273112 NaN
MEDICAL 4.206787 4.377778 4.442857 4.440000 NaN
TOOLS 4.077725 NaN 4.000000 4.500000 4.1
In [208]:
plt.title('Rating Variation with Category & Content')
sns.heatmap(Top5_table, annot=True, cmap='Blues')
Out[208]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa5d896a820>

The heat map shows that:

1) Tools/Teen & Medical/Teen,Mature17,Everyone10+ are the top rated apps mix (4.5) among Google's top apps

2) Tools/Everyone,Unrated are the least rated apps mix (4.1) among Google's top apps

Notebooks AI
Notebooks AI Profile20060