Hands on!¶
In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
!head data/StudentsPerformance.csv
In [ ]:
df = pd.read_csv('data/StudentsPerformance.csv')
The data at a glance:¶
In [ ]:
df.head()
In [ ]:
df.shape
In [ ]:
df.info()
In [ ]:
df.describe()
What's the median of writing score
?¶
- Show the raw median value.
- Show a histogram plot of the writing scores.
In [ ]:
# your code goes here
In [ ]:
df['writing score'].mean()
In [ ]:
# your code goes here
In [ ]:
ax = df['writing score'].plot(kind='hist', figsize=(14,6))
ax.set_ylabel('Count of marks')
ax.set_xlabel('score')
How is reading score
distributed?¶
- Show a box plot of the reading scores.
- Show a density plot of the reading scores.
- Add a red line on the mean.
- Add a green line on the median median.
In [ ]:
# your code goes here
In [ ]:
df['reading score'].plot(kind='box', vert=False, figsize=(14,6))
In [ ]:
# your code goes here
In [ ]:
ax = df['reading score'].plot(kind='density', figsize=(14,6))
ax.axvline(df['reading score'].mean(), color='red')
ax.axvline(df['reading score'].median(), color='green')
What's the most common parental level of education
?¶
- Show the raw count of education level.
- Show a bar plot with all possible education level.
In [ ]:
# your code goes here
In [ ]:
df['parental level of education'].value_counts()
In [ ]:
# your code goes here
In [ ]:
df['parental level of education'].value_counts().plot(kind='bar', figsize=(14,6))
Analyze race/ethnicity
of the students¶
- Show the count of each value.
- Show a pie plot with all the race/ethnicity values.
- Show a bar plot with all the race/ethnicity values.
In [ ]:
# your code goes here
In [ ]:
df['race/ethnicity'].value_counts()
In [ ]:
# your code goes here
In [ ]:
df['race/ethnicity'].value_counts().plot(kind='pie', figsize=(6,6))
In [ ]:
# your code goes here
In [ ]:
df['race/ethnicity'].value_counts().plot(kind='bar', figsize=(14,6))
Show a correlation matrix between numerical columns¶
In [ ]:
# your code goes here
In [ ]:
corr = df.corr()
fig = plt.figure(figsize=(6,6))
plt.matshow(corr, cmap='coolwarm', fignum=fig.number)
plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical');
plt.yticks(range(len(corr.columns)), corr.columns);
In [ ]:
corr.style.background_gradient(cmap='coolwarm').set_precision(2)
Relationship between columns¶
Can you find any significant relationship?
- Show a scatter plot between
reading score
andwriting score
. - Show a scatter plot between
reading score
andmath score
.
In [ ]:
# your code goes here
In [ ]:
df.plot(kind='scatter', x='reading score', y='writing score', figsize=(6,6))
In [ ]:
# your code goes here
In [ ]:
df.plot(kind='scatter', x='reading score', y='math score', figsize=(6,6))
Does the reading score vary depending on race/ethnicity?¶
Show a grouped box plot per race/ethnicity value with the reading score.
In [ ]:
# your code goes here
In [ ]:
ax = df[['reading score', 'race/ethnicity']].boxplot(by='race/ethnicity', figsize=(10,6))
ax.set_ylabel('reading score')
Does the writing score vary depending on parental level of education?¶
Show a grouped box plot per parental level of education with the writing score.
In [ ]:
# your code goes here
In [ ]:
df[['writing score', 'parental level of education']].boxplot(by='parental level of education', figsize=(14,6))
Analyze the distribution of writing score
¶
- Calculate the mean of
writing score
. - Show a density (KDE) of
writing score
.
In [ ]:
# your code goes here
In [ ]:
df['writing score'].mean()
In [ ]:
# your code goes here
In [ ]:
ax = df['writing score'].plot(kind='density', figsize=(14,6))
ax.axvline(df['writing score'].mean(), color='red')
In [ ]:
# your code goes here
In [ ]:
df['writing_math_score'] = (df['writing score'] + df['math score']) / 2
df['writing_math_score'].head()
Analyze the distribution of writing_math_score
¶
- Show a density (KDE) of
writing_math_score
,writing score
andmath score
at the same time. - Show a histogram plot of the
writing_math
scores.
In [ ]:
# your code goes here
In [ ]:
df['writing_math_score'].plot(kind='density', figsize=(14,6))
df['writing score'].plot(kind='density', figsize=(14,6))
df['math score'].plot(kind='density', figsize=(14,6))
plt.legend()
In [ ]:
# your code goes here
In [ ]:
df['writing_math_score'].plot(kind='hist', figsize=(14,6))
Add and calculate a new final score
column¶
This column values should follow this formula:
$$ final\ score = 0.4 * writing\ score + 0.4 * reading\ score + 0.2 * math\ score $$In [ ]:
# your code goes here
In [ ]:
df['final score'] = (df['writing score'] * 0.4) + (df['reading score'] * 0.4) + (df['math score'] * 0.2)
df['final score'].head()
Analyze the distribution of final score
¶
- Calculate the mean of
final score
. - Show a histogram of
final score
. - Add a red line on the mean.
- Add a green line on the median median.
In [ ]:
# your code goes here
In [ ]:
df['final score'].mean()
In [ ]:
# your code goes here
In [ ]:
ax = df['final score'].plot(kind='hist', figsize=(14,6))
ax.axvline(df['final score'].mean(), color='red')
ax.axvline(df['final score'].median(), color='green')
Add and calculate a new grade
column¶
This column values should follow:
- A: final score >= 80
- B: final score >= 70
- C: final score >= 60
- D: final score >= 50
- E: final score >= 35
- F: final score < 35
In [ ]:
# your code goes here
In [ ]:
def get_grade(student):
final_score = student['final score']
if final_score >= 80:
return 'A'
if final_score >= 70:
return 'B'
if final_score >= 60:
return 'C'
if final_score >= 50:
return 'D'
if final_score >= 35:
return 'E'
else:
return 'F'
df['grade'] = df.apply(lambda x: get_grade(x), axis=1)
df['grade'].head()
Analyze the distribution of grade
¶
- Show the count of each grade.
- Show a pie plot with each grade value.
- Show a bar plot with each grade value.
In [ ]:
# your code goes here
In [ ]:
df['grade'].value_counts()
In [ ]:
# your code goes here
In [ ]:
df['grade'].value_counts().plot(kind='pie', figsize=(6,6))
In [ ]:
# your code goes here
In [ ]:
df['grade'].value_counts().plot(kind='bar', figsize=(14,6))
List students with the lowest reading score¶
In [ ]:
# your code goes here
In [ ]:
df.loc[df['reading score'] == df['reading score'].min()]
List students with the highest reading score¶
In [ ]:
# your code goes here
In [ ]:
df.loc[df['reading score'] == df['reading score'].max()]
How many students got writing score
lower than 30?¶
In [ ]:
# your code goes here
In [ ]:
df.loc[df['writing score'] < 30].shape[0]
How many students got final score
higher than 95?¶
In [ ]:
# your code goes here
In [ ]:
df.loc[df['final score'] > 95].shape[0]
In [ ]:
# your code goes here
In [ ]:
df.loc[df['gender'] == 'female', 'grade'].value_counts()
In [ ]:
df.loc[df['gender'] == 'female', 'grade'].value_counts().plot(kind='bar', figsize=(14,6))
Get the mean reading score
of students with high school parental level of education¶
In [ ]:
# your code goes here
In [ ]:
df.loc[df['parental level of education'] == 'high school', 'reading score'].mean()
How many students belong to group c
(race/ethnicity) or has some college parental level of education?¶
In [ ]:
# your code goes here
In [ ]:
df.loc[(df['race/ethnicity'] == 'group C') | (df['parental level of education'] == 'some college')].shape[0]
Get the minimum reading score got on students of group b
(race/ethnicity) with standard lunch¶
In [ ]:
# your code goes here
In [ ]:
df.loc[(df['race/ethnicity'] == 'group B') & (df['lunch'] == 'standard'), 'reading score'].min()
List students with the highest reading score and highest writing score at the same time¶
In [ ]:
# your code goes here
In [ ]:
df.loc[(df['reading score'] == df['reading score'].max()) & (df['writing score'] == df['writing score'].max())]
How many students got more than 80 reading score or more than 80 writing score?¶
In [ ]:
# your code goes here
In [ ]:
df.loc[(df['reading score'] > 80) | (df['reading score'] > 80)].shape[0]
Show a histogram of the students of group b
(race/ethnicity) with standard lunch¶
- Go ahead and add an
axvline
on minimum and maximum values.
In [ ]:
# your code goes here
In [ ]:
df_selected = df.loc[(df['race/ethnicity'] == 'group B') & (df['lunch'] == 'standard'), 'reading score']
ax = df_selected.plot(kind='hist', figsize=(14,6))
ax.axvline(df_selected.min(), color='red')
ax.axvline(df_selected.max(), color='red')