# DS Project - Student Performance Analysis

Last updated: July 2nd, 2019

# Student performance analysis¶

We'll try to understand the influence of the parents background, test preparation and many other factors on students performance.

## Hands on!¶

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline


This data set consists of the marks secured by the students in various subjects in high school from the United States.

In [ ]:
!head data/StudentsPerformance.csv

In [ ]:
df = pd.read_csv('data/StudentsPerformance.csv')


## The data at a glance:¶

In [ ]:
df.head()

In [ ]:
df.shape

In [ ]:
df.info()

In [ ]:
df.describe()


### What's the median of writing score?¶

• Show the raw median value.
• Show a histogram plot of the writing scores.
In [ ]:
# your code goes here

In [ ]:
df['writing score'].mean()

In [ ]:
# your code goes here

In [ ]:
ax = df['writing score'].plot(kind='hist', figsize=(14,6))
ax.set_ylabel('Count of marks')
ax.set_xlabel('score')


### How is reading score distributed?¶

• Show a box plot of the reading scores.
• Show a density plot of the reading scores.
• Add a red line on the mean.
• Add a green line on the median median.
In [ ]:
# your code goes here

In [ ]:
df['reading score'].plot(kind='box', vert=False, figsize=(14,6))

In [ ]:
# your code goes here

In [ ]:
ax = df['reading score'].plot(kind='density', figsize=(14,6))


### What's the most common parental level of education?¶

• Show the raw count of education level.
• Show a bar plot with all possible education level.
In [ ]:
# your code goes here

In [ ]:
df['parental level of education'].value_counts()

In [ ]:
# your code goes here

In [ ]:
df['parental level of education'].value_counts().plot(kind='bar', figsize=(14,6))


### Analyze race/ethnicity of the students¶

• Show the count of each value.
• Show a pie plot with all the race/ethnicity values.
• Show a bar plot with all the race/ethnicity values.
In [ ]:
# your code goes here

In [ ]:
df['race/ethnicity'].value_counts()

In [ ]:
# your code goes here

In [ ]:
df['race/ethnicity'].value_counts().plot(kind='pie', figsize=(6,6))

In [ ]:
# your code goes here

In [ ]:
df['race/ethnicity'].value_counts().plot(kind='bar', figsize=(14,6))


### Show a correlation matrix between numerical columns¶

In [ ]:
# your code goes here

In [ ]:
corr = df.corr()

fig = plt.figure(figsize=(6,6))
plt.matshow(corr, cmap='coolwarm', fignum=fig.number)

plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical');
plt.yticks(range(len(corr.columns)), corr.columns);

In [ ]:
corr.style.background_gradient(cmap='coolwarm').set_precision(2)


### Relationship between columns¶

Can you find any significant relationship?

• Show a scatter plot between reading score and writing score.
• Show a scatter plot between reading score and math score.
In [ ]:
# your code goes here

In [ ]:
df.plot(kind='scatter', x='reading score', y='writing score', figsize=(6,6))

In [ ]:
# your code goes here

In [ ]:
df.plot(kind='scatter', x='reading score', y='math score', figsize=(6,6))


### Does the reading score vary depending on race/ethnicity?¶

Show a grouped box plot per race/ethnicity value with the reading score.

In [ ]:
# your code goes here

In [ ]:
ax = df[['reading score', 'race/ethnicity']].boxplot(by='race/ethnicity', figsize=(10,6))


### Does the writing score vary depending on parental level of education?¶

Show a grouped box plot per parental level of education with the writing score.

In [ ]:
# your code goes here

In [ ]:
df[['writing score', 'parental level of education']].boxplot(by='parental level of education', figsize=(14,6))


### Analyze the distribution of writing score¶

• Calculate the mean of writing score.
• Show a density (KDE) of writing score.
In [ ]:
# your code goes here

In [ ]:
df['writing score'].mean()

In [ ]:
# your code goes here

In [ ]:
ax = df['writing score'].plot(kind='density', figsize=(14,6))
ax.axvline(df['writing score'].mean(), color='red')


### Add and calculate a new writing_math_score column¶

To do that use the mean of this two scores.

In [ ]:
# your code goes here

In [ ]:
df['writing_math_score'] = (df['writing score'] + df['math score']) / 2



### Analyze the distribution of writing_math_score¶

• Show a density (KDE) of writing_math_score, writing score and math score at the same time.
• Show a histogram plot of the writing_math scores.
In [ ]:
# your code goes here

In [ ]:
df['writing_math_score'].plot(kind='density', figsize=(14,6))
df['writing score'].plot(kind='density', figsize=(14,6))
df['math score'].plot(kind='density', figsize=(14,6))

plt.legend()

In [ ]:
# your code goes here

In [ ]:
df['writing_math_score'].plot(kind='hist', figsize=(14,6))


### Add and calculate a new final score column¶

This column values should follow this formula:

$$final\ score = 0.4 * writing\ score + 0.4 * reading\ score + 0.2 * math\ score$$
In [ ]:
# your code goes here

In [ ]:
df['final score'] = (df['writing score'] * 0.4) + (df['reading score'] * 0.4) + (df['math score'] * 0.2)



### Analyze the distribution of final score¶

• Calculate the mean of final score.
• Show a histogram of final score.
• Add a red line on the mean.
• Add a green line on the median median.
In [ ]:
# your code goes here

In [ ]:
df['final score'].mean()

In [ ]:
# your code goes here

In [ ]:
ax = df['final score'].plot(kind='hist', figsize=(14,6))
ax.axvline(df['final score'].mean(), color='red')
ax.axvline(df['final score'].median(), color='green')


### Add and calculate a new grade column¶

This column values should follow:

- A: final score >= 80
- B: final score >= 70
- C: final score >= 60
- D: final score >= 50
- E: final score >= 35
- F: final score < 35
In [ ]:
# your code goes here

In [ ]:
def get_grade(student):
final_score = student['final score']

if final_score >= 80:
return 'A'
if final_score >= 70:
return 'B'
if final_score >= 60:
return 'C'
if final_score >= 50:
return 'D'
if final_score >= 35:
return 'E'
else:
return 'F'



### Analyze the distribution of grade¶

• Show the count of each grade.
• Show a pie plot with each grade value.
• Show a bar plot with each grade value.
In [ ]:
# your code goes here

In [ ]:
df['grade'].value_counts()

In [ ]:
# your code goes here

In [ ]:
df['grade'].value_counts().plot(kind='pie', figsize=(6,6))

In [ ]:
# your code goes here

In [ ]:
df['grade'].value_counts().plot(kind='bar', figsize=(14,6))


### List students with the lowest reading score¶

In [ ]:
# your code goes here

In [ ]:
df.loc[df['reading score'] == df['reading score'].min()]


### List students with the highest reading score¶

In [ ]:
# your code goes here

In [ ]:
df.loc[df['reading score'] == df['reading score'].max()]


### How many students got writing score lower than 30?¶

In [ ]:
# your code goes here

In [ ]:
df.loc[df['writing score'] < 30].shape[0]


### How many students got final score higher than 95?¶

In [ ]:
# your code goes here

In [ ]:
df.loc[df['final score'] > 95].shape[0]


### How many females got each final grade value?¶

Show a bar plot with each grade value count.

In [ ]:
# your code goes here

In [ ]:
df.loc[df['gender'] == 'female', 'grade'].value_counts()

In [ ]:
df.loc[df['gender'] == 'female', 'grade'].value_counts().plot(kind='bar', figsize=(14,6))


### Get the mean reading score of students with high school parental level of education¶

In [ ]:
# your code goes here

In [ ]:
df.loc[df['parental level of education'] == 'high school', 'reading score'].mean()


### How many students belong to group c (race/ethnicity) or has some college parental level of education?¶

In [ ]:
# your code goes here

In [ ]:
df.loc[(df['race/ethnicity'] == 'group C') | (df['parental level of education'] == 'some college')].shape[0]


### Get the minimum reading score got on students of group b (race/ethnicity) with standard lunch¶

In [ ]:
# your code goes here

In [ ]:
df.loc[(df['race/ethnicity'] == 'group B') & (df['lunch'] == 'standard'), 'reading score'].min()


### List students with the highest reading score and highest writing score at the same time¶

In [ ]:
# your code goes here

In [ ]:
df.loc[(df['reading score'] == df['reading score'].max()) & (df['writing score'] == df['writing score'].max())]


### How many students got more than 80 reading score or more than 80 writing score?¶

In [ ]:
# your code goes here

In [ ]:
df.loc[(df['reading score'] > 80) | (df['reading score'] > 80)].shape[0]


### Show a histogram of the students of group b (race/ethnicity) with standard lunch¶

• Go ahead and add an axvline on minimum and maximum values.
In [ ]:
# your code goes here

In [ ]:
df_selected = df.loc[(df['race/ethnicity'] == 'group B') & (df['lunch'] == 'standard'), 'reading score']

ax = df_selected.plot(kind='hist', figsize=(14,6))
ax.axvline(df_selected.min(), color='red')
ax.axvline(df_selected.max(), color='red')