Profile picture

DS Project - Student Performance Analysis

Last updated: July 2nd, 20192019-07-02Project preview

rmotr


Data Science Project

Student performance analysis

We'll try to understand the influence of the parents background, test preparation and many other factors on students performance.

purple-divider

Hands on!

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

green-divider

Loading our data:

This data set consists of the marks secured by the students in various subjects in high school from the United States.

In [ ]:
!head data/StudentsPerformance.csv
In [ ]:
df = pd.read_csv('data/StudentsPerformance.csv')

green-divider

The data at a glance:

In [ ]:
df.head()
In [ ]:
df.shape
In [ ]:
df.info()
In [ ]:
df.describe()

green-divider

What's the median of writing score?

  • Show the raw median value.
  • Show a histogram plot of the writing scores.
In [ ]:
# your code goes here
In [ ]:
df['writing score'].mean()
In [ ]:
# your code goes here
In [ ]:
ax = df['writing score'].plot(kind='hist', figsize=(14,6))
ax.set_ylabel('Count of marks')
ax.set_xlabel('score')

green-divider

How is reading score distributed?

  • Show a box plot of the reading scores.
  • Show a density plot of the reading scores.
  • Add a red line on the mean.
  • Add a green line on the median median.
In [ ]:
# your code goes here
In [ ]:
df['reading score'].plot(kind='box', vert=False, figsize=(14,6))
In [ ]:
# your code goes here
In [ ]:
ax = df['reading score'].plot(kind='density', figsize=(14,6))
ax.axvline(df['reading score'].mean(), color='red')
ax.axvline(df['reading score'].median(), color='green')

green-divider

What's the most common parental level of education?

  • Show the raw count of education level.
  • Show a bar plot with all possible education level.
In [ ]:
# your code goes here
In [ ]:
df['parental level of education'].value_counts()
In [ ]:
# your code goes here
In [ ]:
df['parental level of education'].value_counts().plot(kind='bar', figsize=(14,6))

green-divider

Analyze race/ethnicity of the students

  • Show the count of each value.
  • Show a pie plot with all the race/ethnicity values.
  • Show a bar plot with all the race/ethnicity values.
In [ ]:
# your code goes here
In [ ]:
df['race/ethnicity'].value_counts()
In [ ]:
# your code goes here
In [ ]:
df['race/ethnicity'].value_counts().plot(kind='pie', figsize=(6,6))
In [ ]:
# your code goes here
In [ ]:
df['race/ethnicity'].value_counts().plot(kind='bar', figsize=(14,6))

green-divider

 Show a correlation matrix between numerical columns

In [ ]:
# your code goes here
In [ ]:
corr = df.corr()

fig = plt.figure(figsize=(6,6))
plt.matshow(corr, cmap='coolwarm', fignum=fig.number)

plt.xticks(range(len(corr.columns)), corr.columns, rotation='vertical');
plt.yticks(range(len(corr.columns)), corr.columns);
In [ ]:
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

green-divider

 Relationship between columns

Can you find any significant relationship?

  • Show a scatter plot between reading score and writing score.
  • Show a scatter plot between reading score and math score.
In [ ]:
# your code goes here
In [ ]:
df.plot(kind='scatter', x='reading score', y='writing score', figsize=(6,6))
In [ ]:
# your code goes here
In [ ]:
df.plot(kind='scatter', x='reading score', y='math score', figsize=(6,6))

green-divider

 Does the reading score vary depending on race/ethnicity?

Show a grouped box plot per race/ethnicity value with the reading score.

In [ ]:
# your code goes here
In [ ]:
ax = df[['reading score', 'race/ethnicity']].boxplot(by='race/ethnicity', figsize=(10,6))
ax.set_ylabel('reading score')

green-divider

 Does the writing score vary depending on parental level of education?

Show a grouped box plot per parental level of education with the writing score.

In [ ]:
# your code goes here
In [ ]:
df[['writing score', 'parental level of education']].boxplot(by='parental level of education', figsize=(14,6))

green-divider

Analyze the distribution of writing score

  • Calculate the mean of writing score.
  • Show a density (KDE) of writing score.
In [ ]:
# your code goes here
In [ ]:
df['writing score'].mean()
In [ ]:
# your code goes here
In [ ]:
ax = df['writing score'].plot(kind='density', figsize=(14,6))
ax.axvline(df['writing score'].mean(), color='red')

green-divider

Add and calculate a new writing_math_score column

To do that use the mean of this two scores.

In [ ]:
# your code goes here
In [ ]:
df['writing_math_score'] = (df['writing score'] + df['math score']) / 2

df['writing_math_score'].head()

green-divider

Analyze the distribution of writing_math_score

  • Show a density (KDE) of writing_math_score, writing score and math score at the same time.
  • Show a histogram plot of the writing_math scores.
In [ ]:
# your code goes here
In [ ]:
df['writing_math_score'].plot(kind='density', figsize=(14,6))
df['writing score'].plot(kind='density', figsize=(14,6))
df['math score'].plot(kind='density', figsize=(14,6))

plt.legend()
In [ ]:
# your code goes here
In [ ]:
df['writing_math_score'].plot(kind='hist', figsize=(14,6))

green-divider

 Add and calculate a new final score column

This column values should follow this formula:

$$ final\ score = 0.4 * writing\ score + 0.4 * reading\ score + 0.2 * math\ score $$
In [ ]:
# your code goes here
In [ ]:
df['final score'] = (df['writing score'] * 0.4) + (df['reading score'] * 0.4) + (df['math score'] * 0.2)

df['final score'].head()

green-divider

Analyze the distribution of final score

  • Calculate the mean of final score.
  • Show a histogram of final score.
  • Add a red line on the mean.
  • Add a green line on the median median.
In [ ]:
# your code goes here
In [ ]:
df['final score'].mean()
In [ ]:
# your code goes here
In [ ]:
ax = df['final score'].plot(kind='hist', figsize=(14,6))
ax.axvline(df['final score'].mean(), color='red')
ax.axvline(df['final score'].median(), color='green')

green-divider

 Add and calculate a new grade column

This column values should follow:

- A: final score >= 80
- B: final score >= 70
- C: final score >= 60
- D: final score >= 50
- E: final score >= 35
- F: final score < 35
In [ ]:
# your code goes here
In [ ]:
def get_grade(student):
    final_score = student['final score']
    
    if final_score >= 80:
        return 'A'
    if final_score >= 70:
        return 'B'
    if final_score >= 60:
        return 'C'
    if final_score >= 50:
        return 'D'
    if final_score >= 35:
        return 'E'
    else:
        return 'F'
    
df['grade'] = df.apply(lambda x: get_grade(x), axis=1)

df['grade'].head()

green-divider

Analyze the distribution of grade

  • Show the count of each grade.
  • Show a pie plot with each grade value.
  • Show a bar plot with each grade value.
In [ ]:
# your code goes here
In [ ]:
df['grade'].value_counts()
In [ ]:
# your code goes here
In [ ]:
df['grade'].value_counts().plot(kind='pie', figsize=(6,6))
In [ ]:
# your code goes here
In [ ]:
df['grade'].value_counts().plot(kind='bar', figsize=(14,6))

green-divider

List students with the lowest reading score

In [ ]:
# your code goes here
In [ ]:
df.loc[df['reading score'] == df['reading score'].min()]

green-divider

List students with the highest reading score

In [ ]:
# your code goes here
In [ ]:
df.loc[df['reading score'] == df['reading score'].max()]

green-divider

How many students got writing score lower than 30?

In [ ]:
# your code goes here
In [ ]:
df.loc[df['writing score'] < 30].shape[0]

green-divider

How many students got final score higher than 95?

In [ ]:
# your code goes here
In [ ]:
df.loc[df['final score'] > 95].shape[0]

green-divider

How many females got each final grade value?

Show a bar plot with each grade value count.

In [ ]:
# your code goes here
In [ ]:
df.loc[df['gender'] == 'female', 'grade'].value_counts()
In [ ]:
df.loc[df['gender'] == 'female', 'grade'].value_counts().plot(kind='bar', figsize=(14,6))

green-divider

Get the mean reading score of students with high school parental level of education

In [ ]:
# your code goes here
In [ ]:
df.loc[df['parental level of education'] == 'high school', 'reading score'].mean()

green-divider

How many students belong to group c (race/ethnicity) or has some college parental level of education?

In [ ]:
# your code goes here
In [ ]:
df.loc[(df['race/ethnicity'] == 'group C') | (df['parental level of education'] == 'some college')].shape[0]

green-divider

Get the minimum reading score got on students of group b (race/ethnicity) with standard lunch

In [ ]:
# your code goes here
In [ ]:
df.loc[(df['race/ethnicity'] == 'group B') & (df['lunch'] == 'standard'), 'reading score'].min()

green-divider

List students with the highest reading score and highest writing score at the same time

In [ ]:
# your code goes here
In [ ]:
df.loc[(df['reading score'] == df['reading score'].max()) & (df['writing score'] == df['writing score'].max())]

green-divider

How many students got more than 80 reading score or more than 80 writing score?

In [ ]:
# your code goes here
In [ ]:
df.loc[(df['reading score'] > 80) | (df['reading score'] > 80)].shape[0]

green-divider

 Show a histogram of the students of group b (race/ethnicity) with standard lunch

  • Go ahead and add an axvline on minimum and maximum values.
In [ ]:
# your code goes here
In [ ]:
df_selected = df.loc[(df['race/ethnicity'] == 'group B') & (df['lunch'] == 'standard'), 'reading score']

ax = df_selected.plot(kind='hist', figsize=(14,6))
ax.axvline(df_selected.min(), color='red')
ax.axvline(df_selected.max(), color='red')

purple-divider

Notebooks AI
Notebooks AI Profile20060