Profile picture

Data Scientist @ RMOTR

StackOverflow Survey Bootcamp Respondants

Last updated: May 15th, 20192019-05-15Project preview

rmotr

StackOverflow Developer Survey 2018

Analyzing Bootcamp respondants

Each year, StackOverflow asks the developer community about everything from their favorite technologies to their job preferences, then they publish Survey results of over 100,000 developers.

We'll analyze bootcamp respondants and its relationships with non-bootcamp respondants.

green-divider

In [1]:
!pip install geopandas descartes mapclassify
Requirement already satisfied: geopandas in /usr/local/lib/python3.6/site-packages (0.4.1)
Requirement already satisfied: descartes in /usr/local/lib/python3.6/site-packages (1.1.0)
Requirement already satisfied: mapclassify in /usr/local/lib/python3.6/site-packages (2.0.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.6/site-packages (from geopandas) (0.23.0)
Requirement already satisfied: pyproj in /usr/local/lib/python3.6/site-packages (from geopandas) (2.1.3)
Requirement already satisfied: fiona in /usr/local/lib/python3.6/site-packages (from geopandas) (1.8.6)
Requirement already satisfied: shapely in /usr/local/lib/python3.6/site-packages (from geopandas) (1.6.4.post2)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.6/site-packages (from descartes) (2.2.3)
Requirement already satisfied: numpy>=1.3 in /usr/local/lib/python3.6/site-packages (from mapclassify) (1.14.5)
Requirement already satisfied: scipy>=0.11 in /usr/local/lib/python3.6/site-packages (from mapclassify) (1.1.0)
Requirement already satisfied: pytz>=2011k in /usr/local/lib/python3.6/site-packages (from pandas->geopandas) (2018.9)
Requirement already satisfied: python-dateutil>=2.5.0 in /usr/local/lib/python3.6/site-packages (from pandas->geopandas) (2.7.5)
Requirement already satisfied: munch in /usr/local/lib/python3.6/site-packages (from fiona->geopandas) (2.3.2)
Requirement already satisfied: click<8,>=4.0 in /usr/local/lib/python3.6/site-packages (from fiona->geopandas) (7.0)
Requirement already satisfied: cligj>=0.5 in /usr/local/lib/python3.6/site-packages (from fiona->geopandas) (0.5.0)
Requirement already satisfied: click-plugins>=1.0 in /usr/local/lib/python3.6/site-packages (from fiona->geopandas) (1.1.1)
Requirement already satisfied: attrs>=17 in /usr/local/lib/python3.6/site-packages (from fiona->geopandas) (18.2.0)
Requirement already satisfied: six>=1.7 in /usr/local/lib/python3.6/site-packages (from fiona->geopandas) (1.12.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/site-packages (from matplotlib->descartes) (1.0.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/site-packages (from matplotlib->descartes) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/site-packages (from matplotlib->descartes) (2.3.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib->descartes) (40.7.3)
You are using pip version 19.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [2]:
import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
from shapely.geometry import Point, MultiPoint, Polygon

pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 1000)
pd.set_option('display.float_format', '{:,.3f}'.format)

%matplotlib inline
In [3]:
flatui = ["#2e86de", "#ff4757", "#feca57", "#2ed573", "#ff7f50", "#00cec9", "#fd79a8", "#a4b0be"]
flatui_palette = sns.color_palette(flatui)
sns.palplot(flatui_palette)
sns.set_palette(flatui_palette)

sns.set_style("darkgrid", {
    'axes.edgecolor': '#2b2b2b',
    'axes.facecolor': '#2b2b2b',
    'axes.labelcolor': '#919191',
    'figure.facecolor': '#2b2b2b',
    'grid.color': '#545454',
    'patch.edgecolor': '#2b2b2b',
    'text.color': '#bababa',
    'xtick.color': '#bababa',
    'ytick.color': '#bababa',

})

green-divider

Import data

In [ ]:
!mkdir data && cd data/ && wget --no-check-certificate -r 'https://docs.google.com/uc?export=download&id=19AUNctJbJ2CFWHmxZIULb2qOph8iNpk3' -O survey.zip
!cd data/ && unzip survey.zip
In [4]:
df = pd.read_csv('data/survey_results_public.csv')
/usr/local/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3020: DtypeWarning: Columns (8,12,13,14,15,16,50,51,52,53,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98855 entries, 0 to 98854
Columns: 129 entries, Respondent to SurveyEasy
dtypes: float64(41), int64(1), object(87)
memory usage: 97.3+ MB
In [6]:
df.head(1)
Out[6]:
Respondent Hobby OpenSource Country Student Employment FormalEducation UndergradMajor CompanySize DevType YearsCoding YearsCodingProf JobSatisfaction CareerSatisfaction HopeFiveYears JobSearchStatus LastNewJob AssessJob1 AssessJob2 AssessJob3 AssessJob4 AssessJob5 AssessJob6 AssessJob7 AssessJob8 AssessJob9 AssessJob10 AssessBenefits1 AssessBenefits2 AssessBenefits3 AssessBenefits4 AssessBenefits5 AssessBenefits6 AssessBenefits7 AssessBenefits8 AssessBenefits9 AssessBenefits10 AssessBenefits11 JobContactPriorities1 JobContactPriorities2 JobContactPriorities3 JobContactPriorities4 JobContactPriorities5 JobEmailPriorities1 JobEmailPriorities2 JobEmailPriorities3 JobEmailPriorities4 JobEmailPriorities5 JobEmailPriorities6 JobEmailPriorities7 UpdateCV Currency Salary SalaryType ConvertedSalary CurrencySymbol CommunicationTools TimeFullyProductive EducationTypes SelfTaughtTypes TimeAfterBootcamp HackathonReasons AgreeDisagree1 AgreeDisagree2 AgreeDisagree3 LanguageWorkedWith LanguageDesireNextYear DatabaseWorkedWith DatabaseDesireNextYear PlatformWorkedWith PlatformDesireNextYear FrameworkWorkedWith FrameworkDesireNextYear IDE OperatingSystem NumberMonitors Methodology VersionControl CheckInCode AdBlocker AdBlockerDisable AdBlockerReasons AdsAgreeDisagree1 AdsAgreeDisagree2 AdsAgreeDisagree3 AdsActions AdsPriorities1 AdsPriorities2 AdsPriorities3 AdsPriorities4 AdsPriorities5 AdsPriorities6 AdsPriorities7 AIDangerous AIInteresting AIResponsible AIFuture EthicsChoice EthicsReport EthicsResponsible EthicalImplications StackOverflowRecommend StackOverflowVisit StackOverflowHasAccount StackOverflowParticipate StackOverflowJobs StackOverflowDevStory StackOverflowJobsRecommend StackOverflowConsiderMember HypotheticalTools1 HypotheticalTools2 HypotheticalTools3 HypotheticalTools4 HypotheticalTools5 WakeTime HoursComputer HoursOutside SkipMeals ErgonomicDevices Exercise Gender SexualOrientation EducationParents RaceEthnicity Age Dependents MilitaryUS SurveyTooLong SurveyEasy
0 1 Yes No Kenya No Employed part-time Bachelor’s degree (BA, BS, B.Eng., etc.) Mathematics or statistics 20 to 99 employees Full-stack developer 3-5 years 3-5 years Extremely satisfied Extremely satisfied Working as a founder or co-founder of my own company I’m not actively looking, but I am open to new opportunities Less than a year ago 10.000 7.000 8.000 1.000 2.000 5.000 3.000 4.000 9.000 6.000 nan nan nan nan nan nan nan nan nan nan nan 3.000 1.000 4.000 2.000 5.000 5.000 6.000 7.000 2.000 1.000 4.000 3.000 My job status or other personal status changed NaN NaN Monthly nan KES Slack One to three months Taught yourself a new language, framework, or tool without taking a formal course;Participated in a hackathon The official documentation and/or standards for the technology;A book or e-book from O’Reilly, Apress, or a similar publisher;Questions & answers on Stack Overflow;Online developer communities other than Stack Overflow (ex. forums, listservs, IRC channels, etc.) NaN To build my professional network Strongly agree Strongly agree Neither Agree nor Disagree JavaScript;Python;HTML;CSS JavaScript;Python;HTML;CSS Redis;SQL Server;MySQL;PostgreSQL;Amazon RDS/Aurora;Microsoft Azure (Tables, CosmosDB, SQL, etc) Redis;SQL Server;MySQL;PostgreSQL;Amazon RDS/Aurora;Microsoft Azure (Tables, CosmosDB, SQL, etc) AWS;Azure;Linux;Firebase AWS;Azure;Linux;Firebase Django;React Django;React Komodo;Vim;Visual Studio Code Linux-based 1 Agile;Scrum Git Multiple times per day Yes No NaN Strongly agree Strongly agree Strongly agree Saw an online advertisement and then researched it (without clicking on the ad);Stopped going to a website because of their advertising 1.000 5.000 4.000 7.000 2.000 6.000 3.000 Artificial intelligence surpassing human intelligence ("the singularity") Algorithms making important decisions The developers or the people creating the AI I'm excited about the possibilities more than worried about the dangers. No Yes, and publicly Upper management at the company/organization Yes 10 (Very Likely) Multiple times per day Yes I have never participated in Q&A on Stack Overflow No, I knew that Stack Overflow had a jobs board but have never used or visited it Yes NaN Yes Extremely interested Extremely interested Extremely interested Extremely interested Extremely interested Between 5:00 - 6:00 AM 9 - 12 hours 1 - 2 hours Never Standing desk 3 - 4 times per week Male Straight or heterosexual Bachelor’s degree (BA, BS, B.Eng., etc.) Black or of African descent 25 - 34 years old Yes NaN The survey was an appropriate length Very easy

green-divider

 Country analysis

The first thing we're going to do is analyze where are the respondents from. To do that we'll use the GeoPandas library.

In [7]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

world.columns = ['pop_est', 'continent', 'Country', 'iso_a3', 'gdp_md_est', 'geometry']
In [8]:
respondents = df.groupby('Country')['Country'].count()

world = world.join(respondents, on='Country', rsuffix='respondents')
world['Countryrespondents'].fillna(value=0, inplace=True)

world['Countryrespondents'] = world['Countryrespondents'] / world['Countryrespondents'].sum() * 100
In [9]:
world.plot(column='Countryrespondents',
           cmap='coolwarm',
           scheme='quantiles',
           legend=True,
           figsize=(16,8))

plt.title("Respondents per Country (%)", fontsize=16, fontweight='bold', color='white')
Out[9]:
Text(0.5,1,'Respondents per Country (%)')

green-divider

Bootcamp analysis

We'll center our analysis on people who attended to Bootcamps.

Just ~7% of respondents attended to a Bootcamp.

In [10]:
df['EducationTypes'].str.split(';', expand=True).stack().unique()
Out[10]:
array(['Taught yourself a new language, framework, or tool without taking a formal course',
       'Participated in a hackathon',
       'Contributed to open source software',
       'Completed an industry certification program (e.g. MCPD)',
       'Taken a part-time in-person course in programming or software development',
       'Received on-the-job training in software development',
       'Participated in online coding competitions (e.g. HackerRank, CodeChef, TopCoder)',
       'Taken an online course in programming or software development (e.g. a MOOC)',
       'Participated in a full-time developer training program or bootcamp'],
      dtype=object)
In [11]:
df['Took Bootcamp?'] = df['EducationTypes'].str.contains('Participated in a full-time developer training program or bootcamp').fillna(False)
In [12]:
just_bootcamp = df.loc[df['Took Bootcamp?']]
In [13]:
df.shape
Out[13]:
(98855, 130)
In [14]:
df.loc[df['Took Bootcamp?']].shape
Out[14]:
(6987, 130)
In [15]:
df.loc[~df['Took Bootcamp?']].shape
Out[15]:
(91868, 130)
In [16]:
91868/98855
Out[16]:
0.9293207222699914
In [17]:
bootcamp = pd.Series(df.loc[df['Took Bootcamp?'], 'Took Bootcamp?'].count() / df.shape[0] * 100)
no_bootcamp = pd.Series(df.loc[~df['Took Bootcamp?'], 'Took Bootcamp?'].count() / df.shape[0] * 100)

bootcamp_df = pd.concat([bootcamp, no_bootcamp], keys=['Took Bootcamp', 'No took Bootcamp'], axis=1)
bootcamp_df
Out[17]:
Took Bootcamp No took Bootcamp
0 7.068 92.932
In [18]:
fig, ax = plt.subplots(figsize=(16, 6))

bootcamp_df.plot(kind="bar", ax=ax)
ax.legend(["Bootcamp - YES", "Bootcamp - NO"])

plt.axhline(bootcamp_df['Took Bootcamp'][0], linestyle=':', color=flatui_palette[0])
plt.text(-0.5, bootcamp_df['Took Bootcamp'][0]+1, round(bootcamp_df['Took Bootcamp'][0], 2), color=flatui_palette[0])

plt.axhline(bootcamp_df['No took Bootcamp'][0], linestyle=':', color=flatui_palette[1])
plt.text(-0.5, bootcamp_df['No took Bootcamp'][0]+1, round(bootcamp_df['No took Bootcamp'][0], 2), color=flatui_palette[1])

plt.xticks(rotation=0)
plt.title("Respondents that attended to a Bootcamp (%)", fontsize=16, fontweight='bold', color='white')
Out[18]:
Text(0.5,1,'Respondents that attended to a Bootcamp (%)')

green-divider

Time after Bootcamp to get a Developer job

We'll get the percentages or the time respondents need to get a Developer job after attending to a bootcamp.

We see that ~45% already had a full-time work; ~46% get a new job within a year; and just the ~9% haven't gotten a developer job.

In [19]:
arr = just_bootcamp['TimeAfterBootcamp'].dropna().value_counts()
In [20]:
arr
Out[20]:
I already had a full-time job as a developer when I began the program    3025
Immediately after graduating                                             1085
One to three months                                                       668
I haven’t gotten a developer job                                          581
Less than a month                                                         496
Four to six months                                                        347
Six months to a year                                                      239
Longer than a year                                                        211
Name: TimeAfterBootcamp, dtype: int64
In [21]:
arr = arr.drop('I already had a full-time job as a developer when I began the program')
In [22]:
arr.loc[arr.index != 'I haven’t gotten a developer job'].sum()
Out[22]:
3046
In [23]:
arr.loc[arr.index != 'I haven’t gotten a developer job'].sum() / arr.sum()
Out[23]:
0.8398125172318721
In [24]:
time_index = ['I already had a full-time job as a developer when I began the program',
              'Immediately after graduating', 'Less than a month',
              'One to three months', 'Four to six months',
              'Six months to a year', 'Longer than a year', 'I haven’t gotten a developer job']

time_index_short = ['I already had a full-time', 'Immediately after graduating',
                    '<1 month', '1-3 months', '4-6 months', '6-12 months', '>12 months',
                    'I haven’t gotten a developer job']

time_after_bootcamp = just_bootcamp['TimeAfterBootcamp'].dropna().value_counts() / just_bootcamp['TimeAfterBootcamp'].dropna().shape[0] * 100
time_after_bootcamp
Out[24]:
I already had a full-time job as a developer when I began the program   45.475
Immediately after graduating                                            16.311
One to three months                                                     10.042
I haven’t gotten a developer job                                         8.734
Less than a month                                                        7.456
Four to six months                                                       5.216
Six months to a year                                                     3.593
Longer than a year                                                       3.172
Name: TimeAfterBootcamp, dtype: float64
In [25]:
fig, ax = plt.subplots(figsize=(16, 6))

time_after_bootcamp[time_index].plot(kind="bar", ax=ax)
plt.xticks(np.arange(len(time_index)), time_index_short, rotation=30)
plt.title("Time after Bootcamp to get a Developer job (%)", fontsize=16, fontweight='bold', color='white')
Out[25]:
Text(0.5,1,'Time after Bootcamp to get a Developer job (%)')

green-divider

Programming as a Hobby

In [26]:
hobby_bootcamp = df.loc[df['Took Bootcamp?'], 'Hobby'].value_counts() / df[df['Took Bootcamp?']].shape[0] * 100
hobby_no_bootcamp = df.loc[~(df['Took Bootcamp?']), 'Hobby'].value_counts() / df[~(df['Took Bootcamp?'])].shape[0] * 100

hobby_df = pd.concat([hobby_bootcamp, hobby_no_bootcamp], keys=['Took Bootcamp', 'No took Bootcamp'], axis=1)
hobby_df
Out[26]:
Took Bootcamp No took Bootcamp
Yes 81.408 80.778
No 18.592 19.222
In [27]:
fig, ax = plt.subplots(figsize=(16, 6))

hobby_df.plot(kind="bar", ax=ax)

plt.xticks(rotation=0)
plt.title("Programming as a Hobby (%)", fontsize=16, fontweight='bold', color='white')
Out[27]:
Text(0.5,1,'Programming as a Hobby (%)')

green-divider

Contributing to Open Source projects

In [28]:
os_bootcamp = df.loc[df['Took Bootcamp?'], 'OpenSource'].value_counts() / df[df['Took Bootcamp?']].shape[0] * 100
os_no_bootcamp = df.loc[~(df['Took Bootcamp?']), 'OpenSource'].value_counts() / df[~(df['Took Bootcamp?'])].shape[0] * 100

os_df = pd.concat([os_bootcamp, os_no_bootcamp], keys=['Took Bootcamp', 'No took Bootcamp'], axis=1, sort=False)
os_df
Out[28]:
Took Bootcamp No took Bootcamp
No 54.287 56.577
Yes 45.713 43.423
In [29]:
fig, ax = plt.subplots(figsize=(16, 6))

os_df.plot(kind="bar", ax=ax)

plt.xticks(rotation=0)
plt.title("Contributing to Open Source projects (%)", fontsize=16, fontweight='bold', color='white')
Out[29]:
Text(0.5,1,'Contributing to Open Source projects (%)')

green-divider

Employment analysis

We see here that people that works full-time represent ~76% of Bootcamp assistants.

In [30]:
employment_bootcamp = df.loc[df['Took Bootcamp?'], 'Employment'].value_counts() / df[df['Took Bootcamp?']].shape[0] * 100
employment_no_bootcamp = df.loc[~(df['Took Bootcamp?']), 'Employment'].value_counts() / df[~(df['Took Bootcamp?'])].shape[0] * 100

employment_df = pd.concat([employment_bootcamp, employment_no_bootcamp], keys=['Took Bootcamp', 'No took Bootcamp'], axis=1)
employment_df
Out[30]:
Took Bootcamp No took Bootcamp
Employed full-time 76.170 70.942
Independent contractor, freelancer, or self-employed 9.904 9.350
Not employed, but looking for work 6.970 5.789
Employed part-time 3.592 5.583
Not employed, and not looking for work 2.032 4.343
Retired 0.343 0.221
In [31]:
fig, ax = plt.subplots(figsize=(16, 6))

employment_df.plot(kind="bar", ax=ax)
ax.legend(["Bootcamp - YES", "Bootcamp - NO"])
plt.xticks(rotation=15)
plt.title("Employment type (%)", fontsize=16, fontweight='bold', color='white')
Out[31]:
Text(0.5,1,'Employment type (%)')

green-divider

 Age distribution

We see that people with >30 years old are more likely to attend to Bootcamps, as they are likely to get a new job or improve its career.

In [32]:
age_short = ['Under 18 years old', '18 - 24 years old', '25 - 34 years old',
             '35 - 44 years old', '45 - 54 years old', '55 - 64 years old', '65 years or older']

age_bootcamp = df.loc[df['Took Bootcamp?'], 'Age'].dropna().value_counts()
age_no_bootcamp = df.loc[~(df['Took Bootcamp?']), 'Age'].dropna().value_counts()

age_df = pd.concat([age_bootcamp, age_no_bootcamp], keys=['Took Bootcamp', 'No took Bootcamp'], axis=1, sort=False).loc[age_short]
age_df
Out[32]:
Took Bootcamp No took Bootcamp
Under 18 years old 79 1559
18 - 24 years old 1272 13977
25 - 34 years old 2887 28872
35 - 44 years old 1239 10238
45 - 54 years old 488 2825
55 - 64 years old 176 783
65 years or older 44 135
In [33]:
fig, ax = plt.subplots(figsize=(16, 6))

age_df[['Took Bootcamp', 'No took Bootcamp']].plot(kind="bar", ax=ax)
ax.legend(["Bootcamp - YES", "Bootcamp - NO"])
plt.xticks(rotation=0)

plt.title("Age distribution (count)", fontsize=16, fontweight='bold', color='white')
Out[33]:
Text(0.5,1,'Age distribution (count)')
In [34]:
age_short = ['Under 18 years old', '18 - 24 years old', '25 - 34 years old',
             '35 - 44 years old', '45 - 54 years old', '55 - 64 years old', '65 years or older']

age_bootcamp = df.loc[df['Took Bootcamp?'], 'Age'].dropna().value_counts() / df.loc[df['Took Bootcamp?'], 'Age'].dropna().shape[0] * 100
age_no_bootcamp = df.loc[~(df['Took Bootcamp?']), 'Age'].dropna().value_counts() / df.loc[~(df['Took Bootcamp?']), 'Age'].dropna().shape[0] * 100

age_df = pd.concat([age_bootcamp, age_no_bootcamp], keys=['Took Bootcamp', 'No took Bootcamp'], axis=1, sort=False).loc[age_short]
age_df
Out[34]:
Took Bootcamp No took Bootcamp
Under 18 years old 1.277 2.670
18 - 24 years old 20.566 23.938
25 - 34 years old 46.677 49.448
35 - 44 years old 20.032 17.534
45 - 54 years old 7.890 4.838
55 - 64 years old 2.846 1.341
65 years or older 0.711 0.231
In [35]:
fig, ax = plt.subplots(figsize=(16, 6))

age_df.plot(kind="bar", ax=ax)
ax.legend(["Bootcamp - YES", "Bootcamp - NO"])
plt.xticks(rotation=0)

plt.title("Age distribution (%)", fontsize=16, fontweight='bold', color='white')
Out[35]:
Text(0.5,1,'Age distribution (%)')

green-divider

 Salary

Worldwide salary median is $59,729 for people that attended to a Bootcamp, and $55075 for people that don't.

In [36]:
fig, ax = plt.subplots(figsize=(16, 6))
ax.set_xlim(0, 300_000)

df_salaries = df.dropna(subset=['ConvertedSalary'])
salary_no_bootcamp = df_salaries.loc[~(df['Took Bootcamp?'])]['ConvertedSalary']
salary_bootcamp = df_salaries.loc[df['Took Bootcamp?']]['ConvertedSalary']

sns.distplot(salary_bootcamp.dropna(), bins=250, label='Took Bootcamp')
sns.distplot(salary_no_bootcamp.dropna(), bins=250, label='No took Bootcamp')
ax.legend(["Bootcamp - YES", "Bootcamp - NO"])

ax.axvline(salary_bootcamp.median(), linestyle=':', color=flatui_palette[0])
ax.text(salary_bootcamp.median(), 0, 'Median {}'.format(int(salary_bootcamp.median())), rotation=90, color=flatui_palette[0])

ax.axvline(salary_no_bootcamp.median(), linestyle=':', color=flatui_palette[1])
ax.text(salary_no_bootcamp.median(), 0, 'Median {}'.format(int(salary_no_bootcamp.median())), rotation=90, color=flatui_palette[1])

ax.set_title("Salary distribution (US$/year)", fontsize=16, fontweight='bold', color='white')
Out[36]:
Text(0.5,1,'Salary distribution (US$/year)')
In [37]:
pd.concat([salary_bootcamp.describe(), salary_no_bootcamp.describe()], axis=1)
Out[37]:
ConvertedSalary ConvertedSalary
count 4,615.000 43,087.000
mean 96,581.306 95,695.127
std 201,249.148 202,467.718
min 0.000 0.000
25% 22,818.000 24,000.000
50% 59,729.000 55,075.000
75% 97,729.000 92,000.000
max 2,000,000.000 2,000,000.000

Salary by country

We see that some big countries like United States, Germany, Canada and United Kingdom have higher salaries than the worldwide median salary.

In [38]:
respondents = df.groupby('Country')['Country'].count()

df_representative = df[df['Country'].isin(respondents[respondents > 80].index)]
In [39]:
selected_countries = ['United States', 'United Kingdom', 'Canada', 'Germany']
In [40]:
country_salary_bootcamp = df_representative.loc[df_representative['Took Bootcamp?']].groupby(['Country'])['ConvertedSalary'].median()[selected_countries]
country_salary_no_bootcamp = df_representative.loc[(~df_representative['Took Bootcamp?'])].groupby(['Country'])['ConvertedSalary'].median()[selected_countries]

country_salary_df = pd.concat([country_salary_bootcamp, country_salary_no_bootcamp], keys=['Took Bootcamp', 'No took Bootcamp'], axis=1)
country_salary_df
Out[40]:
Took Bootcamp No took Bootcamp
Country
United States 98,000.000 100,000.000
United Kingdom 69,452.000 62,507.000
Canada 70,456.000 64,417.000
Germany 71,107.000 61,194.000
In [41]:
fig, ax = plt.subplots(figsize=(16, 6))

country_salary_df.sort_values(by='Took Bootcamp', ascending=False).plot(kind="bar", ax=ax)

plt.axhline(salary_bootcamp.median(), linestyle=':', color=flatui_palette[0])
plt.text(-0.75, salary_bootcamp.median()+1000, 'World median {}'.format(int(salary_bootcamp.median())), color=flatui_palette[0])

plt.axhline(salary_no_bootcamp.median(), linestyle=':', color=flatui_palette[1])
plt.text(-0.75, salary_no_bootcamp.median()+1000, 'World median {}'.format(int(salary_no_bootcamp.median())), color=flatui_palette[1])

plt.xticks(rotation=0)
plt.title("Median Salary per Country (US$/year)", fontsize=16, fontweight='bold', color='white')
Out[41]:
Text(0.5,1,'Median Salary per Country (US$/year)')
In [42]:
fig, ax = plt.subplots(figsize=(16, 6))

df_selected_countries = df_representative.loc[df_representative['Country'].isin(selected_countries), :]

ax = sns.boxplot(x='ConvertedSalary', y='Country', hue="Took Bootcamp?", data=df_selected_countries,
                 order=selected_countries, orient='h', fliersize=3, palette=flatui_palette)

for i, box in enumerate(ax.artists):    
    for j in range(i*6,i*6+6):
        line = ax.lines[j]
        line.set_color('#888888')

    plt.setp(ax.lines[i*6+5], mfc='#bababa', mec='#bababa')

plt.legend()
plt.title("Salary Boxplot per Country (US$/year)", fontsize=16, fontweight='bold', color='white')
Out[42]:
Text(0.5,1,'Salary Boxplot per Country (US$/year)')

green-divider

Developer types

In [43]:
df['DevType'].str.split(';', expand=True).stack().unique()
Out[43]:
array(['Full-stack developer', 'Database administrator',
       'DevOps specialist', 'System administrator', 'Engineering manager',
       'Data or business analyst',
       'Desktop or enterprise applications developer',
       'Game or graphics developer', 'QA or test developer', 'Student',
       'Back-end developer', 'Front-end developer', 'Designer',
       'C-suite executive (CEO, CTO, etc.)', 'Mobile developer',
       'Data scientist or machine learning specialist',
       'Marketing or sales professional', 'Product manager',
       'Embedded applications or devices developer',
       'Educator or academic researcher'], dtype=object)
In [44]:
selected_dev_types = ['Full-stack developer', 'Back-end developer', 'Front-end developer',
                      'Data scientist or machine learning specialist', 'Data or business analyst',
                      'DevOps specialist', 'System administrator',
                      'Database administrator', 'Mobile developer']

devtypes_bootcamp = pd.Series([df.loc[df['Took Bootcamp?'], 'DevType'].str.contains(devType).sum() for devType in selected_dev_types],
                              index=selected_dev_types,
                              name='Took Bootcamp') / df[df['Took Bootcamp?']].shape[0] * 100

devtypes_no_bootcamp = pd.Series([df.loc[~df['Took Bootcamp?'], 'DevType'].str.contains(devType).sum() for devType in selected_dev_types],
                              index=selected_dev_types,
                              name='No took Bootcamp') / df[~df['Took Bootcamp?']].shape[0] * 100

devtypes_df = pd.concat([devtypes_bootcamp, devtypes_no_bootcamp], keys=['Took Bootcamp', 'No took Bootcamp'], axis=1)
devtypes_df
Out[44]:
Took Bootcamp No took Bootcamp
Full-stack developer 54.401 44.142
Back-end developer 58.738 53.551
Front-end developer 41.363 34.759
Data scientist or machine learning specialist 8.129 7.097
Data or business analyst 11.121 7.382
DevOps specialist 11.579 9.514
System administrator 10.863 10.467
Database administrator 15.600 13.199
Mobile developer 22.485 18.758
In [45]:
fig, ax = plt.subplots(figsize=(16, 6))

devtypes_df.sort_values(by='Took Bootcamp', ascending=False).plot(kind="bar", ax=ax)

plt.xticks(rotation=15)
plt.title("Developer types (%)", fontsize=16, fontweight='bold', color='white')
Out[45]:
Text(0.5,1,'Developer types (%)')
In [47]:
devtype_index_bootcamp = pd.Series([df.loc[df['Took Bootcamp?'], 'DevType'].str.contains(devType) for devType in selected_dev_types])
salary_devtype_bootcamp = pd.Series([df.loc[df['Took Bootcamp?']][devtype_index_bootcamp[i].fillna(False)]['ConvertedSalary'].median() for i in np.arange(len(selected_dev_types))],
                                       index=selected_dev_types,
                                       name='No took Bootcamp')

devtype_index_no_bootcamp = pd.Series([df.loc[~df['Took Bootcamp?'], 'DevType'].str.contains(devType) for devType in selected_dev_types])
salary_devtype_no_bootcamp = pd.Series([df.loc[~df['Took Bootcamp?']][devtype_index_no_bootcamp[i].fillna(False)]['ConvertedSalary'].median() for i in np.arange(len(selected_dev_types))],
                                       index=selected_dev_types,
                                       name='No took Bootcamp')

salary_devtypes_df = pd.concat([salary_devtype_bootcamp, salary_devtype_no_bootcamp], keys=['Took Bootcamp', 'No took Bootcamp'], axis=1)
salary_devtypes_df
Out[47]:
Took Bootcamp No took Bootcamp
Full-stack developer 61,437.000 58,752.000
Back-end developer 58,598.000 55,075.000
Front-end developer 55,593.500 51,408.000
Data scientist or machine learning specialist 57,946.000 60,000.000
Data or business analyst 63,000.000 58,752.000
DevOps specialist 80,000.000 71,457.000
System administrator 60,000.000 55,562.000
Database administrator 53,076.000 51,000.000
Mobile developer 40,701.000 43,812.500
In [48]:
fig, ax = plt.subplots(figsize=(16, 6))

salary_devtypes_df.sort_values(by='Took Bootcamp', ascending=False).plot(kind="bar", ax=ax)
ax.legend(["Bootcamp - YES", "Bootcamp - NO"])

ax.axhline(salary_bootcamp.median(), linestyle=':', color=flatui_palette[0])
ax.text(7.5, salary_bootcamp.median()+1000, 'World median {}'.format(int(salary_bootcamp.median())), color=flatui_palette[0])

ax.axhline(salary_no_bootcamp.median(), linestyle=':', color=flatui_palette[1])
ax.text(7.5, salary_no_bootcamp.median()+1000, 'World median {}'.format(int(salary_no_bootcamp.median())), color=flatui_palette[1])

ax.tick_params(axis='x', rotation=15)
ax.set_title("Median Salary per Developer type (US$/year)", fontsize=16, fontweight='bold', color='white')
Out[48]:
Text(0.5,1,'Median Salary per Developer type (US$/year)')

green-divider

Developer types in the US

In [49]:
usa_df = df.loc[df['Country'] == 'United States']
In [50]:
devtype_index_bootcamp = pd.Series([usa_df.loc[usa_df['Took Bootcamp?'], 'DevType'].str.contains(devType) for devType in selected_dev_types])

salary_devtype_bootcamp = pd.Series([usa_df.loc[usa_df['Took Bootcamp?']][devtype_index_bootcamp[i].fillna(False)]['ConvertedSalary'].median() for i in np.arange(len(selected_dev_types))],
                                       index=selected_dev_types,
                                       name='No took Bootcamp')

devtype_index_no_bootcamp = pd.Series([usa_df.loc[~usa_df['Took Bootcamp?'], 'DevType'].str.contains(devType) for devType in selected_dev_types])
salary_devtype_no_bootcamp = pd.Series([usa_df.loc[~usa_df['Took Bootcamp?']][devtype_index_no_bootcamp[i].fillna(False)]['ConvertedSalary'].median() for i in np.arange(len(selected_dev_types))],
                                       index=selected_dev_types,
                                       name='No took Bootcamp')

salary_devtypes_df = pd.concat([salary_devtype_bootcamp, salary_devtype_no_bootcamp], keys=['Took Bootcamp', 'No took Bootcamp'], axis=1)
salary_devtypes_df
Out[50]:
Took Bootcamp No took Bootcamp
Full-stack developer 97,500.000 100,000.000
Back-end developer 105,000.000 102,000.000
Front-end developer 90,000.000 95,000.000
Data scientist or machine learning specialist 105,000.000 102,000.000
Data or business analyst 100,000.000 88,500.000
DevOps specialist 115,000.000 110,000.000
System administrator 102,000.000 90,650.000
Database administrator 100,000.000 90,000.000
Mobile developer 100,000.000 101,380.000
In [51]:
fig, ax = plt.subplots(figsize=(16, 6))

salary_devtypes_df.sort_values(by='Took Bootcamp', ascending=False).plot(kind="bar", ax=ax)
ax.legend(["Bootcamp - YES", "Bootcamp - NO"])

ax.axhline(salary_bootcamp.median(), linestyle=':', color=flatui_palette[0])
ax.text(7.5, salary_bootcamp.median()+1000, 'World median {}'.format(int(salary_bootcamp.median())), color=flatui_palette[0])

ax.axhline(salary_no_bootcamp.median(), linestyle=':', color=flatui_palette[1])
ax.text(7.5, salary_no_bootcamp.median()+1000, 'World median {}'.format(int(salary_no_bootcamp.median())), color=flatui_palette[1])

ax.tick_params(axis='x', rotation=15)
ax.set_title("Median Salary in the US per Developer type (US$/year)", fontsize=16, fontweight='bold', color='white')
Out[51]:
Text(0.5,1,'Median Salary in the US per Developer type (US$/year)')
In [52]:
round(df.loc[df['Took Bootcamp?'], 'LanguageWorkedWith'].str.split(';').fillna('').apply(len).mean(), 3)
Out[52]:
6.295
In [53]:
round(df.loc[~df['Took Bootcamp?'], 'LanguageWorkedWith'].str.split(';').fillna('').apply(len).mean(), 3)
Out[53]:
4.808

green-divider

Developer languages worked with

The most people work with a certain language, the most popular this language will be.

In [54]:
df['LanguageWorkedWith'].str.split(';', expand=True).stack().unique()
Out[54]:
array(['JavaScript', 'Python', 'HTML', 'CSS', 'Bash/Shell', 'C#', 'SQL',
       'TypeScript', 'C', 'C++', 'Java', 'Matlab', 'R', 'Assembly',
       'CoffeeScript', 'Erlang', 'Go', 'Lua', 'Ruby', 'PHP', 'VB.NET',
       'Swift', 'Groovy', 'Kotlin', 'Objective-C', 'Scala', 'F#',
       'Haskell', 'Rust', 'Julia', 'VBA', 'Perl', 'Cobol',
       'Visual Basic 6', 'Delphi/Object Pascal', 'Hack', 'Clojure',
       'Ocaml'], dtype=object)
In [55]:
selected_languages = ['Python', 'R', 'Matlab', 'JavaScript', 'TypeScript', 'HTML',
                      'Bash/Shell', 'SQL', 'Swift', 'Go', 'Java', 'Ruby', 'PHP',
                      'Objective-C', 'Scala', 'Rust']

languages_bootcamp = pd.Series([df.loc[df['Took Bootcamp?'], 'LanguageWorkedWith'].str.contains(language).sum() for language in selected_languages],
                              index=selected_languages,
                              name='Took Bootcamp') / df[df['Took Bootcamp?']].shape[0] * 100

languages_no_bootcamp = pd.Series([df.loc[~df['Took Bootcamp?'], 'LanguageWorkedWith'].str.contains(language).sum() for language in selected_languages],
                              index=selected_languages,
                              name='No took Bootcamp') / df[~df['Took Bootcamp?']].shape[0] * 100

languages_df = pd.concat([languages_bootcamp, languages_no_bootcamp], keys=['Took Bootcamp', 'No took Bootcamp'], axis=1)
languages_df
Out[55]:
Took Bootcamp No took Bootcamp
Python 33.476 30.500
R 18.420 13.129
Matlab 5.768 4.529
JavaScript 71.461 54.092
TypeScript 19.150 13.376
HTML 71.003 52.975
Bash/Shell 39.116 30.956
SQL 60.341 44.035
Swift 8.387 6.231
Go 6.083 5.559
Java 83.040 64.342
Ruby 12.738 7.642
PHP 28.424 24.040
Objective-C 7.772 5.407
Scala 4.365 3.391
Rust 1.703 1.892
In [56]:
fig, ax = plt.subplots(figsize=(16, 6))

languages_df.sort_values(by='Took Bootcamp', ascending=False).plot(kind="bar", ax=ax)

plt.xticks(rotation=15)
plt.title("Developer languages worked with (%)", fontsize=16, fontweight='bold', color='white')
Out[56]:
Text(0.5,1,'Developer languages worked with (%)')
In [57]:
language_index_bootcamp = pd.Series([df.loc[df['Took Bootcamp?'], 'LanguageWorkedWith'].str.contains(language) for language in selected_languages])
salary_language_bootcamp = pd.Series([df.loc[df['Took Bootcamp?']][language_index_bootcamp[i].fillna(False)]['ConvertedSalary'].median() for i in np.arange(len(selected_languages))],
                                       index=selected_languages,
                                       name='No took Bootcamp')

language_index_no_bootcamp = pd.Series([df.loc[~df['Took Bootcamp?'], 'LanguageWorkedWith'].str.contains(language) for language in selected_languages])
salary_language_no_bootcamp = pd.Series([df.loc[~df['Took Bootcamp?']][language_index_no_bootcamp[i].fillna(False)]['ConvertedSalary'].median() for i in np.arange(len(selected_languages))],
                                       index=selected_languages,
                                       name='No took Bootcamp')

salary_languages_df = pd.concat([salary_language_bootcamp, salary_language_no_bootcamp], keys=['Took Bootcamp', 'No took Bootcamp'], axis=1)
salary_languages_df
Out[57]:
Took Bootcamp No took Bootcamp
Python 60,000.000 60,000.000
R 70,000.000 68,443.000
Matlab 40,000.000 42,836.000
JavaScript 60,000.000 55,812.000
TypeScript 64,417.000 62,001.000
HTML 58,381.000 54,361.000
Bash/Shell 69,761.000 64,314.000
SQL 60,084.000 56,000.000
Swift 60,000.000 59,729.000
Go 75,000.000 75,850.000
Java 59,485.000 55,000.000
Ruby 72,500.000 73,000.000
PHP 38,556.000 42,674.000
Objective-C 62,418.000 62,418.000
Scala 82,000.000 73,433.000
Rust 76,369.000 72,000.000
In [58]:
fig, ax = plt.subplots(figsize=(16, 6))

salary_languages_df.sort_values(by='Took Bootcamp', ascending=False).plot(kind="bar", ax=ax)

plt.axhline(salary_bootcamp.median(), linestyle=':', color=flatui_palette[0])
plt.text(13.5, salary_bootcamp.median()+1000, 'World median {}'.format(int(salary_bootcamp.median())), color=flatui_palette[0])

plt.axhline(salary_no_bootcamp.median(), linestyle=':', color=flatui_palette[1])
plt.text(13.5, salary_no_bootcamp.median()+1000, 'World median {}'.format(int(salary_no_bootcamp.median())), color=flatui_palette[1])

plt.xticks(rotation=15)
plt.title("Median Salary per Developer language (US$/year)", fontsize=16, fontweight='bold', color='white')
Out[58]:
Text(0.5,1,'Median Salary per Developer language (US$/year)')

green-divider

Programming Frameworks worked with

The most people work with a certain programming framework, the most popular this framework will be.

In [59]:
df['FrameworkWorkedWith'].str.split(';', expand=True).stack().unique()
Out[59]:
array(['Django', 'React', 'Angular', 'Node.js', 'Hadoop', 'Spark',
       'Spring', '.NET Core', 'Cordova', 'Xamarin', 'TensorFlow',
       'Torch/PyTorch'], dtype=object)
In [60]:
selected_frameworks = ['Django', 'React', 'Angular', 'Node.js',
                      'Spark', 'Spring', 'TensorFlow', 'Torch/PyTorch']

frameworks_bootcamp = pd.Series([df.loc[df['Took Bootcamp?'], 'FrameworkWorkedWith'].str.contains(framework).sum() for framework in selected_frameworks],
                              index=selected_frameworks,
                              name='Took Bootcamp') / df[df['Took Bootcamp?']].shape[0] * 100

frameworks_no_bootcamp = pd.Series([df.loc[~df['Took Bootcamp?'], 'FrameworkWorkedWith'].str.contains(framework).sum() for framework in selected_frameworks],
                              index=selected_frameworks,
                              name='No took Bootcamp') / df[~df['Took Bootcamp?']].shape[0] * 100

frameworks_df = pd.concat([frameworks_bootcamp, frameworks_no_bootcamp], keys=['Took Bootcamp', 'No took Bootcamp'], axis=1)
frameworks_df
Out[60]:
Took Bootcamp No took Bootcamp
Django 9.060 6.629
React 21.955 13.936
Angular 29.913 18.432
Node.js 37.055 25.046
Spark 3.821 2.403
Spring 14.727 8.763
TensorFlow 4.981 4.004
Torch/PyTorch 1.302 0.846
In [61]:
fig, ax = plt.subplots(figsize=(16, 6))

frameworks_df.sort_values(by='Took Bootcamp', ascending=False).plot(kind="bar", ax=ax)

plt.xticks(rotation=15)
plt.title("Programming Frameworks worked with (%)", fontsize=16, fontweight='bold', color='white')
Out[61]:
Text(0.5,1,'Programming Frameworks worked with (%)')
In [62]:
framework_index_bootcamp = pd.Series([df.loc[df['Took Bootcamp?'], 'FrameworkWorkedWith'].str.contains(framework) for framework in selected_frameworks])
salary_framework_bootcamp = pd.Series([df.loc[df['Took Bootcamp?']][framework_index_bootcamp[i].fillna(False)]['ConvertedSalary'].median() for i in np.arange(len(selected_frameworks))],
                                       index=selected_frameworks,
                                       name='No took Bootcamp')

framework_index_no_bootcamp = pd.Series([df.loc[~df['Took Bootcamp?'], 'FrameworkWorkedWith'].str.contains(framework) for framework in selected_frameworks])
salary_framework_no_bootcamp = pd.Series([df.loc[~df['Took Bootcamp?']][framework_index_no_bootcamp[i].fillna(False)]['ConvertedSalary'].median() for i in np.arange(len(selected_frameworks))],
                                       index=selected_frameworks,
                                       name='No took Bootcamp')

salary_frameworks_df = pd.concat([salary_framework_bootcamp, salary_framework_no_bootcamp], keys=['Took Bootcamp', 'No took Bootcamp'], axis=1)
salary_frameworks_df
Out[62]:
Took Bootcamp No took Bootcamp
Django 50,000.000 52,692.000
React 67,313.000 64,417.000
Angular 60,000.000 55,075.000
Node.js 63,000.000 59,629.500
Spark 84,000.000 73,433.000
Spring 55,562.000 55,562.000
TensorFlow 59,940.000 60,607.000
Torch/PyTorch 40,977.000 54,507.000
In [63]:
fig, ax = plt.subplots(figsize=(16, 6))

salary_frameworks_df.sort_values(by='Took Bootcamp', ascending=False).plot(kind="bar", ax=ax)

plt.axhline(salary_bootcamp.median(), linestyle=':', color=flatui_palette[0])
plt.text(6.5, salary_bootcamp.median()+1000, 'World median {}'.format(int(salary_bootcamp.median())), color=flatui_palette[0])

plt.axhline(salary_no_bootcamp.median(), linestyle=':', color=flatui_palette[1])
plt.text(6.5, salary_no_bootcamp.median()+1000, 'World median {}'.format(int(salary_no_bootcamp.median())), color=flatui_palette[1])

plt.xticks(rotation=15)
plt.title("Median Salary per programming Framework (US$/year)", fontsize=16, fontweight='bold', color='white')
Out[63]:
Text(0.5,1,'Median Salary per programming Framework (US$/year)')