Python DataFrames and Basic Graphs¶
Python is a popular programming language that can be used for analysis and graphing of large datasets. There are a lot of free resouces online to learn python:
- https://www.codecademy.com
- https://www.datacamp.com
- https://docs.python.org/3.8/tutorial/index.html
- Coursera courses such as "Programming for Everybody (Getting Started with Python)" from University of Michigan
- Linkedin Learning courses (free with UIC login) such as "Learning Python"
To use this Jupyter Notebook you do not have to download anything, but a way to download and use Python and many of the Python packages is by downloading the Anaconda platform:
In this short walk-through we are going to focus on practical Python packages for data analysis called Pandas and Seaborn. Packages add functionality to basic Python. Here are links to the documentation for these packages:
- Pandas: https://pandas.pydata.org/docs/
- Seaborn: https://seaborn.pydata.org/index.html
In Part 1 we will learn how to import csv files (comma-seperated values files) containing data and convert them into data frames. We'll see how to manipulate the data frames and perform summary statistics.
In Part 2 we will use Seaborn to make several basic, yet exciting, graph types from the data.
These commands will import the Python packages we are using:
#import numpy, a numerical python package that allows fast analysis of data in matrices
import numpy as np
#import pandas, a python package that introduces data frames and requires numpy
import pandas as pd
#import matplotlib.pyplot, a python package for making graphs
import matplotlib.pyplot as plt
#import seaborn, a python package that can make nice scientific graphs and requires matplotlib
import seaborn as sns
This next command is so that the graphs will be plotted to the right size in the notebook. It only works in Jupyter Notebooks.
%matplotlib inline
Part 1: Creating the data frame and summary statistics¶
We are going to use a built-in dataset from the seaborn package called 'mpg' that includes data on different types of cars. This dataset can be accessed directly from seaborn, but we are going to save it to a csv (comma seperated variable) file and reopen it as a csv to demonstrate importing files with python.
#accessing the built-in mpg dataset and saving it as the variable 'df'
df = sns.load_dataset('mpg')
#saving 'df' as a file called 'mpg.csv'. You can also use a longer file path to save to a different location.
df.to_csv('mpg.csv', index = False)
#reading the 'mpg.csv' file
df = pd.read_csv('mpg.csv')
Try checking all the read_csv function options by hitting 'shift + tab' while you are typing inside the parenthesis! This works for all python functions in Jupyter Notebooks.
#display the df
df
You can select and view specific columns or rows:
#selecting a specific column of the data frame
df['weight']
#selecting a specific row of the data frame
#note when using iloc that numbering starts at 0 not 1 for both rows and columns
df.iloc[396 , : ]
To select multiple rows or columns you can use the ":" symbol. For example:
df.iloc[1 , 5] gives the second row and the fourth column.
df.iloc[0:2 , 5] gives the first 2 rows and forth column.
The ":" symbol used alone means ALL columns or ALL rows.
df.iloc[0:2 , 5]
Note that the index is just a list of numbers, and it would be more useful to use the column "name" as the index. You can change the index as follows:
#making one of the columns into the index of the dataframe
df = df.set_index('name')
df
Now we can select rows by name of vehicle:
#selcting a specific row by name
df.loc['buick skylark 320', : ]
#note the method is loc, not iloc now
To delete rows use the 'drop' function
An important thing to note is that any function which modifies the data by default will generate a copy of the data frame, and not change the data frame itself. This is prevent accidental errors. To apply changes to the data frame itself the option "inplace=True" needs to be included.
#deleting a column (does not modify df permenantly, but allows you to see result)
df.drop('origin', axis = 1)
#note that our original data frame still includes 'name' column!
df
#deleting a row in the original data frame
df.drop('buick skylark 320', axis = 0, inplace = True)
#not that by using inplace=True the original data frame is now permenantly changed!
df
New columns can be added to the dataframe as well. For example, if you wanted a normalized version of one of the column variables you could create a new column using mathematical operations on columns:
#creating a new column
df['normalized displacement'] = (df['displacement'] - df['displacement'].mean()) / df['displacement'].std() * 100
Now lets look at some info about the data frame:
df.info()
Note that the data types (dtypes) include float64, int64 and object. These were automatically interpreted by pandas when you loaded in the csv file! Numbers with decimals are dtype float64, numbers that are integers are dtype int64, and strings are stored as objects.
Also note that there are 6 null values in the 'horsepower' column, meaning some data is missing.
#removing rows with missing data
df.dropna(inplace = True)
You can change the dtype of any column, although sometimes you will run into errors if there are still missing values. The dtyle 'category' is more versatile than 'object' for categorical variables when it comes to graphing.
#change to category dtype
df['cylinders'] = df['cylinders'].astype('category')
df['orgin'] = df['origin'].astype('category')
#change to integer dtype (will round decimals)
df['displacement'] = df['displacement'].astype('int16')
To see the summary statistics for each numerical column you can use the describe method:
df.describe()
You can also calculate summary statistics by group for categorical variables!
#creating the groups based off a categorical column
by_origin = df.groupby('origin')
#you can use any built-in statistics function on the groups
by_origin.mean()
Some other useful data representations are correlation tables and pivot tables:
#creating a correlation matrix for the numerical variables
df.corr()
#creating a pivot table (reorganization of data)
df.pivot_table(index = "origin", columns = "model_year", values = "mpg")
First we can set some basic plot style parameters with sns.set
If you ever want to know more about what the options in a function are you can check it's documentation page by searching online, or you can hit 'shift + tab' when you are typing inside it's parenthesis in Jupyter notebook!
sns.set(context = 'notebook', style = 'ticks', palette = 'muted', font = 'sans-serif', font_scale = 1.5)
You can also use the matplotlib.rcParams function to modify a lot of plot features.
Bar Plots¶
These are good to use when plotting categorical data and continous data.
The default error bars is 95% confidence interval (bootstrapped) in Seaborn.
You can change this by setting the option ci = 90 for 90% confidence interval, or ci = 'sd' for standard deviation
sns.barplot(x = 'cylinders', y = "weight", ci = 'sd', data = df)
Changing the x and y variables can makes the plot horizontal instead of vertical.
You can also adjust the order of the bars using the order option, and passing it a list of the categories in the order you want.
sns.barplot(x = 'weight', y = 'cylinders', data = df, order = [8,6,5,4,3])
Changing plot aesthetics¶
You can have a lot of control over plot aesthetics in Python, however, I found that learning how to change aesthetics was challenging, and was sometimes different for each plot type. The best way to learn is using the package documentation and also by searching for what you want online, there are lot of tutorials.
Note that you can use commands from Matplotlib to change aesthetics in Seaborn plots too, because Seaborn is built on top of Matplotlib.
Below I show how to make some stylistic changes to bar plots:
#change plot size and dimensions (inches (width,height))
#You must run this command first
plt.figure(figsize=(10,5))
#adding error bar caps, changing the error bar color and width, and adding bar edges
sns.barplot(x = 'cylinders', y = 'weight', data = df, capsize = 0.2, errcolor = '0', edgecolor = '0',errwidth = '1')
#remove the right and top edges of the plot
sns.despine()
#rotate the x-axis labels 45 degrees
plt.xticks(rotation = 45)
#change y axis scale
plt.yscale("log")
#set y axis tick marks
plt.yticks([2000, 3000, 4000, 5000])
#add labels and title
plt.xlabel("Number of Cylinders")
plt.ylabel("Weight (lbs)")
plt.title("Car Weight Increases with Cylinders")
Box Plots¶
The box is the interquartile range (contains 50% of the data). The center line is the median. The whiskers can be modified with the "whis" option, the default is whis = 1.5, which is the "Proportion of the IQR past the low and high quartiles to extend the plot whiskers".
sns.boxplot(x = 'model_year', y = 'mpg', data = df)
sns.boxplot(x = 'model_year', y = 'mpg', data = df, notch = True)
Cat Plot¶
This is one way you can add a second category variable to the plot (the 'hue' option).
By specifying 'kind' you can make a 'bar' plot or others.
sns.catplot(x = 'cylinders', y = "mpg", hue = "origin", data = df, kind = 'box')
sns.catplot(x = 'cylinders', y = "mpg", hue = "origin", data = df, kind = 'swarm')
Distplot¶
Useful to check the distribution of a continous variable.
This is a histogram with a KDE (kernal density estimate) overlayed on top of it.
sns.distplot(df['horsepower'])
It is possible to change the number of bins and remove the KDE line:
sns.distplot(df['horsepower'], bins = 5,kde = False, color = "black")
Lineplot¶
Good for plotting a continous variable against an ordered variable like 'model_year'
This automatically includes a 95% confidence interval that you can change to standard deviation using ci="sd" similarly to for barplots earlier.
sns.lineplot(x = "model_year", y = "horsepower", data = df)
#style='origin' makes different style lines for each origin
#err_style='bars' converts to using bars to represent error
sns.lineplot(x = "model_year", y = "horsepower", style = "origin", err_style = "bars", color = 'black', data = df)
#choose legend location
plt.legend(loc = "upper right", fontsize = "small")
#hue='origin' uses a different color line for each origin
sns.lineplot(x = 'model_year', y = 'mpg', hue = "origin", data = df)
#choose legend location, and place legend outside graph. bbox_to_anchor moves legend from starting position.
plt.legend(loc = "center left", bbox_to_anchor = (1,0.5))
Scatterplots¶
Good for plotting two continuous variables
sns.scatterplot(x = 'horsepower', y = 'displacement', data = df)
#the alpha option determines how transparent the dots are, so you can see where they pile up
sns.scatterplot(x = 'horsepower', y = 'displacement', data = df, alpha = 0.5, color = 'black')
sns.scatterplot(x = 'horsepower', y = "displacement", hue = "origin", data = df)
plt.legend(loc = "center left", bbox_to_anchor = (1,0.5), fontsize="x-small")
Jointplot¶
Jointplot combines scatter plot and dist plot. You can change how the data is graphed using the 'kind' option.
sns.jointplot(x = 'horsepower', y = 'displacement', data = df, kind = 'reg')
df["horsepower"].corr(df["displacement"],method="pearson")
#I couldn't figure out how to get the p-value for the correlation, if anyone knows, let me know!
Heatmap plots¶
pt = df.pivot_table(index = "origin", columns = "model_year", values = "mpg")
sns.heatmap(pt, cmap = 'coolwarm')
Pairplots¶
These are a great way to explore your data. The diagonal shows the distribution of the data for each variable, and the sides show scatterplots of variables plotted against eachother. You can use the 'hue' option to have different colors by an additional variable.
Here I used only a subset of 'df' because there were initally too many columns. I used df.iloc[:,[0,4,5,7]] to specify all the rows (:) and a list of columns ([0,4,5,7]). Remember that the numbering begins at 0 in python!
sns.pairplot(data = df.iloc[:,[0,4,5,7]], hue = 'origin')
Saving a plot¶
You can save a plot as any image type just by adding .jpg, .tiff, .png, ect in the file name. The image resolution can be set using the 'dpi' option. Note that you can include a longer file path to save to a specific location, otherwise it will save to your current working directory.
my_plot = sns.pairplot(data = df.iloc[:,[0,4,5,7]], hue = 'origin')
my_plot.savefig('my_plot.tiff', dpi = 300)
#pdf save option where text is editable in programs such as Illustrator:
plt.rcParams['pdf.fonttype'] = 42
my_plot.savefig('filename.pdf')
If you can't find your saved plot, you can check your current working directory using the following:
import os
print(os.getcwd())