Merge and Concatenation¶
pandas
provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
Hands on!¶
import pandas as pd
import numpy as np
Merge: Database-style Joins¶
The pd.merge()
function implements a number of types of joins:
- One-to-one join
- Many-to-one join
- Many-to-many join
As there are different types of 'merge', we'll be using the optional parameter
validate
to be sure that the merge will be done the way its supposed.
One-to-one join¶
This is the simplest type of merging data and its like a column-wise concatenation. The pd.merge()
function will recognize the column that both DataFrames have in common and automatically joins using that column.
Also, we can specify which column or index level names we want to use to join both DataFrames using the on
parameter.
df1 = pd.DataFrame({
'name': ['Lucas', 'Peter', 'Jane'],
'years': [24, 27, 24]
})
df1
df2 = pd.DataFrame({
'name': ['Lucas', 'Peter', 'Jane'],
'eye_color': ['blue', 'brown', 'brown']
})
df2
pd.merge(df1, df2, validate='one_to_one')
pd.merge(df1, df2, on='name', validate='one_to_one')
Many-to-one join¶
This type of join happens when one of the two key columns contains duplicate entries, and the resulting merged DataFrame will preserve those duplicate entries as appropriate.
df1 = pd.DataFrame({
'name': ['Lucas', 'Peter', 'Jane'],
'years': [24, 18, 32],
'university': ['Harvard', 'MIT', 'MIT']
})
df1
df2 = pd.DataFrame({
'university': ['Harvard', 'MIT'],
'established': [1636, 1861]
})
df2
In our case, merged DataFrame has a new established
column, where its information is repeated in two rows: MIT - 1861.
pd.merge(df1, df2, on='university', validate='many_to_one')
Many-to-many join¶
This type of join happens when the key column in both DataFrames contains duplicated values.
df1 = pd.DataFrame({
'name': ['Lucas', 'Peter', 'Jane'],
'years': [24, 18, 32],
'university': ['Harvard', 'MIT', 'MIT']
})
df1
df2 = pd.DataFrame({
'degree': ['economics', 'data science', 'computer science'],
'university': ['Harvard', 'MIT', 'MIT'],
})
df2
pd.merge(df1, df2, on='university', validate='many_to_many')
Merging on different column or index names¶
Sometimes we'll need to join different DataFrames on columns or index names that don't have the same name in both DataFrames.
The left_on
and right_on
keywords¶
To merge two datasets with different column names we can specify the column or index name to join on the left and right DataFrame using the left_on
and right_on
parameters.
Let's see an example:
directors = pd.DataFrame({
'Name': [
'Steven Spielberg',
'Martin Scorsese',
'Christopher Nolan',
'Quentin Tarantino',
'Francis Ford Coppola'
],
'Age': [71, 75, 47, 55, 79]
}, columns=['Name', 'Age'])
directors
movies = pd.DataFrame({
'Title': [
'The Godfather',
'Taxi Driver',
'Inception',
'Alien',
'Goodfellas',
'Pulp Fiction',
'Inglourious Basterds',
'The Godfather Part II',
'The Prestige'
],
'Directed By': [
'Francis Ford Coppola',
'Martin Scorsese',
'Christopher Nolan',
'Ridley Scott',
'Martin Scorsese',
'Quentin Tarantino',
'Quentin Tarantino',
'Francis Ford Coppola',
'Christopher Nolan'
],
'Year': [1972, 1976, 2010, 1979, 1990, 1994, 2009, 1974, 2006]
})
movies
directors
In this case we want to merge directors
and movies
DataFrames by its director's name.
In directors
DataFrame the director name is on Name
column but in movies
DataFrame it's on Directed By
column, so we'll need to use the left_on
and right_on
parameters to specify the right column names of each DataFrame we want to merge on:
pd.merge(directors, movies, left_on='Name', right_on='Directed By')
pd.merge(directors, movies, left_on='Name', right_on='Directed By', copy=False)
Specifying Set Arithmetic for Joins¶
By default, merge will match only rows that have keys present in both dataframes. That means that data is missing in both DataFrames: Steven Spielberg from directors
and Alien
from movies.
We can perform other style of merges:
inner
(default)outer
left
right
Inner¶
It's the default behaviour when merging two dataframes. The result contains the intersection of the two sets of inputs, as we have seen before.
pd.merge(directors, movies, left_on='Name', right_on='Directed By', how='inner')
Outer¶
Outer join will return a join over the union of the input columns, including all the rows redardless if they match or not and filling in all missing values with NaN
values.
pd.merge(directors, movies, left_on='Name', right_on='Directed By', how='outer')
As you can see, Alien is there, but there's no director information. The same happens with Steven Spielberg, without movie info.
Left and Right¶
As you might imagine, left and right will just include the missing values from dataframes defined at the left or right of the merge method:
pd.merge(directors, movies, left_on='Name', right_on='Directed By', how='left')
In this case, for left, ALL the directors are included, even if they don't have movies (Spielberg).
pd.merge(directors, movies, left_on='Name', right_on='Directed By', how='right')
And for right, it's the opposite, all movies were included, even though we don't have director info for Alien.
Merging on Indexes¶
directors_index = directors.copy().set_index('Name')
directors_index
pd.merge(directors_index, movies, left_index=True, right_on='Directed By')
Concatenation¶
This method is used to concatenate pandas objects along a particular axis with optional set logic along the other axes.
s1 = pd.Series(np.random.randint(0, 500, 3),
index=['A', 'B', 'C'])
s1.to_frame()
s2 = pd.Series(np.random.randint(0, 500, 2),
index=['D', 'E'])
s2.to_frame()
s3 = pd.Series(np.random.randint(0, 500, 3),
index=['F', 'G', 'H'])
s3.to_frame()
pd.concat([s1, s2, s3]).to_frame()
pd.concat([s1, s2, s3], axis=1, sort=True)
s4 = pd.Series(np.random.randint(0, 500, 3),
index=['B', 'A', 'D'])
s4.to_frame()
Concatenation with join¶
As we saw with merge
, the concatenation
function also has two join
modes:
outer
(default)inner
- Specific indexes
pd.concat?
Outer¶
Concatenate all values including all the values, regardless if they match or not. If they don't, there will be data missing.
pd.concat([s1, s4], axis=1, sort=True)
Inner¶
Concatenate just the values that match on both Series or DataFrame.
pd.concat([s1, s4], axis=1, join='inner')
Specific indexes¶
Concatenate just specified indexes instead of performing inner/outer set logic.
pd.concat([s1, s4], axis=1, join_axes=[['A', 'D']])
Dealing with duplicated indexes¶
When concatenating pandas objects, new concatenated axis could contains duplicated indexes.
To deal with this problem and keep our data consistent and without duplicated indexes, one possibility is removing the original indexes using the ignore_index
parameter:
s1 = pd.Series(np.random.randint(0, 500, 3),
index=['A', 'B', 'C'])
s1.to_frame()
s2 = pd.Series(np.random.randint(0, 500, 3),
index=['C', 'D', 'E'])
s2.to_frame()
pd.concat([s1, s2], ignore_index=True).to_frame()
Another possibility to deal with duplicated indexes is to construct a hierarchical index on the new concatenated axis using the keys
parameter:
new = pd.concat([s1, s2, s3], keys=['First', 'Second', 'Third'])
new.to_frame()
new.unstack().stack().to_frame()