Profile picture

4.2 - Merge and Concatenation

Last updated: April 3rd, 20192019-04-03Project preview

rmotr


Merge and Concatenation

pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

purple-divider

Hands on!

In [ ]:
import pandas as pd
import numpy as np

green-divider

Merge: Database-style Joins

The pd.merge() function implements a number of types of joins:

  • One-to-one join
  • Many-to-one join
  • Many-to-many join

As there are different types of 'merge', we'll be using the optional parameter validate to be sure that the merge will be done the way its supposed.

 One-to-one join

This is the simplest type of merging data and its like a column-wise concatenation. The pd.merge() function will recognize the column that both DataFrames have in common and automatically joins using that column.

Also, we can specify which column or index level names we want to use to join both DataFrames using the on parameter.

In [ ]:
df1 = pd.DataFrame({
    'name': ['Lucas', 'Peter', 'Jane'],
    'years': [24, 27, 24]
})

df1
In [ ]:
df2 = pd.DataFrame({
    'name': ['Lucas', 'Peter', 'Jane'],
    'eye_color': ['blue', 'brown', 'brown']
})

df2
In [ ]:
pd.merge(df1, df2, validate='one_to_one')
In [ ]:
pd.merge(df1, df2, on='name', validate='one_to_one')

 Many-to-one join

This type of join happens when one of the two key columns contains duplicate entries, and the resulting merged DataFrame will preserve those duplicate entries as appropriate.

In [ ]:
df1 = pd.DataFrame({
    'name': ['Lucas', 'Peter', 'Jane'],
    'years': [24, 18, 32],
    'university': ['Harvard', 'MIT', 'MIT']
})

df1
In [ ]:
df2 = pd.DataFrame({
    'university': ['Harvard', 'MIT'],
    'established': [1636, 1861]
})

df2

In our case, merged DataFrame has a new established column, where its information is repeated in two rows: MIT - 1861.

In [ ]:
pd.merge(df1, df2, on='university', validate='many_to_one')

 Many-to-many join

This type of join happens when the key column in both DataFrames contains duplicated values.

In [ ]:
df1 = pd.DataFrame({
    'name': ['Lucas', 'Peter', 'Jane'],
    'years': [24, 18, 32],
    'university': ['Harvard', 'MIT', 'MIT']
})

df1
In [ ]:
df2 = pd.DataFrame({
    'degree': ['economics', 'data science', 'computer science'],
    'university': ['Harvard', 'MIT', 'MIT'],
})

df2
In [ ]:
pd.merge(df1, df2, on='university', validate='many_to_many')

green-divider

Merging on different column or index names

Sometimes we'll need to join different DataFrames on columns or index names that don't have the same name in both DataFrames.

The left_on and right_on keywords

To merge two datasets with different column names we can specify the column or index name to join on the left and right DataFrame using the left_on and right_on parameters.

Let's see an example:

In [ ]:
directors = pd.DataFrame({
    'Name': [
        'Steven Spielberg',
        'Martin Scorsese',
        'Christopher Nolan',
        'Quentin Tarantino',
        'Francis Ford Coppola'
    ],
    'Age': [71, 75, 47, 55, 79]
}, columns=['Name', 'Age'])

directors
In [ ]:
movies = pd.DataFrame({
    'Title': [
        'The Godfather',
        'Taxi Driver',
        'Inception',
        'Alien',
        'Goodfellas',
        'Pulp Fiction',
        'Inglourious Basterds',
        'The Godfather Part II',
        'The Prestige'
    ],
    'Directed By': [
        'Francis Ford Coppola',
        'Martin Scorsese',
        'Christopher Nolan',
        'Ridley Scott',
        'Martin Scorsese',
        'Quentin Tarantino',
        'Quentin Tarantino',
        'Francis Ford Coppola',
        'Christopher Nolan'
    ],
    'Year': [1972, 1976, 2010, 1979, 1990, 1994, 2009, 1974, 2006]
})

movies
In [ ]:
directors

In this case we want to merge directors and movies DataFrames by its director's name.

In directors DataFrame the director name is on Name column but in movies DataFrame it's on Directed By column, so we'll need to use the left_on and right_on parameters to specify the right column names of each DataFrame we want to merge on:

In [ ]:
pd.merge(directors, movies, left_on='Name', right_on='Directed By')
In [ ]:
pd.merge(directors, movies, left_on='Name', right_on='Directed By', copy=False)

green-divider

Specifying Set Arithmetic for Joins

By default, merge will match only rows that have keys present in both dataframes. That means that data is missing in both DataFrames: Steven Spielberg from directors and Alien from movies.

We can perform other style of merges:

  • inner (default)
  • outer
  • left
  • right

Inner

It's the default behaviour when merging two dataframes. The result contains the intersection of the two sets of inputs, as we have seen before.

In [ ]:
pd.merge(directors, movies, left_on='Name', right_on='Directed By', how='inner')

Outer

Outer join will return a join over the union of the input columns, including all the rows redardless if they match or not and filling in all missing values with NaN values.

In [ ]:
pd.merge(directors, movies, left_on='Name', right_on='Directed By', how='outer')

As you can see, Alien is there, but there's no director information. The same happens with Steven Spielberg, without movie info.

Left and Right

As you might imagine, left and right will just include the missing values from dataframes defined at the left or right of the merge method:

In [ ]:
pd.merge(directors, movies, left_on='Name', right_on='Directed By', how='left')

In this case, for left, ALL the directors are included, even if they don't have movies (Spielberg).

In [ ]:
pd.merge(directors, movies, left_on='Name', right_on='Directed By', how='right')

And for right, it's the opposite, all movies were included, even though we don't have director info for Alien.

green-divider

Merging on Indexes

In [ ]:
directors_index = directors.copy().set_index('Name')
directors_index
In [ ]:
pd.merge(directors_index, movies, left_index=True, right_on='Directed By')

green-divider

Concatenation

This method is used to concatenate pandas objects along a particular axis with optional set logic along the other axes.

In [ ]:
s1 = pd.Series(np.random.randint(0, 500, 3),
               index=['A', 'B', 'C'])

s1.to_frame()
In [ ]:
s2 = pd.Series(np.random.randint(0, 500, 2),
               index=['D', 'E'])

s2.to_frame()
In [ ]:
s3 = pd.Series(np.random.randint(0, 500, 3),
               index=['F', 'G', 'H'])

s3.to_frame()
In [ ]:
pd.concat([s1, s2, s3]).to_frame()
In [ ]:
pd.concat([s1, s2, s3], axis=1, sort=True)
In [ ]:
s4 = pd.Series(np.random.randint(0, 500, 3),
               index=['B', 'A', 'D'])

s4.to_frame()

green-divider

Concatenation with join

As we saw with merge, the concatenation function also has two join modes:

  • outer (default)
  • inner
  • Specific indexes
In [ ]:
pd.concat?

 Outer

Concatenate all values including all the values, regardless if they match or not. If they don't, there will be data missing.

In [ ]:
pd.concat([s1, s4], axis=1, sort=True)

 Inner

Concatenate just the values that match on both Series or DataFrame.

In [ ]:
pd.concat([s1, s4], axis=1, join='inner')

 Specific indexes

Concatenate just specified indexes instead of performing inner/outer set logic.

In [ ]:
pd.concat([s1, s4], axis=1, join_axes=[['A', 'D']])

green-divider

Dealing with duplicated indexes

When concatenating pandas objects, new concatenated axis could contains duplicated indexes.

To deal with this problem and keep our data consistent and without duplicated indexes, one possibility is removing the original indexes using the ignore_index parameter:

In [ ]:
s1 = pd.Series(np.random.randint(0, 500, 3),
               index=['A', 'B', 'C'])

s1.to_frame()
In [ ]:
s2 = pd.Series(np.random.randint(0, 500, 3),
               index=['C', 'D', 'E'])

s2.to_frame()
In [ ]:
pd.concat([s1, s2], ignore_index=True).to_frame()

Another possibility to deal with duplicated indexes is to construct a hierarchical index on the new concatenated axis using the keys parameter:

In [ ]:
new = pd.concat([s1, s2, s3], keys=['First', 'Second', 'Third'])

new.to_frame()
In [ ]:
new.unstack().stack().to_frame()

purple-divider

Notebooks AI
Notebooks AI Profile20060