Profile picture

Co-founder @ RMOTR

Fuzzy String Matching

Last updated: October 18th, 20182018-10-18Project preview

Find repeated strings in different text files

The problem: we have two text files containing names of companies. The problem is that the company names are sometimes spelled differently. For example, in one of them we have "Western Digital" and in the other "Western Digital Corp.".

This problem will help us demonstrate the difference between Imperative Programming and Declarative Programming.

We'll use Pandas to read the company names from CSV files, and for all the data processing (specially in the declarative solution). We'll also use an excellent Python library for string comparison: fuzzywuzzy.

In [1]:
!pip install fuzzywuzzy
Requirement already satisfied: fuzzywuzzy in /usr/local/lib/python3.6/site-packages (0.17.0)
You are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [2]:
import pandas as pd
import numpy as np
import itertools
from fuzzywuzzy import fuzz, process
/usr/local/lib/python3.6/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

Read both files into its own DataFrame

In [3]:
df1 = pd.read_csv('test_CSV_1.csv')
df2 = pd.read_csv('test_CSV_2.csv')

Each file has 200/300 companies, not too large.

In [4]:
df1.size, df2.size
Out[4]:
(266, 368)

And we can extract company names as regular np.arrays:

In [5]:
csv_1 = df1['CLIENT'].values
csv_2 = df2['Firm Name'].values

Here are he first 10 companies in each file:

In [6]:
csv_1[:10]
Out[6]:
array(['Adobe Systems, Inc.', 'Adventist Health', 'AECOM',
       'Aerojet Rockedyne Holdings (GenCorp)',
       'Alameda-Contra Costa Transit District',
       'Alaska Community Foundation',
       'Alaska Retirement Management Board', 'Alexander & Baldwin, Inc.',
       'Allergan, Inc.', 'Alyeska Pipeline Service Company'], dtype=object)
In [7]:
csv_2[:10]
Out[7]:
array(['AAA Northern California, Nevada & Utah Auto Exchange',
       'ACCO Engineered Systems', 'Adams County Retirement Plan',
       'Adidas America, Inc.', 'Adobe Systems, Inc.',
       'Advanced Micro Devices, Inc.', 'AECOM Technology Corporation',
       'Aera Energy LLC', 'Aerojet Rocketdyne Holdings, Inc.',
       'Agilent Technologies, Inc.'], dtype=object)

In this example you can already notice repeated "similar" companies: "AECOM" in csv_1 and "AECOM Technology Corporation" in csv_2.

Fuzzy Matching

The fuzzywuzzy library has a couple of different functions, that you can see below:

In [7]:
fuzz.ratio('AECOM', 'AECOM Technology Corporation')
Out[7]:
30
In [6]:
fuzz.partial_ratio('AECOM', 'AECOM Technology Corporation')
Out[6]:
100
In [8]:
fuzz.token_sort_ratio('AECOM', 'AECOM Technology Corporation')
Out[8]:
30
In [9]:
fuzz.token_set_ratio('AECOM', 'AECOM Technology Corporation')
Out[9]:
100

We're going to use partial_ratio, as it's the one that best adjusts for our use cases. A few more examples:

In [10]:
fuzz.partial_ratio('Dignity Health (Catholic Healthcare West)', 'Dignity Health')
Out[10]:
100
In [11]:
fuzz.partial_ratio('Western Digital', 'Western Digital Corp.')
Out[11]:
100
In [12]:
fuzz.partial_ratio('University of Southern California (USC)', 'University of Southern California')
Out[12]:
100

Next

Head over to 2. Imperative to see our imperative approach at tackling this problem.

Notebooks AI
Notebooks AI Profile20060