Find repeated strings in different text files¶
The problem: we have two text files containing names of companies. The problem is that the company names are sometimes spelled differently. For example, in one of them we have "Western Digital" and in the other "Western Digital Corp.".
This problem will help us demonstrate the difference between Imperative Programming and Declarative Programming.
We'll use Pandas to read the company names from CSV files, and for all the data processing (specially in the declarative solution). We'll also use an excellent Python library for string comparison: fuzzywuzzy.
!pip install fuzzywuzzy
import pandas as pd
import numpy as np
import itertools
from fuzzywuzzy import fuzz, process
Read both files into its own DataFrame
df1 = pd.read_csv('test_CSV_1.csv')
df2 = pd.read_csv('test_CSV_2.csv')
Each file has 200/300 companies, not too large.
df1.size, df2.size
And we can extract company names as regular np.arrays:
csv_1 = df1['CLIENT'].values
csv_2 = df2['Firm Name'].values
Here are he first 10 companies in each file:
csv_1[:10]
csv_2[:10]
In this example you can already notice repeated "similar" companies: "AECOM"
in csv_1
and "AECOM Technology Corporation"
in csv_2
.
Fuzzy Matching¶
The fuzzywuzzy
library has a couple of different functions, that you can see below:
fuzz.ratio('AECOM', 'AECOM Technology Corporation')
fuzz.partial_ratio('AECOM', 'AECOM Technology Corporation')
fuzz.token_sort_ratio('AECOM', 'AECOM Technology Corporation')
fuzz.token_set_ratio('AECOM', 'AECOM Technology Corporation')
We're going to use partial_ratio
, as it's the one that best adjusts for our use cases. A few more examples:
fuzz.partial_ratio('Dignity Health (Catholic Healthcare West)', 'Dignity Health')
fuzz.partial_ratio('Western Digital', 'Western Digital Corp.')
fuzz.partial_ratio('University of Southern California (USC)', 'University of Southern California')
Next¶
Head over to 2. Imperative to see our imperative approach at tackling this problem.