Find repeated strings in different text files¶
The problem: we have two text files containing names of companies. The problem is that the company names are sometimes spelled differently. For example, in one of them we have "Western Digital" and in the other "Western Digital Corp.".
This problem will help us demonstrate the difference between Imperative Programming and Declarative Programming.
We'll use Pandas to read the company names from CSV files, and for all the data processing (specially in the declarative solution). We'll also use an excellent Python library for string comparison: fuzzywuzzy.
!pip install fuzzywuzzy
Requirement already satisfied: fuzzywuzzy in /usr/local/lib/python3.6/site-packages (0.17.0) You are using pip version 18.0, however version 18.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command.
import pandas as pd import numpy as np import itertools from fuzzywuzzy import fuzz, process
/usr/local/lib/python3.6/site-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
Read both files into its own DataFrame
df1 = pd.read_csv('test_CSV_1.csv') df2 = pd.read_csv('test_CSV_2.csv')
Each file has 200/300 companies, not too large.
And we can extract company names as regular np.arrays:
csv_1 = df1['CLIENT'].values csv_2 = df2['Firm Name'].values
Here are he first 10 companies in each file:
array(['Adobe Systems, Inc.', 'Adventist Health', 'AECOM', 'Aerojet Rockedyne Holdings (GenCorp)', 'Alameda-Contra Costa Transit District', 'Alaska Community Foundation', 'Alaska Retirement Management Board', 'Alexander & Baldwin, Inc.', 'Allergan, Inc.', 'Alyeska Pipeline Service Company'], dtype=object)
array(['AAA Northern California, Nevada & Utah Auto Exchange', 'ACCO Engineered Systems', 'Adams County Retirement Plan', 'Adidas America, Inc.', 'Adobe Systems, Inc.', 'Advanced Micro Devices, Inc.', 'AECOM Technology Corporation', 'Aera Energy LLC', 'Aerojet Rocketdyne Holdings, Inc.', 'Agilent Technologies, Inc.'], dtype=object)
In this example you can already notice repeated "similar" companies:
"AECOM Technology Corporation" in
fuzz.ratio('AECOM', 'AECOM Technology Corporation')
fuzz.partial_ratio('AECOM', 'AECOM Technology Corporation')
fuzz.token_sort_ratio('AECOM', 'AECOM Technology Corporation')
fuzz.token_set_ratio('AECOM', 'AECOM Technology Corporation')
We're going to use
partial_ratio, as it's the one that best adjusts for our use cases. A few more examples:
fuzz.partial_ratio('Dignity Health (Catholic Healthcare West)', 'Dignity Health')
fuzz.partial_ratio('Western Digital', 'Western Digital Corp.')
fuzz.partial_ratio('University of Southern California (USC)', 'University of Southern California')