Terms like "fuzzy matching", "similarity searching", "string distance search", and "entity name resolution" have been used over the years to describe the process of matching organization and company names that appear as account names, vendor names, customer names, or anything else they represent in a database or dataset. The goal has been to identify spelling variations, non-standardized names, and other inconsistencies in data that cause significant issues in various types of data analysis. However, these traditional methods of duplicate identification have often achieved only limited success. If these data quality issues go unaddressed in any business or organizational scenario, they can have a tremendously negative effect in using data for reporting, decision-making, customer/prospect communication, and in effectiveness of AI models, delivering information technology ROI a serious blow.
Now, however, with the use of various innovative algorithmic techniques enhanced by modern, sophisticated AI models, we can achieve results that are vastly superior to previous approaches, including international data.
A great way to showcase this cutting-edge approach to organization name matching is by using pandas data frames within Python. Pandas data frames are incredibly versatile and useful for handling datasets due to their powerful and flexible data structures. They allow for efficient manipulation and analysis of data, offering a range of functions to filter and transform datasets seamlessly. The tabular format, similar to spreadsheets, makes it intuitive for users to visualize and manipulate large volumes of data. Additionally, pandas supports diverse data types, making it easy to merge, join, and concatenate different datasets.
This flexibility makes it easier to integrate and process datasets using Interzoid's cloud-native data quality platform. Simple cloud-native, JSON API calls can enable a new dimension of higher quality data to pandas datasets, significantly enhancing data analysis, decision-making, customer communication, AI model building, and other data-led purposes.
Additionally, pandas data frames integrate well with other Python libraries, making it a powerful tool for data processing, statistical analysis, data pipelines, and business intelligence workflows. This versatility and ease of use make pandas indispensable for data scientists, data engineers, and data analysts.
For an example of how similarity keys generated from Interzoid's APIs are used to identify inconsistent yet matching data, especially with international data, see the following blog entry.
To achieve this kind of matching in our code example, we will use Interzoid's Company & Organization Matching API. This is a scalar API, meaning we will call it once for each row we analyze. Since it is a JSON API, it can be used almost anywhere, making it easy to implement in this example.
Functionally, the API will be sent the name of an entity, such as an organization or company name, from each row in a data frame. The API will analyze and process the name using specialized algorithms, knowledge bases, machine learning techniques, and an AI language model. It will respond with a generated similarity key, which is essentially a hashed canonical key encapsulating the many different variations the organization or company name could have within a dataset. This makes it easy to match up names despite differences in their actual electronic, data-described representation. Refer to the aforementioned blog entry to learn more about similarity keys.
Here is the API endpoint we will use to process row values for matching purposes in this example:
url = 'https://api.interzoid.com/getcompanymatchadvanced'
We will use a few Python libraries in this code example. If not already installed in your environment, please install the libraries as follows:
$ pip install pandas
$ pip install requests
$ pip install tabulate
In this first example, we will call the Interzoid matching API for each row of the data frame we have created, obtaining a similarity key for each value in the 'org' column with the data frame's 'apply' method. You can obtain an API key to enable the API calls from www.interzoid.com, required to provide access to the matching API.
import pandas as pd
import requests
# Sample DataFrame
data = {
'org': ['ibm inc', 'Microsoft Corp.', 'go0gle llc','IBM','Google','Microsot', 'Amazon', 'microsfttt']
}
df = pd.DataFrame(data)
# API details
url = 'https://api.interzoid.com/getcompanymatchadvanced'
headers = {
'x-api-key': 'Your-Interzoid-API-Key' # Get key at interzoid.com
}
# Function to call the API and get the simkey
def get_simkey(org):
params = {
'company': org,
'algorithm': 'ai-plus'
}
response = requests.get(url, params=params, headers=headers)
if response.status_code == 200:
data = response.json()
return data.get("SimKey", None)
else:
return None
# Apply the function to each row in the DataFrame
df['simkey'] = df['org'].apply(get_simkey)
# Sort the DataFrame by simkey
df_sorted = df.sort_values(by='simkey')
# Display the sorted DataFrame
print(df_sorted)
When executed, this Python code will call the matching API for each row value in the org column within the data frame. The generated similarity key will then be placed in the simkey column. After each row is processed, the data frame is sorted so that organization names with the same similarity key line up next to each other:
org simkey
0 ibm inc edplDLsBWcH9Sa7ZECaJx8KiEl5lvMWAa6ackCA4azs
3 IBM edplDLsBWcH9Sa7ZECaJx8KiEl5lvMWAa6ackCA4azs
2 go0gle llc pGWzK9MrYZzcyOrW5AkpnJYiOgI3qnO0EhwsuNh_dxk
4 Google pGWzK9MrYZzcyOrW5AkpnJYiOgI3qnO0EhwsuNh_dxk
6 Amazon tyGzXZjfZUqhgqt6mqNZF8MCsn-QQV1NJbysxSTB7aI
1 Microsoft Corp. xUhcrilUNsRiCthe7rXkIupHiCbhhgyLrKNAcXruwoA
5 Microsot xUhcrilUNsRiCthe7rXkIupHiCbhhgyLrKNAcXruwoA
7 microsfttt xUhcrilUNsRiCthe7rXkIupHiCbhhgyLrKNAcXruwoA
To make the results more readable and resembling something more like a report, let's make the following changes to our code. We will add a space between the records of each matching set of similarity keys. Additionally, let's not show the entries where an organization or company name has no other data value that shares the same similarity key. This will ensure that we will only display rows that have matches, enabling us to clearly see the data redundancy that exists in our dataset.
import pandas as pd
import requests
from tabulate import tabulate
# Sample DataFrame
data = {
'org': ['ibm inc', 'Microsoft Corp.', 'go0gle llc','IBM','Google','Microsot', 'Amazon', 'microsfttt']
}
df = pd.DataFrame(data)
# API details
url = 'https://api.interzoid.com/getcompanymatchadvanced'
headers = {
'x-api-key': 'Your-Interzoid-API-Key' # Get key at interzoid.com
}
# Function to call the API and get the simkey
def get_simkey(org):
params = {
'company': org,
'algorithm': 'ai-plus'
}
response = requests.get(url, params=params, headers=headers)
if response.status_code == 200:
data = response.json()
return data.get("SimKey", None)
else:
return None
# Apply the function to each row in the DataFrame
df['simkey'] = df['org'].apply(get_simkey)
# Sort the DataFrame by simkey
df_sorted = df.sort_values(by='simkey').reset_index(drop=True)
# Filter out records that don't have at least one other record with the same simkey
filtered_df = df_sorted[df_sorted.duplicated(subset=['simkey'], keep=False)]
# Proceed only if there are records with duplicate simkeys
if not filtered_df.empty:
# Insert blank lines where simkey changes
output_rows = []
previous_simkey = None
for index, row in filtered_df.iterrows():
if previous_simkey is not None and row['simkey'] != previous_simkey:
# Insert a blank row (as a dictionary of NaN values)
blank_row = pd.Series([None] * len(filtered_df.columns), index=filtered_df.columns)
output_rows.append(blank_row)
output_rows.append(row)
previous_simkey = row['simkey']
# Create a new DataFrame from the rows with blank lines inserted
output_df = pd.concat(output_rows, axis=1).T.reset_index(drop=True)
# Replace None with empty string
output_df.fillna('', inplace=True)
# Convert the DataFrame to a table with left-justified columns
table = tabulate(output_df, headers='keys', tablefmt='plain', stralign='left')
# Print the table
print(table)
else:
print("No records with duplicate simkeys found.")
Here are the formatted matched results with the refined, tabulated report showing only matched data rows:
ibm inc edplDLsBWcH9Sa7ZECaJx8KiEl5lvMWAa6ackCA4azs
IBM edplDLsBWcH9Sa7ZECaJx8KiEl5lvMWAa6ackCA4azs
go0gle llc pGWzK9MrYZzcyOrW5AkpnJYiOgI3qnO0EhwsuNh_dxk
Google pGWzK9MrYZzcyOrW5AkpnJYiOgI3qnO0EhwsuNh_dxk
Microsoft Corp. xUhcrilUNsRiCthe7rXkIupHiCbhhgyLrKNAcXruwoA
Microsot xUhcrilUNsRiCthe7rXkIupHiCbhhgyLrKNAcXruwoA
microsfttt xUhcrilUNsRiCthe7rXkIupHiCbhhgyLrKNAcXruwoA
From here, the possibilities are endless. You can add additional business logic and columns to refine matches further if desired. These similarity keys can also be used for searching, matching data across datasets for data enhancement, and much more.
Questions or would like to put it to use with your own data? www.interzoid.com