Introducing our Snowflake Data Cloud Native Application: AI-Driven Data Quality built into SQL statements! Learn More

Identifying Duplicate Organization Names in Data using AI with Python and Pandas

by Interzoid Team


Python Pandas Duplicate Data AI Matching

This is a Python example to generate AI-enriched match reports that identify redundant organization and company entities in datasets with pandas.

Terms like "fuzzy matching", "similarity searching", "string distance search", and "entity name resolution" have been used over the years to describe the process of matching organization and company names that appear as account names, vendor names, customer names, or anything else they represent in a database or dataset. The goal has been to identify spelling variations, non-standardized names, and other inconsistencies in data that cause significant issues in various types of data analysis. However, these traditional methods of duplicate identification have often achieved only limited success. If these data quality issues go unaddressed in any business or organizational scenario, they can have a tremendously negative effect in using data for reporting, decision-making, customer/prospect communication, and in effectiveness of AI models, delivering information technology ROI a serious blow.

Now, however, with the use of various innovative algorithmic techniques enhanced by modern, sophisticated AI models, we can achieve results that are vastly superior to previous approaches, including international data.

A great way to showcase this cutting-edge approach to organization name matching is by using pandas data frames within Python. Pandas data frames are incredibly versatile and useful for handling datasets due to their powerful and flexible data structures. They allow for efficient manipulation and analysis of data, offering a range of functions to filter and transform datasets seamlessly. The tabular format, similar to spreadsheets, makes it intuitive for users to visualize and manipulate large volumes of data. Additionally, pandas supports diverse data types, making it easy to merge, join, and concatenate different datasets.

This flexibility makes it easier to integrate and process datasets using Interzoid's cloud-native data quality platform. Simple cloud-native, JSON API calls can enable a new dimension of higher quality data to pandas datasets, significantly enhancing data analysis, decision-making, customer communication, AI model building, and other data-led purposes.

Additionally, pandas data frames integrate well with other Python libraries, making it a powerful tool for data processing, statistical analysis, data pipelines, and business intelligence workflows. This versatility and ease of use make pandas indispensable for data scientists, data engineers, and data analysts.

For an example of how similarity keys generated from Interzoid's APIs are used to identify inconsistent yet matching data, especially with international data, see the following blog entry.

To achieve this kind of matching in our code example, we will use Interzoid's Company & Organization Matching API. This is a scalar API, meaning we will call it once for each row we analyze. Since it is a JSON API, it can be used almost anywhere, making it easy to implement in this example.

Functionally, the API will be sent the name of an entity, such as an organization or company name, from each row in a data frame. The API will analyze and process the name using specialized algorithms, knowledge bases, machine learning techniques, and an AI language model. It will respond with a generated similarity key, which is essentially a hashed canonical key encapsulating the many different variations the organization or company name could have within a dataset. This makes it easy to match up names despite differences in their actual electronic, data-described representation. Refer to the aforementioned blog entry to learn more about similarity keys.

Here is the API endpoint we will use to process row values for matching purposes in this example:


    url = 'https://api.interzoid.com/getcompanymatchadvanced'
                

We will use a few Python libraries in this code example. If not already installed in your environment, please install the libraries as follows:


    $ pip install pandas
    $ pip install requests
    $ pip install tabulate
                

In this first example, we will call the Interzoid matching API for each row of the data frame we have created, obtaining a similarity key for each value in the 'org' column with the data frame's 'apply' method. You can obtain an API key to enable the API calls from www.interzoid.com, required to provide access to the matching API.


    import pandas as pd
    import requests

    # Sample DataFrame
    data = {
        'org': ['ibm inc', 'Microsoft Corp.', 'go0gle llc','IBM','Google','Microsot', 'Amazon', 'microsfttt']
    }
    df = pd.DataFrame(data)

    # API details
    url = 'https://api.interzoid.com/getcompanymatchadvanced'
    headers = {
        'x-api-key': 'Your-Interzoid-API-Key'   # Get key at interzoid.com
    }

    # Function to call the API and get the simkey
    def get_simkey(org):
        params = {
            'company': org,
            'algorithm': 'ai-plus'
        }

        response = requests.get(url, params=params, headers=headers)

        if response.status_code == 200:
            data = response.json()
            return data.get("SimKey", None)
        else:
            return None

    # Apply the function to each row in the DataFrame
    df['simkey'] = df['org'].apply(get_simkey)

    # Sort the DataFrame by simkey
    df_sorted = df.sort_values(by='simkey')

    # Display the sorted DataFrame
    print(df_sorted)
        

When executed, this Python code will call the matching API for each row value in the org column within the data frame. The generated similarity key will then be placed in the simkey column. After each row is processed, the data frame is sorted so that organization names with the same similarity key line up next to each other:


            org                          simkey
    0          ibm inc  edplDLsBWcH9Sa7ZECaJx8KiEl5lvMWAa6ackCA4azs
    3              IBM  edplDLsBWcH9Sa7ZECaJx8KiEl5lvMWAa6ackCA4azs
    2       go0gle llc  pGWzK9MrYZzcyOrW5AkpnJYiOgI3qnO0EhwsuNh_dxk
    4           Google  pGWzK9MrYZzcyOrW5AkpnJYiOgI3qnO0EhwsuNh_dxk
    6           Amazon  tyGzXZjfZUqhgqt6mqNZF8MCsn-QQV1NJbysxSTB7aI
    1  Microsoft Corp.  xUhcrilUNsRiCthe7rXkIupHiCbhhgyLrKNAcXruwoA
    5         Microsot  xUhcrilUNsRiCthe7rXkIupHiCbhhgyLrKNAcXruwoA
    7       microsfttt  xUhcrilUNsRiCthe7rXkIupHiCbhhgyLrKNAcXruwoA
            

To make the results more readable and resembling something more like a report, let's make the following changes to our code. We will add a space between the records of each matching set of similarity keys. Additionally, let's not show the entries where an organization or company name has no other data value that shares the same similarity key. This will ensure that we will only display rows that have matches, enabling us to clearly see the data redundancy that exists in our dataset.


    import pandas as pd
    import requests
    from tabulate import tabulate

    # Sample DataFrame
    data = {
        'org': ['ibm inc', 'Microsoft Corp.', 'go0gle llc','IBM','Google','Microsot', 'Amazon', 'microsfttt']
    }
    df = pd.DataFrame(data)

    # API details
    url = 'https://api.interzoid.com/getcompanymatchadvanced'
    headers = {
        'x-api-key': 'Your-Interzoid-API-Key'   # Get key at interzoid.com
    }

    # Function to call the API and get the simkey
        def get_simkey(org):
            params = {
                'company': org,
                'algorithm': 'ai-plus'
            }

            response = requests.get(url, params=params, headers=headers)

            if response.status_code == 200:
                data = response.json()
                return data.get("SimKey", None)
            else:
                return None

        # Apply the function to each row in the DataFrame
        df['simkey'] = df['org'].apply(get_simkey)

        # Sort the DataFrame by simkey
        df_sorted = df.sort_values(by='simkey').reset_index(drop=True)

        # Filter out records that don't have at least one other record with the same simkey
        filtered_df = df_sorted[df_sorted.duplicated(subset=['simkey'], keep=False)]

        # Proceed only if there are records with duplicate simkeys
        if not filtered_df.empty:
            # Insert blank lines where simkey changes
            output_rows = []
            previous_simkey = None

            for index, row in filtered_df.iterrows():
                if previous_simkey is not None and row['simkey'] != previous_simkey:
                    # Insert a blank row (as a dictionary of NaN values)
                    blank_row = pd.Series([None] * len(filtered_df.columns), index=filtered_df.columns)
                    output_rows.append(blank_row)
                output_rows.append(row)
                previous_simkey = row['simkey']

            # Create a new DataFrame from the rows with blank lines inserted
            output_df = pd.concat(output_rows, axis=1).T.reset_index(drop=True)

            # Replace None with empty string
            output_df.fillna('', inplace=True)

            # Convert the DataFrame to a table with left-justified columns
            table = tabulate(output_df, headers='keys', tablefmt='plain', stralign='left')

            # Print the table
            print(table)
        else:
            print("No records with duplicate simkeys found.")
            

Here are the formatted matched results with the refined, tabulated report showing only matched data rows:


    ibm inc          edplDLsBWcH9Sa7ZECaJx8KiEl5lvMWAa6ackCA4azs
    IBM              edplDLsBWcH9Sa7ZECaJx8KiEl5lvMWAa6ackCA4azs

    go0gle llc       pGWzK9MrYZzcyOrW5AkpnJYiOgI3qnO0EhwsuNh_dxk
    Google           pGWzK9MrYZzcyOrW5AkpnJYiOgI3qnO0EhwsuNh_dxk

    Microsoft Corp.  xUhcrilUNsRiCthe7rXkIupHiCbhhgyLrKNAcXruwoA
    Microsot         xUhcrilUNsRiCthe7rXkIupHiCbhhgyLrKNAcXruwoA
    microsfttt       xUhcrilUNsRiCthe7rXkIupHiCbhhgyLrKNAcXruwoA

From here, the possibilities are endless. You can add additional business logic and columns to refine matches further if desired. These similarity keys can also be used for searching, matching data across datasets for data enhancement, and much more.

Questions or would like to put it to use with your own data? www.interzoid.com

See our Snowflake Native Application. Achieve Data Quality built-in to SQL statements.
Identify inconsistent and duplicate data quickly and easily in data tables and files.
More...
Connect Directly to Cloud SQL Databases and Perform Data Quality Analysis
Achieve better, more consistent, more usable data.
More...
Try our Pay-as-you-Go Option
Start increasing the usability and value of your data - start small and grow with success.
More...
Launch Our Entire Data Quality Matching System on an AWS EC2 Instance
Deploy to the instance type of your choice in any AWS data center globally. Start analyzing data and identifying matches across many databases and file types in minutes.
More...
Free Usage Credits
Register for an Interzoid API account and receive free usage credits. Improve the value and usability of your strategic data assets now.
Automate API Integration into Cloud Databases
Run live data quality exception and enhancement reports on major Cloud Data Platforms direct from your browser.
More...
Check out our APIs and SDKs
Easily integrate better data everywhere.
More...
Example API Usage Code on Github
Sample Code for invoking APIs on Interzoid in multiple programming languages
Business Case: Cloud APIs and Cloud Databases
See the business case for API-driven data enhancement - directly within your important datasets
More...
Documentation and Overview
See our documentation site.
More...
Product Newsletter
Receive Interzoid product and technology updates.
More...