Introducing our Snowflake Data Cloud Native Application: AI-Driven Data Quality built into SQL statements! Learn More

Matching and Merging Data Across Files using Similarity Keys

by Interzoid Team


Posted on March 10th, 2020


Match and Merge

Much value can be obtained from merging multiple files of data to get clearer, more comprehensive pictures of prospects, customers, and business opportunities. Data points that are spread across multiple sources are a classic example where the whole is greater than the sum of the parts, if only the data could be combined. Achieving this can unleash a large number of possibilities to use the newly merged data in all kinds of scenarios, including analytics, machine learning, and other forms of data analysis.

However, matching data across tables or merging files can be a challenge, especially if the only overlapping field is a textual field such as "company name", where the same data could be represented countless ways. Inconsistently-represented data, quite common, makes the ability match data using SQL queries or simple merge logic very difficult. Match rates will typically be very low. This stands in the way of aggregating data from multiple sources.

One technique that can make great headway in merging data sources is the use of API-generated similarity keys, where algorithms are used to eliminate data variance to generate a canonical key. The key is then used for matching rather than the actual data. This can increase match rates tremendously, making the desired data merge a reality.

For example, here is a snapshot case of matching two files where the only common field is a company name field:

Note the inconsistency in company name data, making it difficult to match and merge
Matching and Merging Company Name Data

We can use each company name value to generate a similarity key by calling a Company Name Matching API for each record, where each request uses the company name string to generate the corresponding key. In this example, we will store the keys in the same table as a new column. Of course, they could be stored in separate tables with the use of a primary key to match similarity keys to company names.

We will use this similarity key as the basis for matching for our queries, rather than the actual company name content data. This will provide greater levels of matching.


Adding Matching Keys to Company Name Data Adding Matching Keys to Company Name Data

In the output below, the merged file shows the combined data (similarity keys don't need to be included in the merged file of course).

Examples of Catching Duplicate, Inconsistent Company Name Data

We now have a combined dataset as a result of being able to match across files with data that is inherently inconsistent in its representation.


See our Snowflake Native Application. Achieve Data Quality built-in to SQL statements.
Identify inconsistent and duplicate data quickly and easily in data tables and files.
More...
Connect Directly to Cloud SQL Databases and Perform Data Quality Analysis
Achieve better, more consistent, more usable data.
More...
Try our Pay-as-you-Go Option
Start increasing the usability and value of your data - start small and grow with success.
More...
Launch Our Entire Data Quality Matching System on an AWS EC2 Instance
Deploy to the instance type of your choice in any AWS data center globally. Start analyzing data and identifying matches across many databases and file types in minutes.
More...
Free Usage Credits
Register for an Interzoid API account and receive free usage credits. Improve the value and usability of your strategic data assets now.
Automate API Integration into Cloud Databases
Run live data quality exception and enhancement reports on major Cloud Data Platforms direct from your browser.
More...
Check out our APIs and SDKs
Easily integrate better data everywhere.
More...
Example API Usage Code on Github
Sample Code for invoking APIs on Interzoid in multiple programming languages
Business Case: Cloud APIs and Cloud Databases
See the business case for API-driven data enhancement - directly within your important datasets
More...
Documentation and Overview
See our documentation site.
More...
Product Newsletter
Receive Interzoid product and technology updates.
More...