Posted on March 10th, 2021
In this step-by-step walkthrough example, Interzoid’s Cloud Data Connect Product is used to generate a "Company Name Match Report" showing possible data redundancy using data within and while connected to a CockroachDB instance in the CockroachCloud, and without any coding required. We will first launch a free instance of the CockroachCloud (in just a few clicks), create a database table, and then populate it with data for analysis.
The CockroachCloud is a distributed-database-as-a-service offering from Cockroach Labs, the creators of CockroachDB. By offering CockroachDB as-a-service, customers can take advantage of the CockroachDB distributed database platform without having to manage any of the underlying infrastructure or perform any standard database management duties (such as backups). This includes having CockroachCloud site reliability managers, actual humans, constantly monitoring the system and ensuring its reliability.
The advantages of a distributed database platform is that data is replicated across multiple geographic locations. This reduces the latency of queries from different geographies as data is physically closer to the user, and also becomes fault-tolerant, remaining up and running even if a given data center becomes offline. CockroachDB, a relational, SQL-based database management system, is now being offered as a managed database platform called the CockroachCloud.
Cockroach Labs is making an entry-level configuration of CockroachDB available FREE for the CockroachCloud, which includes one virtual CPU and 5GB of a storage. This doesn't enable the distributed features, but does enable testing, proof-of-concepts, and other development to occur that can then be scaled up as need dictates when additional horsepower is needed.
Since CockroachDB is SQL-based, including on the CockroachCloud, Interzoid is able to provide connectivity to it within our Interzoid Cloud Data Connect product. This automates connecting to the CockroachCloud, retrieving data as desired, calling our APIs to perform some advanced data quality analysis, and reporting the findings - all without any coding.
The premise of matching similar data for data quality purposes is that Interzoid’s Company Matching API generates “similarity keys” based on matching algorithms, machine learning, and data-specific knowledge bases. In this case, these keys help identify similar company or organization names within a single database or across multiple data sources.
The use of these similarity keys helps to overcome the issues and challenges caused by inconsistent alphanumeric data for representing company names within a dataset. The technique can provide significantly greater impact and value for Cloud Data Warehousing efforts in making the resident data cleaner, more consistent, more accurate, more usable, and therefore more valuable. It can also help identify data quality challenges in underlying source data that feed a data warehouse that needs to be addressed.
It is important and useful not only to identify cases of redundant data records in a single dataset, but data matching via similarity keys can also significantly enhance the ability to match across multiple datasets where data can be inconsistently represented. The ability to get higher match rates when combining data from multiple sources to create new datasets or enhance existing ones can also add significant value.
For example, the following similarity keys were generated for the company names below.
Using the Interzoid Company Matching API, likely matches will generate the same similarity key. You can see in the above example that similar company names generate the same similarity key. This makes them easy to identify as part of a data query.
To run the match report for this walkthrough using CockroachDB data, you can download Interzoid's Cloud Data Connect here.
You will need to register-api-account at the Interzoid Web site to get your trial API license key, required to use the product, which accesses the APIs over the Cloud at runtime.
If you would like to run larger tests, contact support@interzoid.com and mention this blog to receive a larger number of free usage credits.
How to connect to and create match reports within CockroachCloud:
There are four steps in this walkthrough:
Step #1: Signup for your free instance of CockroachCloud.
Step #2: Load data into your instance.
Step #3: Execute the match report.
Step #4: View the results.
Step #1 – Signup for your free instance of CockroachCloud.
Visit https://www.cockroachlabs.com/product/cockroachcloud/ to setup your free CockroachDB database instance on the CockroachCloud. As of this blog publishing, the CockroachCloud is still in beta. A free instance, listed as "free forever", provides one virtual CPU and five gigabytes of storage.
After you provide the requested signup information, you will be given a choice to deploy the database instance to either Google Cloud or AWS. Your choice is immaterial to this blog, so go ahead and choose either.
Next, download the CockroachCloud SQL Client and connect to it as follows. To do so, log in to the Cockroach Cloud if you have not already. Once logged in, you will be at the Cluster overview page.
Click the Connect button in the upper-right corner.
A connection information dialog window will appear. Across the top, you will see three tabs, "Command Line", "Connection String", and "Connection Parameters". Explore these tabs, as we will make use of all three in this walkthrough.
We are going to start with the "Command Line" tab.
First, it will ask you to download a certificate to enable secure mode communication with your CockroachCloud database instance. Of course you should do that in proof-of-concept and production uses, but for simplicity, we will skip that for the purposes of this blog (feel free to do download it and incorporate into the walkthrough yourself if so inclined).
In this walkthrough, we are going to be using the Linux SQL client to setup the initial database (you can use the Mac or Windows clients as well if you would like, as the SQL is straight-forward.)
Here is the current Linux download command to get the client set up:
$ wget -qO- https://binaries.cockroachdb.com/cockroach-v20.2.4.linux-amd64.tgz | tar xvz;
cp -i cockroach-v20.2.4.linux-amd64/cockroach /usr/local/bin/
Once the client is installed, we will then connect to the CockroachCloud via the terminal. Note the change to the sslmode parameter below. Here is an example (edit yours as necessary)
$ cockroach sql --url
'postgres://[yourusername]@free-tier.gcp-us-central1.cockroachlabs.cloud:26257/defaultdb?sslmode=require&options=--cluster=[yourclustername]'
Note the sslmode option in the connection string, allowing us to bypass using a certificate for simplicity.
You will be prompted for your password. We will then be connected to the SQL shell.
For help or additional setup information, assistance in setting up on Darwin platforms, and to explore other configuration options, visit CockroachCloud Quick Start.
Next, we will create a data table in your instance of the CockroachCloud and populate it with some data that we will use for our analysis.
Step #2 - Load data into your instance of the CockroachCloud
First, we will create a relational data table to store our data. We don't have to create a database as we are automatically connected to the default one.
CREATE TABLE companies (
ID SERIAL PRIMARY KEY NOT NULL,
company TEXT NOT NULL);
We can use SQL Insert statements to load some sample data into the table we have just created. Just copy and paste the entire list of SQL commands below:
INSERT INTO companies (company)
VALUES ('Apple Computer');
INSERT INTO companies (company)
VALUES ('Apple Inc.');
INSERT INTO companies (company)
VALUES ('Arctic Floors');
INSERT INTO companies (company)
VALUES ('Hilton');
INSERT INTO companies (company)
VALUES ('Arctic Flooring Inc.');
INSERT INTO companies (company)
VALUES ('Ford');
INSERT INTO companies (company)
VALUES ('Ford Motors');
INSERT INTO companies (company)
VALUES ('IBM');
INSERT INTO companies (company)
VALUES ('Google');
INSERT INTO companies (company)
VALUES ('Hilten Hotels Corp');
INSERT INTO companies (company)
VALUES ('Ford Motor Co');
INSERT INTO companies (company)
VALUES ('APPLE');
To ensure all of our data is now in the table, we can run a quick SQL query that will show us all of the data that we have just entered.
SELECT * FROM companies;
We now have data in our table and are ready to perform the analysis.
Step #3 - Execute the match report
To support both Windows and Linux and ease of inserting into workflows, the Interzoid Cloud Data Connect product runs from the command line. Once installed, you simply run it from a Windows or Linux command prompt with the following parameters (see documentation for more details):
function – name of Interzoid API function to call
target – native, odbc, or textfile
dsn – ODBC connection name if using ODBC
license – your Interzoid API license key
algo – matching algorithm used with the Company Match API
table – database table name within CockroachDB
column – name of table column in database to use as basis for match algorithms
To execute, from the command line of either Windows or Linux in your install directory, enter the following:
$ izdc -function=companymatch -target=cockroachdb -license=[Your Interzoid API License Key] -dsn="postgres://[user]:[password]@free-tier.[yoururl]/defaultdb?sslmode=require&options=--cluster=[yourcluster]" -table=companies -column=company
...where "user" is your CockroachCloud user name, "password" is your CockroachCloud password, "yoururl" is the url of your CockroachCloud connection string, and "yourcluster" is your unique CockroachCloud cluster identifier generated by the CockroachCloud at instance creation. See the "Connection string" tab in the "Connection Info" panel after connecting on the CockroachCloud site to get the correct information. Be sure to include your password in the connection string.
…and click [Enter]
Step #4: View the results
You will then see the match report showing candidate matches in text form upon successful execution. The similarity keys used to match similar company names are also displayed for reference.
Match Report
================
Ford Motor Co,YuH67lS2iG-euwCx3Alp_QHNDGM29jYSSHTbNW_nhGE
Ford Motors,YuH67lS2iG-euwCx3Alp_QHNDGM29jYSSHTbNW_nhGE
Ford,YuH67lS2iG-euwCx3Alp_QHNDGM29jYSSHTbNW_nhGE
Arctic Floors,afS-_5jFe8WrYxY2pM5mk7V0YFwcOIKJP62ficb0Vac
Arctic Flooring Inc.,afS-_5jFe8WrYxY2pM5mk7V0YFwcOIKJP62ficb0Vac
Hilton,bYJkyBAaMbvxkPMDf74Mul-LfR3jO8LxDyRvS-B-_k8
Hilton,bYJkyBAaMbvxkPMDf74Mul-LfR3jO8LxDyRvS-B-_k8
Hilten Hotels Corp,bYJkyBAaMbvxkPMDf74Mul-LfR3jO8LxDyRvS-B-_k8
Apple Computer,cZdRqd6Ya6FBDPmFpn4_USiTu2DVoYO32ANw1Z5NYN0
Apple Inc.,cZdRqd6Ya6FBDPmFpn4_USiTu2DVoYO32ANw1Z5NYN0
APPLE,cZdRqd6Ya6FBDPmFpn4_USiTu2DVoYO32ANw1Z5NYN0
You can continue to enter or load more data into CockroachCloud and run again to perform additional testing. If you need more Interzoid usage credits for exploration and testing, contact us at support@interzoid.com and mention this blog.
Now that matches have been identified, there are several potential courses of action to take, depending on business requirements.
For example, data can be exported from the CockroachCloud for further analysis. Results can be saved/exported to a delimited file for review or additional processing.
If this is a raw marketing list, redundant data could be eliminated to save outreach costs.
If this is a customer or prospect list, logic can be created to collapse redundant records into a single instance of a company name while maintaining important data from each of the records to keep in the surviving record.
This walkthrough is only one example of integrating Interzoid APIs within your CockroachCloud instance. There are also other matching APIs available from Interzoid that include specific matching algorithms for other types of data such as individual names (Bob = Robert, Johnson = Jonsen). In addition, there are data enrichment APIs available to add external data elements via the Cloud to existing data that can enhance various data initiatives using CockroachDB.
The possibilities are limitless.
For more information, questions, or assistance, contact support@interzoid.com or visit https://www.interzoid.com