In this walkthrough, we will set up a PostgreSQL database on Ubuntu, connect to it, populate it with data, and then use a third-party tool to analyze the data for data quality issues.
Part One: Set up PostgreSQL on AWS RDS
1. Log into your AWS console and select the RDS service.
2. Click "Create Database" and select "PostgreSQL" as the engine. Be sure to scroll down to the RDS "Create Database", rather than the more complex Aurora database choice.
3. Set your DBInstance Name (this will identify it and be used in your connection string).
4. Set your username and password.
5. Select your DB instance size - choose an instance type and storage amount that fits your needs. The defaults are ok for this walkthrough. The burstable classes are a good choice for lower resource usage.
6. Configure the DB instance settings:
- Select the VPC you want to launch the database within (ok to just use the default).
- Set public accessibility to "yes".
7. For connectivity, under Security Group, create a new one. At minimum, allow port 5432 access. To make it easy for testing, you can allow ingress (inbound access) on port 5432 from 0.0.0.0/0. This will allow access to the PostgreSQL port from any IP address so that you can connect from anywhere, however it is recommended that you lock it down to the specific IP address of a server you will be connecting to it from, especially in a production environment.
8. Click "Create Database". It will now launch and provision the RDS Postgres instance.
Part Two: Preparing to Connect to the Database
To connect:
1. Once the status shows "Available", click on the instance name to view details.
2. Scroll down to "Connectivity & Security". Copy the Endpoint.
3. Using a SQL client like psql, pgAdmin, or DBeaver, connect to the endpoint using the master username and password as part of a connection string (see below). We will use psql to load the database with sample data.
4. You will then be able to use SQL statements to create tables, insert rows, run queries and more on your RDS Postgres database.
Part Three: Installing psql to Connect to the Database
In order to run SQL statements on this data, you can install psql, available on many platforms.
For example, here are the steps to install psql on Ubuntu:
1. Update the package repository(this ensures you get the latest version of psql):
$ sudo apt update
2. Install the postgresql client package:
$ sudo apt install postgresql-client-common
3. Install the libpq-dev package that provides the library needed for software to communicate with PostgreSQL databases:
$ sudo apt install libpq-dev
4. Install postgresql:
$ sudo apt install postgresql
5. Verify psql installed correctly:
$ psql --version
You should see the version of psql installed.
Part Four: Loading Sample Data into the PostgreSQL database:
The following will provide access to your Postgres instance on AWS (you will be prompted for the password)
$ psql --host=your-specific-endpoint.rds.amazonaws.com --port=5432 --username=postgres --password --
You should then be at the psql command line.
For a sample database script to populate a database, get it here: https://dl.interzoid.com/csv/companiessql.txt.
You can execute these statements and the table with data will be created in the default postgres database. You can try it from the psql prompt here:
> select * from companies;
Part Five: Running Redundant Data Match Reports using a Third Party Tool
Once you have data populated into the database, you can test it from a third-party product such as Interzoid's Cloud Data Connect Wizard, which will identify inconsistent and duplicate from a SQL table.
This is the connection string for the database we have created and loaded with data:
postgres://user:password@your-custom-endpoint.rds.amazonaws.com/database-name?sslmode=require
To run a match job with the Interzoid Cloud Data Connect Wizard, choose AWS RDS PostgreSQL as the data source, "companies", as the table name, and "company" as the column name to match on. Run the "Match Report" and view the results. It's that easy!
To register for an API key with Interzoid (free tier available), click here.
For more documentation on using PostgreSQL data sources for matching and cleansing database tables, click here.