PostgreSQL High Availability on OneOps (1)

postgresql_ha

(Disclaimer: the blogs posted here only represent the author’s respective, OneOps project does not guarantee the support or warranty of any code, tutorial, documentation discussed here)

When I first introduced OneOps, I ever said “OneOps has a rich set of ‘best-practice’ based application designs”. Today I will use an application design (or called “pack”) to explain what ‘best-practice’ really means.

Actually, “pack” is not a new terminology, as some cloud or configuration management tools already open-sourced their application “cookbook” or “playbook”, that are similar to OneOps pack in concept, but they may have the following issues:

(1) Most open-sourced cookbooks are more focused on the deployment workflow:

  • expose the application config parameters.
  • install the application binaries.
  • lay down the configuration files.
  • start the application.

The above workflow typically does not meet production requirements – missing high availability, load balancing, automatic failover and etc.

(2) Operational supports are missing, e.g. monitoring, alerting, and easy access to repair/replace the bad instances and scale the application.

(3) To be qualified for production, users either pay premiums for the proprietary cookbooks or subscribe the enterprise services from the vendors.

In this and next few blogs, we will take PostgreSQL as an example to illustrate how PostgreSQL on OneOps follows the best practices available from the industry.

PostgreSQL High Availability

PostgreSQL is one of the most popular transactional databases. However it does not ship with a very decent HA solution out-of-the-box. When searching for “PostgreSQL HA”,  people are easily overwhelmed with diverse solutions, which creates a high technical bar to deploy PostgreSQL in HA mode.

Recently I noticed Compose, Inc. published a blog about open sourcing their implementation of running PostgreSQL HA, which has been used in production for a while. After an independent research in this area, I believe their solution (called “Governor“) is supposed to be “state-of-the-art” for PostgreSQL HA.

Though Governor is open sourced, the example provided in its github is for experimental purposes. Moreover, Governor depends on other components, such as Etcd and HAProxy, so automating their deployments and configure them to work together should be very helpful.

Deploy Governor-based PostgreSQL on OneOps

In “Design” phase, choose “Governor based PostgreSQL” from “Pack Name” to create a new platform.

Screen Shot 2016-07-01 at 9.57.45 AM

Then we may check postgresql-governor component to review the PostgreSQL config parameters.

Screen Shot 2016-05-22 at 1.16.17 AM

Create a new user (“Username” is your local login name) and add your local SSH key, so that you could directly ssh into the virtual machines after the deployment.

Screen Shot 2016-05-22 at 1.19.36 AM

Save the design, move to Transition phase to create a new environment.

Please note that, (1) Availability Mode should be set to Redundant, (2) choose 1 cloud as Primary cloud for this demo.

Screen Shot 2016-05-22 at 1.26.18 AM

Save the environment, then “Commit & Deploy”. The deployment plan should show up now.

Screen Shot 2016-08-04 at 7.41.19 AM

As seen above, step 6 wants to deploy Etcd and HAproxy, then step 7 is to deploy Governor-based PostgreSQL. Specifically step 6 calls the existing Etcd and HAproxy packs on OneOps, which could be independently used to create a self-contained service, or serve and co-exist with other application, like Governor in this case. Again they are also packaged with their best practices.

Also note that the above plan will deploy 2 PostgreSQL instances – one of them will be the leader PostgreSQL that serves read and write requests, while the other one will actively follow and stream the changes from leader. In next section, I will describe how to identify the leader PostgreSQL.

After the deployment complete, the Governor-based PostgreSQL cluster is up and running. Next, we would finish the Post Deployment section mentioned in PostgreSQL-Governor pack main page.

Test out High Availability

To connect PostgreSQL server, we need to figure out the hostname or IP address of PostgreSQL. Since each virtual machine runs (1) PostgreSQL (2) Etcd (3) HAproxy , they are identical to each other and connecting to any one of their IP addresses should be working.

However there is a better way to do this. In my previous post, I mentioned that in most cases, OneOps will deploy a FQDN component (based on DNS service), which could serve a Round-Robin DNS (e.g. load balancing) to the application. Here are some benefits of using FQDN to connect the application:

  1. If the applications are deployed over multiple VM, we do not need to remember or hard-code the multiple IP addresses. FQDN could automatically load balance to one of VM, by default based on the Round-Robin.
  2. VM may become unavailable or die over the time, FQDN could route the requests to and return from the first working VM seamlessly.

To figure out the FQDN of a deployment, go to “Operate” section, click PostgreSQL platform on right-hand side, choose  fqdn component. Now we could see two “DNS Entries”, the shorter one is the platform-level FQDN that we will use for PostgreSQL connection.

On a machine that has PostgreSQL client installation, or one of the PostgreSQL VM we just deployed, type the following to connect the Governor-based PostgreSQL server:

/usr/pgsql-9.4/bin/psql --host your_fqdn_here --port 5000 -U postgres postgres

If everything is set up correctly, we are now connected to the server:

psql (9.4.8)
Type "help" for help.
postgres=#

Now let’s intentionally fail the leader to see how the failover works automatically. Identifying the leader is simple – keep a terminal open for each virtual machine, login to them ssh your_local_username@machine_ip,  then tail -f /var/log/messages, the leader will print out the following messages:

May 23 06:29:42 postgres-238213-1-20776470 bash: 2016-05-23 06:29:42,541 INFO: Governor Running: no action.  i am the leader with the lock
May 23 06:29:42 postgres-238213-1-20776470 bash: 2016-05-23 06:29:42,542 INFO: Governor Running: I am the Leader

On the leader machine, type sudo -s; service governor stop which will bring down the Governor service, equivalently the PostgreSQL service. (Note: this is not a clean or proper shutdown, so we need to do something before we bring it online, otherwise it may not catch up with the new leader. The most simple way is to delete the Postgres data directory and let it re-sync from the Postgres leader)

Watch closely the other terminals and after around 30 seconds, one of the PostgreSQL followers will be elected to be the new leader. So your PostgreSQL client should be able to connect the PostgreSQL server again. Now after removing the Postgres data directory, let’s restart the ex-leader: sudo -s; service governor start, the ex-leader will only become a follower because there is already a leader.

people may question about the “3o seconds” failover gap (TTL): by default Etcd service needs 30 seconds to realize the leader is down. During the 30 seconds PostgreSQL service may be unavailable. But the pack exposes the Etcd TTL value, so that users could shorten the TTL if this may help in some cases. Please see the picture about postgresql-governor configuration for tuning the Etcd TTL.

PostgreSQL Performance Stats Monitoring

The PostgreSQL pack not only have HA supported, but also instrument some monitors. Let’s look at one of them: Performance Stats (perfstat).

After finishing “Post Deployment” setup mentioned in PostgreSQL-Governor pack main page, perfstat monitor should start to work. Once there are some running database workload, several key database performance stats could be visualized from OneOps UI. Here is the list of performance stats:

active_queries, disk_usage, heap_hit, heap_hit_ratio, heap_read, index_hit, index_hit_ratio, index_read, locks, wait_locks

In “Operate” section, click PostgreSQL platform on right-hand side, choose the postgresql component and then the “leader” PostgreSQL instance. Next click the “monitor” tab, then choose “default metrics”, several graphs will show up and each includes some of performance stats. For example, the following picture shows the lock usage stats over the past hour.

Screen Shot 2016-05-22 at 8.57.57 PM

What is Next?

This is the first post to introduce Governor based PostgreSQL on OneOps, the automatic failover is verified within the same cloud or data center. A more real-world scenario is: one cloud or data center could entirely go down because of the power outage, so a seamless failover and replication solution is preferred across multiple clouds or data centers.

Also regularly backing up the PostgreSQL data to the remote storage, e.g. AWS S3, is also a good practice to add another layer of data redundancy. I plan to discuss above in the next few posts and please stay tuned!

Advertisements

One thought on “PostgreSQL High Availability on OneOps (1)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s