Onboarding New Cloud on OneOps

alicloud_logo.pngOne significant advantage of OneOps is its Cloud Agnostic, meaning that the developers and operators can freely move applications that are managed by OneOps from one cloud provider to another, so that they could “shop around” and take the benefit of better technology support, price, capacity, scalability, security, and customer service on demand.

Since the beginning, OneOps started to support many great cloud providers : OpenStack, AWS, Rackspace, Azure….As new cloud provider may emerge and some other cloud providers may be focusing on the markets other than US, it could be useful for OneOps owners to know how to onboard new cloud providers.

In this blog, we will use Alibaba Cloud (a.k.a. Aliyun) as the new cloud. As a background, Alibaba is a Chinese-based internet company that hosts the largest e-commerce websites in China. Aliyun is part of Alibaba and providing the core cloud services oversea that other major players are offering, such as Elastics Compute Service, Storage, Load Balancer, and VPC. For Aliyun China, it almost offers every single “apple-to-apple” service that AWS is offering across all levels of product lines.

For demonstration purpose, this blog will just illustrate adding the compute service on OneOps. For quick references, the pull requests for this effort could be found at: request #1 and request #2.

These two pull requests collectively add/change the following files:

Configuration Files:

Keypair cookbook:

  • add.rb: Aliyun currently does not provide a place to store public key. But if a cloud does support keypair, we suppose to implement something like add_keypair_openstack.rb

Security Group cookbook:

Compute cookbook:

OneOps-Admin (request #2):

Finally, we may want to sync the newly added cloud to OneOps, and restart the OneOps UI so that it will be picked up and shown in the next run:

bundle exec knife model sync aliyun
bundle exec knife cloud sync aliyun
service display restart

Application Deployment on Aliyun

We could follow one of my previous blogs on creating a new cloud on OneOps. The only different is to use Aliyun, instead of Openstack. In order to deploy something on Aliyun, we may need to spin up an Aliyun account first. The AccessKey and SecretKey are from the Aliyun account and should be pre-configured for authentication on OneOps.

At this moment, Aliyun may only support Aliyun Beijing data center and its availability zones (e.g. cn-beijing-a, cn-beijing-b, cn-beijing-c). But it is straightforward to add more Aliyun data centers in the compute service setup page.

In this blog, we choose MySQL as the application pack to deploy. The following picture shows the deployment was successful on Aliyun.


During the deployment, we could also see from Aliyun web console that there was a compute being launched for this deployment.




Alibaba Cloud is one of the largest cloud providers, primarily serving the Asian-based users. Now it is expanding its data centers oversea and onboarding Alibaba Cloud should bring more adoptions of OneOps in the near future. Please stay tuned for more updates!






ActiveMQ Deployment on OneOps


Apache ActiveMQ is an open source message broker that is developed in Java and with a full Java Message Service (JMS) client. As oriented for enterprise use cases, it supports multiple Java standards, languages and protocols:

  • JMS 1.1, J2EE 1.4, JCA 1.5 and XA
  • Java, C, C++, C#, Ruby, Perl, Python, PHP
  • TCP, SSL, NIO, UDP, multicast, JGroups and JXTA transports

In this blog, I would like to introduce how to deploy ActiveMQ on OneOps.

First, choose ActiveMQ pack in the Design phase.


For the demonstration purpose, in activemq component, we could keep everything default, but Auth Type set to None, such as following,

Screen Shot 2016-09-12 at 12.51.56 AM.png

In more serious cases, JAAS or other authentication methods should be used.

There are also two optional components: queue and topic that could be added upfront or later when needed. It is highly to recommended to create/edit/delete queue and topic in the OneOps layer. This is because: (1) take advantage of the OneOps audit trail so that we know which changes have been made. (2) more statistics about the topic or queue (e.g. message in/out per second, pending messages per second) could be provided and visualized on OneOps UI.

Last, we could add our own SSH key to the user-activemq component so that we could log into the VM later on.

Commit the design and we are ready to deploy ActiveMQ. Create a new environment with Availability Mode = single and 1 cloud as primary cloud, as ActiveMQ is not a distributed system by its nature and the current ActiveMQ pack did not fully support a High Availability (HA) deployment out-of-the-box. It may be the future work to have the pack support some Master-Slave deployment.

The deployment plan may resemble as following,

Screen Shot 2016-09-12 at 1.56.55 AM.png

After the deployment, we could SSH into the ActiveMQ broker and run a simple JMS program to test the ActiveMQ.

On the broker machine, download the hello world JMS code and save it as App.java, then use an editor to change the following  from




Save the change and compile the code by: javac -cp /opt/activemq/activemq-all-5.13.0.jar App.java

Then run the code by: java -classpath /opt/activemq/activemq-all-5.13.0.jar:. App

The output of the code run would be:

Sent message: 140686497 : Thread-0
Sent message: 104414848 : Thread-1
Received: Hello world! From: Thread-23 : 2112836430
Sent message: 125493280 : Thread-9
Sent message: 1026563844 : Thread-6
Received: Hello world! From: Thread-0 : 1499689553
Received: Hello world! From: Thread-6 : 336615606
Received: Hello world! From: Thread-9 : 1948548290
Sent message: 907776190 : Thread-17
Sent message: 645246177 : Thread-16
Sent message: 1998048735 : Thread-12
Received: Hello world! From: Thread-30 : 1818405427
Received: Hello world! From: Thread-9 : 574906110
Sent message: 1335593353 : Thread-19
Sent message: 1799775865 : Thread-25
Sent message: 935344593 : Thread-30
Sent message: 1964711431 : Thread-23
Received: Hello world! From: Thread-0 : 1694803203
Received: Hello world! From: Thread-19 : 686363848
Sent message: 271300696 : Thread-27
Received: Hello world! From: Thread-25 : 2112836430
Received: Hello world! From: Thread-23 : 756278511
Received: Hello world! From: Thread-1 : 321452604
Received: Hello world! From: Thread-25 : 523145999

The above output shows that ActiveMQ broker is working properly.

Another place to check ActiveMQ is through its web admin console. Open a browser and enter http://platform_level_fqdn_or_ip:8161/admin. The platform_level_fqdn could be retrieved from the these steps: go to Operate phase, click your_activemq_platform_name on the right,  find fqdn component and click into, the shorter URL is platform_level_fqdn.

The browser will have a pop-up window to ask for username and passwords, which are admin by default.

After logging into the web admin console, the main page should look like:

Screen Shot 2016-09-12 at 2.17.59 AM.png

We could further check the information of the queue that is just created for the “hello world” JMS code.

Screen Shot 2016-09-12 at 2.18.46 AM.png

Monitoring & Alerting

In Operate phase, some statistics about ActiveMQ broker could be visualized on OneOps UI. For example, for an ActiveMQ instance, click to Operate Tab, which will show the following 3 items to monitors:

Screen Shot 2016-09-12 at 2.28.54 AM.png

The BrokerStatus shows some broker related statistics:

Screen Shot 2016-09-12 at 2.30.25 AM.png

The Memory Status shows some memory related statistics on the broker:

Screen Shot 2016-09-12 at 2.31.53 AM.png

The Log  will alert if there is some critical log exception in the ActiveMQ broker log file, within 15 minutes.

Screen Shot 2016-09-12 at 2.32.44 AM.png


Here is a more detailed user guide about ActiveMQ pack on OneOps, especially for setting up SSL, authorization and authentication. To make it better for production, the future work may be to add clustering to support HA and Disaster Recovery.


Building an ElasticSearch, Logstash, Kibana (ELK) Stack on OneOps


ElasticSearch, Logstash, Kibana are massively popular open source projects that can compose an end-to-end stack which delivers actionable insights in real time from almost any type of structured and unstructured data source.

In short sentences:

  • Logstash is a tool for collecting, parsing, and transporting the logs for downstream use.
  • Kibana is a web interface that can be used to search and view the logs that Logstash has indexed.
  • ElasticSearch connects Logstash and Kibana, which is used for storing logs in a highly scalable, durable and available manner.

The following picture simply illustrate the relationship among them:


In OneOps application repository, we have all three of Logstash, Kibana, ElasticSearch, so in this blog I would like to introduce how to build ElasticSearch, Logstash, Kibana (ELK) stack on OneOps, by reproducing the demo shown in Visualizing Data with ELK.

Deploy Logstash and ElasticSearch

In fact Logstash has been shipped as an optional component of every application pack on OneOps, because though Logstash is not required for most applications, it is so generic to collect and transport the application logs, and could be conveniently enabled when this is needed.

For purposes of conciseness and demonstration, I will show the deployment of ElasticSearch together with Logstash, so that Logstash will run on every ElasticSearch node.

First, in Design phase, create a new ElasticSearch platform.

Screen Shot 2016-08-26 at 4.25.29 PM

After this, we may need to configure elasticsearchdownload and logstash components.

(1) elasticsearch component: if using the small compute (e.g. memory is less than 2GB), we may need to set Allocated Memory(MB)  to 512, otherwise ElasticSearch may run into JVM Out-of-Memory issue, because Logstash also needs to run in the same box (virtual machine) which additionally requires 512 MB heap size for launching its JVM.

Screen Shot 2016-08-26 at 4.28.38 PM

(2) download component: since we want to reproduce the demo in Visualizing Data with ELK, the data set used by that demo should be downloaded in advance. Fortunately, OneOps provides the download component so that anything hosted on internet could be automatically downloaded on every VM during the deployment. (Generally,  when we need to install some package, library or dependency, right after the VM is boot up, download component will do this job.)

Screen Shot 2016-08-29 at 9.35.31 AM

Save the download component and overall it should resemble:

Screen Shot 2016-08-26 at 4.54.57 PM.png

(3) logstash component: as we will run Logstash in the same box as ElasticSearch, we need to add a logstash component so that it will be deployed together with ElasticSearch. Note that the configuration steps described here also apply to other application who may want Logstash.

  • add a new logstash component
  • set Inputs to file {path => "/app/data.csv" start_position => "beginning" sincedb_path => "/app/sincedb.iis-logs"}
  • set Filters to csv {separator => "," columns => ["Date","Open","High","Low","Close","Volume","Adj Close"]} mutate {convert => ["High", "float"]} mutate {convert => ["Open", "float"]} mutate {convert => ["Low", "float"]} mutate {convert => ["Close", "float"]} mutate {convert => ["Volume", "float"]}
  • set Outputs to elasticsearch {action => "index" host => "localhost" index => "stock" workers => 1} stdout {}


Finally looks like:

Screen Shot 2016-08-26 at 5.05.00 PM.png

Save the logstash component and overall it should resemble:
Screen Shot 2016-08-26 at 5.09.53 PM.png

Last we could add our own SSH key to the user-app component so that we could log into the VM later on.

Now we are ready to deploy ElasticSearch and Logstash. Create a new environment with Availability Mode = redundant and choose 1 cloud as primary cloud. Regarding how to set up a cloud in OneOps, please refer to one of my previous blogs.

By default, an ElasticSearch cluster with 2 VMs will be created. For serious use cases, a cluster with 3 nodes are needed. This is because discovery.zen.minimum_master_node should be set to 2 to avoid split brain and also could tolerate 1 node loss. The number of node could be adjusted in Scaling section after clicking  your_elasticsearch_platform_name in Transition phase.

Screen Shot 2016-08-26 at 11.56.13 PM

The deployment plan will resemble the following: (number of compute instances is 3, denoting 3 VMs and 3 ElasticSearch instances will be created)

Screen Shot 2016-08-27 at 12.03.46 AM.png

After the deployment, ElasticSearch and Logstash should be running automatically.

Deploy Kibana

As introduced, Kibana typically pairs with ElasticSearch to provide a virtualization dashboard of the search results.

First, choose to create a Kibana platform in the Design phase.

Screen Shot 2016-08-27 at 12.15.32 AM.png

Then we need to configure the kibana component and the only thing we need to take care is ElasticSearch Cluster FQDN including PORT. We could get your_elasticsearch_platform_fqdn from following steps:

In Transition phase, first choose the ElasticSearch environment, then go to Operate phase, click your_elasticsearch_platform_name on the right,  find fqdn component and click into, the shorter URL is your_elasticsearch_platform_fqdn

Prefix with “http://” and postfix with “:9200/”, ElasticSearch Cluster FQDN including PORT will look like:


The entire section of configuring kibana component may resemble as the following:

Screen Shot 2016-08-27 at 12.21.41 AM

Again we could add our own SSH key to the user-app component in order to log into the VM later on.

After saving the platform, we could start to create an environment followed by the deployment. Same as before, Availability Mode = redundant and choose 1 cloud as primary cloud.

By default, two independent Kibana instances  will be deployed, which will provide some redundancy when 1 Kibana goes down.

The deployment plan will resemble the following: (number of compute instances is 2, denoting 2 VMs and 2 Kibana instances will be created)

Screen Shot 2016-08-27 at 12.54.32 AM.png

After the deployment, we could check the platform-level FQDN of Kibana and use it for accessing the Kibana dashboard.

Open a web browser and go to: http://your_kibana_platform_fqdn:5601

By following the steps in Visualizing Data with ELK to create the visualization dashboards on Kibana.

Note that the data set used in Visualizing Data with ELK is historical, so we may need to increase the search span in timestamp on Kibana, in order to pull the historical data from ElasticSearch and present it. This change could be done at top-right corner. For example,

Screen Shot 2016-08-27 at 1.44.12 AM
Click “Last 15 minutes” to change search span

In the following picture, we set the search span to 30 years ago relative to today, so that the similar visualization graph will be shown as the one in Visualizing Data with ELK.

Screen Shot 2016-08-24 at 12.01.35 AM.png


In this blog, I introduced how to build a ELK stack on OneOps and verify it works end-to-end by reproducing a demo in Visualizing Data with ELK. The ELK stack discussed in this blog is still preliminary, as in a production environment, it is more scalable and practical to include Filebeat (previously logstash-forwarder) and Redis into the pipeline.

Filebeat is a lightweight tool and installed on every node for tailing the system or application log files and forwarding them to Logstash. Redis could serve as a buffer to cache the aggregated huge volume of logs collected from all nodes.

Also ElasticSearch could follow a better deployment architecture, which separates the master-eligible nodes from the data nodes, and potentially have dedicated client nodes for routing the requests and aggregating the search results. (In this blog, every ElasticSearch instance in the cluster acts as both master-eligible and data node.)



Orchestrating Couchbase on OneOps


Couchbase is an open-source distributed, NoSQL document-oriented database that typically is well suited for powering the high performance web, mobile & IoT applications. For authoritative Couchbase use cases and notable users, please visit link1 and link2.

OneOps orchestrated the Couchbase pack, both “community” and “enterprise” versions. Please see here for a comparison. Recently, Couchbase, Inc. announced to support Couchbase (Enterprise) that is deployed and managed by OneOps, which was a win-win strategy to both parties and provided a good example of how the technology vendors play a crucial role in the OneOps ecosystem.

In this post, I plan to introduce the Couchbase OneOps pack from three aspects:

  • Deployment
  • Operation
  • Monitoring


As we will see later in this blog that Couchbase will emit metric data to Graphite for monitoring purpose, we need a running Graphite instance upfront. In my previous blog about Graphite on OneOps, we could follow the steps there to deploy a Graphite first (possibly in a different OneOps assembly).

Then in the Design phase, create a Couchbase platform by choosing “CouchBase” pack.

Screen Shot 2016-08-11 at 11.38.06 PM.png

After creating a Couchbase platform, there may be several parameters to review:

  • couchbase component: by default, it will deploy “community” version, but “enterprise” version is also provided if commercial supports and more features are needed. Change the Admin User and Password if the default ones do not work perfectly (the password is “password” by default). In future blog, we may review how to set up Email server to deliver alerts. For demo we do not need to change these now.

Screen Shot 2016-08-12 at 12.57.06 AM.png

  • bucket component: 1 default bucket will be created after the Couchbase deployment. Here the bucket name, password, number of replicas could be tuned up. (The default password for Bucket is “password”). In case we want to have more buckets, we could create them now, or later when we actually need them.

Screen Shot 2016-08-12 at 12.52.24 AM.png

  • diagnostic_cache component: please add the list of graphite servers (ip or FQDN) in a Graphite cluster , such as graphite_url_1:2003,graphite_url_2:2003 Note that the metric data will be sent to the first working graphite server in the list, so these URL should belong to only one Graphite cluster. In case we use FQDN for accessing a Graphite cluster, it is a plus to put multiple times of graphite_fqdn:2003 on purpose (graphite_fqdn:2003,graphite_fqdn:2003). The benefit of doing this: in case the FQDN -> ip resolution fails the first time (e.g. network transient), it gives more few chances to re-try the resolution.

Also we could add our own SSH key to the user-app component so that we could log into the VM later on.

After committing the Design, create a new environment with “Availability Mode = redundant” and choose 1 cloud as “primary cloud”. Regarding how to set up a cloud in OneOps, please refer to one of my previous blogs.

By default, a Couchbase cluster with 3 VMs will be created. The deployment plan will resemble the following: (number of compute instances is 3, denoting 3 VMs will be created)

Screen Shot 2016-08-12 at 1.53.37 AM.png

After the deployment, we could open a web browser and  visit Couchbase Web Console (your_couchbase_platform_fqdn:8091) to verify the cluster information and even do some operational work (will cover later). To get the couchbase platform FQDN, go to Operate phase, click your_couchbase_platform_name on the right,  find fqdn component and click into, the shorter URL is your_couchbase_platform_fqdn

Screen Shot 2016-08-12 at 2.01.37 AM.png
By Default, Username: Administrator, Password: password


Typical operations for Couchbase could be done at Couchbase Web Console:

  1. Add Server
  2. Fail over
  3. Remove Server
  4. Rebalance and etc.

Screen Shot 2016-08-12 at 9.51.43 AM

Interestingly, some of above operations could also be done on OneOps UI as well. For example, go to Operate phase, click your_couchbase_platform_name on the right,  findcouchbase component and click into, we may find multiple instances of couchbase.

Choose any one of them, then click Choose Action To Execute, we will see a drop-down list of actions that could run on this couchbase instance.

Screen Shot 2016-08-12 at 10.07.36 AM.png

One distinction between OneOps and some Automation tool is that: OneOps provides full flexibility to define operational actions associated with the pack. Take Couchbase cookbook for an example, in the “recipe” folder we could find the corresponding recipes for each operational action, for instance “add-to-cluster.rb“. The magic to present those operational actions on the front-end is the cookbook metadata file. (Typically defined at bottom of the metadata file).

Another operational highlight is the cluster-wise operation. Go to “Operate” tab, click your_couchbase_platform_name on the right,  find couchbase-cluster component and click into, then we will see only one couchbase cluster instance. The following picture shows the list of operational actions that could run on the cluster-wise.

Screen Shot 2016-08-12 at 11.24.40 AM

For this demo, we could run cluster-health-check which will check the following items to make sure the cluster is running in good state:

  1. if automatic fail over is enabled
  2. if the node (VM) is in healthy state
  3. if data is highly available in each bucket (e.g. replica exists and spread evenly over all nodes)
  4. if the nodes seen by OneOps are the same ones that are seen by Couchbase
  5. if the buckets seen by OneOps are the same ones that are seen by Couchbase
  6. if quota reset is not needed
  7. if multiple nodes  (VMs) are not sitting on the same hypervisor

If any of the answer to the above question is NO, the cluster-health-check will show fail status and will point out at which step it got failed. For example, more than one node/VM could be launched on the same hypervisor, leading to a higher risk of when a hypervisor is down, multiple VM will be offline at same time.

Screen Shot 2016-08-12 at 1.33.03 PM

If everything looks good, we will not see the red color output from this cluster-health-check operation.


A production-driven system can not live without extensive monitoring. Couchbase pack is a great example of monitoring and alerting.

The monitoring part of Couchbase will be introduced in 2 parts:


Remember that we mentioned about Couchbase deployment needs a Graphite instance upfront to present the Couchbase performance metrics. Now let’s look at the Graphite and see what we could get from it.

After opening the Graphite dashboard, we could navigate to the folder that contains Couchbase metrics. See below for an example.

Screen Shot 2016-08-12 at 3.01.34 PM

In the root directory, it contains many metrics about disk, memory usage, healthy node info and rebalance. Two sub-directories are buckets and nodes, which contains the metrics about all buckets and all nodes that we could further drill down. Let’s take a node for an example, if we want to visualize the number of operation (ops) on a certain Couchbase node, we could pick up a node and click “ops” icon and visualize the metrics over the time.


OneOps UI Monitor

Couchbase pack also emits some metrics to OneOps Monitor on UI. To visualize those, go to “Operate” tab, click your_couchbase_platform_name on the right,  find diagnostic_cache component, choose any one of the multiple diagnostic_cache instances (where each one corresponds to a Couchbase node, identified by the tailing numbers). Then click monitors tab which will show a list of monitored metrics on OneOps UI:

Screen Shot 2016-08-17 at 11.08.21 AM

For example, we want to look for the Disk Performance, so we just click Cluster Health Info and scroll down to find the corresponding chart about Disk Performance.

Screen Shot 2016-08-17 at 11.14.48 AM.png

Alerting could be optionally associated to some monitored metric. For example, if Cache Miss Ratio is too high (e.g. over 50%), the alert will be fired – an alerting message will show up on OneOps “Operate” UI (and will be sent to the sign-up email account after email notification is enabled).

Another metric to check if Couchbase is effectively used is Docs Resident. By default, if 100% of documents can not reside in memory over 5 minutes, the alert will be fired. On the other hand, the alert will “buzz off” after all documents sit in memory over 5 minutes.

Screen Shot 2016-08-12 at 4.05.15 PM
Alerting Message about “High Active Doc Resident”
Screen Shot 2016-08-12 at 4.03.04 PM
Recovery Message about “High Active Doc Resident”


Also it is very flexible to customize the criteria to trigger the alert case-by-case, as shown in the picture below.

Screen Shot 2016-08-17 at 11.17.44 AM


Couchbase pack is a great example of the application packs in the OneOps ecosystem which achieves:

  • fully automated deployment
  • “one-click” operational supports on node-level and cluster-wise
  • extensive monitoring and alerting

One huge benefit of OneOps is not only automating the deployment, similar as what other automation tools that already did, but to provide:

  1. an interface to use “code” to implement operational work once and simply present  as a button on UI. Anyone (e.g. Ops team, engineer) could repeatedly launch the operational work by “one-click” of button.
  2. 100% flexibility to define and customize any monitor that only a specific application or people care about. Visualize the metrics on-demand on the OneOps UI.
  3. seamless integration between alerting and monitoring,  so that any metric being monitored could also be optionally alerted by defining a threshold.

Given that many infrastructure technologies are based on the open-source offerings nowadays, the challenges for many organizations become: (1) pick up the right technology, and (2) operate it well in production.

I hope to see more OneOps application packs with rich set of monitoring and operational support, which are the “must-have” for a system live in production!




PostgreSQL High Availability on OneOps (2)


(Disclaimer: the blogs posted here only represent the author’s respective, OneOps project does not guarantee the support or warranty of any code, tutorial, documentation discussed here)

In my last posting, I introduced the Governor based Postgres on OneOps, which is an automated approach of deploying Postgres with High Availability (HA). Since then, I started a new round of development on the Governor Postgres cookbook, Etcd cookbook and HAProxy cookbook to take the benefits of the multi-layer clouds from OneOps, so that the Postgres pack on OneOps is one big step towards production readiness!

From last Postgres HA blog, the scenario we demonstrated was – all Postgres instances were in the same cloud or data center, so the limitation is: what if the entire cloud or data center is down?

In this blog, I will introduce a seamless HA failover solution across multi-layer clouds or data centers to guarantee the availability of Postgres, even after a sudden failure of an entire data center.

OneOps offers a concept of multi-layer clouds, technically speaking, the “primary” cloud and “secondary” cloud. If an application pack could be deployed over both primary and secondary clouds, the application instances in the secondary clouds (secondary instances) typically serve as the backup of the instances in primary clouds (primary instances). Here is one more difference from network’s point of view:

Primary Cloud: Global Load Balancer (global vip) forwards all the traffic to primary clouds.

Secondary Cloud: All secondary clouds are disabled and does not receive traffic from Global Load Balancer.

Moreover, if the application is stateless, e.g. REST application, it is not required to have the secondary instances to replicate the states or data from primary instances, which is the simplest case for OneOps pack developer.

Otherwise the secondary instances are supposed to replicate states or data from the primary instances, so that when the system admins flip the primary and secondary clouds (because of primary clouds are down), the secondary instances could immediately start to receive traffic with the consistent states or data. In our case, Postgres belongs to this case and the following is the architectural picture:


From above picture, when a client wants to connect to Postgres, its connection URL should be the FQDN of Global Load Balancer (GLB). Then the request is forwarded to HAproxy which helps to look up the Postgres master in primary cloud. Then the client could directly talk to the master forward, without having to go through the GLB and HAProxy (until the master failover happens).

In terms of data replication, Governor-based Postgres uses Stream Replication and all Postgres slaves directly and independently replicate data from the master. However when the master fails, only the slaves in primary clouds could be elected as new master, while the slaves in secondary clouds have no chance for the master election. The reason is: only the IP addresses from the primary cloud will be covered under OneOps GLB and the clients  should always use GLB for connecting Postgres master. If a slave Postgres instance from secondary cloud is elected to the new master, the user will not be able to connect to it.

After all Postgres instances in primary clouds fail at one shot (e.g. power outage) or by sequence, the primary and secondary clouds need to be flipped so that Postgres service will keep available.

In most cases, the primary and secondary clouds are mapped to different geographical data centers so that the outage of one data center will not bring down both primary and secondary clouds. The flip between primary and secondary clouds is triggered on OneOps UI, possibly by system admin or on-call engineers.

Now let’s quickly go through the deployment process and then see how the seamless failover solution will work.

Deploy Governor-based PostgreSQL on OneOps

Similar as the steps in the last blog, we create a Governor-based Postgres platform, add our SSH key into the user component. The difference comes from: when creating the environment, we need to at least one cloud as primary cloud, and at least one cloud as secondary cloud, which means at least two clouds should be created on OneOps beforehand. In this demo, I will use 1 cloud as primary and 1 cloud as secondary. (Again, “Availability Mode” should be set to “Redundant”)

Screen Shot 2016-07-28 at 4.57.52 PM

After creating the environment (before “Commit & Deploy”), we could click the Postgres platform name to review and change the “scaling” factor. In this demo, we use 3 compute per cloud, totally 6 compute for two clouds (primary and secondary). In practice, I personally recommend to use at least 3 compute for primary cloud (as well as secondary cloud), because Etcd runs best with minimum of 3 nodes, and more than 5 nodes may be a waste of having so many Postgres slaves.

Screen Shot 2016-07-29 at 10.23.26 AM

The deployment plan may resemble as following: (it is a long plan with total of 15 steps, which will happen on both primary and secondary clouds. Deployment on primary cloud will go first):

Screen Shot 2016-08-03 at 5.20.24 PM.png
Governor-based Postgres deployment plan (totally 15 steps, but here only show up to step 12th due to the space limitation. From step 9th, deployment moves onto secondary clouds)


After the deployment is done and finishing the Post Deployment section mentioned in PostgreSQL-Governor pack main page, from a Postgres client machine we could start to use FQDN to connect to the Postgres master.

/usr/pgsql-9.4/bin/psql --host your_fqdn_here --port 5000 -U postgres postgres

psql (9.4.8)
Type "help" for help.

Please note that the port number to connect is 5000, rather than 5432. In case you do not like 5000, it could be changed in HAProxy component in the Design phase and the port number could be something else (but not 5432).

Screen Shot 2016-08-04 at 7.34.59 AM.png

Test Postgres Master Failover

First let create a table, called “company”, and insert a record into it.

   NAME           TEXT    NOT NULL,
   AGE            INT     NOT NULL,
   ADDRESS        CHAR(50),
   SALARY         REAL,
VALUES (1, 'Paul', 32, 'California', 20000.00 ,'2001-07-13');

Now let use the method mentioned in the last blog to identify the current Postgres master, SSH into the master VM and run “service governor stop” to kill the current master. Later, I will talk about how to put the failed ex-master back online.

After 30 – 60 seconds (depending the value of Etcd TTL), the new Postgres master should be elected from the remaining compute from primary cloud. Now let’s re-connect the Postgres from the client and run the following select query:

postgres=# select * from company;
 id | name | age |         address        | salary | join_date
  1 | Paul |  32 | California             |  20000 | 2001-07-13
(1 row)

Next insert another record:

VALUES (2, 'Allen', 25, 'Texas', '2007-12-13');

Query the table again:

postgres=# select * from company;
 id | name | age |         address        | salary | join_date
  1 | Paul  |  32 | California             |  20000 | 2001-07-13
  2 | Allen |  25 | Texas                  |        | 2007-12-13
(2 rows)

From above we could see the new master is available to receive both read and write requests.

Now let’s shut down the current master by “service governor stop” to simulate another VM down.

Again after a short period of time, the last slave from primary cloud should be promoted to the new Postgres master and we could repeat the similar read and write queries to verify that the new master is working normally or not.

After shutting down the last Postgres instance (master) from primary cloud, no new master will be elected, because the Postgres instances in secondary cloud are not allowed to participate into the election. At this point, we need to flip the primary and secondary cloud and here are the sequences:

(1) Change the primary cloud to secondary cloud, and we will temporarily see two secondary clouds.

Go to “Transition” phase, click the environment and then Postgres platform name (on the right), find “Cloud Status” section (at the bottom), identify the primary cloud and choose “Make Secondary”.

Screen Shot 2016-07-29 at 1.27.31 AM

Then “Commit & Deploy” (only 4 steps).

Screen Shot 2016-08-03 at 5.16.15 PM

(2) Change the secondary cloud (not the one just got flipped) to primary cloud, so we again have one primary and one secondary cloud.

Identify the (right) secondary cloud and choose “Make Primary”.

Screen Shot 2016-07-29 at 1.35.33 AM.png

Then “Commit & Deploy” (now 7 steps).

Screen Shot 2016-08-03 at 5.18.42 PM

After the deployment completes, we try to re-connect the Postgres from client via FQDN:

/usr/pgsql-9.4/bin/psql --host your_fqdn_here --port 5432 -U postgres postgres

psql (9.4.8)
Type "help" for help.

And try some query:

postgres=# select * from company;
 id | name | age |         address        | salary | join_date
  1 | Paul  |  32 | California             |  20000 | 2001-07-13
  2 | Allen |  25 | Texas                  |        | 2007-12-13
(2 rows)

Run an insert command:

VALUES (3, 'Teddy', 23, 'Norway', 20000.00, DEFAULT )

Query the table one more time:

postgres=# select * from company;
 id | name | age |         address        | salary | join_date
  1 | Paul  |  32 | California             |  20000 | 2001-07-13
  2 | Allen |  25 | Texas                  |        | 2007-12-13
  3 | Teddy |  23 | Norway                 |  20000 |
(3 rows)

As seen from above, the old 2 records were not lost and the new record could be added to the existing table.

At this point, we have used simple queries to verify that the Postgres OneOps pack could support seamless master failover within a cloud and even across the clouds. Next question is: how to bring the failed Postgres instances back online?

This question is a little bit out of the discussion in this blog. For more details on this, there should be some materials to refer or the DBA may already know what to do based on the previous experiences of setting up Postgres HA. The one-sentence reminder is: do not simply reboot the failed master without properly handling the recovery tasks.

In this blog we just go ahead and delete the Postgres data directory and restart Postgre Governor process by “service governor start” to let the ex-master Postgres to re-sync all data from the current master.

rm /db/*
systemctl daemon-reload
service governor start

Now the ex-master should be a slave now. (“tail -f /var/log/message”)

Jul 29 08:49:06 pg-238343-1-25880159 bash: 2016-07-29 08:49:06,558 INFO: 
does not have lock
Jul 29 08:49:06 pg-238343-1-25880159 bash: 2016-07-29 08:49:06,573 INFO: 
Governor Running: no action.  i am a secondary and i am following a leader

Upto this point, we have seen a full cycle of Postgres master failover and recovery.


The Governor-based Postgres OneOps pack made several “number 1” in the OneOps application ecosystem.

  1.  the first pack that nicely supports deployment over both primary and secondary clouds.
  2. the first pack that seamlessly provides a HA failover solution between primary and secondary instances with an integrated user experience and data/state consistency.
  3. the first pack that stitches with many other existing packs (Etcd, Haproxy) together to make a complex but transparent HA system, without “re-invent the wheel”.

There will be ongoing improvements, such as, we may provide a knob for users to choose from synchronous and asynchronous replication. (currently it is only asynchronous)

PostgreSQL High Availability on OneOps (1)


(Disclaimer: the blogs posted here only represent the author’s respective, OneOps project does not guarantee the support or warranty of any code, tutorial, documentation discussed here)

When I first introduced OneOps, I ever said “OneOps has a rich set of ‘best-practice’ based application designs”. Today I will use an application design (or called “pack”) to explain what ‘best-practice’ really means.

Actually, “pack” is not a new terminology, as some cloud or configuration management tools already open-sourced their application “cookbook” or “playbook”, that are similar to OneOps pack in concept, but they may have the following issues:

(1) Most open-sourced cookbooks are more focused on the deployment workflow:

  • expose the application config parameters.
  • install the application binaries.
  • lay down the configuration files.
  • start the application.

The above workflow typically does not meet production requirements – missing high availability, load balancing, automatic failover and etc.

(2) Operational supports are missing, e.g. monitoring, alerting, and easy access to repair/replace the bad instances and scale the application.

(3) To be qualified for production, users either pay premiums for the proprietary cookbooks or subscribe the enterprise services from the vendors.

In this and next few blogs, we will take PostgreSQL as an example to illustrate how PostgreSQL on OneOps follows the best practices available from the industry.

PostgreSQL High Availability

PostgreSQL is one of the most popular transactional databases. However it does not ship with a very decent HA solution out-of-the-box. When searching for “PostgreSQL HA”,  people are easily overwhelmed with diverse solutions, which creates a high technical bar to deploy PostgreSQL in HA mode.

Recently I noticed Compose, Inc. published a blog about open sourcing their implementation of running PostgreSQL HA, which has been used in production for a while. After an independent research in this area, I believe their solution (called “Governor“) is supposed to be “state-of-the-art” for PostgreSQL HA.

Though Governor is open sourced, the example provided in its github is for experimental purposes. Moreover, Governor depends on other components, such as Etcd and HAProxy, so automating their deployments and configure them to work together should be very helpful.

Deploy Governor-based PostgreSQL on OneOps

In “Design” phase, choose “Governor based PostgreSQL” from “Pack Name” to create a new platform.

Screen Shot 2016-07-01 at 9.57.45 AM

Then we may check postgresql-governor component to review the PostgreSQL config parameters.

Screen Shot 2016-05-22 at 1.16.17 AM

Create a new user (“Username” is your local login name) and add your local SSH key, so that you could directly ssh into the virtual machines after the deployment.

Screen Shot 2016-05-22 at 1.19.36 AM

Save the design, move to Transition phase to create a new environment.

Please note that, (1) Availability Mode should be set to Redundant, (2) choose 1 cloud as Primary cloud for this demo.

Screen Shot 2016-05-22 at 1.26.18 AM

Save the environment, then “Commit & Deploy”. The deployment plan should show up now.

Screen Shot 2016-08-04 at 7.41.19 AM

As seen above, step 6 wants to deploy Etcd and HAproxy, then step 7 is to deploy Governor-based PostgreSQL. Specifically step 6 calls the existing Etcd and HAproxy packs on OneOps, which could be independently used to create a self-contained service, or serve and co-exist with other application, like Governor in this case. Again they are also packaged with their best practices.

Also note that the above plan will deploy 2 PostgreSQL instances – one of them will be the leader PostgreSQL that serves read and write requests, while the other one will actively follow and stream the changes from leader. In next section, I will describe how to identify the leader PostgreSQL.

After the deployment complete, the Governor-based PostgreSQL cluster is up and running. Next, we would finish the Post Deployment section mentioned in PostgreSQL-Governor pack main page.

Test out High Availability

To connect PostgreSQL server, we need to figure out the hostname or IP address of PostgreSQL. Since each virtual machine runs (1) PostgreSQL (2) Etcd (3) HAproxy , they are identical to each other and connecting to any one of their IP addresses should be working.

However there is a better way to do this. In my previous post, I mentioned that in most cases, OneOps will deploy a FQDN component (based on DNS service), which could serve a Round-Robin DNS (e.g. load balancing) to the application. Here are some benefits of using FQDN to connect the application:

  1. If the applications are deployed over multiple VM, we do not need to remember or hard-code the multiple IP addresses. FQDN could automatically load balance to one of VM, by default based on the Round-Robin.
  2. VM may become unavailable or die over the time, FQDN could route the requests to and return from the first working VM seamlessly.

To figure out the FQDN of a deployment, go to “Operate” section, click PostgreSQL platform on right-hand side, choose  fqdn component. Now we could see two “DNS Entries”, the shorter one is the platform-level FQDN that we will use for PostgreSQL connection.

On a machine that has PostgreSQL client installation, or one of the PostgreSQL VM we just deployed, type the following to connect the Governor-based PostgreSQL server:

/usr/pgsql-9.4/bin/psql --host your_fqdn_here --port 5000 -U postgres postgres

If everything is set up correctly, we are now connected to the server:

psql (9.4.8)
Type "help" for help.

Now let’s intentionally fail the leader to see how the failover works automatically. Identifying the leader is simple – keep a terminal open for each virtual machine, login to them ssh your_local_username@machine_ip,  then tail -f /var/log/messages, the leader will print out the following messages:

May 23 06:29:42 postgres-238213-1-20776470 bash: 2016-05-23 06:29:42,541 INFO: Governor Running: no action.  i am the leader with the lock
May 23 06:29:42 postgres-238213-1-20776470 bash: 2016-05-23 06:29:42,542 INFO: Governor Running: I am the Leader

On the leader machine, type sudo -s; service governor stop which will bring down the Governor service, equivalently the PostgreSQL service. (Note: this is not a clean or proper shutdown, so we need to do something before we bring it online, otherwise it may not catch up with the new leader. The most simple way is to delete the Postgres data directory and let it re-sync from the Postgres leader)

Watch closely the other terminals and after around 30 seconds, one of the PostgreSQL followers will be elected to be the new leader. So your PostgreSQL client should be able to connect the PostgreSQL server again. Now after removing the Postgres data directory, let’s restart the ex-leader: sudo -s; service governor start, the ex-leader will only become a follower because there is already a leader.

people may question about the “3o seconds” failover gap (TTL): by default Etcd service needs 30 seconds to realize the leader is down. During the 30 seconds PostgreSQL service may be unavailable. But the pack exposes the Etcd TTL value, so that users could shorten the TTL if this may help in some cases. Please see the picture about postgresql-governor configuration for tuning the Etcd TTL.

PostgreSQL Performance Stats Monitoring

The PostgreSQL pack not only have HA supported, but also instrument some monitors. Let’s look at one of them: Performance Stats (perfstat).

After finishing “Post Deployment” setup mentioned in PostgreSQL-Governor pack main page, perfstat monitor should start to work. Once there are some running database workload, several key database performance stats could be visualized from OneOps UI. Here is the list of performance stats:

active_queries, disk_usage, heap_hit, heap_hit_ratio, heap_read, index_hit, index_hit_ratio, index_read, locks, wait_locks

In “Operate” section, click PostgreSQL platform on right-hand side, choose the postgresql component and then the “leader” PostgreSQL instance. Next click the “monitor” tab, then choose “default metrics”, several graphs will show up and each includes some of performance stats. For example, the following picture shows the lock usage stats over the past hour.

Screen Shot 2016-05-22 at 8.57.57 PM

What is Next?

This is the first post to introduce Governor based PostgreSQL on OneOps, the automatic failover is verified within the same cloud or data center. A more real-world scenario is: one cloud or data center could entirely go down because of the power outage, so a seamless failover and replication solution is preferred across multiple clouds or data centers.

Also regularly backing up the PostgreSQL data to the remote storage, e.g. AWS S3, is also a good practice to add another layer of data redundancy. I plan to discuss above in the next few posts and please stay tuned!

Running Tomcat WebApp on OneOps (1)


(Disclaimer: the blogs posted here only represent the author’s respective, OneOps project does not guarantee the support or warranty of any code, tutorial, documentation discussed here)

Apache Tomcat is a the most widely used web application server that are powering from small websites to large-scale enterprise networks.If the applications are developed on top of Java technologies (JSP, Servlet), Apache Tomcat is best choice for it.

Today I would like to introduce how to deploy a Tomcat Web Application on OneOps.

First we need to choose a Tomcat web application for demonstration purpose. Without loss of generality, the following war file hosted on a public Nexus server will be chosen: (it is totally fine to choose other war file as long as it is hosted on Nexus and be accessible from your local environment):


Next on OneOps UI, create a new Tomcat platform in the Design phase:

Screen Shot 2016-07-10 at 2.41.08 PM

After the platform is created, we need to configure a couple of components:

(1) Click the “variables” tab (between “summary” and “diff”) to add the following key-value pairs:

  • groupId: org.jboss.seam.examples-ee6.remoting.helloworld
  • appVersion: 2.3.0.Beta2-20120521.053313-26
  • artifactId: helloworld-web
  • extension: war
  • deployContext: hello

Overall it should resemble as below,

Screen Shot 2016-07-10 at 2.48.06 PM

(2) Add a new artifact component, possibly called artifact-app,  which defines the Tomcat web application. Then input the following information:

  • Repository URL: https://repository.jboss.org
  • Repository Name: pubic
  • Identifier: $OO_LOCAL{groupId}:$OO_LOCAL{artifactId}:$OO_LOCAL{extension}
  • Version: $OO_LOCAL{appVersion}
  • Install Directory: /app/$OO_LOCAL{artifactId}
  • Deploy as user: app
  • Deploy as group: app
  • Restart: execute “rm -fr /app/tomcat7/webapps/$OO_LOCAL{deployContext}”

Screen Shot 2016-07-10 at 3.03.19 PM

As noted, OO_LOCAL{} is typically used for defining artifacts and the variables used in OO_LOCAL{}is defined in “variables” tab (between “summary” and “diff”). For more use cases about OO_LOCAL{}, please refer to http://oneops.github.io/user/references/#variables


(3) Update Tomcat component:

  • User: app
  • Group: app

Also it may worth to review the following components if it matters to you: (not required for this demonstration)

  • Max Threads: The max number of active threads in the pool, default is 50
  • Min Spare Threads: The minimum number of threads kept alive, default is 25
  • Java Options: JVM command line option.
  • System Properties: key-value pairs for -D args to JVM
  • Startup Parameters: -XX arguments. For example,

Screen Shot 2016-07-10 at 3.23.12 PM

(4) Create a new user (“Username” is your local login name) and add your local SSH key, so that you could directly ssh into the virtual machines after the deployment.

Save & commit the platform.

Now move the Transition phase and create a new environment. Please note that, (1) Availability Mode should be set to Redundant, (2) choose 1 cloud as Primary cloud for this demonstration.

Save the environment, then “Commit & Deploy”. The deployment plan should show up as follow. In this case, we will deploy 2 web application instances with a load balancer (fqdn) in front.

Screen Shot 2016-07-10 at 3.41.26 PM

After the deployment, we still need to add some (dependent) jar files to the library folder:


The additional jar files are:

After the jar files are added, restart all 2 Tomcat instances in order to load the new jar files into the runtime. Go to “Operate” phase, click the platform name of Tomcat, and Tomcat component. Tick all Tomcat instances and click “restart” from the “Action” dropdown list. Please see following,

Screen Shot 2016-07-10 at 3.52.27 PM

To get access to the web application, we only need to know the address of load balancer, which is platform-level FQDN. Go to “Operate” phase, click the platform name of Tomcat, and fqdn component. The shorter address is the platform-level FQDN and we will input the following URL into the web browser:


The web application should look like:

Screen Shot 2016-07-11 at 10.37.38 PM

Regarding to the availability, as we deployed 2 instances of web applications, losing 1 instance should not hurt the availability as long as we use address of load balancer to access the application.

What is Next?

This is the first blog to introduce Tomcat and its web application deployment on OneOps. More production-driven features and use cases may be the next topics, for example:

Please stay tuned!