Tag: DevOps

Building an ElasticSearch, Logstash, Kibana (ELK) Stack on OneOps


ElasticSearch, Logstash, Kibana are massively popular open source projects that can compose an end-to-end stack which delivers actionable insights in real time from almost any type of structured and unstructured data source.

In short sentences:

  • Logstash is a tool for collecting, parsing, and transporting the logs for downstream use.
  • Kibana is a web interface that can be used to search and view the logs that Logstash has indexed.
  • ElasticSearch connects Logstash and Kibana, which is used for storing logs in a highly scalable, durable and available manner.

The following picture simply illustrate the relationship among them:


In OneOps application repository, we have all three of Logstash, Kibana, ElasticSearch, so in this blog I would like to introduce how to build ElasticSearch, Logstash, Kibana (ELK) stack on OneOps, by reproducing the demo shown in Visualizing Data with ELK.

Deploy Logstash and ElasticSearch

In fact Logstash has been shipped as an optional component of every application pack on OneOps, because though Logstash is not required for most applications, it is so generic to collect and transport the application logs, and could be conveniently enabled when this is needed.

For purposes of conciseness and demonstration, I will show the deployment of ElasticSearch together with Logstash, so that Logstash will run on every ElasticSearch node.

First, in Design phase, create a new ElasticSearch platform.

Screen Shot 2016-08-26 at 4.25.29 PM

After this, we may need to configure elasticsearchdownload and logstash components.

(1) elasticsearch component: if using the small compute (e.g. memory is less than 2GB), we may need to set Allocated Memory(MB)  to 512, otherwise ElasticSearch may run into JVM Out-of-Memory issue, because Logstash also needs to run in the same box (virtual machine) which additionally requires 512 MB heap size for launching its JVM.

Screen Shot 2016-08-26 at 4.28.38 PM

(2) download component: since we want to reproduce the demo in Visualizing Data with ELK, the data set used by that demo should be downloaded in advance. Fortunately, OneOps provides the download component so that anything hosted on internet could be automatically downloaded on every VM during the deployment. (Generally,  when we need to install some package, library or dependency, right after the VM is boot up, download component will do this job.)

Screen Shot 2016-08-29 at 9.35.31 AM

Save the download component and overall it should resemble:

Screen Shot 2016-08-26 at 4.54.57 PM.png

(3) logstash component: as we will run Logstash in the same box as ElasticSearch, we need to add a logstash component so that it will be deployed together with ElasticSearch. Note that the configuration steps described here also apply to other application who may want Logstash.

  • add a new logstash component
  • set Inputs to file {path => "/app/data.csv" start_position => "beginning" sincedb_path => "/app/sincedb.iis-logs"}
  • set Filters to csv {separator => "," columns => ["Date","Open","High","Low","Close","Volume","Adj Close"]} mutate {convert => ["High", "float"]} mutate {convert => ["Open", "float"]} mutate {convert => ["Low", "float"]} mutate {convert => ["Close", "float"]} mutate {convert => ["Volume", "float"]}
  • set Outputs to elasticsearch {action => "index" host => "localhost" index => "stock" workers => 1} stdout {}


Finally looks like:

Screen Shot 2016-08-26 at 5.05.00 PM.png

Save the logstash component and overall it should resemble:
Screen Shot 2016-08-26 at 5.09.53 PM.png

Last we could add our own SSH key to the user-app component so that we could log into the VM later on.

Now we are ready to deploy ElasticSearch and Logstash. Create a new environment with Availability Mode = redundant and choose 1 cloud as primary cloud. Regarding how to set up a cloud in OneOps, please refer to one of my previous blogs.

By default, an ElasticSearch cluster with 2 VMs will be created. For serious use cases, a cluster with 3 nodes are needed. This is because discovery.zen.minimum_master_node should be set to 2 to avoid split brain and also could tolerate 1 node loss. The number of node could be adjusted in Scaling section after clicking  your_elasticsearch_platform_name in Transition phase.

Screen Shot 2016-08-26 at 11.56.13 PM

The deployment plan will resemble the following: (number of compute instances is 3, denoting 3 VMs and 3 ElasticSearch instances will be created)

Screen Shot 2016-08-27 at 12.03.46 AM.png

After the deployment, ElasticSearch and Logstash should be running automatically.

Deploy Kibana

As introduced, Kibana typically pairs with ElasticSearch to provide a virtualization dashboard of the search results.

First, choose to create a Kibana platform in the Design phase.

Screen Shot 2016-08-27 at 12.15.32 AM.png

Then we need to configure the kibana component and the only thing we need to take care is ElasticSearch Cluster FQDN including PORT. We could get your_elasticsearch_platform_fqdn from following steps:

In Transition phase, first choose the ElasticSearch environment, then go to Operate phase, click your_elasticsearch_platform_name on the right,  find fqdn component and click into, the shorter URL is your_elasticsearch_platform_fqdn

Prefix with “http://” and postfix with “:9200/”, ElasticSearch Cluster FQDN including PORT will look like:


The entire section of configuring kibana component may resemble as the following:

Screen Shot 2016-08-27 at 12.21.41 AM

Again we could add our own SSH key to the user-app component in order to log into the VM later on.

After saving the platform, we could start to create an environment followed by the deployment. Same as before, Availability Mode = redundant and choose 1 cloud as primary cloud.

By default, two independent Kibana instances  will be deployed, which will provide some redundancy when 1 Kibana goes down.

The deployment plan will resemble the following: (number of compute instances is 2, denoting 2 VMs and 2 Kibana instances will be created)

Screen Shot 2016-08-27 at 12.54.32 AM.png

After the deployment, we could check the platform-level FQDN of Kibana and use it for accessing the Kibana dashboard.

Open a web browser and go to: http://your_kibana_platform_fqdn:5601

By following the steps in Visualizing Data with ELK to create the visualization dashboards on Kibana.

Note that the data set used in Visualizing Data with ELK is historical, so we may need to increase the search span in timestamp on Kibana, in order to pull the historical data from ElasticSearch and present it. This change could be done at top-right corner. For example,

Screen Shot 2016-08-27 at 1.44.12 AM
Click “Last 15 minutes” to change search span

In the following picture, we set the search span to 30 years ago relative to today, so that the similar visualization graph will be shown as the one in Visualizing Data with ELK.

Screen Shot 2016-08-24 at 12.01.35 AM.png


In this blog, I introduced how to build a ELK stack on OneOps and verify it works end-to-end by reproducing a demo in Visualizing Data with ELK. The ELK stack discussed in this blog is still preliminary, as in a production environment, it is more scalable and practical to include Filebeat (previously logstash-forwarder) and Redis into the pipeline.

Filebeat is a lightweight tool and installed on every node for tailing the system or application log files and forwarding them to Logstash. Redis could serve as a buffer to cache the aggregated huge volume of logs collected from all nodes.

Also ElasticSearch could follow a better deployment architecture, which separates the master-eligible nodes from the data nodes, and potentially have dedicated client nodes for routing the requests and aggregating the search results. (In this blog, every ElasticSearch instance in the cluster acts as both master-eligible and data node.)




Orchestrating Couchbase on OneOps


Couchbase is an open-source distributed, NoSQL document-oriented database that typically is well suited for powering the high performance web, mobile & IoT applications. For authoritative Couchbase use cases and notable users, please visit link1 and link2.

OneOps orchestrated the Couchbase pack, both “community” and “enterprise” versions. Please see here for a comparison. Recently, Couchbase, Inc. announced to support Couchbase (Enterprise) that is deployed and managed by OneOps, which was a win-win strategy to both parties and provided a good example of how the technology vendors play a crucial role in the OneOps ecosystem.

In this post, I plan to introduce the Couchbase OneOps pack from three aspects:

  • Deployment
  • Operation
  • Monitoring


As we will see later in this blog that Couchbase will emit metric data to Graphite for monitoring purpose, we need a running Graphite instance upfront. In my previous blog about Graphite on OneOps, we could follow the steps there to deploy a Graphite first (possibly in a different OneOps assembly).

Then in the Design phase, create a Couchbase platform by choosing “CouchBase” pack.

Screen Shot 2016-08-11 at 11.38.06 PM.png

After creating a Couchbase platform, there may be several parameters to review:

  • couchbase component: by default, it will deploy “community” version, but “enterprise” version is also provided if commercial supports and more features are needed. Change the Admin User and Password if the default ones do not work perfectly (the password is “password” by default). In future blog, we may review how to set up Email server to deliver alerts. For demo we do not need to change these now.

Screen Shot 2016-08-12 at 12.57.06 AM.png

  • bucket component: 1 default bucket will be created after the Couchbase deployment. Here the bucket name, password, number of replicas could be tuned up. (The default password for Bucket is “password”). In case we want to have more buckets, we could create them now, or later when we actually need them.

Screen Shot 2016-08-12 at 12.52.24 AM.png

  • diagnostic_cache component: please add the list of graphite servers (ip or FQDN) in a Graphite cluster , such as graphite_url_1:2003,graphite_url_2:2003 Note that the metric data will be sent to the first working graphite server in the list, so these URL should belong to only one Graphite cluster. In case we use FQDN for accessing a Graphite cluster, it is a plus to put multiple times of graphite_fqdn:2003 on purpose (graphite_fqdn:2003,graphite_fqdn:2003). The benefit of doing this: in case the FQDN -> ip resolution fails the first time (e.g. network transient), it gives more few chances to re-try the resolution.

Also we could add our own SSH key to the user-app component so that we could log into the VM later on.

After committing the Design, create a new environment with “Availability Mode = redundant” and choose 1 cloud as “primary cloud”. Regarding how to set up a cloud in OneOps, please refer to one of my previous blogs.

By default, a Couchbase cluster with 3 VMs will be created. The deployment plan will resemble the following: (number of compute instances is 3, denoting 3 VMs will be created)

Screen Shot 2016-08-12 at 1.53.37 AM.png

After the deployment, we could open a web browser and  visit Couchbase Web Console (your_couchbase_platform_fqdn:8091) to verify the cluster information and even do some operational work (will cover later). To get the couchbase platform FQDN, go to Operate phase, click your_couchbase_platform_name on the right,  find fqdn component and click into, the shorter URL is your_couchbase_platform_fqdn

Screen Shot 2016-08-12 at 2.01.37 AM.png
By Default, Username: Administrator, Password: password


Typical operations for Couchbase could be done at Couchbase Web Console:

  1. Add Server
  2. Fail over
  3. Remove Server
  4. Rebalance and etc.

Screen Shot 2016-08-12 at 9.51.43 AM

Interestingly, some of above operations could also be done on OneOps UI as well. For example, go to Operate phase, click your_couchbase_platform_name on the right,  findcouchbase component and click into, we may find multiple instances of couchbase.

Choose any one of them, then click Choose Action To Execute, we will see a drop-down list of actions that could run on this couchbase instance.

Screen Shot 2016-08-12 at 10.07.36 AM.png

One distinction between OneOps and some Automation tool is that: OneOps provides full flexibility to define operational actions associated with the pack. Take Couchbase cookbook for an example, in the “recipe” folder we could find the corresponding recipes for each operational action, for instance “add-to-cluster.rb“. The magic to present those operational actions on the front-end is the cookbook metadata file. (Typically defined at bottom of the metadata file).

Another operational highlight is the cluster-wise operation. Go to “Operate” tab, click your_couchbase_platform_name on the right,  find couchbase-cluster component and click into, then we will see only one couchbase cluster instance. The following picture shows the list of operational actions that could run on the cluster-wise.

Screen Shot 2016-08-12 at 11.24.40 AM

For this demo, we could run cluster-health-check which will check the following items to make sure the cluster is running in good state:

  1. if automatic fail over is enabled
  2. if the node (VM) is in healthy state
  3. if data is highly available in each bucket (e.g. replica exists and spread evenly over all nodes)
  4. if the nodes seen by OneOps are the same ones that are seen by Couchbase
  5. if the buckets seen by OneOps are the same ones that are seen by Couchbase
  6. if quota reset is not needed
  7. if multiple nodes  (VMs) are not sitting on the same hypervisor

If any of the answer to the above question is NO, the cluster-health-check will show fail status and will point out at which step it got failed. For example, more than one node/VM could be launched on the same hypervisor, leading to a higher risk of when a hypervisor is down, multiple VM will be offline at same time.

Screen Shot 2016-08-12 at 1.33.03 PM

If everything looks good, we will not see the red color output from this cluster-health-check operation.


A production-driven system can not live without extensive monitoring. Couchbase pack is a great example of monitoring and alerting.

The monitoring part of Couchbase will be introduced in 2 parts:


Remember that we mentioned about Couchbase deployment needs a Graphite instance upfront to present the Couchbase performance metrics. Now let’s look at the Graphite and see what we could get from it.

After opening the Graphite dashboard, we could navigate to the folder that contains Couchbase metrics. See below for an example.

Screen Shot 2016-08-12 at 3.01.34 PM

In the root directory, it contains many metrics about disk, memory usage, healthy node info and rebalance. Two sub-directories are buckets and nodes, which contains the metrics about all buckets and all nodes that we could further drill down. Let’s take a node for an example, if we want to visualize the number of operation (ops) on a certain Couchbase node, we could pick up a node and click “ops” icon and visualize the metrics over the time.


OneOps UI Monitor

Couchbase pack also emits some metrics to OneOps Monitor on UI. To visualize those, go to “Operate” tab, click your_couchbase_platform_name on the right,  find diagnostic_cache component, choose any one of the multiple diagnostic_cache instances (where each one corresponds to a Couchbase node, identified by the tailing numbers). Then click monitors tab which will show a list of monitored metrics on OneOps UI:

Screen Shot 2016-08-17 at 11.08.21 AM

For example, we want to look for the Disk Performance, so we just click Cluster Health Info and scroll down to find the corresponding chart about Disk Performance.

Screen Shot 2016-08-17 at 11.14.48 AM.png

Alerting could be optionally associated to some monitored metric. For example, if Cache Miss Ratio is too high (e.g. over 50%), the alert will be fired – an alerting message will show up on OneOps “Operate” UI (and will be sent to the sign-up email account after email notification is enabled).

Another metric to check if Couchbase is effectively used is Docs Resident. By default, if 100% of documents can not reside in memory over 5 minutes, the alert will be fired. On the other hand, the alert will “buzz off” after all documents sit in memory over 5 minutes.

Screen Shot 2016-08-12 at 4.05.15 PM
Alerting Message about “High Active Doc Resident”
Screen Shot 2016-08-12 at 4.03.04 PM
Recovery Message about “High Active Doc Resident”


Also it is very flexible to customize the criteria to trigger the alert case-by-case, as shown in the picture below.

Screen Shot 2016-08-17 at 11.17.44 AM


Couchbase pack is a great example of the application packs in the OneOps ecosystem which achieves:

  • fully automated deployment
  • “one-click” operational supports on node-level and cluster-wise
  • extensive monitoring and alerting

One huge benefit of OneOps is not only automating the deployment, similar as what other automation tools that already did, but to provide:

  1. an interface to use “code” to implement operational work once and simply present  as a button on UI. Anyone (e.g. Ops team, engineer) could repeatedly launch the operational work by “one-click” of button.
  2. 100% flexibility to define and customize any monitor that only a specific application or people care about. Visualize the metrics on-demand on the OneOps UI.
  3. seamless integration between alerting and monitoring,  so that any metric being monitored could also be optionally alerted by defining a threshold.

Given that many infrastructure technologies are based on the open-source offerings nowadays, the challenges for many organizations become: (1) pick up the right technology, and (2) operate it well in production.

I hope to see more OneOps application packs with rich set of monitoring and operational support, which are the “must-have” for a system live in production!




PostgreSQL High Availability on OneOps (2)


(Disclaimer: the blogs posted here only represent the author’s respective, OneOps project does not guarantee the support or warranty of any code, tutorial, documentation discussed here)

In my last posting, I introduced the Governor based Postgres on OneOps, which is an automated approach of deploying Postgres with High Availability (HA). Since then, I started a new round of development on the Governor Postgres cookbook, Etcd cookbook and HAProxy cookbook to take the benefits of the multi-layer clouds from OneOps, so that the Postgres pack on OneOps is one big step towards production readiness!

From last Postgres HA blog, the scenario we demonstrated was – all Postgres instances were in the same cloud or data center, so the limitation is: what if the entire cloud or data center is down?

In this blog, I will introduce a seamless HA failover solution across multi-layer clouds or data centers to guarantee the availability of Postgres, even after a sudden failure of an entire data center.

OneOps offers a concept of multi-layer clouds, technically speaking, the “primary” cloud and “secondary” cloud. If an application pack could be deployed over both primary and secondary clouds, the application instances in the secondary clouds (secondary instances) typically serve as the backup of the instances in primary clouds (primary instances). Here is one more difference from network’s point of view:

Primary Cloud: Global Load Balancer (global vip) forwards all the traffic to primary clouds.

Secondary Cloud: All secondary clouds are disabled and does not receive traffic from Global Load Balancer.

Moreover, if the application is stateless, e.g. REST application, it is not required to have the secondary instances to replicate the states or data from primary instances, which is the simplest case for OneOps pack developer.

Otherwise the secondary instances are supposed to replicate states or data from the primary instances, so that when the system admins flip the primary and secondary clouds (because of primary clouds are down), the secondary instances could immediately start to receive traffic with the consistent states or data. In our case, Postgres belongs to this case and the following is the architectural picture:


From above picture, when a client wants to connect to Postgres, its connection URL should be the FQDN of Global Load Balancer (GLB). Then the request is forwarded to HAproxy which helps to look up the Postgres master in primary cloud. Then the client could directly talk to the master forward, without having to go through the GLB and HAProxy (until the master failover happens).

In terms of data replication, Governor-based Postgres uses Stream Replication and all Postgres slaves directly and independently replicate data from the master. However when the master fails, only the slaves in primary clouds could be elected as new master, while the slaves in secondary clouds have no chance for the master election. The reason is: only the IP addresses from the primary cloud will be covered under OneOps GLB and the clients  should always use GLB for connecting Postgres master. If a slave Postgres instance from secondary cloud is elected to the new master, the user will not be able to connect to it.

After all Postgres instances in primary clouds fail at one shot (e.g. power outage) or by sequence, the primary and secondary clouds need to be flipped so that Postgres service will keep available.

In most cases, the primary and secondary clouds are mapped to different geographical data centers so that the outage of one data center will not bring down both primary and secondary clouds. The flip between primary and secondary clouds is triggered on OneOps UI, possibly by system admin or on-call engineers.

Now let’s quickly go through the deployment process and then see how the seamless failover solution will work.

Deploy Governor-based PostgreSQL on OneOps

Similar as the steps in the last blog, we create a Governor-based Postgres platform, add our SSH key into the user component. The difference comes from: when creating the environment, we need to at least one cloud as primary cloud, and at least one cloud as secondary cloud, which means at least two clouds should be created on OneOps beforehand. In this demo, I will use 1 cloud as primary and 1 cloud as secondary. (Again, “Availability Mode” should be set to “Redundant”)

Screen Shot 2016-07-28 at 4.57.52 PM

After creating the environment (before “Commit & Deploy”), we could click the Postgres platform name to review and change the “scaling” factor. In this demo, we use 3 compute per cloud, totally 6 compute for two clouds (primary and secondary). In practice, I personally recommend to use at least 3 compute for primary cloud (as well as secondary cloud), because Etcd runs best with minimum of 3 nodes, and more than 5 nodes may be a waste of having so many Postgres slaves.

Screen Shot 2016-07-29 at 10.23.26 AM

The deployment plan may resemble as following: (it is a long plan with total of 15 steps, which will happen on both primary and secondary clouds. Deployment on primary cloud will go first):

Screen Shot 2016-08-03 at 5.20.24 PM.png
Governor-based Postgres deployment plan (totally 15 steps, but here only show up to step 12th due to the space limitation. From step 9th, deployment moves onto secondary clouds)


After the deployment is done and finishing the Post Deployment section mentioned in PostgreSQL-Governor pack main page, from a Postgres client machine we could start to use FQDN to connect to the Postgres master.

/usr/pgsql-9.4/bin/psql --host your_fqdn_here --port 5000 -U postgres postgres

psql (9.4.8)
Type "help" for help.

Please note that the port number to connect is 5000, rather than 5432. In case you do not like 5000, it could be changed in HAProxy component in the Design phase and the port number could be something else (but not 5432).

Screen Shot 2016-08-04 at 7.34.59 AM.png

Test Postgres Master Failover

First let create a table, called “company”, and insert a record into it.

   NAME           TEXT    NOT NULL,
   AGE            INT     NOT NULL,
   ADDRESS        CHAR(50),
   SALARY         REAL,
VALUES (1, 'Paul', 32, 'California', 20000.00 ,'2001-07-13');

Now let use the method mentioned in the last blog to identify the current Postgres master, SSH into the master VM and run “service governor stop” to kill the current master. Later, I will talk about how to put the failed ex-master back online.

After 30 – 60 seconds (depending the value of Etcd TTL), the new Postgres master should be elected from the remaining compute from primary cloud. Now let’s re-connect the Postgres from the client and run the following select query:

postgres=# select * from company;
 id | name | age |         address        | salary | join_date
  1 | Paul |  32 | California             |  20000 | 2001-07-13
(1 row)

Next insert another record:

VALUES (2, 'Allen', 25, 'Texas', '2007-12-13');

Query the table again:

postgres=# select * from company;
 id | name | age |         address        | salary | join_date
  1 | Paul  |  32 | California             |  20000 | 2001-07-13
  2 | Allen |  25 | Texas                  |        | 2007-12-13
(2 rows)

From above we could see the new master is available to receive both read and write requests.

Now let’s shut down the current master by “service governor stop” to simulate another VM down.

Again after a short period of time, the last slave from primary cloud should be promoted to the new Postgres master and we could repeat the similar read and write queries to verify that the new master is working normally or not.

After shutting down the last Postgres instance (master) from primary cloud, no new master will be elected, because the Postgres instances in secondary cloud are not allowed to participate into the election. At this point, we need to flip the primary and secondary cloud and here are the sequences:

(1) Change the primary cloud to secondary cloud, and we will temporarily see two secondary clouds.

Go to “Transition” phase, click the environment and then Postgres platform name (on the right), find “Cloud Status” section (at the bottom), identify the primary cloud and choose “Make Secondary”.

Screen Shot 2016-07-29 at 1.27.31 AM

Then “Commit & Deploy” (only 4 steps).

Screen Shot 2016-08-03 at 5.16.15 PM

(2) Change the secondary cloud (not the one just got flipped) to primary cloud, so we again have one primary and one secondary cloud.

Identify the (right) secondary cloud and choose “Make Primary”.

Screen Shot 2016-07-29 at 1.35.33 AM.png

Then “Commit & Deploy” (now 7 steps).

Screen Shot 2016-08-03 at 5.18.42 PM

After the deployment completes, we try to re-connect the Postgres from client via FQDN:

/usr/pgsql-9.4/bin/psql --host your_fqdn_here --port 5432 -U postgres postgres

psql (9.4.8)
Type "help" for help.

And try some query:

postgres=# select * from company;
 id | name | age |         address        | salary | join_date
  1 | Paul  |  32 | California             |  20000 | 2001-07-13
  2 | Allen |  25 | Texas                  |        | 2007-12-13
(2 rows)

Run an insert command:

VALUES (3, 'Teddy', 23, 'Norway', 20000.00, DEFAULT )

Query the table one more time:

postgres=# select * from company;
 id | name | age |         address        | salary | join_date
  1 | Paul  |  32 | California             |  20000 | 2001-07-13
  2 | Allen |  25 | Texas                  |        | 2007-12-13
  3 | Teddy |  23 | Norway                 |  20000 |
(3 rows)

As seen from above, the old 2 records were not lost and the new record could be added to the existing table.

At this point, we have used simple queries to verify that the Postgres OneOps pack could support seamless master failover within a cloud and even across the clouds. Next question is: how to bring the failed Postgres instances back online?

This question is a little bit out of the discussion in this blog. For more details on this, there should be some materials to refer or the DBA may already know what to do based on the previous experiences of setting up Postgres HA. The one-sentence reminder is: do not simply reboot the failed master without properly handling the recovery tasks.

In this blog we just go ahead and delete the Postgres data directory and restart Postgre Governor process by “service governor start” to let the ex-master Postgres to re-sync all data from the current master.

rm /db/*
systemctl daemon-reload
service governor start

Now the ex-master should be a slave now. (“tail -f /var/log/message”)

Jul 29 08:49:06 pg-238343-1-25880159 bash: 2016-07-29 08:49:06,558 INFO: 
does not have lock
Jul 29 08:49:06 pg-238343-1-25880159 bash: 2016-07-29 08:49:06,573 INFO: 
Governor Running: no action.  i am a secondary and i am following a leader

Upto this point, we have seen a full cycle of Postgres master failover and recovery.


The Governor-based Postgres OneOps pack made several “number 1” in the OneOps application ecosystem.

  1.  the first pack that nicely supports deployment over both primary and secondary clouds.
  2. the first pack that seamlessly provides a HA failover solution between primary and secondary instances with an integrated user experience and data/state consistency.
  3. the first pack that stitches with many other existing packs (Etcd, Haproxy) together to make a complex but transparent HA system, without “re-invent the wheel”.

There will be ongoing improvements, such as, we may provide a knob for users to choose from synchronous and asynchronous replication. (currently it is only asynchronous)

PostgreSQL High Availability on OneOps (1)


(Disclaimer: the blogs posted here only represent the author’s respective, OneOps project does not guarantee the support or warranty of any code, tutorial, documentation discussed here)

When I first introduced OneOps, I ever said “OneOps has a rich set of ‘best-practice’ based application designs”. Today I will use an application design (or called “pack”) to explain what ‘best-practice’ really means.

Actually, “pack” is not a new terminology, as some cloud or configuration management tools already open-sourced their application “cookbook” or “playbook”, that are similar to OneOps pack in concept, but they may have the following issues:

(1) Most open-sourced cookbooks are more focused on the deployment workflow:

  • expose the application config parameters.
  • install the application binaries.
  • lay down the configuration files.
  • start the application.

The above workflow typically does not meet production requirements – missing high availability, load balancing, automatic failover and etc.

(2) Operational supports are missing, e.g. monitoring, alerting, and easy access to repair/replace the bad instances and scale the application.

(3) To be qualified for production, users either pay premiums for the proprietary cookbooks or subscribe the enterprise services from the vendors.

In this and next few blogs, we will take PostgreSQL as an example to illustrate how PostgreSQL on OneOps follows the best practices available from the industry.

PostgreSQL High Availability

PostgreSQL is one of the most popular transactional databases. However it does not ship with a very decent HA solution out-of-the-box. When searching for “PostgreSQL HA”,  people are easily overwhelmed with diverse solutions, which creates a high technical bar to deploy PostgreSQL in HA mode.

Recently I noticed Compose, Inc. published a blog about open sourcing their implementation of running PostgreSQL HA, which has been used in production for a while. After an independent research in this area, I believe their solution (called “Governor“) is supposed to be “state-of-the-art” for PostgreSQL HA.

Though Governor is open sourced, the example provided in its github is for experimental purposes. Moreover, Governor depends on other components, such as Etcd and HAProxy, so automating their deployments and configure them to work together should be very helpful.

Deploy Governor-based PostgreSQL on OneOps

In “Design” phase, choose “Governor based PostgreSQL” from “Pack Name” to create a new platform.

Screen Shot 2016-07-01 at 9.57.45 AM

Then we may check postgresql-governor component to review the PostgreSQL config parameters.

Screen Shot 2016-05-22 at 1.16.17 AM

Create a new user (“Username” is your local login name) and add your local SSH key, so that you could directly ssh into the virtual machines after the deployment.

Screen Shot 2016-05-22 at 1.19.36 AM

Save the design, move to Transition phase to create a new environment.

Please note that, (1) Availability Mode should be set to Redundant, (2) choose 1 cloud as Primary cloud for this demo.

Screen Shot 2016-05-22 at 1.26.18 AM

Save the environment, then “Commit & Deploy”. The deployment plan should show up now.

Screen Shot 2016-08-04 at 7.41.19 AM

As seen above, step 6 wants to deploy Etcd and HAproxy, then step 7 is to deploy Governor-based PostgreSQL. Specifically step 6 calls the existing Etcd and HAproxy packs on OneOps, which could be independently used to create a self-contained service, or serve and co-exist with other application, like Governor in this case. Again they are also packaged with their best practices.

Also note that the above plan will deploy 2 PostgreSQL instances – one of them will be the leader PostgreSQL that serves read and write requests, while the other one will actively follow and stream the changes from leader. In next section, I will describe how to identify the leader PostgreSQL.

After the deployment complete, the Governor-based PostgreSQL cluster is up and running. Next, we would finish the Post Deployment section mentioned in PostgreSQL-Governor pack main page.

Test out High Availability

To connect PostgreSQL server, we need to figure out the hostname or IP address of PostgreSQL. Since each virtual machine runs (1) PostgreSQL (2) Etcd (3) HAproxy , they are identical to each other and connecting to any one of their IP addresses should be working.

However there is a better way to do this. In my previous post, I mentioned that in most cases, OneOps will deploy a FQDN component (based on DNS service), which could serve a Round-Robin DNS (e.g. load balancing) to the application. Here are some benefits of using FQDN to connect the application:

  1. If the applications are deployed over multiple VM, we do not need to remember or hard-code the multiple IP addresses. FQDN could automatically load balance to one of VM, by default based on the Round-Robin.
  2. VM may become unavailable or die over the time, FQDN could route the requests to and return from the first working VM seamlessly.

To figure out the FQDN of a deployment, go to “Operate” section, click PostgreSQL platform on right-hand side, choose  fqdn component. Now we could see two “DNS Entries”, the shorter one is the platform-level FQDN that we will use for PostgreSQL connection.

On a machine that has PostgreSQL client installation, or one of the PostgreSQL VM we just deployed, type the following to connect the Governor-based PostgreSQL server:

/usr/pgsql-9.4/bin/psql --host your_fqdn_here --port 5000 -U postgres postgres

If everything is set up correctly, we are now connected to the server:

psql (9.4.8)
Type "help" for help.

Now let’s intentionally fail the leader to see how the failover works automatically. Identifying the leader is simple – keep a terminal open for each virtual machine, login to them ssh your_local_username@machine_ip,  then tail -f /var/log/messages, the leader will print out the following messages:

May 23 06:29:42 postgres-238213-1-20776470 bash: 2016-05-23 06:29:42,541 INFO: Governor Running: no action.  i am the leader with the lock
May 23 06:29:42 postgres-238213-1-20776470 bash: 2016-05-23 06:29:42,542 INFO: Governor Running: I am the Leader

On the leader machine, type sudo -s; service governor stop which will bring down the Governor service, equivalently the PostgreSQL service. (Note: this is not a clean or proper shutdown, so we need to do something before we bring it online, otherwise it may not catch up with the new leader. The most simple way is to delete the Postgres data directory and let it re-sync from the Postgres leader)

Watch closely the other terminals and after around 30 seconds, one of the PostgreSQL followers will be elected to be the new leader. So your PostgreSQL client should be able to connect the PostgreSQL server again. Now after removing the Postgres data directory, let’s restart the ex-leader: sudo -s; service governor start, the ex-leader will only become a follower because there is already a leader.

people may question about the “3o seconds” failover gap (TTL): by default Etcd service needs 30 seconds to realize the leader is down. During the 30 seconds PostgreSQL service may be unavailable. But the pack exposes the Etcd TTL value, so that users could shorten the TTL if this may help in some cases. Please see the picture about postgresql-governor configuration for tuning the Etcd TTL.

PostgreSQL Performance Stats Monitoring

The PostgreSQL pack not only have HA supported, but also instrument some monitors. Let’s look at one of them: Performance Stats (perfstat).

After finishing “Post Deployment” setup mentioned in PostgreSQL-Governor pack main page, perfstat monitor should start to work. Once there are some running database workload, several key database performance stats could be visualized from OneOps UI. Here is the list of performance stats:

active_queries, disk_usage, heap_hit, heap_hit_ratio, heap_read, index_hit, index_hit_ratio, index_read, locks, wait_locks

In “Operate” section, click PostgreSQL platform on right-hand side, choose the postgresql component and then the “leader” PostgreSQL instance. Next click the “monitor” tab, then choose “default metrics”, several graphs will show up and each includes some of performance stats. For example, the following picture shows the lock usage stats over the past hour.

Screen Shot 2016-05-22 at 8.57.57 PM

What is Next?

This is the first post to introduce Governor based PostgreSQL on OneOps, the automatic failover is verified within the same cloud or data center. A more real-world scenario is: one cloud or data center could entirely go down because of the power outage, so a seamless failover and replication solution is preferred across multiple clouds or data centers.

Also regularly backing up the PostgreSQL data to the remote storage, e.g. AWS S3, is also a good practice to add another layer of data redundancy. I plan to discuss above in the next few posts and please stay tuned!

Running Tomcat WebApp on OneOps (1)


(Disclaimer: the blogs posted here only represent the author’s respective, OneOps project does not guarantee the support or warranty of any code, tutorial, documentation discussed here)

Apache Tomcat is a the most widely used web application server that are powering from small websites to large-scale enterprise networks.If the applications are developed on top of Java technologies (JSP, Servlet), Apache Tomcat is best choice for it.

Today I would like to introduce how to deploy a Tomcat Web Application on OneOps.

First we need to choose a Tomcat web application for demonstration purpose. Without loss of generality, the following war file hosted on a public Nexus server will be chosen: (it is totally fine to choose other war file as long as it is hosted on Nexus and be accessible from your local environment):


Next on OneOps UI, create a new Tomcat platform in the Design phase:

Screen Shot 2016-07-10 at 2.41.08 PM

After the platform is created, we need to configure a couple of components:

(1) Click the “variables” tab (between “summary” and “diff”) to add the following key-value pairs:

  • groupId: org.jboss.seam.examples-ee6.remoting.helloworld
  • appVersion: 2.3.0.Beta2-20120521.053313-26
  • artifactId: helloworld-web
  • extension: war
  • deployContext: hello

Overall it should resemble as below,

Screen Shot 2016-07-10 at 2.48.06 PM

(2) Add a new artifact component, possibly called artifact-app,  which defines the Tomcat web application. Then input the following information:

  • Repository URL: https://repository.jboss.org
  • Repository Name: pubic
  • Identifier: $OO_LOCAL{groupId}:$OO_LOCAL{artifactId}:$OO_LOCAL{extension}
  • Version: $OO_LOCAL{appVersion}
  • Install Directory: /app/$OO_LOCAL{artifactId}
  • Deploy as user: app
  • Deploy as group: app
  • Restart: execute “rm -fr /app/tomcat7/webapps/$OO_LOCAL{deployContext}”

Screen Shot 2016-07-10 at 3.03.19 PM

As noted, OO_LOCAL{} is typically used for defining artifacts and the variables used in OO_LOCAL{}is defined in “variables” tab (between “summary” and “diff”). For more use cases about OO_LOCAL{}, please refer to http://oneops.github.io/user/references/#variables


(3) Update Tomcat component:

  • User: app
  • Group: app

Also it may worth to review the following components if it matters to you: (not required for this demonstration)

  • Max Threads: The max number of active threads in the pool, default is 50
  • Min Spare Threads: The minimum number of threads kept alive, default is 25
  • Java Options: JVM command line option.
  • System Properties: key-value pairs for -D args to JVM
  • Startup Parameters: -XX arguments. For example,

Screen Shot 2016-07-10 at 3.23.12 PM

(4) Create a new user (“Username” is your local login name) and add your local SSH key, so that you could directly ssh into the virtual machines after the deployment.

Save & commit the platform.

Now move the Transition phase and create a new environment. Please note that, (1) Availability Mode should be set to Redundant, (2) choose 1 cloud as Primary cloud for this demonstration.

Save the environment, then “Commit & Deploy”. The deployment plan should show up as follow. In this case, we will deploy 2 web application instances with a load balancer (fqdn) in front.

Screen Shot 2016-07-10 at 3.41.26 PM

After the deployment, we still need to add some (dependent) jar files to the library folder:


The additional jar files are:

After the jar files are added, restart all 2 Tomcat instances in order to load the new jar files into the runtime. Go to “Operate” phase, click the platform name of Tomcat, and Tomcat component. Tick all Tomcat instances and click “restart” from the “Action” dropdown list. Please see following,

Screen Shot 2016-07-10 at 3.52.27 PM

To get access to the web application, we only need to know the address of load balancer, which is platform-level FQDN. Go to “Operate” phase, click the platform name of Tomcat, and fqdn component. The shorter address is the platform-level FQDN and we will input the following URL into the web browser:


The web application should look like:

Screen Shot 2016-07-11 at 10.37.38 PM

Regarding to the availability, as we deployed 2 instances of web applications, losing 1 instance should not hurt the availability as long as we use address of load balancer to access the application.

What is Next?

This is the first blog to introduce Tomcat and its web application deployment on OneOps. More production-driven features and use cases may be the next topics, for example:

Please stay tuned!





Orchestrate Redis on OneOps


(Disclaimer: the blogs posted here only represent the author’s respective, OneOps project does not guarantee the support or warranty of any code, tutorial, documentation discussed here)

Redis is a very popular in-memory NoSQL cache and store that has been in production in many places. Here is the list of who are using Redis.

Regarding to the Redis deployment and automation, there has been a well-known public Chef Redis cookbook. In this post I would like to introduce how transparently and easily it could be transplanted to make Redis deployment happen on OneOps. In General, I hope this post will open more avenues for bringing the existing best public DevOps practices into OneOps ecosystem, with very minimum efforts.

As said, OneOps Redis cookbook was mostly mirrored from the public well-known Redis cookbook, so they are 99.99% same! The only difference is that OneOps Redis cookbook is more self-contained so that it does not refer to other cookbooks.

For example, in recipe/install.rb, it does not cross-reference build-essential cookbook (as opposed to what public Redis cookbook did). Instead, in recipe/_install_prereqs.rb, the execution of “make automake gcc”  makes sure the necessary packages will be installed from Linux repositories, which has a similar result of running build-essential cookbook.

In addition, Redis deployment through OneOps currently follows the Cluster mode, a Redis cluster is created by running redis-trib  command on only one of the nodes. Please see the following piece of code for the little “tricky” cluster creation process:


I guess the fundamental reason of why OneOps keeps self-contained: it wants to make its own cookbook codebase thin and lightweight. But if we do want to refer to certain cookbook that is not available on OneOps, a temporary workaround could be copy the cookbook (and its dependencies) to OneOps cookbook directory.

Redis Deployment on OneOps

In OneOps “Design” phase, choose “Redis” pack. After creating the Redis design, you may click the “redisio” component to review some Redis attributes (recommend Redis version 3.0 and above)

Screen Shot 2016-06-29 at 12.14.42 PM

Add your local SSH key to “user-app” component so that you could directly log into the Redis VM after the deployment.

Screen Shot 2016-06-29 at 12.17.29 PM

After saving the Design, create a new environment with “Availability Mode” = redundant and choose 1 cloud as “primary cloud”.

By default, a Redis cluster with 6 VMs will be deployed: 3 VM will serve the masters, and the other 3 VM will be the slaves to replicate data from the 3 masters. The deployment plan will look like the following:

Screen Shot 2016-06-29 at 12.37.37 PM

After deployment, the Redis cluster is up and running. We could validate this by checking the cluster members: log into any VM, use redis-cli command output all cluster members.

>> ssh app@
-bash-4.2$ sudo -s
[root@redis-11075986-6-24531380 app]#
>> /usr/local/bin/redis-cli cluster nodes
xxxxc30d9 slave xxxx13fb 0 1467312088802 4 connected
xxxxb03a master - 0 1467312090305 3 connected 10923-16383
xxxx588c master - 0 1467312088802 2 connected 5461-10922
xxxx13fb master - 0 1467312088301 1 connected 0-5460
xxxx546e slave xxxxb03a 0 1467312089804 6 connected
xxxx01c9 myself,slave 3xxxx588c 0 0 5 connected

From above output, we could see 3 master nodes evenly split the keyspace (e.g. 0-5460) and each master will serve the requests that fall into its keyspace. Each slave replicates data from a master. Next let’s verify how master and slave provide data redundancy.

Verify the Redundancy of Redis Cluster

Put a key-value pair into Redis cluster:

>> /usr/local/bin/redis-cli -c> set hello world
-> Redirected to slot [866] located at

Get the value by the key:> get hello

Now let’s shut down the master that saves the key-value pair of “hello world”, in this example it is Open a new terminal to SSH into, run service redis@6379 stop to terminate the running Redis instance, go back to the first terminal to output all cluster members again:

>> /usr/local/bin/redis-cli cluster nodes
xxxxc30d9 master xxxx13fb 0 1467312088802 4 connected
xxxxb03a master - 0 1467312090305 3 connected 10923-16383
xxxx588c master - 0 1467312088802 2 connected 5461-10922
xxxx13fb master, fail - 0 1467312088301 1 connected 0-5460
xxxx546e slave xxxxb03a 0 1467312089804 6 connected
xxxx01c9 myself,slave 3xxxx588c 0 0 5 connected

From above, Redis instance on has been marked as fail, while has been promoted from slave to the new master, to cover the Redis failure on

Try to get the value by the key:

>> /usr/local/bin/redis-cli -c> get hello
-> Redirected to slot [866] located at

The value could still be read from Redis cluster and the machine to serve that request is now


The focus of this article is to demonstrate how easily a well recognized public Chef cookbook could be transplanted and integrated with OneOps transparently, which may potentially open the opportunities for the Chef users to migrate their existing Chef cookbooks and scripts onto OneOps.

This article does not discuss about the operational benefits from OneOps on running Redis cluster. This is mostly because OneOps adopted the public Redis cookbook which does not have the full-fledged operational supports. However OneOps is specialized in operational excellence, e.g. auto-repair, auto-replace, auto-scale, as introduced in Cassandra OneOpe Pack. So it is the future work to make highly resilient Redis deployment with strong operational supports.

High-performance Graphite on OneOps

(Disclaimer: the blogs posted here only represent the author’s respective, OneOps project does not guarantee the support or warranty of any code, tutorial, documentation discussed here)

Graphite has been a notable enterprise-level, time-series monitoring tool that runs well on commodity hardware. From architectural perspective: it consists of 3 components:

  • Carbon: responsible for receiving metrics over the network and writing them down to disk using a storage backend.
  • Whisper: a file-based time-series database format for Graphite.
  • Graphite-web: A Django webapp that renders graphs

Although Graphite was originally written in 2006, it still being widely used by lots of organizations for their production monitoring. However as Graphite was not fully designed as a “distributed system” in mind back to 2006, using Graphite to handle large amount of read/write requests is not very trivial.

Lifting up Graphite to be a highly available, scalable and redundant system is the top priority when it is considered for productions nowadays. Many optimizations are discussed in the last few years, covering every possible aspect of Graphite, such as replacing backend data store with more scalable ones (Cassandra, influxDB), changing the metric file format, using SSD for faster I/O, adding front-end cache….

Graphite pack on OneOps evolved from Walmart internal Graphite production deployment, which has been used for Real User Monitoring (RUM) at Walmart Global E-commerce websites (walmart.com, walmart.ca, asda.com…) for several years. There was a news coverage from ABC News about Walmart presented hot deals on the e-commerce websites during holiday seasons. The background charts are generated by Graphite!

Screen Shot 2016-07-21 at 6.17.17 PM.png

Video Link: http://abcnews.go.com/GMA/video/shoppers-head-online-cyber-monday-35486571

Now let me quickly introduce the architecture of Graphite pack on OneOps.


The Graphite cluster consists of n homogeneous nodes and Round-Robin DNS Load Balancer. Each node is installed and configured with carbon, whisper, graghite-web (served by uwsgi + nginx), memcached.

From the top, the metric raw data are ingested into Graphite backend via a Round-Robin DNS Load Balancer, which evenly distribute the write requests over the Graphite nodes. There are 2 levels of carbon-relay:

The first-level relay runs consistent-hash to horizontally spread the write workloads across all Graphite nodes. In the first-level, users could specify how many times the metric data will be replicated in Graphite.

The second-level relay also runs consistent-hash but only sends data locally to multiple  carbon-cache instances. The number of carbon-cache instance, which independently writes to Whisper, equals to the number of CPU cores in the node, in order to fully utilize the hardware resources.

Graphite-web is served by uwsgi with nginx, rather than Apache, mostly because of the faster response time (For more details, please refer to this comparison between uwsgi and Apache). Memcached is also configured with Graphite-web to boost the throughput.

Finally the Round-Robin DNS sits in frontend to load balance the read requests over multiple Graphite-web instances.

Next I will show how to deploy Graphite on cloud via OneOps

Graphite Deployment via OneOps

In OneOps “Design” phase, choose “Graphite” pack:

Screen Shot 2016-06-21 at 3.11.25 PM

After creating the Graphite design, you may click the “graphite” component to review and update some Graphite attributes, such as Graphite version, replication factor, storage-schema.conf and so on.

Screen Shot 2016-06-21 at 3.14.12 PM

Add your local SSH key to “user-graphite” component so that you could directly log into the Graphite VM after the deployment.

Screen Shot 2016-06-21 at 3.21.26 PM

After saving the Design, create a new environment with “Availability Mode” = redundant and choose 1 cloud as “primary cloud”. Regarding setting up a cloud in OneOps, please refer to one of my previous blogs.

By default, a Graphite cluster with 3 VMs will be created. The deployment plan will look like the following: (number of compute instances is 3, denoting 3 VMs will be created)

Screen Shot 2016-06-21 at 3.24.53 PM

To access the Graphite GUI after the deployment, we need to know the DNS or Load Balancer address. To get this, go to “Operate” ->  your_graphite_platform_name -> fqdn. The shorter URL is the address of platform FQDN, which could be resolved to the IP addresses of all VMs.

Copy & Paste the shorter URL into you browser and you should see the Graphite GUI:

Screen Shot 2016-06-21 at 4.14.45 PM

To test Graphite, it is convenient to use command line to send some raw metric data and see if they could be visualized on the GUI. The following testing script will run in loop to send random metric values to Graphite. Replace your_fqdn_shorter_url with yours.

while true; do
echo "local.random.diceroll $RANDOM `date +%s`" | nc -c ${SERVER} ${PORT}
sleep 3

On the left panel, navigate to the metric name “local.random.diceroll” and the graph should show up on the right. (make sure to let the graph auto-refresh and adjust the data and time range)

Screen Shot 2016-06-21 at 5.12.03 PM

Monitoring and Alerting

Graphite pack comes with basic process-level up/down monitoring, such as “carbon” and “memcached”. If the monitored processes went down for any reason, the alerts are triggered and delivered to the sign-up Email.

Screen Shot 2016-06-21 at 4.32.22 PM

Additional Graphite tools

In addition to deploy Graphite itself, the pack also provides a couple of useful tools to manage and operate the Graphite cluster:

Graphite Dashboard CLI Tool: help to synchronize and delete the Graphite dashboards.

Note that:

  • Always put “http://” in front of ip address, and NO “/” at the end. For example,
    graphite-dashboardcli sync '*'
  • Issue the command only on one node.
  • Regardless of on which node you issue the command, include all graphite node IP addresses in the “sync” and “delete” commands
  • Currently there is no cron-like job to periodically synchronize dashboards run the command when needed.
  • If you want to create multiple dashboards “in a row”, the best practice is to sync the dashboard right after you create one. (create N dashboards and sync N times)

Carbonate: help to manage the Graphite cluster, such as re-balance the data, sync data across nodes. The configuration file is auto-generated and located at: /opt/graphite/conf/carbonate.conf


SSH_USER = root


Graphite pack on OneOps has a simple, yet high-performance architecture that has been serving as the monitoring backbone internally at Walmart Global E-commerce for several years. It is certain that many optimizations could be applied on the current architecture to further improve the performance, so meaningful contributions and suggestions to the pack are highly appreciated.

Making Grafana pack available on OneOps is also a great contribution from a different angle, because there are more and more cases that use Graphite and Grafana together to present beautiful analytical dashboards and graph visualizations.