Deploy Apache Zookeeper via OneOps

zookeeper-logo(Disclaimer: the blogs posted here only represent the author’s respective, OneOps project does not guarantee the support or warranty of any code, tutorial, documentation discussed here)

Zookeeper is a very popular centralized coordination service. Its use cases cover a large space: leader election, group membership, configuration maintenance, event notification, locking and etc..

Many open-source projects are also depending on Zookeeper and here are some examples:

  • Kafka: Zookeeper saves broker info, consumer info and does leader election
  • Storm: Zookeeper coordinates the various “actors” (e.g. Nimbus, Supervisor daemons)
  • HBase: ZooKeeper acts as a distributed coordination service to maintain server state (which servers are available and provides server failure notification)
  • Solr: Zookeeper is used to maintain the configuration between Solr servers.

OneOps has provided a Zookeeper pack with proactive monitoring and alerting features. Now let’s first go through the steps of deploying Zookeeper via OneOps.

First, select Zookeeper pack:

Screen Shot 2016-06-13 at 10.23.52 AM

There are many Zookeeper configurations could be updated and tuned. For experimental purposes, we could just leave it as “it-is” for now.

Screen Shot 2016-06-13 at 11.06.00 AM

Add your local SSH key to “user-zookeeper” component so that you could directly log into the zookeeper VM after the deployment.

Screen Shot 2016-06-13 at 11.11.11 AM.png

After saving the Design, create a new environment with “Availability Mode” = redundant and choose 1 cloud as “primary cloud”. Regarding setting up a cloud in OneOps, please refer to one of my previous blogs.

By default, a Zookeeper cluster with 3 VMs will be created. The deployment plan will look like the following: (number of compute instances is 3, denoting 3 VMs will be created)

Screen Shot 2016-06-13 at 1.49.16 PM

After the deployment finishes, we could now SSH into the VM. To check the IP address of the VM, go to “Operate” ->  your_zookeeper_platform_name -> “compute”. Then we will see the details of multiple compute instances, where IP address is shown.

Screen Shot 2016-06-13 at 3.45.41 PM

From your local machine, SSH into each zookeeper VM separately, then “sudo -s” to have the “root” access to run commands in the rest of this blog.

ssh zookeeper@vm_ip_address
Last login: Mon Jun 13 22:32:12 2016 from 172.29.238.216
[zookeeper@zookeeper-238213-2-23206984 ~]$ sudo -s

`/etc/init.d/zookeeper-server` is the start/stop/status script. To check the status of zookeeper process on current VM,

>> /etc/init.d/zookeeper-server status
JMX enabled by default
Using config: /etc/zookeeper/zoo.cfg
Mode: follower

Now let’s check out the proactive monitoring and alerting features that come together with OneOps and its Zookeeper, which means the operators will receive “push” like notifications in their email box, mobile or/and collaboration tools (e.g. slack) when something went wrong or got recovered.

Currently there are two monitoring scripts that runs on every Zookeeper VM. The first one is called ZookeeperProcess which monitors the up and down of zookeeper process. Now let’s shut down the zookeeper process on one VM to simulate something bad happens, e.g. someone killed all Java processes by mistake.

>> /etc/init.d/zookeeper-server stop
JMX enabled by default
Using config: /etc/zookeeper/zoo.cfg
Stopping zookeeper ... STOPPED

Check the zookeeper status again and it shows Zookeeper process is not running.

>> /etc/init.d/zookeeper-server status
JMX enabled by default
Using config: /etc/zookeeper/zoo.cfg
Error contacting service. It is probably not running.

If the OneOps installation on your promise enables “monitoring and alerting”, you should receive an email (as well as see an alerting message on OneOps “Operate” webpage) indicating that “Zookeeper Process Down”.

Integration with mobile and collaboration tools requires additional setup.

Screen Shot 2016-06-13 at 9.55.51 PM

Since Zookeeper run on the basis of quorum concept and the Zookeeper cluster we just created has size of 3 (e.g. 3 VMs), losing 1 zookeeper process will not impact the continuity of Zookeeper service. To verify this, let’s connect to the Zookeeper service.

>> /usr/lib/zookeeper/zookeeper-3.4.5/bin/zkCli.sh -server fqdn:2181
Welcome to ZooKeeper!
JLine support is enabled

WATCHER::

WatchedEvent state:SyncConnected type:None path:null

Here, replace fqdn with your platform-level FQDN. To find it, go to “Operate” ->  your_zookeeper_platform_name -> “fqdn”. To understand more about FQDN in OneOps, please refer to one of my previous blogs.

Get the list of current Znode in Zookeeper by “ls /”:

[zk: fqdn:2181(CONNECTED) 0] ls /
[zookeeper, aliases.json, clusterstate.json]

From above console output, Zookeeper service is still running, though one Zookeeper process went down.

Now let bring down another Zookeeper process on a different VM. Now we expect that the Zookeeper service will not be available, since 2 0ut of 3 Zookeeper processes have went down, leading to more than 50% of nodes unavailable in the cluster.

This scenario will trigger the second monitor, called Cluster Health, to fire alerts indicating that the Zookeeper lost quorum and is not in serviceable mode.

Screen Shot 2016-06-13 at 9.51.18 PM

Again, we could connect to the Zookeeper shell and try “ls /” command.

[zk: fqdn:2181(CONNECTING) 0] ls /
Exception in thread "main" org.apache.zookeeper.KeeperException$Connection
LossException: KeeperErrorCode = ConnectionLoss for /
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
    at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1496)
    at org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:725)
    at org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:593)
    at org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:365)
    at org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:323)
    at org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:282)

From above, we could see that Zookeeper runs into Exception.

Finally, let’s restore the Zookeeper service by starting up the Zookeeper process.

>> /etc/init.d/zookeeper-server start
JMX enabled by default
Using config: /etc/zookeeper/zoo.cfg
Starting zookeeper ... STARTED

After a few minutes, you should receive the notifications from OneOps about the service and process have been restored.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s