(Disclaimer: the blogs posted here only represent the author’s respective, OneOps project does not guarantee the support or warranty of any code, tutorial, documentation discussed here)
Zookeeper is a very popular centralized coordination service. Its use cases cover a large space: leader election, group membership, configuration maintenance, event notification, locking and etc..
Many open-source projects are also depending on Zookeeper and here are some examples:
- Kafka: Zookeeper saves broker info, consumer info and does leader election
- Storm: Zookeeper coordinates the various “actors” (e.g. Nimbus, Supervisor daemons)
- HBase: ZooKeeper acts as a distributed coordination service to maintain server state (which servers are available and provides server failure notification)
- Solr: Zookeeper is used to maintain the configuration between Solr servers.
OneOps has provided a Zookeeper pack with proactive monitoring and alerting features. Now let’s first go through the steps of deploying Zookeeper via OneOps.
First, select Zookeeper pack:
There are many Zookeeper configurations could be updated and tuned. For experimental purposes, we could just leave it as “it-is” for now.
Add your local SSH key to “user-zookeeper” component so that you could directly log into the zookeeper VM after the deployment.
After saving the Design, create a new environment with “Availability Mode” = redundant and choose 1 cloud as “primary cloud”. Regarding setting up a cloud in OneOps, please refer to one of my previous blogs.
By default, a Zookeeper cluster with 3 VMs will be created. The deployment plan will look like the following: (number of compute instances is 3, denoting 3 VMs will be created)
After the deployment finishes, we could now SSH into the VM. To check the IP address of the VM, go to “Operate” -> your_zookeeper_platform_name -> “compute”. Then we will see the details of multiple compute instances, where IP address is shown.
From your local machine, SSH into each zookeeper VM separately, then “sudo -s” to have the “root” access to run commands in the rest of this blog.
ssh zookeeper@vm_ip_address Last login: Mon Jun 13 22:32:12 2016 from 172.29.238.216 [zookeeper@zookeeper-238213-2-23206984 ~]$ sudo -s
`/etc/init.d/zookeeper-server` is the start/stop/status script. To check the status of zookeeper process on current VM,
>> /etc/init.d/zookeeper-server status JMX enabled by default Using config: /etc/zookeeper/zoo.cfg Mode: follower
Now let’s check out the proactive monitoring and alerting features that come together with OneOps and its Zookeeper, which means the operators will receive “push” like notifications in their email box, mobile or/and collaboration tools (e.g. slack) when something went wrong or got recovered.
Currently there are two monitoring scripts that runs on every Zookeeper VM. The first one is called ZookeeperProcess which monitors the up and down of zookeeper process. Now let’s shut down the zookeeper process on one VM to simulate something bad happens, e.g. someone killed all Java processes by mistake.
>> /etc/init.d/zookeeper-server stop JMX enabled by default Using config: /etc/zookeeper/zoo.cfg Stopping zookeeper ... STOPPED
Check the zookeeper status again and it shows Zookeeper process is not running.
>> /etc/init.d/zookeeper-server status JMX enabled by default Using config: /etc/zookeeper/zoo.cfg Error contacting service. It is probably not running.
If the OneOps installation on your promise enables “monitoring and alerting”, you should receive an email (as well as see an alerting message on OneOps “Operate” webpage) indicating that “Zookeeper Process Down”.
Integration with mobile and collaboration tools requires additional setup.
Since Zookeeper run on the basis of quorum concept and the Zookeeper cluster we just created has size of 3 (e.g. 3 VMs), losing 1 zookeeper process will not impact the continuity of Zookeeper service. To verify this, let’s connect to the Zookeeper service.
>> /usr/lib/zookeeper/zookeeper-3.4.5/bin/zkCli.sh -server fqdn:2181 Welcome to ZooKeeper! JLine support is enabled WATCHER:: WatchedEvent state:SyncConnected type:None path:null
Here, replace fqdn with your platform-level FQDN. To find it, go to “Operate” -> your_zookeeper_platform_name -> “fqdn”. To understand more about FQDN in OneOps, please refer to one of my previous blogs.
Get the list of current Znode in Zookeeper by “ls /”:
[zk: fqdn:2181(CONNECTED) 0] ls / [zookeeper, aliases.json, clusterstate.json]
From above console output, Zookeeper service is still running, though one Zookeeper process went down.
Now let bring down another Zookeeper process on a different VM. Now we expect that the Zookeeper service will not be available, since 2 0ut of 3 Zookeeper processes have went down, leading to more than 50% of nodes unavailable in the cluster.
This scenario will trigger the second monitor, called Cluster Health, to fire alerts indicating that the Zookeeper lost quorum and is not in serviceable mode.
Again, we could connect to the Zookeeper shell and try “ls /” command.
[zk: fqdn:2181(CONNECTING) 0] ls / Exception in thread "main" org.apache.zookeeper.KeeperException$Connection LossException: KeeperErrorCode = ConnectionLoss for / at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1496) at org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:725) at org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:593) at org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:365) at org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:323) at org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:282)
From above, we could see that Zookeeper runs into Exception.
Finally, let’s restore the Zookeeper service by starting up the Zookeeper process.
>> /etc/init.d/zookeeper-server start JMX enabled by default Using config: /etc/zookeeper/zoo.cfg Starting zookeeper ... STARTED
After a few minutes, you should receive the notifications from OneOps about the service and process have been restored.