Highly Resilient Cassandra on OneOps

(Disclaimer: the blogs posted here currently neither represent Walmart’s official perspectives regarding to the technologies, nor meant to provide any support or warranty of the code, tutorial, documentation and etc.)

Apache Cassandra is a top level Apache project born at Facebook and built on Amazon’s Dynamo and Google’s BigTable, is a distributed database for managing large amounts of structured or unstructured data across commodity servers. Cassandra is a highly available, scalable and redundant service with no single point of failure.

an-overview-of-apache-cassandra-4-728

Cassandra’s architecture by nature is designed for scaling and continuous uptime. Instead of using the common master-slave or a sharded architecture, Cassandra has the similar “ring” design as Dynamo that is elegant, easy to set up and maintain.

In this blog, I would like to describe the following three items:

  • Cassandra deployment on cloud via OneOps
  • Easy scaling and automatic failure handling feature by OneOps
  • Cassandra performance monitoring on OneOps

Cassandra Deployment

In OneOps “Design” phase, choose “Cassandra” pack:

Screen Shot 2016-06-18 at 2.12.51 PM

After creating the Cassandra design, you may click the “cassandra” component to change the default configurations of Cassandra.

Screen Shot 2016-06-18 at 2.21.33 PM

In particular, the “topology” section, Cassandra pack uses GossipingPropertyFileSnitch by default, since it is recommended by the community for production. Also Cassandra has datacenter and rack awareness, “Map of Cloud to DC:Rack” attribute is the place to specify the mapping between OneOps clouds and <Data Center:Rack> pair, which is effective to GossipingPropertyFileSnitch and PropertyFileSnitch. For experimental purposes, it is totally fine to leave it blank. For more details about Snitch and Cassandra topology, please refer to this.

In the Design phase, add a new “User” component and your local SSH key to “user” component so that you could directly log into the Cassandra VM after the deployment.

Screen Shot 2016-06-18 at 2.57.32 PM

After saving the design, create a new environment with “Availability Mode” = redundant and choose 1 cloud as “primary cloud” (if experimental purposes). Regarding setting up a cloud in OneOps, please refer to one of my previous blogs.

By default, the Cassandra cluster with 3 VMs will be created. The deployment plan will look like the following: (number of compute instances is 3, denoting 3 VMs will be created)

Screen Shot 2016-06-18 at 2.59.08 PM

After the deployment finishes, we could now SSH into the VM. To check the IP address of the VM, go to “Operate” -> your_cassandra_platform_name -> “compute”. Then we will see the details of multiple compute instances, where IP address is shown.

Screen Shot 2016-06-18 at 7.58.30 PM

From local machine, SSH into one of the VM, then run the nodetool, a Cassandra utility tool, to show the status of the cluster.

[nzhang2@cas-238213-1-23632914 ~]$ nodetool status
Datacenter: dal
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
UN  10.65.227.135  51.6 KB    256     67.3%             4d99d6ad-889a-42d5-b288-b79c02dd1c9a  2
UN  10.65.227.121  66.08 KB   256     63.3%             1e3e77b4-c39f-40f8-85ed-830a22d61afc  2
UN  10.65.227.12   99.01 KB   256     69.4%             e4867580-d77c-47ce-9c8d-1b8ddc8f781d  2

Easy Scaling and Automatic Failure Handling by OneOps

It is typical that the Cassandra admin needs to scale up the cluster capacity when the workload grows. Without the help from the automation software like OneOps, the admin has to manually follow several steps in order to safely add new nodes to an exiting cluster. However OneOps hides those manual and error-prone details in a couple of mouse clicks. Let’s how this happens:

In OneOps “Transition” phase, choose the Cassandra environment, then on your right, click your_cassandra_platform_name. Then you will see the “Scaling” section:

Screen Shot 2016-06-18 at 9.03.13 PM

Change the value of “Current” from 3 to 4 to add one more node into the existing cluster. Save and deploy.

Screen Shot 2016-06-18 at 9.09.53 PM

After the deployment is done, let’s run “nodetool status” command to see the output:

[nzhang2@cas-238213-1-23632914 ~]$ nodetool status
Datacenter: dal
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
UN  10.65.227.135  109.28 KB  256     51.6%             4d99d6ad-889a-42d5-b288-b79c02dd1c9a  2
UN  10.65.227.121  123.55 KB  256     51.5%             1e3e77b4-c39f-40f8-85ed-830a22d61afc  2
UN  10.65.227.12   156.38 KB  256     48.3%             e4867580-d77c-47ce-9c8d-1b8ddc8f781d  2
UN  10.65.227.14   66.57 KB   256     48.7%             60812954-0529-45bd-8f39-12a15a02ac01  2

Now we see that the cluster has added one more node and the load could be evenly distributed over all 4 nodes, which did not require us to type in a single command to make this happen!

Decommission a node is also just a couple of mouse clicks without accessing the CLI (Command-line Interface): In the “Scaling” section, decrease the value of “Current” by 1. After the deployment one node has been safely removed from the cluster without service interruption.

Auto-scaling feature is also available on OneOps, like some other automation software/service also support it, such as AWS, Ansible, SaltStack. However what is really unique to OneOps is called AUTO-REPLACE, meaning that when the VM and the service on it (or call “instance”) are really down, OneOps will automatically replace the bad instance with a new one and automatically make it join the existing service or cluster.

To take advantage of “auto-replace” feature: go to OneOps “Operate” page, click your_cassandra_platform_name, in the “summary” tab find “auto-replace” section:

Screen Shot 2016-06-19 at 10.54.14 AM

“Enable autoreplace”, then “Change conditions” to set up the thresholds to trigger auto-replace:

Screen Shot 2016-06-19 at 10.57.57 AM

Before auto-replace is triggered, OneOps will first try to repair the failed instance. As shown in above picture,

  • Replace unhealthy after repairs #: how many time OneOps will attempt to repair the instance before running auto-replace.
  • Replace unhealthy after minutes: before executing auto-replace, how long to wait after all repair attempts have been made.

For experimental purposes, we could define the auto-replace feature like this:

Screen Shot 2016-06-19 at 11.09.41 AM.png

Now let’s simulate an instance is down hard by blocking a wide range of Linux ports via iptables. AFAIK – this is by far the most simple and effective way to simulate a VM down in the cloud. (On a side note, just shutting down the applications, like cassandra daemon, does not have the same effects as block Linux ports. Correctly “simulate a machine down” is a very important to debug and test a distributed system)

chkconfig --level 345 iptables on
iptables -A INPUT -p tcp --match multiport --dport 22:65535 -j DROP;service iptables save

Then on a healthy VM, run “nodetool status” to see that there is a node in “Dead” (DN) status: (the last node in the following example output)

[nzhang2@cas-238213-1-23632914 ~]$ nodetool status
Datacenter: dal
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
UN  10.65.227.135  197.28 KB  256     51.6%             4d99d6ad-889a-42d5-b288-b79c02dd1c9a  2
UN  10.65.227.121  211.52 KB  256     51.5%             1e3e77b4-c39f-40f8-85ed-830a22d61afc  2
UN  10.65.227.12   244.12 KB  256     48.3%             e4867580-d77c-47ce-9c8d-1b8ddc8f781d  2
DN  10.65.227.14   173.96 KB  256     48.7%             60812954-0529-45bd-8f39-12a15a02ac01  2

After a few minutes, we should receive an alerting notification (via Email) about “Heartbeat is violated” and “repair starting”:

Screen Shot 2016-06-19 at 11.24.52 AM.png

As the blocking port rule will persist after reboot, OneOps will not be able to repair the VM, so after several minutes it will trigger the auto-replace:

Screen Shot 2016-06-19 at 3.11.16 PM

On the OneOps “Operate” page, you should see a deployment triggered by the auto-replace is underway,

Screen Shot 2016-06-19 at 3.12.10 PM

After the deployment is done, run “nodetool status” and it shows that all nodes (including the ex-down nodes) are up and working.

[nzhang2@cas-238213-1-23632914 ~]$ nodetool status
Datacenter: dal
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID                               Rack
UN  10.65.227.135  157.05 KB  256     48.1%             4d99d6ad-889a-42d5-b288-b79c02dd1c9a  2
UN  10.65.227.121  189.84 KB  256     49.8%             1e3e77b4-c39f-40f8-85ed-830a22d61afc  2
UN  10.65.227.12   189.53 KB  256     51.7%             e4867580-d77c-47ce-9c8d-1b8ddc8f781d  2
UN  10.65.227.90   83.09 KB   256     50.3%             503e39c5-3b79-4a50-b74f-de17dc3e4f46  2

In general, replacing a node in a distributed system requires two steps: (1) rotate the bad node away from the cluster, (2) add the replacement node into it, which requires multiple manual steps, especially for Cassandra. When running on the cloud-based infrastructure, there is a relatively higher chance that a VM become “dead” or inaccessible for many reasons (e.g. Hypervisor error, overloaded server…), so replacing a bad node may become a “routine” job for the operation folks.

From above example, OneOps could auto-detect the bad instances, try to repair them, after several attempts without luck, replace them and bootstrap the new instances to join the existing service. All of these happen automatically – OneOps “auto-pilot” the services or applications that are orchestrated on it !

Cassandra Performance Monitoring on OneOps

Some monitors about Cassandra performance could also be visualized on OneOps UI. For example, Read Operation, Write Operation and Logs.

Go to OneOps “Operate” page, choose the Cassandra environment, click your_cassandra_platform_name, then click the “cassandra” component:

Screen Shot 2016-06-19 at 5.44.14 PM.png

Select any one of Cassandra instances:

Screen Shot 2016-06-20 at 9.59.02 AM

Then the details of the selected instances are shown. Click “monitors” tab then “Write Operations”. On your right side, the write operations for that Cassandra instance (e.g. cassandra-238213-1) could be visualized over the time .

Screen Shot 2016-06-19 at 5.39.43 PM

Click “Read Operations” to visualize the monitoring graph about read:

Screen Shot 2016-06-19 at 5.39.20 PM

The “Log” tab shows the number of different types of logs (total, error, critical…)

Screen Shot 2016-06-19 at 5.50.02 PM

For production or serious use cases, we could even set up alerts on the monitored statistics, so that when the read/write operations drop below a threshold, or the number of error log lines goes beyond a threshold, the “push” notifications will be sent to the sign-up Email, mobile or collaboration tools. In future blogs, I plan to introduce how to add customized monitoring and alerting for more production-driven cases.

What is Next?

The Cassandra admin will have better experiences and productivity if Cassandra could be managed and operated on GUI, such as OpsCenter. There has been an effort underway to develop OpsCenter pack on OneOps and I would like to keep updated and post a blog about this when it is ready.

Advertisements

One thought on “Highly Resilient Cassandra on OneOps

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s