High-performance Graphite on OneOps

(Disclaimer: the blogs posted here only represent the author’s respective, OneOps project does not guarantee the support or warranty of any code, tutorial, documentation discussed here)

Graphite has been a notable enterprise-level, time-series monitoring tool that runs well on commodity hardware. From architectural perspective: it consists of 3 components:

  • Carbon: responsible for receiving metrics over the network and writing them down to disk using a storage backend.
  • Whisper: a file-based time-series database format for Graphite.
  • Graphite-web: A Django webapp that renders graphs

Although Graphite was originally written in 2006, it still being widely used by lots of organizations for their production monitoring. However as Graphite was not fully designed as a “distributed system” in mind back to 2006, using Graphite to handle large amount of read/write requests is not very trivial.

Lifting up Graphite to be a highly available, scalable and redundant system is the top priority when it is considered for productions nowadays. Many optimizations are discussed in the last few years, covering every possible aspect of Graphite, such as replacing backend data store with more scalable ones (Cassandra, influxDB), changing the metric file format, using SSD for faster I/O, adding front-end cache….

Graphite pack on OneOps evolved from Walmart internal Graphite production deployment, which has been used for Real User Monitoring (RUM) at Walmart Global E-commerce websites (walmart.com, walmart.ca, asda.com…) for several years. There was a news coverage from ABC News about Walmart presented hot deals on the e-commerce websites during holiday seasons. The background charts are generated by Graphite!

Screen Shot 2016-07-21 at 6.17.17 PM.png

Video Link: http://abcnews.go.com/GMA/video/shoppers-head-online-cyber-monday-35486571

Now let me quickly introduce the architecture of Graphite pack on OneOps.

Slide2

The Graphite cluster consists of n homogeneous nodes and Round-Robin DNS Load Balancer. Each node is installed and configured with carbon, whisper, graghite-web (served by uwsgi + nginx), memcached.

From the top, the metric raw data are ingested into Graphite backend via a Round-Robin DNS Load Balancer, which evenly distribute the write requests over the Graphite nodes. There are 2 levels of carbon-relay:

The first-level relay runs consistent-hash to horizontally spread the write workloads across all Graphite nodes. In the first-level, users could specify how many times the metric data will be replicated in Graphite.

The second-level relay also runs consistent-hash but only sends data locally to multiple  carbon-cache instances. The number of carbon-cache instance, which independently writes to Whisper, equals to the number of CPU cores in the node, in order to fully utilize the hardware resources.

Graphite-web is served by uwsgi with nginx, rather than Apache, mostly because of the faster response time (For more details, please refer to this comparison between uwsgi and Apache). Memcached is also configured with Graphite-web to boost the throughput.

Finally the Round-Robin DNS sits in frontend to load balance the read requests over multiple Graphite-web instances.

Next I will show how to deploy Graphite on cloud via OneOps

Graphite Deployment via OneOps

In OneOps “Design” phase, choose “Graphite” pack:

Screen Shot 2016-06-21 at 3.11.25 PM

After creating the Graphite design, you may click the “graphite” component to review and update some Graphite attributes, such as Graphite version, replication factor, storage-schema.conf and so on.

Screen Shot 2016-06-21 at 3.14.12 PM

Add your local SSH key to “user-graphite” component so that you could directly log into the Graphite VM after the deployment.

Screen Shot 2016-06-21 at 3.21.26 PM

After saving the Design, create a new environment with “Availability Mode” = redundant and choose 1 cloud as “primary cloud”. Regarding setting up a cloud in OneOps, please refer to one of my previous blogs.

By default, a Graphite cluster with 3 VMs will be created. The deployment plan will look like the following: (number of compute instances is 3, denoting 3 VMs will be created)

Screen Shot 2016-06-21 at 3.24.53 PM

To access the Graphite GUI after the deployment, we need to know the DNS or Load Balancer address. To get this, go to “Operate” ->  your_graphite_platform_name -> fqdn. The shorter URL is the address of platform FQDN, which could be resolved to the IP addresses of all VMs.

Copy & Paste the shorter URL into you browser and you should see the Graphite GUI:

Screen Shot 2016-06-21 at 4.14.45 PM

To test Graphite, it is convenient to use command line to send some raw metric data and see if they could be visualized on the GUI. The following testing script will run in loop to send random metric values to Graphite. Replace your_fqdn_shorter_url with yours.

PORT=2003
SERVER=your_fqdn_shorter_url
while true; do
echo "local.random.diceroll $RANDOM `date +%s`" | nc -c ${SERVER} ${PORT}
sleep 3
done

On the left panel, navigate to the metric name “local.random.diceroll” and the graph should show up on the right. (make sure to let the graph auto-refresh and adjust the data and time range)

Screen Shot 2016-06-21 at 5.12.03 PM

Monitoring and Alerting

Graphite pack comes with basic process-level up/down monitoring, such as “carbon” and “memcached”. If the monitored processes went down for any reason, the alerts are triggered and delivered to the sign-up Email.

Screen Shot 2016-06-21 at 4.32.22 PM

Additional Graphite tools

In addition to deploy Graphite itself, the pack also provides a couple of useful tools to manage and operate the Graphite cluster:

Graphite Dashboard CLI Tool: help to synchronize and delete the Graphite dashboards.

Note that:

  • Always put “http://” in front of ip address, and NO “/” at the end. For example,
    graphite-dashboardcli sync '*' http://10.247.198.50 http://10.247.198.5
  • Issue the command only on one node.
  • Regardless of on which node you issue the command, include all graphite node IP addresses in the “sync” and “delete” commands
  • Currently there is no cron-like job to periodically synchronize dashboards run the command when needed.
  • If you want to create multiple dashboards “in a row”, the best practice is to sync the dashboard right after you create one. (create N dashboards and sync N times)

Carbonate: help to manage the Graphite cluster, such as re-balance the data, sync data across nodes. The configuration file is auto-generated and located at: /opt/graphite/conf/carbonate.conf

[main]
DESTINATIONS = 10.65.224.94:2004:carbon01, 10.65.224.217:2004:carbon01

REPLICATION_FACTOR = 1
SSH_USER = root

Summary

Graphite pack on OneOps has a simple, yet high-performance architecture that has been serving as the monitoring backbone internally at Walmart Global E-commerce for several years. It is certain that many optimizations could be applied on the current architecture to further improve the performance, so meaningful contributions and suggestions to the pack are highly appreciated.

Making Grafana pack available on OneOps is also a great contribution from a different angle, because there are more and more cases that use Graphite and Grafana together to present beautiful analytical dashboards and graph visualizations.

Advertisements

One thought on “High-performance Graphite on OneOps

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s