We love graphs here at Liquidstate HQ. After over a decade of building scalable infrastructure, we've learned the hard way that if you're not monitoring it, then you're not managing it!

What is Graphite?

Graphite is a scalable a real-time graphing system commonly used in conjuction with Collectd and Statsd for tracking system and application performance metrics. There's plenty of documentation out there about how wonderful it is, but that's not what I want to talk about today.

Unfortunately, there seems to a lack of real-world documentation on some of the operational factors - in particular, managing its storage.

Whisper

To store metrics, Graphite uses Whisper, a fixed-size database similar in design to RRD (round-robin-database). It provides fast, reliable storage of numeric data over time. It correctly samples data to maintain a chosen retention policy.

At its simplist, imagine you have a stream of metrics coming in every 10 seconds. Whisper would automatically average out 1 minute of data and store a single datapoint for that minute in a .wsp file. It will keep the last 7 days of such samples, automatically discarding out of range data. You can specify multiple retentions, for example "1m:7d,15m:30d,1h:1y" would instruct Whisper to keep 7 days of 1 minute data, 30 days of 15 minute data and a whole year of 1 hour data.

Capacity Planning

So you've had a think and decided what resolution of data you want to keep and for how long. How much disk space are you going to need? It's fairly easy to work out...

So, lets take the above example retention policy of "1m:7d,15m:30d,1h:1y". For the first part we will need to store (7 days x 24 hours x 60 minutes) = 10,080 data points. Then (30 * 96) = 2,880 and (365 * 24) = 8,760 data points respective for the second and third retention policy. That gives us a total of 21,720 data points.

So, how much storage do we need? Well, the documentation gives an example for statsd where for a retention policy of 10s:6h,1min:7d,10min:5y they suggest a maximum on disk size of 3.2 MB. This would be 2160 + 10080 + 262800 = 275040 data points and, assuming a linear relationship, this would be approximately 12 Bytes per data point. Which, matches what I've seen in our environment too. Neato!

For a retention policy of "1m:7d,15m:30d,1h:1y", we now know we would store a maximum of 21,720 data points at 12 Bytes each, thus requiring approximate 255 KB of on disk storage.

Remember, this is for a single metrics source. In a production environment, you will likely have lots of things you want to monitor across a number of servers. Trust me, it all adds up, so please be sure to do your maths first! To give you an idea, here at Liquidstate, we're handling a stream of over 120,000 metrics per minute.

Changing retention settings

So, perhaps you got a little carried away and started recording metrics of everything that moves (and a few that don't, just in case). You perhaps forgot that all those metric sources add up and now you're running out of disk space?

Well, good news! You can retrospectively change the retention of a .wsp file using the whisper-resize tool that comes with Graphite.

$ cd /var/lib/carbon/whisper/path/to/metrics
$ find ./ -type f -name '*.wsp' -exec whisper-resize --nobackup {} 1m:2d 5m:7d 15m:60d \;

Note: the retention policy is separated by spaces rather than commas.